A selection of datasets for machine learning

Hello, reader!

Before you is an article guide to open data sets for machine learning. In it, I, for a start, will collect a selection of interesting and fresh (relatively) datasets . And as a bonus, at the end of the article, I will attach useful links on independent search of datasets.

Less words, more data.
')

A selection of datasets for machine learning:

Data of deaths and battles from the game of thrones - this data set combines three data sources, each of which is based on information from a series of books.
Global Terrorism Database - More than 180,000 terrorist attacks worldwide, 1970-2017.
Bitcoin, historical data - data of bitcoins with an interval of 1 minute from selected exchanges, January 2012 - March 2019
FIFA 19 complete player data set - 18k + FIFA 19 players, ~ 90 attributes extracted from the latest FIFA database.
YouTube video statistics - daily statistics of trending videos on YouTube.
Review of suicide rates from 1985 to 2016 - Comparison of socio-economic information with suicide rates by year and country.
Huge Stock Market Dataset - historical daily prices and volumes of all US stocks and ETFs.
World Development Indicators are indicators of the development of countries from around the world.
Kaggle Machine Learning & Data Science Survey 2017 - A great insight into the state of data science and machine learning.
The data on violence and weapons - a full report on more than 260 thousand American weapon incidents in 2013-2018.
Chest X-ray (pneumonia) - 5,863 images, 2 categories.
Gender by voice recognition - this database was created to identify a voice as male or female, based on the acoustic properties of voice and speech. The data set consists of 3168 recorded voice samples collected from men and women.
Student alcohol consumption - data was obtained during a survey of students in mathematics and Portuguese language courses in high school. It contains a lot of interesting social, gender and educational information about students.
Malaria Cell Datasets - Cellular Images for Malaria Detection.
Surveys of young people - data on the preferences, interests, habits, opinions and fears of young people.
World university rankings - explore the best universities in the world.
Credit Card Fraud Detection - Dated by anonymous credit card transactions marked fraudulent or genuine.
Dataset of heart disease - this database contains 76 attributes, such as age, gender, type of chest pain, blood pressure at rest and others.
European football base - 25 000+ matches, attributes of players and teams for European professional football.
Wine Reviews - 130k wine reviews with variety, location, winery, price and description.
Baidu Apolloscapes . A large dataset for recognizing 26 semantically different objects like cars, bicycles, pedestrians, buildings, street lamps, etc.
Comma.ai . More than seven hours driving on the highway. Dataset includes information about the speed of the vehicle, acceleration, steering angle and GPS coordinates.
Color Recognition - this dataset contains 4242 color images. Data collection is based on flicr data, Google images, Yandex images.
The daily market price of each cryptocurrency is the historical price of cryptocurrency for all tokens.
Chocolate rating - Expert rating of more than 1700 chocolate bars.
The health insurance market - data on health and dental plans in the US health insurance market.
Heartbeat sounds - a classification of heartbeat abnormalities by stethoscope.
Database anime recommendations - recommendations from 76,000 users on myanimelist.net
Blood cell images - 12,500 images: 4 different cell types.
Chest Radiography - more than 112,000 chest radiographs from more than 30,000 unique patients.
Murder reports, 1980-2014 - The project “Responsibility for the killings” - the most complete database of killings in the United States, currently available.
Database of used cars - more than 370000 used cars. The data content is in German, so you must first translate it if you do not speak German.
US Government Open Data House — data, tools and resources for research, web and mobile application development, data visualization development.
National Center for the Prevention of Chronic Diseases and Health Promotion (NCCDPHP). The center is working to reduce the risk factors for chronic diseases.
The UK's largest collection of social, economic and demographic resources.
EconData - several thousand economic time series, prepared by a number of US government agencies and distributed in various formats and media.
The center for coastal research is interesting data about the sea and its biological composition. Here you can find datasets from the analysis of data from the Red Sea model to the study of temperature and currents over the narrow southern California shelf.
The data set of sign language figures is Turkey, Ankara, Ayranji, Anadolu. High school sign language data set.
The quality of red wine is a simple and clear practical set of data for regression or classification modeling.
Tables of the English Premier League (1968-2019).
HotspotQA Dataset - with questions and answers, allowing you to create systems for answering questions in a more comprehensible way.
xView is one of the largest publicly available sets of aerial imagery of the earth. It contains images of various scenes from around the world, annotated with bounding boxes.
Labelme - Large annotated image dataset.
ImageNet - Dataset images for new algorithms, organized according to the WordNet hierarchy, in which hundreds and thousands of images represent each node of the hierarchy.
LSUN. - Dataset images, divided into scenes and categories with partial marking data.
MS COCO - large-scale dataset for the detection and segmentation of objects.
COIL100 - 100 different objects depicted at every angle in a circular rotation.
Visual Genome - dataset with ~ 100 thousand. Detailed annotated images.
Google's Open Images. - a collection of 9 million URLs to images “tagged with tags covering more than 6,000 categories” under a Creative Commons license.
Labelled Faces in the Wild is a collection of 13,000 tagged face images of people to use applications that use face recognition technology.
Stanford Dogs Dataset - contains 20,580 images of 120 dog breeds.
Indoor Scene Recognition. - dataset for recognizing the interior of buildings. Contains 15,620 images and 67 categories.
Oxford's Robotic Car - more than 100 repetitions of one route across Oxford, filmed during the year. Various combinations of weather conditions, traffic and pedestrians, as well as longer changes, like road works, got into datasets.
Cityscape Dataset is a large dataset containing records of a hundred street scenes in 50 cities.
KUL Belgium Traffic Sign Dataset - more than 10,000 annotations of thousands of different traffic lights in Belgium.
LISA Laboratory for Intelligent & Safe Automobiles - dataset with road signs, traffic lights, recognized vehicles and trajectories of movement.
Bosch Small Traffic Light Dataset - Dataset with 24,000 annotated traffic lights.
WPI datasets - dataset for recognition of traffic lights, pedestrians and road markings.
Berkeley DeepDrive - a huge dataset for autopilots. It contains over 100,000 videos with more than 1,100 hours of driving records at different times of the day and in different weather conditions.
MIMIC-III - datasets with impersonal health data of ~ 40,000 patients on intensive care (demographic data, vital signs, laboratory tests and drugs).
Amazon Reviews - Contains about 35 million reviews from Amazon for 18 years. Data includes product and user information, ratings and the text of the review itself.

Useful links for searching datasets:

Of course, Kaggle is the meeting place for all fans of machine learning competitions.
Google Dataset Search - search datasets across the entire Internet. Also, if necessary, you can add your own data sets .
Machine Learning Repository is a collection of databases, domain theories and data generators that are used by the machine learning community to empirically analyze machine learning algorithms.
VisualData - search dataset for machine vision, with a convenient classification by category.
DATA USA - a complete set of publicly available US data c visualization, description and infographics.

On this, our short selection came to an end. If someone has something to add or share - write in the comments.

All knowledge!

Subscribe to the Neuron channel in Telegram (@neurondata) - there are fresh articles and news from the world of data science appearing every week. Thanks to everyone who helps with useful links, especially Igor Mariarty, Andrey Bondarenko and Matthew Kochergin.

Source: https://habr.com/ru/post/452392/

All Articles

A selection of datasets for machine learning

A selection of datasets for machine learning:

Useful links for searching datasets:

More articles: