Before you is an article guide to open data sets for machine learning. In it, I, for a start, will collect a selection of interesting and fresh (relatively) datasets . And as a bonus, at the end of the article, I will attach useful links on independent search of datasets.
Gender by voice recognition - this database was created to identify a voice as male or female, based on the acoustic properties of voice and speech. The data set consists of 3168 recorded voice samples collected from men and women.
Student alcohol consumption - data was obtained during a survey of students in mathematics and Portuguese language courses in high school. It contains a lot of interesting social, gender and educational information about students.
Malaria Cell Datasets - Cellular Images for Malaria Detection.
Surveys of young people - data on the preferences, interests, habits, opinions and fears of young people.
Credit Card Fraud Detection - Dated by anonymous credit card transactions marked fraudulent or genuine.
Dataset of heart disease - this database contains 76 attributes, such as age, gender, type of chest pain, blood pressure at rest and others.
European football base - 25 000+ matches, attributes of players and teams for European professional football.
Wine Reviews - 130k wine reviews with variety, location, winery, price and description.
Baidu Apolloscapes . A large dataset for recognizing 26 semantically different objects like cars, bicycles, pedestrians, buildings, street lamps, etc.
Comma.ai . More than seven hours driving on the highway. Dataset includes information about the speed of the vehicle, acceleration, steering angle and GPS coordinates.
Color Recognition - this dataset contains 4242 color images. Data collection is based on flicr data, Google images, Yandex images.
Chest Radiography - more than 112,000 chest radiographs from more than 30,000 unique patients.
Murder reports, 1980-2014 - The project “Responsibility for the killings” - the most complete database of killings in the United States, currently available.
Database of used cars - more than 370000 used cars. The data content is in German, so you must first translate it if you do not speak German.
US Government Open Data House — data, tools and resources for research, web and mobile application development, data visualization development.
National Center for the Prevention of Chronic Diseases and Health Promotion (NCCDPHP). The center is working to reduce the risk factors for chronic diseases.
The UK's largest collection of social, economic and demographic resources.
EconData - several thousand economic time series, prepared by a number of US government agencies and distributed in various formats and media.
The center for coastal research is interesting data about the sea and its biological composition. Here you can find datasets from the analysis of data from the Red Sea model to the study of temperature and currents over the narrow southern California shelf.
HotspotQA Dataset - with questions and answers, allowing you to create systems for answering questions in a more comprehensible way.
xView is one of the largest publicly available sets of aerial imagery of the earth. It contains images of various scenes from around the world, annotated with bounding boxes.
ImageNet - Dataset images for new algorithms, organized according to the WordNet hierarchy, in which hundreds and thousands of images represent each node of the hierarchy.
LSUN. - Dataset images, divided into scenes and categories with partial marking data.
MS COCO - large-scale dataset for the detection and segmentation of objects.
COIL100 - 100 different objects depicted at every angle in a circular rotation.
Google's Open Images. - a collection of 9 million URLs to images “tagged with tags covering more than 6,000 categories” under a Creative Commons license.
Labelled Faces in the Wild is a collection of 13,000 tagged face images of people to use applications that use face recognition technology.
Indoor Scene Recognition. - dataset for recognizing the interior of buildings. Contains 15,620 images and 67 categories.
Oxford's Robotic Car - more than 100 repetitions of one route across Oxford, filmed during the year. Various combinations of weather conditions, traffic and pedestrians, as well as longer changes, like road works, got into datasets.
Cityscape Dataset is a large dataset containing records of a hundred street scenes in 50 cities.
WPI datasets - dataset for recognition of traffic lights, pedestrians and road markings.
Berkeley DeepDrive - a huge dataset for autopilots. It contains over 100,000 videos with more than 1,100 hours of driving records at different times of the day and in different weather conditions.
MIMIC-III - datasets with impersonal health data of ~ 40,000 patients on intensive care (demographic data, vital signs, laboratory tests and drugs).
Amazon Reviews - Contains about 35 million reviews from Amazon for 18 years. Data includes product and user information, ratings and the text of the review itself.
Useful links for searching datasets:
Of course, Kaggle is the meeting place for all fans of machine learning competitions.
Machine Learning Repository is a collection of databases, domain theories and data generators that are used by the machine learning community to empirically analyze machine learning algorithms.
VisualData - search dataset for machine vision, with a convenient classification by category.
DATA USA - a complete set of publicly available US data c visualization, description and infographics.
On this, our short selection came to an end. If someone has something to add or share - write in the comments.
All knowledge!
Subscribe to the Neuron channel in Telegram (@neurondata) - there are fresh articles and news from the world of data science appearing every week. Thanks to everyone who helps with useful links, especially Igor Mariarty, Andrey Bondarenko and Matthew Kochergin.