Fuel for AI: a selection of open datasets for machine learning

Related Open Data Community Projects (Linked Open Data Cloud Project). Many datasets on this diagram may include data protected by copyright, and they are not mentioned in this article.

If you are not doing your AI right now, then others will do it for you instead of you. Nothing more prevents you from creating a system based on machine learning. There is an open library of deep learning TensorFlow , a large number of algorithms for learning in the Torch library, a framework for implementing distributed processing of unstructured and semi-structured Spark data and many other tools that facilitate work.

Add to this the availability of large computational power, and you will realize that for complete happiness, only one ingredient is missing - data. A huge amount of data is publicly available, but it is not easy to understand which of the open datasets should pay attention to, which of them are suitable for testing ideas, and which can be useful as a means of testing potential products or their properties before you accumulate your own proprietary data.

We understood this question and collected data on datasets that meet the criteria of openness, relevance, speed of work and proximity to real tasks.

Computer vision

Visual genom

Open data for machine learning is like free electricity for the electric vehicle market. Therefore, research groups that do not pursue direct financial gain make a great contribution to the process of obtaining new datasets. Thus, an international team of researchers, which included scientists from Stanford University, as well as representatives of Yahoo and Snapchat, developed a new Visual Genom database and an image evaluation algorithm that will allow artificial intelligence systems to understand what is happening in the images. All images in the Visual Genome database are marked in such a way as to contain information about all the objects in the image, their features and connections.

ImageNet

Previously, researchers from Stanford University presented ImageNet datasets, which contains more than a million images, marked by the content shown in the snapshot of the event. Many companies that create an API for working with images have REST interfaces that use labels suspiciously similar to the WordNet 1000-categorical hierarchy from ImageNet.

MIAS (Mammographic Image Analysis Society)

Datasets by mammograms, in which doctors can recognize cancer by using algorithms. The array is a real chest snapshots with known types of diseases.

Landsat8

Landsat-8 is an Earth remote sensing satellite launched into orbit in 2013. The satellite collects and stores multi-spectral images of medium resolution (30 meters per point). Landsat-8 data is available from 2015, along with some sample snapshots of 2013–14. All new Landsat-8 images appear every day in just a few hours after their creation.

MNIST (Mixed National Institute of Standards and Technology) database of handwritten digits

A handwritten number database with a prepared set of learning values in the amount of 60,000 images for training and 10,000 images for testing. Figures taken from the US Census Bureau sample set (with the addition of test samples written by students at American universities) are normalized in size and have a fixed image size. This database is a standard proposed by the National Institute of Standards and Technology of the USA for the purpose of calibrating and comparing image recognition methods.

Chars74K

The next stage of evolution for those who passed handwritten numbers. This dataset includes 74,000 images of various characters (alphabet, numbers, etc.).

Open Source Biometric Recognition Data

Biometric recognition data (front face image), obtained using the open source engine.

SVHN

House numbers from Google Street View. 73,257 numbers for training, 26,032 numbers for testing, and 531,131 a somewhat less complicated sample to use as additional training data.

Natural languages

Common Crawl Corpus

The body of these web pages of more than 540 terabytes - consists of more than 5 billion web pages. This dataset is freely available on Amazon S3.

Yelp open dataset

Yelp is a site for searching local services, such as restaurants or barbershops, with the ability to add and view ratings and reviews of these services. For many years of work, I have accumulated a huge amount of data from service users. The dataset includes 4,700,000 reviews for 156,000 companies from more than 1,000,000 users.

Wikitext

The dataset is a collection of text from more than 100 million word usage, extracted from verified Good and Selected Wikipedia articles.

Maluuba datasets

This set of CNN news articles contains 120,000 pairs of questions + context / answers. Questions are written by people in a natural language. Questions may not have answers, and answers may be multilingual. The Maluuba dataset is designed to help create “smart” chat bots that can support decision making in difficult conditions.

The Children's Book Test

Basic data consisting of pairs (questions + context / answers) extracted from children's books available under the Gutenberg Project, aimed at creating and distributing an electronic universal library. The project, founded in 1971, provides for the digitization and preservation in text format of various works of world literature - mostly texts that are freely available in all popular world languages. More than 53,000 documents are available for free download.

Twitter Sentiment Analysis

Dataset analysis of the tone of "comments" on Twitter. Contains 1,578,627 tweets showing positive and negative attitudes.

Speech

Google audioset

Comprehensive dictionary of sound events. 632 classes of audio events and a collection of 2,084,320 voice 10-second segments from a video on YouTube (more than 5 thousand hours of audio recordings).

2000 HUB5 English

English language dataset containing transcripts of 40 telephone conversations in English. The 2000 HUB5 English data is focused on speaking by phone with a specific task of speech-to-text transcription.

TED-LIUM

Audio recordings of 1495 performances on TED with full decoding.

Dataset dataset

mldata

Mldata (machine learning data set repository) is a machine learning data set repository containing more than 800 publicly available archive data sets with ratings, views, comments.

UCI Machine Learning repository

The largest repository of real and model machine learning problems, leading its history since 1987. It contains real data on applied problems of biology, medicine, physics, engineering, sociology and other fields that have become classic for the work of various algorithms. Datasets of this repository are often used by the scientific community for empirical analysis of machine learning algorithms. Includes interesting text data from UCI's Spambase spam letters, which can be used as a platform for learning personalized spam filters.

Datasets for "The Elements of Statistical Learning"

Created under the guidance of Stanford University professor Trevor Hasti, datasets for Statistical Training Elements are sets of data in various categories, such as the mineral density of skeleton bones, countries, galaxies, marketing data, spam, postal codes, and many others.

Amazon Web Services (AWS)

AWS offers several interesting datasets, including all Enron e-mail, Google Books syntax n-grams, NASA NEX data (information about climate, geology, and the state of the world's flora of more than 20 terabytes) and much more.

Kaggle

This is a platform where all users can share their datasets. They have more than 350 datasets and more than 200 of them appear as recommended by the platform.

Awesome public datasets

Several hundred datasets classified in various categories in different areas. Alas, does not contain the description.

data.world

The data.world project itself speaks of itself as a “social network for people with datasets”, but it is more correct to describe it as “GitHub for data”. This is a place where you can search, copy, analyze and download datasets. In addition, you can upload your data to data.world and use it to collaborate with other users.

One of the key differences in data.world is the tools they created to simplify working with data. The system supports SQL queries for exploring data and combining several datasets, they also have an SDK that simplifies working with data in the tool of your choice (for details, read the tutorial on the data.world Python SDK ).

Developers often forget that when creating new AI solutions or products, the most difficult part is not the algorithms, but the collection and marking of a collection of data. Standard datasets can be used for validation or as a starting point for building a more specialized solution.

Another popular misconception is the idea that solving problems with a single dataset is tantamount to carefully crafting an entire product. Use these datasets to validate or test your ideas, but do not forget to test or prototype a product, and obtain new, more reliable data that will help refine your product. Successful companies, whose business is built on data, usually pay a lot of attention to collecting new, proprietary data that will increase productivity without increasing risks.

Sources (links you will also find more examples of interesting datasets):

→ Open Data for Deep Learning
→ KDNuggets
→ Fueling the Gold Rush: The Greatest Public Datasets for AI

Source: https://habr.com/ru/post/339496/

All Articles

Fuel for AI: a selection of open datasets for machine learning

Computer vision

Natural languages

Speech

Dataset dataset

More articles: