⬆️ ⬇️

Another Github 2: machine learning, datasets and Jupyter Notebooks





Although there are many sources of free software for machine learning on the Internet, Github remains an important information exchange center for all types of open source tools used in the machine learning and data analysis community.



This collection contains repositories for machine learning, datasets and Jupyter Notebooks, ranked by the number of stars. In the previous section, we talked about popular repositories for studying data visualization and deep learning.



Machine learning



Awesome machine learning

38,809, 9,615



An impressive list of systems, libraries and software classified by language and category (computer vision, natural language processing, etc.). In addition, in this repository you will find a list of free books on machine learning, free (mostly) machine learning courses, data science blogs.

')

Scikit-learn

34 067, 16 698



Developed since 2007, a Python module for machine learning, built on the basis of the SciPy, NumPy and Matplotlib libraries. Distributed under the BSD 3-Clause license. Scikit-learn is a versatile tool for work, containing algorithms for classification, regression and clustering, as well as methods for preparing data and evaluating models.



Prediousio

11,703, 1,903



An open source machine learning framework that supports event collection, algorithm deployment, evaluation, templates for known tasks such as classification and recommendations. Connects to existing applications using the REST API or SDK. PredictionIO is based on open source scalable services such as Hadoop, HBase (and other databases), Elasticsearch, Spark.



Dive Into Machine Learning

9 163, 1,673



Material for newcomers to the topic. The repository contains a collection of IPython tutorials for the Scikit-learn library, which implements a large number of machine learning algorithms, as well as several links to Python-related machine learning topics and more general information on data analysis. The author provides links to many other tutorials covering the topic.



Pattern

6,845, 1,353



Python-based web development module with tools for analysis, natural language processing (markup of parts of speech, n-gram search, sentiment analysis, WordNet), machine learning, network analysis and visualization. The module was created and is well documented at the Research Center for Computational Linguistics and Psycholinguistics at the University of Antwerp (Belgium). In the repository you will find more than 50 examples of its use.



Golearn

6,374, 867



Actively developing machine learning library for Go. Provides a full-featured, easy-to-use, easily customizable software package for developers. GoLearn implements Scikit-learn, a familiar to many learning interface.



Vowpal wabbit

6,189, 1,519



The Vowpal Wabbit system expands the boundaries of machine learning using such methods as hashing, allreduce, learning2search, active and interactive learning. Vowpal Wabbit is focused on fast modeling massive data sets and supports parallel learning. Particular attention is paid to reinforcement training using several contextual “gangster algorithms”.



NuPIC (Numenta Platform for Intelligent Computing)

5,852, 1,570



NuPIC implements Hierarchical Temporal Memory (HTM) machine learning algorithms. In general, an HTM is an attempt to model the computational operations of the neocortex of the human brain and focuses on maintaining and invoking spatial and temporal patterns. HTM is a memory system, it is not programmed, does not learn to perform algorithms for various tasks, it learns to solve a problem. NuPIC is suitable for solving all sorts of problems, in particular, for the detection of anomalies associated with patterns.



aerosolve

4,522, 570



aerosolve is trying to be different from other libraries, concentrating on user-friendly debugging tools, Scala code for learning, an image content analysis mechanism for convenient ranking, flexibility, and control over functions. The library is intended for use with rare interpreted functions that are commonly found in a search (search keywords, filters) or pricing (number of rooms in a hotel room, location, price).



Code for Machine Learning for Hackers

3 467, 2,220



The repository supplementing the “ Machine Learning for Hackers ” book, in which all code is presented in the R language, intended for statistical data processing (in fact, the standard of statistical programs) and work with graphics. Here you will find numerous R packages. Topics covered include general classification, ranking, and regression tasks, as well as statistical component analysis and multidimensional scaling procedures.



Datasets on Github



Awesome public datasets

31,852, 5,361



Another impressive repository with its size is a list divided into 30 topics: biology, sports, museums, natural language, etc. The repository includes several hundred data sets, most of which are free. Here are links to other collections of big data.



Openaddresses

1664, 745



The official repository of OpenAddresses.io is a free and open global collection of street addresses. The project includes street names, house numbers, zip codes, and geographic coordinates.



Open Exoplanet Catalog

583, 176



Catalog of all known planets that exist outside the solar system. Previously, the database was updated within 24 hours after the discovery of a new planet, but now, unfortunately, the project is practically not developing.



CitySDK

510, 149



The US Census Bureau database, adapted for integration with other open data sets, has convenient features for working and creating your own custom dataset with the Census API: statistics, cartographic GeoJSON, lat / lng, etc.



openFDA

353, 84



openFDA is a US Food and Drug Administration (FDA) project that aims to provide a collection of public data sets for researchers and developers through the API, as well as examples of using this data and documentation. There is information about the side effects of drugs, drug labeling, reports of drug withdrawals from the market and changes in the prescription formula.



CERN Open Data Portal

247, 88



The source code for the CERN Nuclear Research Organization's Open Data Portal, which is described as “an access point to a growing spectrum of data obtained from CERN research.”



IPython (Jupyter) Notebooks



A list of useful Github repositories, consisting of IPython (Jupyter) notebooks, focused on working with data and machine learning.



Python Machine Learning Book

9,655, 3,674



Accompanying repository of the first edition of the book “ Machine Learning with Python ” (repository for the second edition here ), which deals with working with missing values, converting categorical variables into formats applicable to machine learning, choosing informative properties, compressing data with transfer to smaller spaces number of measurements.



Example Data Science Notebook

4,156, 1,463



A repository of training materials, code and data for various data analysis and machine learning projects. Notebook contains all the basic principles of working with data analysis on the example of dataset Iris , and serves as an excellent illustration of building a workflow in data science. The basic points for working in repos are drawn from the book The Elements of Data Analytic Style (Jeff Leek, 2015).



Learn data science

2,197, 1,228



A collection of Notebooks and datasets covering four algorithmic topics: linear regression, logistic regression, random forests, and K-Means clustering algorithms. Learn Data Science is based on materials created for the Open Data Science Training project.



IPython Notebooks

2 106, 1,226



The repository contains various Notebooks IPython - from an overview of the language and functionality of IPython to examples of using various popular libraries in data analysis. Here you will find a comprehensive collection of materials on machine learning, in-depth training and big data processing environments from the Machine Learning courses Andrew Ng (Coursera), Intro to TensorFlow for Deep Learning (Udacity) and Spark (edX).



Scikit-learn Tutorial

963, 573



The repository for studying the library Scikit-learn , which implements a large number of machine learning algorithms. The library provides the implementation of a number of algorithms for learning with and without a teacher. Scikit-learn is built on top of SciPy (Scientific Python).



Machine learning

543, 336



A series of very detailed IPython Notebook training materials, based on data from Andrew Nga’s Machine Learning course (Stanford University), Tom Mitchell’s course (Carnegie Mellon University) and Christopher M. Bishor’s book Pattern Recognition and Machine Learning.



The presented list cannot be fully called exhaustive, so we welcome comments with a list of your favorite (or your own) repositories.

Source: https://habr.com/ru/post/445530/



All Articles