⬆️ ⬇️

Neural networks: implementation of the task about mushrooms on Tensor Flow and Python

Tensor Flow - a framework for building and working with neural networks from Google. Allows you to abstract from the internal parts of machine learning and focus directly on solving their problems. A very powerful thing that allows you to create, train and use neural networks of any known type. I did not find any sensible text on this topic on Habré, so I am writing my own. The implementation of the solution of the mushroom problem using the Tensor Flow library will be described below. By the way, the algorithm described below is suitable for predictions in almost any field. For example, the likelihood of cancer in humans in the future or cards from an opponent in poker.



Task



The essence of the problem: on the basis of the input parameters of the fungus to determine its edibility. The specificity is that these parameters are categorical, not numerical. For example, the parameter “cap shape” can be “flat” or “convex” or “cone-shaped”. A set of fungal data for learning network is taken from the machine learning repository . Thus, the solution of the problem can be called a peculiar Hello World in the field of machine learning, along with the problem of irises , where the parameters of the flower are expressed in numerical values.



Sources



You can download all the sources from my repository on Github: link . Do this to see the code in action. Use only source codes, because all necessary indents and encoding are observed there. Below, the whole process will be analyzed in detail.

')

Training



It is assumed that you have a ready installation Tensor Flow. If not, you can install by reference .



Source code



from __future__ import absolute_import from __future__ import division from __future__ import print_function import tensorflow as tf import numpy as np import pandas as pd from sklearn.model_selection import train_test_split import os #       . #     CSV-    Tensor Flow       .  ,       (0  1) def prepare_data(data_file_name): header = ['class', 'cap_shape', 'cap_surface', #  CSV-   ,     'agaricus-lepiota.name'   'cap_color', 'bruises', 'odor', 'gill_attachment', 'gill_spacing', 'gill_size', 'gill_color', 'stalk_shape', 'stalk_root', 'stalk_surface_above_ring', 'stalk_surface_below_ring', 'stalk_color_above_ring', 'stalk_color_below_ring', 'veil_type', 'veil_color', 'ring_number', 'ring_type', 'spore_print_color', 'population', 'habitat'] df = pd.read_csv(data_file_name, sep=',', names=header) #   "?"      #        df.replace('?', np.nan, inplace=True) df.dropna(inplace=True) #         #  'e'  'p' .       # ,   0  , 1 -    df['class'].replace('p', 0, inplace=True) df['class'].replace('e', 1, inplace=True) #       , #     . Tensor Flow      # .  Pandas    "get_dummies" #      cols_to_transform = header[1:] df = pd.get_dummies(df, columns=cols_to_transform) #      #    -    () #      () df_train, df_test = train_test_split(df, test_size=0.1) #           num_train_entries = df_train.shape[0] num_train_features = df_train.shape[1] - 1 num_test_entries = df_test.shape[0] num_test_features = df_test.shape[1] - 1 #      csv-, .. #          #  csv,    Tensor Flow df_train.to_csv('train_temp.csv', index=False) df_test.to_csv('test_temp.csv', index=False) #     ,    open("mushroom_train.csv", "w").write(str(num_train_entries) + "," + str(num_train_features) + "," + open("train_temp.csv").read()) open("mushroom_test.csv", "w").write(str(num_test_entries) + "," + str(num_test_features) + "," + open("test_temp.csv").read()) #   ,     os.remove("train_temp.csv") os.remove("test_temp.csv") #        Tensor Flow def get_test_inputs(): x = tf.constant(test_set.data) y = tf.constant(test_set.target) return x, y #        Tensor Flow def get_train_inputs(): x = tf.constant(training_set.data) y = tf.constant(training_set.target) return x, y #        #    ( : , ) #  ,         def new_samples(): return np.array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]], dtype=np.int) if __name__ == "__main__": MUSHROOM_DATA_FILE = "agaricus-lepiota.data" #     Tensor Flow, #   CSV- (  ) prepare_data(MUSHROOM_DATA_FILE) #    training_set = tf.contrib.learn.datasets.base.load_csv_with_header( filename='mushroom_train.csv', target_dtype=np.int, features_dtype=np.int, target_column=0) test_set = tf.contrib.learn.datasets.base.load_csv_with_header( filename='mushroom_test.csv', target_dtype=np.int, features_dtype=np.int, target_column=0) # ,        ( ) feature_columns = [tf.contrib.layers.real_valued_column("", dimension=98)] #   DNN-  10, 20  10    classifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns, hidden_units=[10, 20, 10], n_classes=2, model_dir="/tmp/mushroom_model") #   classifier.fit(input_fn=get_train_inputs, steps=2000) #        accuracy_score = classifier.evaluate(input_fn=get_test_inputs, steps=1)["accuracy"] print("\n : {0:f}\n".format(accuracy_score)) #         predictions = list(classifier.predict_classes(input_fn=new_samples)) print("   : {}\n" .format(predictions)) 


We load and prepare data from the repository.



We will load data for training and testing of the neural network from the machine learning repository specially created for this purpose. All data is presented in the form of two files: agaricus-lepiota.data and agaricus-lepiota.names. The first has 8124 rows and 22 columns. One line provides one mushroom, each column is one of the 22 parameters of the mushroom in the form of a symbol-reduction from the whole word-parameter. The legend of all symbols is in the file agarius-lepiota.names.



Data from the repository must be processed in order to result in a form acceptable to Tensor Flow. First we import several libraries for work.



 from __future__ import absolute_import from __future__ import division from __future__ import print_function import tensorflow as tf import numpy as np import pandas as pd from sklearn.model_selection import train_test_split import os 


Then we will create a header from the mushroom parameters for Tensor Flow so that the library knows which column in the data file corresponds to which parameter. The cap is glued to the data file. We form it in the form of an array, the elements of which are taken from the file agaricus-lepiota.names.



 header = ['class', 'cap_shape', 'cap_surface', 'cap_color', 'bruises', 'odor', 'gill_attachment', 'gill_spacing', 'gill_size', 'gill_color', 'stalk_shape', 'stalk_root', 'stalk_surface_above_ring', 'stalk_surface_below_ring', 'stalk_color_above_ring', 'stalk_color_below_ring', 'veil_type', 'veil_color', 'ring_number', 'ring_type', 'spore_print_color', 'population', 'habitat'] df = pd.read_csv(data_file_name, sep=',', names=header) 


Now you need to understand the missing data. In this case, in the file agaricus-lepiota.data, the symbol "?" Is set instead of the parameter. There are many methods for handling such cases, but we will simply delete the entire line with at least one missing parameter.



 df.replace('?', np.nan, inplace=True) df.dropna(inplace=True) 


Next, you need to manually replace the symbol edible parameter to digital. That is, “p” and “e” are replaced by 0 and 1.



 df['class'].replace('p', 0, inplace=True) df['class'].replace('e', 1, inplace=True) 


And after that you can convert the balances of data into a number. This is the pandas get_dummies function.



 cols_to_transform = header[1:] df = pd.get_dummies(df, columns=cols_to_transform) 


Any neural network needs to be trained. But in addition, it also needs to be calibrated in order to increase the accuracy of work in real conditions. To do this, we will divide our data set into two - training and calibration. The first will be more than the second, as it should be.



 df_train, df_test = train_test_split(df, test_size=0.1) 


And the last. Tensor Flow requires that the number of rows and columns of the file be indicated at the beginning of the data files. We will manually extract this information from our training and calibration datasets and then write to the resulting CSV files.



 #         num_train_entries = df_train.shape[0] num_train_features = df_train.shape[1] - 1 num_test_entries = df_test.shape[0] num_test_features = df_test.shape[1] - 1 #     CSV df_train.to_csv('train_temp.csv', index=False) df_test.to_csv('test_temp.csv', index=False) #       CSV,      open("mushroom_train.csv", "w").write(str(num_train_entries) + "," + str(num_train_features) + "," + open("train_temp.csv").read()) open("mushroom_test.csv", "w").write(str(num_test_entries) + "," + str(num_test_features) + "," + open("test_temp.csv").read()) 


As a result, you should get these files: training and calibration .



We throw the generated data in Tensor Flow



Now that we have downloaded from the repository and processed CSV files with mushroom data, we can send them to Tensor Flow for training. This is done using the load_csv_with_header () function provided by the framework itself:



 training_set = tf.contrib.learn.datasets.base.load_csv_with_header( filename='mushroom_train.csv', target_dtype=np.int, features_dtype=np.int, target_column=0) test_set = tf.contrib.learn.datasets.base.load_csv_with_header( filename='mushroom_test.csv', target_dtype=np.int, features_dtype=np.int, target_column=0) 


The function load_csv_with_header () is engaged in the formation of a training set of data from those files that we collected above. In addition to the data file, the function accepts target_dtype as an argument, which is the type of data predicted as a result. In our case, it is necessary to teach the neural network to predict the edibility or toxicity of the fungus, which can be expressed as 1 or 0. Thus, in our case, target_dtype is an integer value. features_dtype is the parameter where the type of parameters accepted for training is specified. In our case, this is also an integer (they were originally string, but, as you remember, we passed them to a number). At the end, the target_column parameter is set, which is the column index with the parameter that the neural network will have to predict. That is, with the edible parameter.



Create an object classifier Tensor Flow



That is, an object of a class that deals directly with the predictions of the result. In other words, the class of the neural network itself.



 feature_columns = [tf.contrib.layers.real_valued_column("", dimension=98)] classifier = tf.contrib.learn.DNNClassifier( feature_columns=feature_columns, hidden_units=[10, 20, 10], n_classes=2, model_dir="/tmp/mushroom_model") 


The first parameter is feature_columns. These are the parameters of the mushrooms. Please note that the value of the parameter is created right there, just above. There, the input is the value 98 of the dimension parameter, which means 98 different parameters of the fungus, with the exception of edibility.



hidden_units - the number of neurons in each layer of the neural network. The correct selection of the number of layers and neurons in them is something at the level of art in the field of machine learning. True to determine these values ​​can only later experience. We took these numbers simply because they are listed in one of the Tensor Flow tutorials. And they work.



n_classes - the number of classes to predict. We have two of them - edible and not.



model_dir is the path where the trained neural network model will be saved. And in the future it will be used to predict the results so as not to train the network every time.



Training



For simplicity in the future, we will create two functions:



 def get_test_inputs(): x = tf.constant(test_set.data) y = tf.constant(test_set.target) return x, y def get_train_inputs(): x = tf.constant(training_set.data) y = tf.constant(training_set.target) return x, y 


Each function provides its own set of input data - for training and for calibration. x and y are Tensor Flow constants that the framework needs to work. Do not go into details, just accept that these functions should be as an intermediary between the data and the neural network.



We train the network:



 classifier.fit(input_fn=get_train_inputs, steps=2000) 


The first parameter takes the input data formed a little higher, the second - the number of workout steps. Again, the figure was used in one of the Tensor Flow manuals, and understanding this setting will come to you with experience.



Next, we calibrate the trained network. This is done using the calibration data set created above. The result will be the accuracy of future predictions of the network (accuracy_score).



 accuracy_score = classifier.evaluate(input_fn=get_test_inputs, steps=1)["accuracy"] print("\n : {0:f}\n".format(accuracy_score)) 


Let's test in



Now the neural network is ready, and you can try to predict with it the edibility of the fungus.



 def new_samples(): return np.array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]], dtype=np.int) 


The function above gives the data of two completely new mushrooms that were not present either in the training or in the calibration set (in fact, they were simply pulled out of the latter). Imagine, for example, that you bought them on the market, and are trying to figure out whether you can eat them. The code below will determine this:



 predictions = list(classifier.predict(input_fn=new_samples)) print("   : {}\n" .format(predictions)) 


The result of the work should be the following:



    : [0, 1] 


And this means that the first mushroom is poisonous, the second is completely edible. In this way, you can make predictions based on any data, be it mushrooms, people, animals or anything. It is enough to correctly form the input data. And to predict, for example, the probability of an arrhythmia in a patient in the future or the course of movement of stock prices on the exchange.

Source: https://habr.com/ru/post/419917/



All Articles