⬆️ ⬇️

Python typing

Today, only a lazy person does not speak (write, think) about machine learning, neural networks, and artificial intelligence in general. Just last year, ML was compared with teenage sex - everyone wants it, but no one does it. Today, everyone is concerned that the AI ​​will leave us without work. Although, judging by the latest research by Gartner , you can calm down, because by 2020, thanks to AI, there will be more jobs than it is eliminated. So, dear friend, learn ML, and you will be happy.





Note: we continue the series of publications of the full versions of articles from the magazine Hacker. Spelling and punctuation of the author saved.



   ** ML- Python   Azure Functions**, ,      ,      . :) 


In this article we want to show ML on a practical case - using the example of a project that we did for Aktion-press (online subscription service). I am sure that this example can be useful to many. Why many? Yes, because the problem we solved was called “sorting and sending to the address a huge number of emails”. The problem of gigantic correspondence, which managers have to sort and forward to the appropriate departments, is almost universal, and this problem must be solved in modern ways.



So, after consulting with the customer, we decided to develop a machine learning model to automate the sorting of letters as much as possible.



Machine learning model



I think it will not surprise you that we have chosen Python as the language for this solution. It happened so historically, it is high-level, and most importantly - it has many useful libraries for machine learning. I will tell about them below.



Honestly, there’s nothing much to say about ML in this case. A set of simple binary classifiers based on logistic regression showed promising results and allowed several to abstract from the model itself, focusing on the preparation of data and the construction of the attached text. But the repository itself has already been used as the basis for three other independent projects, it has shown itself well in several classification experiments and has established itself as a reliable foundation for a quick transition to development. Therefore, the task of this section is not to demonstrate the "know-how", it is needed as a basis for the next section dedicated to operationalization.



Here I will share my experience and give some recommendations to you so that you can experiment with this code yourself or reuse it.



To preserve confidentiality, the original dataset has been replaced with a similar publicly available set for categorizing McDonalds feedback . See data / data.csv file .



The data itself was presented in CSV files with three columns: Id , Text and Class . And since NLTK does not provide built-in support for reading data from files in CSV format, we wrote our own module that allows you to read files from a folder as a single dataframe pandas or to extract text in the form of lists of paragraphs, sentences, words, and so on in NLTK format.



And here is the code for initializing this self-written CsvCorpusReader reader CsvCorpusReader with client data. The implementation of the class can be seen in the file lib \ corpus.py . We strongly recommend that you familiarize yourself with the contents of the file Experiments \ TrainingExperiment.py .



 #%% create corpus corpus = CsvCorpusReader(".\data", ["data.csv"], encoding="utf8", default_text_selector=lambda row: row["Text"]) 


After the initialization is complete, you need to extract the words from the documents and normalize them. In our case, after a series of experiments, we decided to use a set of support functions as a wrapper for the entire process to hide the calls to the NLTK and Gensim libraries inside an easy-to-use configuration level.



Below, we give the extractor a command to return documents in the form of a list of words, discarding the structure of paragraphs or sentences (see keep_levels=Levels.Nothing ). Then we translate each word into lower case, discard any stop words and highlight the basics of words. At the final stage, we remove the low-frequency words, assuming that these are just typos or that they do not have a significant effect on the classification.



Please note that the code below focuses exclusively on data samples in English, while the original version implemented Russian lemmatization using PyMorphy2, which made it possible to achieve a more accurate classification for the Russian language.



 #%% tokenize the text stop_words = ['would', 'like', 'mcdonald'] text_processor = generate_processor(keep_alpha_only=True, to_lower=True, stopwords_langs=['english'], add_stopwords=stop_words, stemmer_langs=['english']) docs_factory = lambda: corpus.words(keep_levels=Levels.Nothing, **text_processor) word_frequencies = Counter((word for doc in docs_factory() for word in doc)) min_word_freq = 3 docs = [ [ word for word in doc if word_frequencies[word] >= min_word_freq ] for doc in docs_factory() ] 


As soon as we tokenize our body, the next step is to build the attachment. The code below is needed to convert each document into a series of meaningless digits for use in a classifier.



We tested several different approaches (including BoW, TF-IDF, LSI, RP, and w2v), but the classic LSI model with 500 extracted topics gave the best results ( AUC = 0.98) in our case. First, the code checks for the existence of an existing serialized model in the shared folder. If there is no model, the code trains the new model using pre-prepared data and saves the result to disk. If a model is detected, it is simply loaded into memory. The code then converts the dataset and repeats the stream with the next attachment.



In terms of efficiency, the LSI model has surpassed the much more powerful vector2-based vectorization algorithm and other more complex approaches, and this may be due to several possible reasons.



The most obvious of these is that the letters of the types we were looking for had predictable and repetitive patterns of words, as in the case of auto-answers (for example, “Thank you for your letter ... I will not be in the office until ... If the question is urgent ... " ). Therefore, to process them is quite enough something simple, for example TF-IDF. LSI supports a common ideology, and this model can be viewed as a way to add synonyms suitable for processing. At the same time, the word2vec algorithm, which was trained on Wikipedia, probably generates unnecessary noise due to complex synonymous structures, thereby "blurring" the patterns in the messages and, consequently, reducing the classification accuracy.



This approach has shown that old and fairly simple methods are still worth trying, even in the era of word2vec and recurrent neural networks.



 #%% convert to Bag of Words representation dictionary_path = os.path.join(preprocessing_path, 'dictionary.bin') if os.path.exists(dictionary_path): dictionary = corpora.Dictionary.load(dictionary_path) else: dictionary = corpora.Dictionary(docs) dictionary.save(dictionary_path) docs_bow = [dictionary.doc2bow(doc) for doc in docs] nested_partial_print(docs_bow) #%% convert to tf-idf representation tfidf_path = os.path.join(preprocessing_path, 'tfidf.bin') if os.path.exists(tfidf_path): model_tfidf = models.TfidfModel.load(tfidf_path) else: model_tfidf = models.TfidfModel(docs_bow) model_tfidf.save(tfidf_path) docs_tfidf = nested_to_list(model_tfidf[docs_bow]) #%% train and convert to LSI representation lsi_path = os.path.join(preprocessing_path, 'lsi.bin') lsi_num_topics = 500 if os.path.exists(lsi_path): model_lsi = models.LsiModel.load(lsi_path) else: model_lsi = models.LsiModel(docs_tfidf, id2word=dictionary, num_topics=lsi_num_topics) model_lsi.save(lsi_path) docs_lsi = model_lsi[docs_tfidf] 


As always, it is impossible to get rid of the mandatory routine code. Then it will be useful to us when preparing data for machine learning using skit-learn.



As I said above, we use several binary instead of one multi-class classifier. That is why we create a binary target for one of the classes (in this sample, this is SlowService). You can change the value of the class_to_find variable and class_to_find -execute the code below to train each classifier separately. The assessment script is designed to work with several models, and automatically loads them from the selected folder. Finally, a training and test data set is formed, the lines with gaps are completely excluded.



 #%% create target class_to_find = "SlowService" df["Target"] = df.apply(lambda row: 1 if class_to_find in row["Class"] else 0, axis=1) df.groupby(by=["Target"]).count() #%% create features and targets dataset features = pd.DataFrame(docs_features, columns=["F" + str(i) for i in range(lsi_num_topics)]) notnul_idx = features.notnull().all(axis=1) features = features[notnul_idx] df_notnull = df[notnul_idx] target = df_notnull[["Target"]] plot_classes_scatter(features.values, target["Target"].values) #%% split dataset to train and test train_idx, test_idx = train_test_split(df_notnull.index.values, test_size=0.3, random_state=56) df_train = df_notnull.loc[train_idx] features_train = features.loc[train_idx] target_train = target.loc[train_idx] df_test = df_notnull.loc[test_idx] features_test = features.loc[test_idx] target_test = target.loc[test_idx] 


Now we proceed to the training of the classifier (in our case, this is logistic regression), then we will save the model in the same general catalog that we used earlier to embed the transformations.



As you can see, in the code below, we follow a special format for the model name: class_{0}_thresh_{1}.bin . This is necessary to determine the class name and the corresponding threshold value in the course of further evaluation.



And one final note before we continue. As a development tool, I chose Visual Studio Code. It is an easy-to-use, lightweight editor that even provides the basic IntelliSense features (code completion and hints) for a dynamic language like Python. At the same time, the Jupyter and Python extensions in combination with the IPython core allow you to execute code on the cell and visualize the result without re-running the script, which is always convenient for ML tasks. Yes, it looks like a standard Jupyter, but with IntelliSense and code / git orientation. I recommend that you try, at least while you are working with the sample, because many other possibilities related to VS Code are used for productive development.



As for the code below, the line with plot ROC threshold values are examples of using the Jupyter extension. You can click the special Run cell button above the cell to see the TP and FP values ​​and compare them with the Threshold threshold value in the results panel on the right. We actively used this diagram during our work, because due to a pronounced imbalance in the data set, the optimal cutoff level was always around 0.04 instead of the usual 0.5. If you cannot use VS Code for testing, you can simply run the script using standard Python tools and, after viewing the results in a separate window, make changes directly to the file name.



 #%% train logistic classifier classifier = LogisticRegression() classifier.fit(features_train, target_train) #%% score on test scores_test = classifier.predict_proba(features_test)[:, 1] #%% plot ROC threshold values pd.DataFrame(nested_to_list(zip(tsh, tp_test, fp_test, fp_test-tp_test)), columns=['Threshold', 'True Positive Rate', 'False Positive Rate', 'Difference']).plot(x='Threshold') plt.xlim(0, 1) plt.ylim([0,1]) plt.grid() plt.show() #%% save model threshold = 0.25 model_filename = 'class_{0}_thresh_{1}.bin'.format(class_to_find, threshold) joblib.dump(classifier, os.path.join(model_path, model_filename)) 


Now it’s time for the assessment script: Score \ run.py. New in it is very small, most of the code taken from the original teaching experiment, discussed earlier. Check out the contents of this file in the GitHub repository .



The input file is a CSV file for evaluation, at the output we get two different files, one contains the estimated classes, the other - the row identifiers, which cannot be assessed. I will explain the reason for using the file later when we talk about operationalization.



At the end of this section, I want to explain why we use several binary instead of one multi-class classifier. First, it was much easier to start, to work and optimize performance on classes separately. This approach also allows you to use different mathematical models for different classes, as is the case with auto-answers, which often have a fairly rigid structure, and can be processed using a simple bag of words. At the same time, from the point of view of an IT professional, something like the code below can simplify deployment, allowing you to connect new ones or change existing models without affecting others.



 model_paths = [path for path in os.listdir(os.path.join('..', 'model')) if path.startswith('class_') ] for model_path in model_paths: model = joblib.load(os.path.join('..', 'model', model_path)) res = model.predict_proba(features_notnull)[:, 1] class_name = model_path.split('_')[1] threshold = float(model_path.rsplit('.', 1)[0].split('_')[-1]) result.loc[:, "class_" + class_name] = res > threshold result.loc[:, "class_" + class_name + "_score"] = res 


You can even try out the code right now, using your own data from your local PC, and completely without operationalization:





In VS Code, you can even open the Debug section (Ctrl + Alt + D) , select Score (Python) as the configuration and click Start Debugging to perform line-by-line analysis of the code in the editor. When the algorithms have completed their work, the results can be found in the files input.scores.csv and input.unscorable.csv in the folder \ Debug .



Operationalization



Python support in Azure Functions is still in an early preview, so using it for mission critical tasks is undesirable. But often ML does not apply to these, and therefore the convenience of implementation may outweigh the difficulty of adapting the preliminary version.



So, at this stage we had two scripts. The Experiments \ TraintExperiment.py script trains the model, then it stores the converted and trained model in a shared directory, and it is assumed that this training script is restarted on the local machine as needed. The Score \ run.py script runs daily, it sorts new emails as they arrive.



In this section, we’ll talk about process operationalization using Azure Functions. The functions are easy to use, they allow you to bind the script to a variety of different triggers (HTTP, queues, storage BLOB objects, WebHooks, and so on), provide several automatic output bindings and are inexpensive: having chosen Consumption plan, you pay only 0.000016 dollars for each gigabyte of RAM used per second. But there are limitations: your function cannot run for more than ten minutes and use more than 1.5 GB of RAM. If this does not suit you, you can always switch to a special tariff plan based on App Service, while maintaining access to other benefits of the serverless approach. However, for our simple logistic regression and packages of several hundred letters, the chosen plan turned out to be optimal.



From the programmer’s point of view, a function is a folder that bears the name of the function itself (in our case it’s just the Score ) and contains two different files:





You can manually create function.json or configure it with the Azure Portal tool. The code that we received in this case is presented below. The first binding, inputcsv , runs the script each time a file with the name corresponding to the mail-classify/input/{input_file_name}.csv appears in the default Azure BLOB storage. The remaining two bindings save the output files after successful execution of the function. In this case, we save them to a separate output folder, their names correspond to the name of the input file with the suffixes scored or unscorable . Thus, you can place a file with any name-identifier, for example GUID, in the input folder, and two new files with the name derived from GUID will appear in the output folder after some time.



 { "bindings": [ { "name": "inputcsv", "type": "blobTrigger", "path": "mail-classify/input/{input_file_name}.csv", "connection": "apmlstor", "direction": "in" }, { "name": "scoredcsv", "type": "blob", "path": "mail-classify/output/{input_file_name}.scored.csv", "connection": "apmlstor", "direction": "out" }, { "name": "unscorablecsv", "type": "blob", "path": "mail-classify/output/{input_file_name}.unscorable.csv", "connection": "apmlstor", "direction": "out" } ], "disabled": false } 


The run.py script for Azure features is almost the same as our original “non-operationalized” version. The only change relates to how functions pass through incoming and outgoing data flows. Regardless of the type of input and output data selected (HTTP request, message in queue, BLOB file ...), the content will be stored in a temporary file, and its path will be recorded in the environment variable with the name of the corresponding binding. For example, in our case, each time the function is executed, a file is created with the name " ... \ Binding [GUID] \ inputcsv " and this path will be stored in the inputcsv environment variable . A similar operation will be performed for each outgoing file. Given this logic, we made a few small changes to the script.



 # read file input_path = os.environ['inputcsv'] input_dir = os.path.dirname(input_path) input_name = os.path.basename(input_path) corpus = CsvCorpusReader(input_dir, [input_name], encoding="utf8", default_text_selector=lambda row: row["Text"]) [...] # write unscorables unscorable_path = os.environ['unscorablecsv'] ids_null.to_csv(unscorable_path, index=False) # pandas DataFrame [...] # write scored emails output_path = os.environ['scoredcsv'] result.to_csv(output_path) # pandas DataFrame 


These are all the changes needed to start the service when a CSV file appears in the BLOB storage and, as a result, the files containing the forecast are received.



To be honest, we tested other triggers, but found that the most powerful function of Python — modules — becomes its curse in a serverless system. A module in Python is not a static library that needs to be included, as in many other languages, but code that runs every time it is run. For durable solutions such as services, this is almost imperceptible, but from the point of view of Azure functions, the complete execution of the script entails quite large costs each time. This makes it difficult to use HTTP triggers in Python, but batch-processing on the basis of CSV files, which is popular in many ML scenarios, makes it possible to reduce these costs per data line to a reasonable minimum.



If you can’t do without real-time triggers with Python, you can try switching to the dedicated Azure App Service tariff plan, as this allows you to significantly increase the computing resources of the host and speed up the import. In our case, the ease of implementation and the low cost of the consumption plan outweighed the benefits of a quick implementation.



Before proceeding, let's see how to simplify development using Visual Studio Code. At the time of this writing, the Functions CLI interface provided the initial generation of Python templates, but there were no debugging functions. However, the runtime environment is not so difficult to imitate using the built-in VS Code functions. We will be helped by the .vscode \ launch.json file , which allows you to configure debugging options. JSON , debug Score (Python) VS Code ${workspaceRoot}/Score/run.py ${workspaceRoot}/Score , , - . , Azure Functions ( ). Debug (Ctrl + Alt + D) VS Code, Score (Python) Start Debugging , .



 [...] { "name": "Score (Python)", "type": "python", "request": "launch", "stopOnEntry": true, "pythonPath": "${config:python.pythonPath}", "console": "integratedTerminal", "program": "${workspaceRoot}/Score/run.py", "cwd": "${workspaceRoot}/Score", "env": { "inputcsv": "${workspaceRoot}/Score/debug/input.csv", "outputcsv": "${workspaceRoot}/Score/debug/output.csv", "unscorablecsv": "${workspaceRoot}/Score/debug/unscorable.csv" }, "debugOptions": [ "RedirectOutput", "WaitOnAbnormalExit" ] } [...] 


Jupyter , . , . IPython, Debug .



 if "IPython" in sys.modules and 'Score' not in os.getcwd(): os.environ['inputcsv'] = os.path.join('debug', 'input.csv') os.environ['scoredcsv'] = os.path.join('debug', 'input.scores.csv') os.environ['unscorablecsv'] = os.path.join('debug', 'input.unscorable.csv') os.chdir('Score') 




, , Azure. Python Azure , . Python 2.7. 3.6, wiki Python ( ) D:\home\site\tools . . Python 2.7 PATH python.exe .



Kudu, , , . setup , . , 3.6, , (.zip) Python D:\home\site\tools .



 tools_path = 'D:\\home\\site\\tools' if not sys.version.startswith('3.6'): # in python 2.7 import urllib print('Installing Python Version 3.6.3') from zipfile import ZipFile if not os.path.exists(tools_path): os.makedirs(tools_path) print("Created [{}]".format(tools_path)) python_url = 'https://apmlstor.blob.core.windows.net/wheels/python361x64.zip' python_file = os.path.join(tools_path, 'python.zip') urllib.urlretrieve(python_url, python_file) print("Downloaded Python 3.6.3") python_zip = ZipFile(python_file, 'r') python_zip.extractall(tools_path) python_zip.close() print("Extracted Python to [{}]".format(tools_path)) print("Please rerun this function again to install required pip packages") sys.exit(0) 


pip. Pip API Python, Python , . , Python ( langid , pymorphy ) , . , C++. App Service Visual C++, (wheels). pip ( ), ML- wheel . Azure Blob Storage, Azure. .



 def install_package(package_name): pip.main(['install', package_name]) install_package('https://apmlstor.blob.core.windows.net/wheels/numpy-1.13.1%2Bmkl-cp36-cp36m-win_amd64.whl') install_package('https://apmlstor.blob.core.windows.net/wheels/pandas-0.20.3-cp36-cp36m-win_amd64.whl') install_package('https://apmlstor.blob.core.windows.net/wheels/scipy-0.19.1-cp36-cp36m-win_amd64.whl') install_package('https://apmlstor.blob.core.windows.net/wheels/scikit_learn-0.18.2-cp36-cp36m-win_amd64.whl') install_package('https://apmlstor.blob.core.windows.net/wheels/gensim-2.3.0-cp36-cp36m-win_amd64.whl') install_package('https://apmlstor.blob.core.windows.net/wheels/nltk-3.2.4-py2.py3-none-any.whl') install_package('langid') install_package('pymorphy2') 


. , , NLTK. install_packages.



 import nltk; nltk_path = os.path.abspath(os.path.join('..', 'lib', 'nltk_data')) if not os.path.exists(nltk_path): os.makedirs(nltk_path) print("INFO: Created {0}".format(nltk_path)) nltk.download('punkt', download_dir=os.path.join('..', 'lib', 'nltk_data')) nltk.download('stopwords', download_dir=os.path.join('..', 'lib', 'nltk_data')) 


Setup , . , : , Python 3.6, , .



Conclusion



, , Azure Functions ML- Python. , ML . GitHub .



, .



')

Source: https://habr.com/ru/post/343670/



All Articles