In the last couple of years, it has only been heard that Python and scikit-learn are a kind of gold standard in data science.
I want to tell you about the possibility of alternative development in the field of machine learning, a library written in C ++.
TMVA (Toolkit for Multivariate Data Analysis with ROOT ) is an open-source library of machine learning algorithms that comes in addition to the big data analysis package ROOT and is installed with it, respectively. About the installation is written in detail in the manual, so we will not consider this point.
Until recently, TMVA was considered to be the main site of the project, but, as we can see, there have not been any updates on it for quite some time. This is not a reason for skepticism and panic, because Now, the development is being carried out by a new team of CERN developers .
CERN (European Organization for Nuclear Research) was a pioneer in creating software for analyzing large amounts of data . It was there that the object-oriented library ROOT was developed, which found application not only in the world of physics .
In ROOT, data is stored in a very economical * .root format , but you can work with any text format. For simplicity, we use the usual text format csv / txt when working with TMVA.
Unfortunately, at the moment, TMVA uses only learning algorithms with the teacher.
This is the correlation matrix in TMVA:
Feature Ratios:
Ro-curve looks nonstandard:
So, imagine that we already have a ROOT installed and there are 2 text files: with "good" and those who need to be classified (or build a regression for prediction). In order to submit these 2 files as input, you need to bring the file header to the required format:
id / F: Param1 / I: Param2 / I: Param3 / F
id / F: Param1 / I: Param2 / I: Param3 / F
2.59.1.0
3,85,0,44
4.39.0.78
...
In TMVA 2 data types: Float and Integer (in Reader only float)
The default delimiter is a comma.
You can view the list of algorithms in the User Guide
Let's go to the code.
#include "TMVA/Types.h" #include "TMVA/Factory.h" #include "TMVA/Tools.h" using std::cout; //For Reader std::string outputListFileName; void Model_BDT() { std::cout << std::endl; std::cout << "===> Start TMVAClassification" << std::endl; // ROOT-, : , , RO-) TFile* outFputFile = new TFile("Model.root", "RECREATE"); // ( MakeClass, weights, xml TMVA::Factory *factory = new TMVA::Factory("TMVAClassification_Model",outFputFile,"V:!Silent:Color:Transformations=I:DrawProgressBar:AnalysisType=Classification"); // TString sigFile="Signal.csv"; TString bkgFile ="Background.csv"; cout << ">>>> Adding variables phase\n"; factory->AddVariable("Param1",'I'); factory->AddVariable("Param2",'I'); factory->AddVariable("Param3",'F'); //Id factory->AddSpectator("id", 'F'); Double_t sigWeight = 1.0; // overall weight for all signal events Double_t bkgWeight = 1.0; // overall weight for all background events factory->SetInputTrees( sigFile, bkgFile, sigWeight, bkgWeight ); cout << ">>>> Cutting\n"; // Param1 Param3; - TCut preselectionCut("Param1 > 0. && Param3<350.0"); TCut mycutS = ""; // n- Background, , TCut mycutB = "id%100==0"; // factory->PrepareTrainingAndTestTree(mycutS, mycutB, "nTrain_Signal=16000:nTest_Signal=1451:nTrain_Background=800000:nTest_Background=118416:VerboseLevel=Debug"); // Boosted Decision and Regression Trees, factory->BookMethod(TMVA::Types::kBDT, "BDT", "MaxDepth=5:NTrees=2000:MinNodeSize=9%:PruneStrength=10:SeparationType=GiniIndex"); // help factory->PrintHelpMessage("BDT"); //, cout << ">>>> doing TrainAllMethods\n"; factory->TrainAllMethods(); cout << ">>>> doing TestAllMethods\n"; factory->TestAllMethods(); cout << ">>>> doing EvaluateAllMethods\n"; factory->EvaluateAllMethods(); // Save the output outFputFile->Close(); std::cout << "===> Wrote root file: " << outFputFile->GetName() << std::endl; std::cout << "===> TMVAClassification is done!" << std::endl; delete factory; }
You can start a macro with the command from the "root Model_BDT.C" terminal.
After everything counts, you can open a ROOT browser in the console, using the command "TBrowser b;" and admire the many cute graphs.
In the next article I want to talk about how to write a Reader model, which allows you to apply the resulting model to any other data and unload the scanned array with a certain cut-off rate.
Source: https://habr.com/ru/post/306242/
All Articles