Fast Data Mining or C # vs Python Performance Comparison (pandas-numpy-skilearn)

Hello! Understanding Spark Apache, I faced the fact that after a rather small complication of data preparation algorithms, calculations began to be performed extremely slowly. Therefore, I wanted to implement something in C # and compare performance with a solution similar in class on the python stack (pandas-numpy-skilearn). Similarly, because they are executed on the local machine. Data preparation in C # was carried out with in-built tools (linq), the calculation of linear regression by the library of extremeoptimization .

The “B. Predicting customer spending ”from the November Sberbank Data Science Journey competition.

Immediately it should be emphasized that this article describes only the aspect of comparing the performance of platforms, and not the quality of the model and predictions.
')
So, first a brief description of the sequence of actions implemented in C # (pieces of code will be lower):

1. Download data from csv. Used library Fast Csv Reader .
2. Filter expense transactions and group by month.
3. Add to each client the categories for which he had no operations. In order to avoid a lengthy iteration cycle-in-cycle used Bloom filter . Found implementation on C # here .
4. Formation array Hashing trick . Since the finished implementation under C # could not be found, I had to implement it myself. To do this, download and finish the hashing implementation murmurhash3
5. Actually regression calculation.

The solution on Jupyter Notebook (hereinafter referred to as JN) looks like this (I omit connecting libraries, because it was not included in the measured time):

%%time #       transactions = pd.read_csv('.//JN//SBSJ//transactions.csv') all_cuses = transactions.customer_id.unique() #     mcc = pd.read_csv('.//JN//SBSJ//tr_mcc_codes.csv', sep=';') all_mcc = mcc.mcc_code.unique() #   transactions = transactions[transactions.amount < 0].copy() transactions['day'] = transactions.tr_day.apply(lambda dt: dt.split()[0]).astype(int) transactions.day += 29 - transactions['day'].max()%30 #    transactions['month_num'] = (transactions.day) // 30 train_transactions = transactions[transactions.month_num < 15] #         (     ) grid = list(product(*[all_cuses, all_mcc, range(11, 15)])) train_grid = pd.DataFrame(grid, columns = ['customer_id', 'mcc_code', 'month_num']) train = pd.merge(train_grid, train_transactions.groupby(['month_num', 'customer_id', 'mcc_code'])[['amount']].sum().reset_index(), how='left').fillna(0) #       for month_shift in range(1, 3): train_shift = train.copy() train_shift['month_num'] = train_shift['month_num'] + month_shift train_shift = train_shift.rename(columns={"amount" : 'amount_{0}'.format(month_shift)}) train_shift = train_shift[['month_num', 'customer_id', 'mcc_code', 'amount_{0}'.format(month_shift)]] train = pd.merge(train, train_shift, on=['month_num', 'customer_id', 'mcc_code'], how='left').fillna(0) train['year_num'] = (train.month_num) // 12 #  hashier trick hasher = FeatureHasher(n_features=6, input_type='string') train_sparse = \ hasher.fit_transform(train[['year_num', 'month_num', 'customer_id', 'mcc_code']].astype(str).as_matrix()) train_sparse2 = sparse.hstack([train_sparse, np.log(np.abs(train[['amount_1', 'amount_2']]) + 1).as_matrix(),]) #       d = list(train_sparse2.toarray()) #  clf = LinearRegression() clf.fit(d, np.log(np.abs(train['amount']) + 1)) # print('Coefficients: \n', clf.coef_) print('Intercept: \n', clf.intercept_) print("\nRMSLE: ") np.sqrt(mse(np.log(np.abs(train['amount']) + 1),clf.predict(d)))

Now more about the implementation of C #. Experiments have shown that classes like DataTable and others are very wasteful with respect to memory. Therefore, a simple list of elements of the Client class was used:

  [Serializable] public class Client { private Int32 name; private Int16 period; private Int16 year; private Int16 mcc; private double amount; private double amount1; private double amount2; //  get/set ...

Further, data reading and grouping:

  //    List<Client> lTransGrouped = lClientsTrans.AsParallel() .Where(row => row.getAmount() < 0) .GroupBy(row => new { month = (row.getPeriod() + 29 - Convert.ToInt16(maxNumDay) % 30) / 30, //     mcc = row.getMcc(), cid = row.getName() }) .Select(grp => new Client( grp.Key.cid, Convert.ToInt16(grp.Key.month), grp.Key.mcc, Math.Log(Math.Abs(grp.Sum(r => r.getAmount())) + 1))).ToList(); lClientsTrans = null;

Then add the missing operation types using the Bloom filter. It could have been possible without it, but then the execution time would increase (full search for each type) or the amount of memory used (if all the types are added in a row, and then aggregated).

 public static List<Client> addPeriodMcc(List<Client> lTransGrouped, Int16 maxNumMon) { List<Client> lMcc = new List<Client>(); string fnameMcc = @"j:\hadoop\Contest\Contest\tr_mcc_codes.csv"; //  mcc_code CsvReader csvMccReader = new CsvReader(new StreamReader(fnameMcc), true, ';'); //    while (csvMccReader.ReadNextRecord()) { Int16 mcc = Convert.ToInt16(csvMccReader[0]); lMcc.Add(new Client(0, 0, mcc, 0)); } //     mcc    List<Client> lNewMcc = new List<Client>(); //       ID  var lTransCID = lTransGrouped.AsParallel().Select(a => a.getName()).Distinct(); Console.WriteLine("Unique CID: " + lTransCID.Count()); //    int capacity = lTransGrouped.Count() * 6; //   ,     var filter = new Filter<string>(capacity); //   //   foreach (var i in lTransGrouped) filter.Add(i.getName().ToString() + i.getPeriod() + i.getMcc()); //        ,   foreach (var cid in lTransCID) for (Int16 m = 0; m <= maxNumMon; m++) foreach (var mcc in lMcc) if (filter.Contains(cid.ToString() + m.ToString() + mcc.getMcc().ToString()) != true) lNewMcc.Add(new Client(cid, m, mcc.getMcc(), 0)); lTransCID = lMcc = null; Console.WriteLine("Count lNewMcc: " + lNewMcc.Count); Console.WriteLine("Count lTransGrouped: " + lTransGrouped.Count); //  List<Client> lTransFull = lNewMcc.Union(lTransGrouped).ToList(); Console.WriteLine("Count lTransFull: " + lTransFull.Count); lTransGrouped = lNewMcc = null; return lTransFull; }

Stage of adding operations for previous months:

 public static List<Client> addAmounts(List<Client> lTransFull) { List<Client> lTransFullA2; //         lTransFullA2 = lTransFull.OrderBy(a => a.getName()) .ThenBy(a => a.getMcc()) .ThenBy(a => a.getYear()) .ThenBy(a => a.getPeriod()).ToList(); int name = 0; int month = 0; int year = 0; int mcc = 0; int i = 0; foreach (var l in lTransFullA2) { name = l.getName(); mcc = l.getMcc(); year = l.getYear(); month = l.getPeriod(); //   if (i > 0 && name == lTransFullA2[i - 1].getName() && mcc == lTransFullA2[i - 1].getMcc() && year == lTransFullA2[i - 1].getYear() && month == lTransFullA2[i - 1].getPeriod() + 1) { l.setAmount1(lTransFullA2[i - 1].getAmount()); } //   if (i > 1 && name == lTransFullA2[i - 2].getName() && mcc == lTransFullA2[i - 2].getMcc() && year == lTransFullA2[i - 2].getYear() && month == lTransFullA2[i - 2].getPeriod() + 2) { l.setAmount2(lTransFullA2[i - 2].getAmount()); } i++; } return lTransFullA2; }

Next, filling in the hashing trick array and preparing the data in a format understandable model and the actual calculation

  int n_features = 6; //    Extreme.Mathematics.LinearAlgebra.SparseVector<double> v = Vector.CreateSparse<double>(lTransFullA2.Count); //   (hash +   ) md = Matrix.Create<double>(lTransFullA2.Count, n_features + 2); //  Hashing trick Parallel.For(0, lTransFullA2.Count(), i => hashing_vectorizer(lTransFullA2[i], i, n_features)); for (int i = 0; i < lTransFullA2.Count; i++) { md[i, n_features] = lTransFullA2[i].getAmount1(); md[i, n_features + 1] = lTransFullA2[i].getAmount2(); v.AddAt(i, lTransFullA2[i].getAmount()); } lTransFullA2 = null; GC.Collect(2, GCCollectionMode.Forced); var model = new LinearRegressionModel(v, md); //   model.MaxDegreeOfParallelism = 8; model.Compute(); //  Console.WriteLine(model.Summarize()); //    GC.Collect(2, GCCollectionMode.Forced);

And finally, the implementation of Hashing trick:

  public static void hashing_vectorizer(Client f, int i, int n) { int[] x = new int[n]; string s = f.getYear().ToString(); // int idx = getIndx(s, n); x[idx] += calcBit(s); md[i,idx] = x[idx]; s = f.getPeriod().ToString(); idx = getIndx(s, n); x[idx] += calcBit(s); md[i, idx] = x[idx]; s = f.getName().ToString(); idx = getIndx(s, n); x[idx] += calcBit(s); md[i, idx] = x[idx]; s = f.getMcc().ToString(); idx = getIndx(s, n); x[idx] += calcBit(s); md[i, idx] = x[idx]; } //     public static int calcBit(string s) { byte b = 0; b = Convert.ToByte(s[0]); for (int i = 1; i < s.Count(); i++) b ^= Convert.ToByte(s[0]); bool result = true; while (b >= 1) { result ^= (b & 0x01) != 0; b = Convert.ToByte(b >> 1); } if (result) return -1; else return 1; } public static int getIndx(string str, int n) { Encoding encoding = new UTF8Encoding(); byte[] input = encoding.GetBytes(str); uint h = MurMurHash3.Hash(input); return Convert.ToInt32(h % n); }

The results of the programs are almost identical (RMSLE about 1.6). Here's what it looks like:

Now we come to the most interesting - the test results. All tests were run on i7-2600 (8 threads, but most of the time it worked 1-2). RAM 12 GB, OS Win7.

To determine the dependence of the execution time on the data volume, the calculations were run on 1.7, 3.4, 5.1, and 6.8 million source records (the contents of the file transactions.csv). But since during the preparation of the data, filtering occurred for 11-14 months, the graph shows the amount of data already after filtering.

As you can see, the C # version is approximately 2 times faster. A similar situation with memory consumption. It does not take into account the memory occupied by Visual Studio (C # was launched in debug mode) and the browser (localhost: 8888). For the evaluation, a peak value was taken:

With a further increase in the sample, JN was already starting to use the paging file, with the result that everything slowed down dramatically.

Thus, we see that using C # allows you to process a larger amount of data much faster than JN, since the RAM is a hard limiter here.

On the other hand, the matplotlib visualization tools allow you to analyze the data almost on the fly, and you need to write much more C # code. Therefore, in case of insufficient memory / speed, the best option is to use the JN stack for debugging a model on a limited sample and the final implementation is already in C #.

Source: https://habr.com/ru/post/318484/

All Articles

Fast Data Mining or C # vs Python Performance Comparison (pandas-numpy-skilearn)

More articles: