📜 ⬆️ ⬇️

Organization of storage of intermediate tables for the CART algorithm

Good day to all who read. First set out the background of this issue. It all started with the fact that a friend asked me for a job to help him make an application (WinForm in C #) that would create a binary decision tree using the CART algorithm, based on the statistics in the table on the form. The point is to divide the source table into 2 parts according to the calculated criteria. In order to divide the source table, we need to find the average for each of the columns:

//     public static double[] GetAverageValue(DataTable table) { #region      var sr = new double[14]; for (int j = 1; j < table.Columns.Count - 1; j++) { double sum = 0.0; for (int i = 0; i < table.Rows.Count; i++) { sum = sum + Convert.ToDouble(table.Rows[i][j]); } sr[j - 1] = sum / table.Rows.Count; } #endregion return sr; } 


The first and last columns in the calculation are not taken into account (in the first value is Year, in the last value is 0 or 1). This function will return a one-dimensional array of averages, which will be accessed according to the index.
Then there is the actual search for the Gini coefficient for each of the columns, which will show us which column to take to divide our table, select the minimum among them and, accordingly, the column index with the minimum coefficient will be necessary for us to split. Then this column is selected and all values ​​greater than the average for this column are sent to one table, for example 1.1, and all values ​​less than the average to table 1.2. Below is the code. which is responsible for the first partitioning of the source table:

 //         public static int[] GetMoreAverageAndLessAverage(double average, double[] columns, int[] T) { int moreAverage = 0; int valueTMoreAverage = 0; int valueTInvertMoreAverage = 0; int lessAverage = 0; int valueTLessAverage = 0; int valueTLessInvertAverage = 0; var moreAverageValue = new int[6]; for (int i = 0; i < columns.Length; i++ ) { if (columns[i] >= average) { moreAverage = moreAverage + 1; if (T[i] == 1) { valueTMoreAverage = valueTMoreAverage + 1; } else { valueTInvertMoreAverage = valueTInvertMoreAverage + 1; } } else { lessAverage = lessAverage + 1; if (T[i] == 1) { valueTLessAverage = valueTLessAverage + 1; } else { valueTLessInvertAverage = valueTLessInvertAverage + 1; } } } moreAverageValue[0] = moreAverage; moreAverageValue[1] = valueTMoreAverage; moreAverageValue[2] = valueTInvertMoreAverage; moreAverageValue[3] = lessAverage; moreAverageValue[4] = valueTLessAverage; moreAverageValue[5] = valueTLessInvertAverage; return moreAverageValue; } //   Gini(T,param) public static double GetValueGini(int[] currentCollumnsAndT, int lengthCollumns) { var gini = 0.0; if (currentCollumnsAndT != null) { double gini1 = (Convert.ToDouble(currentCollumnsAndT.GetValue(0)) / lengthCollumns) * (1 - ((Convert.ToDouble(currentCollumnsAndT.GetValue(1)) / Convert.ToDouble(currentCollumnsAndT.GetValue(0))) * (Convert.ToDouble(currentCollumnsAndT.GetValue(1)) / Convert.ToDouble(currentCollumnsAndT.GetValue(0)))) - ((Convert.ToDouble(currentCollumnsAndT.GetValue(2)) / Convert.ToDouble(currentCollumnsAndT.GetValue(0))) * (Convert.ToDouble(currentCollumnsAndT.GetValue(2)) / Convert.ToDouble(currentCollumnsAndT.GetValue(0))))); double gini2 = (Convert.ToDouble(currentCollumnsAndT.GetValue(3)) / lengthCollumns) * (1 - ((Convert.ToDouble(currentCollumnsAndT.GetValue(4)) / Convert.ToDouble(currentCollumnsAndT.GetValue(3))) * (Convert.ToDouble(currentCollumnsAndT.GetValue(4)) / Convert.ToDouble(currentCollumnsAndT.GetValue(3)))) - ((Convert.ToDouble(currentCollumnsAndT.GetValue(5)) / Convert.ToDouble(currentCollumnsAndT.GetValue(3))) * (Convert.ToDouble(currentCollumnsAndT.GetValue(5)) / Convert.ToDouble(currentCollumnsAndT.GetValue(3))))); gini = gini1 + gini2; } return gini; } //    public static double[] GetValueGiniIsPrimaryDivide(DataTable table) { var currentCollumn = new double[table.Rows.Count]; var average = GetAverageValue(table); //      ,    GetValueGiniIsPrimaryDivide ( ) var T = new int[table.Rows.Count]; var gini = new double[table.Columns.Count - 1]; for (int i = 0; i < table.Rows.Count; i++) { T[i] = Convert.ToInt16(table.Rows[i][table.Columns.Count - 1]); } for (int i = 1; i < table.Columns.Count - 1; i++) { for (int j = 0; j < table.Rows.Count; j++) { currentCollumn[j] = 0.0; currentCollumn[j] = Convert.ToDouble(table.Rows[j][i].ToString()); } var bufer = average[i - 1]; gini[i] = GetValueGini(GetMoreAverageAndLessAverage(bufer, currentCollumn, T), table.Rows.Count); } var min = Convert.ToDouble(gini.GetValue(0)); int indexMin = 0; for (int i = 0; i < gini.Length; i++) { if (Convert.ToDouble(gini.GetValue(i)) <= min) { min = Convert.ToDouble(gini.GetValue(i)); indexMin = i; } } var returnValue = new double[2]; returnValue[0] = min; returnValue[1] = indexMin; return returnValue; } } 

')
The table itself, recall, is obtained from the DataGridView. And all of the above is called this way:

 #region   DataGrid DataTable var table = new DataTable(); for (int j = 0; j < dataGridView1.Columns.Count; j++) { table.Columns.Add(dataGridView1.Columns[j].Name); } for (int i = 0; i < dataGridView1.Rows.Count - 1; i++) { table.Rows.Add(); for (int j = 0; j < dataGridView1.Columns.Count; j++) { table.Rows[i][j] = dataGridView1[j, i].Value.ToString(); } } #endregion var giniIsMin = SeparationTable.GetValueGiniIsPrimaryDivide(table); 


Then each intermediate table goes through all the same steps. The signal to complete the division of the table is that in the last column (T) all values ​​are either all equal to 1 or all are equal to 0. At each subsequent stage one column does not participate in the calculations, for which the Gini coefficient was minimal at the previous step. Thus a binary decision tree is built. So, now the actual question itself is: how can you organize the storage of intermediate tables, so that you can short-cut them all and that it’s important not to lose them. We suggested the option to store them in a three-dimensional array, putting all intermediate tables at positions 1, 11, 12, 111, 112, and so on, but we rejected this option due to its “gluttony” to memory. If anyone understood at least something or tell me how to organize this.

Source: https://habr.com/ru/post/177797/


All Articles