⬆️ ⬇️

Work with large Excel files

What is a large file? So what is really big? When I was being, I thought it was a file for 50-60 thousand lines of records. And I would remain in such ignorance until now, but I had to carry out one project in which I had to work with files of 600-800 thousand lines. Going to the hell out - under the cut:







What first



')

And first, my friends, we rushed into the simplest thing you can think of. Interop.Excell and everything. It seemed. Yeah, schaz. As shown by the test tests, this method of opening led to the fact that in an hour 200 thousand lines of Excel were read, the application actively consumed the RAM, and pushed back the rest of the processes on the machine. It all ended as expected, but the investigative experiment had to be completed - at 260,000 the application fell into OutOfMemory on a machine with 4 GB. It became clear that the problem could not be solved in the forehead



Google it





How many wonderful discoveries to us ... Google brought, oddly enough, in msdn, where I met two methods of opening very large files: DOM and SAX. I don’t remember after the prescription of times, but some of them fell off due to OutOfMemory, which was already disgusted at that moment, and the second one was completely unsuitable in terms of access to data. Why - read below.



From what, from what





Our exelki are made. For anyone who decided to dig the format a little deeper, it will not be a secret that, unlike the binary xls, xlsx is essentially a zip archive with data. It is enough to change the extension with pens and unpack the archive into a folder - and we will get the entire internal structure of the document, which is nothing but a set of xml files and related information. As it turned out, there is no text data in the root xml. Instead, we have a set of indexes that refer to an auxiliary file that contains key / value pairs. One of the above methods to open that file is possible, but you need to dig into the accompanying files and pull text values ​​from them. Darkness.



And the darkness receded





After long ordeals and wailing, the following was born:



Our favorite usings that some individuals forget to include:

using System;

using System.Collections. Generic ;

using System.Data;

using System.Data.OleDb;

using System.IO;

using System.Linq;

using DocumentFormat.OpenXml;

using DocumentFormat.OpenXml.Packaging;

using DocumentFormat.OpenXml.Spreadsheet;




* This source code was highlighted with Source Code Highlighter .

using System;

using System.Collections. Generic ;

using System.Data;

using System.Data.OleDb;

using System.IO;

using System.Linq;

using DocumentFormat.OpenXml;

using DocumentFormat.OpenXml.Packaging;

using DocumentFormat.OpenXml.Spreadsheet;




* This source code was highlighted with Source Code Highlighter .

using System;

using System.Collections. Generic ;

using System.Data;

using System.Data.OleDb;

using System.IO;

using System.Linq;

using DocumentFormat.OpenXml;

using DocumentFormat.OpenXml.Packaging;

using DocumentFormat.OpenXml.Spreadsheet;




* This source code was highlighted with Source Code Highlighter .






Actually, the code itself:

public delegate void MessageHave( string message);



public delegate void _DataLoaded( List < string > data);



public delegate void _NewProcent( int col);



public static _DataLoaded DataLoaded;



public static _NewProcent NewProcent;



public static MessageHave MessageHave_Event;



public static void ReadData( object data)

{

// " "-" "

var keyValuePair = (KeyValuePair< string , string >)data;

using ( var cnn = new OleDbConnection( @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" +

keyValuePair.Key + @";Extended Properties=" "Excel 12.0;HDR=No;IMEX=1" "" )

)

{

int calc = 1000;

MessageHave_Event( " " );

cnn.Open();

try

{

var cmd = new OleDbCommand( String .Format( "select * from [{0}]" , keyValuePair.Value), cnn);

using (OleDbDataReader dr = cmd.ExecuteReader())

{

var lines = new List < string >();

int id = 0;

if (dr != null )

while (dr.Read())

{

string text = "" ;

for ( int i = 0; i < dr.FieldCount; ++i)

{

if (dr[i] != null )

text += dr[i] + "^" ; //

else

text += "^" ;

}

lines.Add(text);



id++;

if (id == calc)

{

NewProcent(id);

calc += 1000;

}

}

DataLoaded(lines);

}

cnn.Close();

}

catch (Exception ex)

{

MessageHave_Event( "Exception: " + ex.Message);

cnn.Close();

}

}

}




* This source code was highlighted with Source Code Highlighter .

public delegate void MessageHave( string message);



public delegate void _DataLoaded( List < string > data);



public delegate void _NewProcent( int col);



public static _DataLoaded DataLoaded;



public static _NewProcent NewProcent;



public static MessageHave MessageHave_Event;



public static void ReadData( object data)

{

// " "-" "

var keyValuePair = (KeyValuePair< string , string >)data;

using ( var cnn = new OleDbConnection( @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" +

keyValuePair.Key + @";Extended Properties=" "Excel 12.0;HDR=No;IMEX=1" "" )

)

{

int calc = 1000;

MessageHave_Event( " " );

cnn.Open();

try

{

var cmd = new OleDbCommand( String .Format( "select * from [{0}]" , keyValuePair.Value), cnn);

using (OleDbDataReader dr = cmd.ExecuteReader())

{

var lines = new List < string >();

int id = 0;

if (dr != null )

while (dr.Read())

{

string text = "" ;

for ( int i = 0; i < dr.FieldCount; ++i)

{

if (dr[i] != null )

text += dr[i] + "^" ; //

else

text += "^" ;

}

lines.Add(text);



id++;

if (id == calc)

{

NewProcent(id);

calc += 1000;

}

}

DataLoaded(lines);

}

cnn.Close();

}

catch (Exception ex)

{

MessageHave_Event( "Exception: " + ex.Message);

cnn.Close();

}

}

}




* This source code was highlighted with Source Code Highlighter .

public delegate void MessageHave( string message);



public delegate void _DataLoaded( List < string > data);



public delegate void _NewProcent( int col);



public static _DataLoaded DataLoaded;



public static _NewProcent NewProcent;



public static MessageHave MessageHave_Event;



public static void ReadData( object data)

{

// " "-" "

var keyValuePair = (KeyValuePair< string , string >)data;

using ( var cnn = new OleDbConnection( @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" +

keyValuePair.Key + @";Extended Properties=" "Excel 12.0;HDR=No;IMEX=1" "" )

)

{

int calc = 1000;

MessageHave_Event( " " );

cnn.Open();

try

{

var cmd = new OleDbCommand( String .Format( "select * from [{0}]" , keyValuePair.Value), cnn);

using (OleDbDataReader dr = cmd.ExecuteReader())

{

var lines = new List < string >();

int id = 0;

if (dr != null )

while (dr.Read())

{

string text = "" ;

for ( int i = 0; i < dr.FieldCount; ++i)

{

if (dr[i] != null )

text += dr[i] + "^" ; //

else

text += "^" ;

}

lines.Add(text);



id++;

if (id == calc)

{

NewProcent(id);

calc += 1000;

}

}

DataLoaded(lines);

}

cnn.Close();

}

catch (Exception ex)

{

MessageHave_Event( "Exception: " + ex.Message);

cnn.Close();

}

}

}




* This source code was highlighted with Source Code Highlighter .






The code showed a performance of about 15-20 minutes on files of 600-800 thousand records.



If someone finds the implementation curve - do not kick much :) I will listen to all comments



Source: https://habr.com/ru/post/139706/



All Articles