CPython vs. IronPython: MD5 Hash Calculation

It was necessary somehow to make an auto-update for the client application in the project. Since it worked with domestic crypto-providers, which were easier to access from .Net, it was written in IronPython. At the same time, C # was not chosen, since python was already actively used on the server side and we didn’t want to relearn much.

It would seem simple. A script was compiled that calculates md5 hashes for the files included in the application, combines everything into one file with strings of the “relative path” type: "md5" and puts nginx in the distribution directory of statics. The client application takes a file when it starts up, runs a similar script, and compares the result with the standard.

But then a little detail came to light. In IronPython, the script ran several times slower. And it is on a fairly fast iron. The user could be much weaker. Optimization began, during which the idea was born to compare the performance of CPython and IronPython using this example. The article, respectively, considers three separate results: for CPython, IronPython and IronPython with an adapted script.
Results under the cut.

Configuration

Core i5 650 3.20 GHz
8 GB of RAM
Windows 7 Enterprise x64
Python 2.7.1
IronPython 2.7.3

The directory with the application files was used as the “food” for the script. It includes the IronPython Runtime itself, additional libraries, and other necessary files. Only about 350 files from kilobytes to three megabytes.
')
Script code:

1| import os 2| import hashlib 3| 4| def getMD5sum(fileName): 5| m = hashlib.md5() 6| fd = open(fileName, 'rb') 7| b = fd.read() 8| m.update(b) 9| fd.close() 10| return m.hexdigest() 11| 12| output = '' 13| rootpath = 'app' 14| 15| for dirname, dirnames, filenames in os.walk(rootpath): 16| for filename in filenames: 17| fname = os.path.join(dirname, filename).replace('\\', '/') 18| md5sum = getMD5sum(fname) 19| output+='{0}:{1}\n'.format(fname.replace(rootpath, ''), md5sum) 20| 21| f = open('./checksums.csv', 'w') 22| f.write(output) 23| f.close()

The same script adapted for IronPython:

  1| import os 2| import System.IO 3| from System.Security.Cryptography import MD5CryptoServiceProvider 4| 5| def getMD5sum(fileName): 6| b = System.IO.File.ReadAllBytes(fileName) 7| md5 = MD5CryptoServiceProvider() 8| hash = md5.ComputeHash(b) 9| result = '' 10| for b in hash: 11| result += b.ToString("x2") 12| return result 13| 14| output = '' 15| rootpath = 'app' 16| 17| for dirname, dirnames, filenames in os.walk(rootpath): 18| for filename in filenames: 19| fname = os.path.join(dirname, filename).replace('\\', '/') 20| md5sum = getMD5sum(fname) 21| output += fname.replace(rootpath, '', 1) + ':' + md5sum + '\n' 21| 22| System.IO.File.WriteAllText('checksums.csv', output)

In principle, the whole adaptation comes down to the fact that the reading / writing of files and the calculation of hashes are rewritten to .Net. This gives a sufficient performance boost. This is due to the fact that ipy itself is written in c # and most of the “batteries” are just a wrapper for .Net. In this sense, the difference between the 19 lines of the main line and 21 adapted ones may look interesting:

 19| output += '{0}:{1}\n'.format(fname.replace(rootpath, ''), md5sum)

 21| output += fname.replace(rootpath, '', 1) + ':' + md5sum + '\n'

In ipy, the second option was faster. As for python, I could not see the difference exceeding the statistical error.

results

And so, the results of cold starts (average):

CPython: ~ 0.06 s.
IronPython: ~ 0.33 s.
IronPython (adapted script): ~ 0.16 s.

With the naked eye it can be seen that the same script in python and IronPython are executed with a more than five-fold advantage on the python side. At the same time, the script adapted for ipy, although it is still executed slower, but the result is already quite acceptable.

There is another nuance: on the client, this script must be embedded in the application itself. Accordingly, it is not so much the time of cold start that is of interest, but the time of its direct execution, without taking into account the start of the interpreter. Let's reproduce this behavior by putting the code in a loop.

Typical results:

CPython	ipy	ipy (adapt.)
0: 00: 00.057000	0: 00: 00.327000	0: 00: 00.161000
0: 00: 00.056000	0: 00: 00.243000	0: 00: 00.093000
0: 00: 00.055000	0: 00: 00.234000	0: 00: 00.099000
0: 00: 00.058000	0: 00: 00.228000	0: 00: 00.096000
0: 00: 00.055000	0: 00: 00.226000	0: 00: 00.093000
0: 00: 00.055000	0: 00: 00.236000	0: 00: 00.093000
0: 00: 00.055000	0: 00: 00.225000	0: 00: 00.093000
0: 00: 00.055000	0: 00: 00.261000	0: 00: 00.092000
0: 00: 00.057000	0: 00: 00.240000	0: 00: 00.092000
0: 00: 00.057000	0: 00: 00.227000	0: 00: 00.093000

findings

According to the results of this test, it is already possible to draw a more or less plausible conclusion. It can be seen that approximately 0.7 seconds is the time required just to launch the IronPython interpreter itself. During this time, the script running in the native python has time to complete. CPython starts virtually instantly and as you can see, the first iteration was as fast as the next ones. At the same time, it is clear that even code optimized for ipy, launched on hot, is almost one and a half times slower than the native one.

Using the same code for CPython and IronPython seems to be of no use at all if performance is at all critical. However, this is not the only IronPython restriction when using the same code. There are some nuances and bugs not related to performance, but this is beyond the scope of this article. However, I want to make a reservation that the refusal to use IronPython is also out of the question. He quite successfully copes with the responsibilities assigned to him.

I will be glad to hear constructive criticism.

UPD
mstyura offered a more optimized version of the script for ipy with a more interesting result:

 from System.IO import StreamWriter, Directory, SearchOption, File, Path from System import String, BitConverter, Environment, Array from System.Security.Cryptography import MD5CryptoServiceProvider def getMD5sum(fileName): stm = File.OpenRead(fileName) md5 = MD5CryptoServiceProvider() hash = md5.ComputeHash(stm) stm.Close() return BitConverter.ToString(hash).Replace("-", "").ToLower() rootpath = 'app' workingDir = Environment.CurrentDirectory Environment.CurrentDirectory = rootpath appFiles = Directory.EnumerateFiles('.', '*', SearchOption.AllDirectories) output = StreamWriter(File.OpenWrite(Path.Combine(workingDir, 'checksums.csv'))) for _, file in enumerate(appFiles): output.Write(file.replace(".", "", 1).replace("\\", "/")) output.Write(":") output.WriteLine(getMD5sum(file)) output.Close() Environment.CurrentDirectory = workingDir

The result of this option:
0: 00: 00.116000
0: 00: 00.063000
0: 00: 00.064000
0: 00: 00.063000
0: 00: 00.059000
0: 00: 00.059000
0: 00: 00.058000
0: 00: 00.058000
0: 00: 00.058000
0: 00: 00.059000

It is seen that a little bit more and he will overtake python - they are almost on a par. The start is of course still slow, but it has also become faster, apparently due to the fact that Python libraries are not imported and used. But if you just add import os and call os.walk (rootpath) once for a single time, it will increase the time of the first iteration to ~ 0.145 c! However, apparently this function itself is so heavy. If you call something simple like os.getcwd (), the speed does not change much

Source: https://habr.com/ru/post/149612/

All Articles

CPython vs. IronPython: MD5 Hash Calculation

Configuration

results

findings

More articles: