Good day.
Working at the institute, I have to deal with a lot of semi-structured information. Here, the prefix "semi" means that, in general, all data is similar, but, as a rule, they are pushed into local folders on employees' computers, in .xls, .txt or in binary format. The information is data obtained from various devices (level sensors, temperature, flow velocity, atmospheric pressure, humidity, and so on, up to 20-30 different parameters). All instruments unload data each in its own format: either in ascii or a binary format, which is then processed, and, at the output, ascii is obtained again. Well, in general, everything as always, you yourself represent all this chaos.
I wanted to stuff the whole thing into one common database, so as not to look for the right data of the right version in the right folder, which takes a lot of time. Experience in the development of various systems (mainly geo-information) is available. But what was done before contained only processed data, and in general, all these systems were custom made. There was no automation complex for themselves.
')
Processing of this whole economy is quite standard thing, nothing new and interesting: checking the time series for integrity (if you need interpolation), building a heap of various graphs, running different models on this data, processing the output of models (again a bunch of graphs), outputting statistics. I will tell about the last in this article.
Statistics
It would not be so sad if there was no need to get not just ordinary indicators (minimum, maximum, average, deviation, etc.), but also summary tables (Pivot tables) of various kinds.
Simple
We have data on speed, we need to build a table of the form:
Gradations of speed, cm / s | Repeatability,% |
---|
0-10 | 60% |
10-20 | thirty% |
20-30 | ten% |
The original series itself is about 10k speed values. I am not an Excel guru and, except in VBA, I don’t know a simple way how to do this (no, of course, there is also an option to pull formulas, then sort and so on, and also make pivot tables in excel itself, but each time for each device this is brute force).
In the database (I use PostgresSQL) the task looks like this:
REATE TABLE tbldata ( dataid serial NOT NULL, measurementdate timestamp without time zone, velocity double precision, direction double precision ); Select 100*(count(velocity)/(select count(*) from tbldata)::float) from tbldata velocity>0 and measurement<10;
How many gradations - so many requests, automated easily, works quickly. You can write a stored procedure that will take values from, step, number of steps and which will output SETOF RECORDS. In general, the problem is solved.
More difficult
Problems begin when a table of the form is needed:
Gradations of speed, cm / s | Rumbam frequency,% |
---|
WITH | NE | AT | SE | YU | SW | H | NW |
---|
0 - 10 | 10.8 | 8.2 | 1.3 | 1.3 | 2.1 | 10.1 | 6.9 | 25.4 |
10 - 20 | 4.0 | 0.1 | 0.1 | 1.6 | 3.3 | 0.6 | 0.1 | 10.9 |
20 - 30 | 1.8 | 0.0 | 0.0 | 1.2 | 3.4 | 0.1 | 0.0 | 2.2 |
30 - 40 | 0.7 | 0.0 | 0.0 | 0.8 | 1.2 | 0.0 | 0.0 | 0.3 |
40 - 50 | 0.1 | 0.0 | 0.0 | 0.3 | 0.2 | 0.0 | 0.0 | 0.1 |
50 - 60 | 0.1 | 0.0 | 0.0 | 0.2 | 0.1 | 0.0 | 0.0 | 0.0 |
60 - 70 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 |
70 - 80 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 |
That is, you need to calculate the percentage of gradations for two variables:
Select 100*(count(*)/(select count(*) from tbldata)::float) from tbldata where velocity >0 and velocity<10 and (direction>337.5 or direction<22.5);
Then you can take advantage of the excellent
crosstab . To do this, you need to slightly modify the SQL by adding the information about the gradation number in speed and the number of gradation in the direction:
select 1 as VelGrd,1 as DirGrid, 100*(count(*)/(select count(*) from tbldata)::float) from tbldata where velocity >0 and velocity<10 and (direction>337.5 or direction<22.5);
Then wrap it all into a function, which will include the beginning, step and number of steps for both speed and direction, which will give UNION ALL of all these requests.
As a result, our query will look like this.
select * from crosstab('fnMakePercentageByTwoVariables(0,10,10,22.5,45,8)') as tbldata(velGrid int,dirGrid1 double precision, …,dirGrid8 double precision);
Very bad
Gradations of speed, cm / s | Rumbam frequency,% | Amount | Security% |
---|
WITH | NE | AT | SE | YU | SW | H | NW |
---|
0 - 10 | 10.8 | 8.2 | 1.3 | 1.3 | 2.1 | 10.1 | 6.9 | 25.4 | 66.1 | 100.0 |
10 - 20 | 4.0 | 0.1 | 0.1 | 1.6 | 3.3 | 0.6 | 0.1 | 10.9 | 20.6 | 33.9 |
20 - 30 | 1.8 | 0.0 | 0.0 | 1.2 | 3.4 | 0.1 | 0.0 | 2.2 | 8.8 | 13.3 |
30 - 40 | 0.7 | 0.0 | 0.0 | 0.8 | 1.2 | 0.0 | 0.0 | 0.3 | 3.1 | 4.5 |
40 - 50 | 0.1 | 0.0 | 0.0 | 0.3 | 0.2 | 0.0 | 0.0 | 0.1 | 0.7 | 1.4 |
50 - 60 | 0.1 | 0.0 | 0.0 | 0.2 | 0.1 | 0.0 | 0.0 | 0.0 | 0.4 | 0.7 |
60 - 70 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.2 | 0.3 |
70 - 80 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.1 | 0.1 |
Amount | 17.6 | 8.3 | 1.4 | 5.5 | 10.3 | 10.9 | 7.1 | 39.0 | 100.0 | |
Average speed, cm / s | 10.7 | 2.6 | 2.9 | 22.6 | 19.9 | 4.3 | 3.3 | 9.2 | 9.7 | |
Maximum speed, cm / s | 76.2 | 57.8 | 50.2 | 78.3 | 61.1 | 48.8 | 42.9 | 62.5 | 78.3 | |
To be honest here, my patience is already over, to the previous query, you need to add three union and one join. And if you want to know the square deviation and something else? One more union and so on until there is enough imagination. Security (in fact, it is a cumulative percentage from the bottom up), I still don’t know how to count, except for handles. In general, if the look of the table does not change, then this can be avoided.
Much worse
But the challenge is to compare the readings of the two devices. Then it turns out that each column must also be divided into two columns:
Gradations of speed, cm / s | Rumbam frequency,% |
---|
WITH | ... | NW |
---|
Device 1 | Device 2 | ... | Device 1 | Device 2 |
---|
0 - 10 | 10.8 | 8.2 | ... | 1.3 | 1.3 |
... | ... | ... | ... | ... | ... |
We are also interested to compare the readings of devices immediately? Yes, for me. When I thought about such a comparison, I felt very sad. There was no option faster than loading the data into SPSS, re-encode everything and output exactly what you need without tons of hatred. It is also important to have tables where, according to the speed gradations, you also need to have a cumulative percentage, which you also had to do in Excel with your hands.
Decision
My main tool for work is python, I use Django to develop applications. And in the end I decided - to hell with everything, enough pain and suffering. Stored procedures are fast and good, but it's hard for me to maintain them. Plus, the speed of work is not a factor - people who dealt with it to me, God forbid if they did not consider everything with their hands. So wait a minute - another instead of having to shovel a series of data for a couple of hours - it’s just heavenly delight.
And so salvation came in the face of
Pandas , about which I did not find a single mention in Habré. The main charm of the "panda" in their data types. One-dimensional -
Series . In fact, the series has everything that I lacked in dict. A lot of bikes are made to expand the dictionary functionality and this is all from ignorance about the panda.
You can apply both to the dictionary using the key s ['a'], and as sa For all the rest, panda structures work with vectorized functions from numpy (I cannot live without numpy.where). Series are added to the
DataFrame - already a two dimensional structure. DataFrame in the
Panel . All this works with slices, sub-indexes, a set of indices, with mixed data - in general, everything is divine!
And back to my task to get the table I need:
First, let's connect to the postgrese and get the result of the query
import psycopg2 import pandas as pd import numpy as np from pandas.io.sql import frame_query conn_string = "host='localhost' dbname='test' user='postgres' password='**********'" conn = psycopg2.connect(conn_string)
Create gradations and labels for them.
def makeVelocityLabels(c): ret = dict() for levidx in xrange(len(c.levels)): lev=c.levels[levidx] ret[levidx]=str(lev).replace('(',"").replace(", ","-").replace(']',"") return ret velGrid = np.linspace(0,100,11) dirGrid = np.linspace(-22.5,337.5,9) dirLabels=['','',"","","","","",""] velLabels = makeVelocityLabels(pd.cut(velGrid,velGrid)).values()
Inquisitive eyes noticed the
cut function. It deals with the fact that it makes discrete from a continuous series, actually turns a variable into a Category type. If I have always used
numpy.where before , piling up something like:
for i in xrange(len(velGrid)-1): veloity[np.where((velocity>velGrid[i]) & (velocity<velGrid[i+1]))]=velGrid[i]
now i do it
pd.cut(velocity,velGrid)
Next, create a new DataFrame, put categorical variables in it:
resultDf = pd.DataFrame(index=df.index,columns=df.columns)
Then the most interesting begins. Since there will be directions in the table heading (in fact, I cross the speed into directions), then in order to take statistics (minimum, maximum and mn), I need to count them as well for each direction.
Let us understand how much we have measurements for each gradation of direction.
totalDir = resultDf.groupby("direction").size() direction 97 503 1251 592 2773 736 388 773
I would recommend using the count () function instead of size (), which will return all values that are not missing.
Next, we need to calculate statistics, also in directions, for maximum, average, and build the actual table.
dfDGr = df.copy() dfDGr.direction = resultDf.direction meanVel = dfDGr.groupby("direction").mean() maxVel = dfDGr.groupby("direction").max() totalVel = resultDf.groupby("velocity").size() pivot = 100 * pd.pivot_table(resultDf,rows='velocity',cols='direction',aggfunc=len, fill_value=0)/totalDir.sum() direction velocity 0-10 1.279348 6.916913 10.839308 8.224378 25.432307 2.066639 1.265289 10.066076 10-20 0.056235 0.126529 3.992689 0.070294 10.881485 3.303810 1.560523 0.632644 20-30 0.000000 0.014059 1.813581 0.014059 2.235344 3.388163 1.223113 0.112470 30-40 0.014059 0.000000 0.674821 0.000000 0.323352 1.237171 0.801350 0.014059 40-50 0.000000 0.014059 0.140588 0.000000 0.056235 0.224940 0.253058 0.042176 50-60 0.014059 0.000000 0.084353 0.014059 0.042176 0.112470 0.168705 0.000000 60-70 0.000000 0.000000 0.028118 0.000000 0.014059 0.014059 0.126529 0.000000 70-80 0.000000 0.000000 0.014059 0.000000 0.000000 0.000000 0.056235 0.000000
It remains only to add statistics to the table, round it up and put the columns in the correct order.
def getCumulate(s): ns = s.copy() for val in xrange(s.size): ns[ns.index[val]]=np.sum(s[s.index[val]:s.index[s.size-1]]) return ns pivot[""] = 100 * np.round(totalVel/float(np.sum(totalVel)),3) pivot[',%']=getCumulate(pivot[""]) totalDirFr = pd.DataFrame(100*totalDir/float(np.sum(totalDir))).T totalDirFr.index= [""] pivot = pivot.append(totalDirFr) meanVelFr = meanVel.T meanVelFr.index=[" "] pivot = pivot.append(meanVelFr) maxVelFr = maxVel.T maxVelFr.index=[" "] pivot = pivot.append(maxVelFr)
Summary:
The result is exactly what you need. Less blood? - Yes. Easier to modify? - Yes.
In fact, the whole hemorrhoids is that the variables are not categories, if they were fed to the input right away, then everything would be much simpler. On my weak laptop, the whole script works in 0.2 seconds for 7113 values.
Thanks for attention.