Problem of minutes

Life, as they say, is the best playwright. So the task of life puts no less interesting than the authors of entertaining books in mathematics.
Just 2 days ago, a seemingly routine, uninteresting and easy task made me nervous and began to doubt my own adequacy of perception of reality.

So, a brief introduction: I work in the field of IP telephony, a lot of voice transit traffic passes through our nodes to different directions, each operator in this area conducts their own call counts, when billing there are often significant differences in readings, then t occurs. n dispute, and reconciliation begins in directions and days.
Record about each call is stored in the so-called CDR-format (Call Detail Record: ru.wikipedia.org/wiki/Call_Detail_Record ). The format is strictly not regulated, everyone is free to keep there what he needs. However, each call has a unique identifier, i.e. The GID (Global Call ID), which is recorded unchanged on all intermediate nodes, is the one that allows you to unambiguously indicate any call, since all other data of the call may not coincide (the time for setting up and ending a call almost always differs due to the inaccuracy of the hours used, the duration of the call due to delays in transmission and rounding often differs by one or two seconds [by the way, on large volumes this is a lot of money, for the sake of interest, we considered how much money a mobile operator can make from the air simply by adding one second to each conversation: millions of dollars a month], etc.)
This time I had to analyze data in one direction 992 (Tajikistan). I was provided with CDR records of the originator of calls (the caller) in CSV format (comma separated values, although I usually use a semicolon :) in the first 5 days, I unloaded my data for the same period and began to compare.
Since all data is almost uniquely different, the analysis is performed by CallID. The originator claimed that he sent us less minutes than we indicated in the invoice. Accordingly, my task was to find the missing calls from the originator in my data (if any), calculate their duration and provide them to the originator.

Next comes a small scriptbook written on the knee and slightly processed for this publication (COMPANY1 is the call originator, COMPANY2 is the call terminator, that is, we):
================================================= =========

#!/bin/bash

# CallID.
sort -u -f -k7 -t';' COMPANY1.csv >COMPANY1-sorted.csv
sort -u -f -k6 -t';' COMPANY2.csv >COMPANY2-sorted.csv

#
echo " COMPANY1:"
SEC=0
while read sec; do
SEC=$(($SEC+$sec))
done < <(cut -f6 -d';' COMPANY1-sorted.csv)
echo $(($SEC/60))

echo " COMPANY2:"
SEC=0
while read sec; do
SEC=$(($SEC+$sec))
done < <(cut -f5 -d';' COMPANY2-sorted.csv)
echo $(($SEC/60))

# ""
cut -f7 -d';' COMPANY1-sorted.csv >COMPANY1-callid.csv
cut -f6 -d';' COMPANY2-sorted.csv >COMPANY2-callid.csv
# CallID ,
# , join
# , , , .. CallID
# , CallID
# , , ,
comm -1 -3 COMPANY1-callid.csv COMPANY2-callid.csv | join --check-order -i -1 1 -2 6 -t';' - COMPANY2-sorted.csv >COMPANY1-COMPANY2-absent.csv

echo " , COMPANY1:"
SEC=0
while read sec; do
SEC=$(($SEC+$sec))
done < <(cut -f6 -d';' COMPANY1-COMPANY2-absent.csv)
echo $(($SEC/60))

================================================= =========
Run:
')
The amount of minutes in the COMPANY1 file:
21291
The amount of minutes in the COMPANY2 file:
24789
The sum of minutes of calls that are not in the COMPANY1 file:
2731

Voila
However, wait! In the file of the originator company, 21291 minutes, the sum of the missing minutes, found on the basis of the data of our file, is 2731 minutes, which means that they should total 24789 minutes, as in our file. And 21291 + 2731 = 24022, where did 767 minutes go to?
And it was here that I slowly began to go crazy. The script was rechecked from and to, various intermediate calculations were used to make sure that there were no errors and the number of records matches, the usual diff was tried before the comm utility, I even checked the source files in the required columns for isdigit (), etc. etc.
Nothing helped! The number of minutes in a magical way did not coincide, although at first glance everything is correct. And only the next day, one intelligent person (Sergey Vlasov, vsu @ ALTLinux Team) helped me find these minutes. Try it and you find them :-)
For the purity of the experiment, I post the script itself and two source files (removing the names and IP from them) on my ftp:
ftp://82.208.44.206/pub/minutes_task.zip

UPD: habracut for some reason refuses to perceive the text as in the text parameter, it turns out only “Read more” :(
UPD2 : Asked to describe the CSV format (Comma Separated Values): this is a simple text file, where the values are separated by a comma (comma), and in this case a semicolon. Moreover, the order of the fields is not regulated, so only with your eyes you can determine where and what. Here is an example:
XXX.XXX.XXX.11; COMPANY1; 992927233890; 2008-06-01 19: 31: 35 ;; 9; 780a890bddce4248b839040046464636; 16;
Here is the sixth field - the duration of the call in seconds. Seventh - the same CallID

Source: https://habr.com/ru/post/43802/

All Articles

Problem of minutes

More articles: