📜 ⬆️ ⬇️

Using Bash in SQL style

Greetings This short article is intended to highlight some aspects of using Bash to analyze SQL-style files. It will be interesting for beginners, perhaps experienced users will also find something new for themselves.

Task structure:


Must : find overdue reports.
')
So, uncover the Bash, open a separate terminal for the man-s and proceed)

All who are interested - I ask under the cat.

We have: an internal reporting system in the form of a folder with projects. In each project, in the conf folder, there are configurations of building reports that contain the names of Hive databases in the "schema" fields, according to the tables of which reports are built. In the reports folder, the reports themselves are arranged in folders with configuration names. Each report is a json containing statistics on Hive tables in the array of "table" objects, as well as the date of creation in the "created_date" field. Take it instead of the file creation date, since there is one. We need to find such reports that contain tables that were changed after the report was created.

Why in SQL style? Bash provides great opportunities for working with text divided into columns (usually spaces), resembling the processing of tables in SQL.

Tools:


Let's get started For starters, simply use the tables used:

  grep -r "\"table\":" projects/*/reports/* | ... 

It gives the data in the following form:
projects / project1 / reports / run1 / report1.json: "table": "table1",
projects / project1 / reports / run2 / report2.json: "table": "table2",
projects / project2 / reports / run3 / report3.json: "table": "table3",
...
  ... | sed 's/:/ /g' | awk '{print $1 " " $3}' | sed 's/[\r\n",:]//g' | ... ... | sort -k 1b,1 | uniq > report_tables 

Change the ':' to a space to accurately separate the file name from the “table” column, print the first (report file) and third (table name) columns, clean the garbage with sed, re-sort and save to our first table — report_tables.

Then, in the same way, we build the report_dates table, just create_date map and output slightly more columns (date and time):

 grep -r "\"created_date\":" projects/*/reports/* | sed 's/:/ /g' | ... ... | awk '{print $1 " " $3"T"$4":"$5":"$6}' | sed 's/[\r\n",:]//g' | ... ... | sort -k 1b,1 | uniq > report_dates 

Now join them, merging the name of the report file and the name of the table in one column, and we get a table with the report files, tables and dates for the creation of this report:

 join report_tables report_dates | awk '{print $1"#"$2 " " $3}' | ... ... | sort -k 1b,1 > report_table_date 
projects / project1 / reports / run1 / report1.json # table1 2017-08-07T070918.024907
projects / project1 / reports / run1 / report1.json # table2 2017-08-07T070918.024907
projects / project1 / reports / run1 / report1.json # table3 2017-08-07T070918.024907
...

The first part seems to be ready. Now, by analogy, we use the bases:

 grep -r "schema\":" projects/*/conf/* | sed 's/:/ /g' | ... ... | awk '{print $3 " " $1}' | sed 's/[\r\n":,]//g' | ... ... | sort -k 1b,1 | uniq > schema_configs 
schema1 projects / project1 / conf / run1.conf
schema1 projects / project1 / conf / run2.conf
schema2 projects / project2 / conf / run1.conf
...

That is the first difficulty. The previous table is based on the report files, and this table is based on the configs files. It is necessary to put a correspondence between them:

 cat schema_configs | awk '{print $2}' | sort | uniq | ... 

And now think about it. Just put xargs -n1 find ... we can not, because we lose the line itself with the config, but it is needed. So we will iterate over the loop. Anyway. We put the pipe and go:
 ... | while read line; do <statements>; done | sort -k 1b,1 > config_reports 

Further we write all inside statements:
 dir=$(dirname $line); dir2=$(dirname $dir); ... run=$(echo $line | sed "s/.*\///" | sed 's/\.conf//g'); ... reps=$(find $dir2/reports/$run/ -name *.json); ... for r in $reps; do echo $line $r ; done 

Looks hard. dirname pulls the path to the last slash out of the file path, which we used to go up the report file to a couple of levels ($dir2) . The following expression run=... pulls the name of the run.conf file from $line and truncates the extension, getting the name of the launch configuration. Further reps are the names of files with reports for this config, and looping over them we output the file with the $line config and the file with the report $r . Re-sort and write the config_reports .
projects / project1 / conf / run1.conf projects / project1 / reports / run1 / report1.json
projects / project1 / conf / run1.conf projects / project1 / reports / run1 / report2.json
projects / project1 / conf / run2.conf projects / project1 / reports / run2 / report3.json
...

It was the most important part of the work - to put the correspondence between the configs space and the reporting space. It remains only to determine the dates of the last change of the tables in the used databases, and we will have all the necessary info, all that remains is to correctly rewrite everything. Go:

 cat schema_configs | awk '{print $1}' | sort | uniq | ... ... |sed 's/^/path_in_hive/g' | sed 's/$/\.db/g' | ... ... | xargs -n1 -I dr hdfs dfs -ls dr | sed 's/\// /g' | ... ... | sed 's/\.db//g' | awk '{print $12 " " $13 " " $6"T"$7}' | ... ... | sort -k 1b,1 | uniq > schema_tables 

Despite the length, everything is simple. First, we take schema_configs , from there we schema_configs unique schemes, then with sed assign the path to the Hive store to the beginning, to the end - the .db extension. Now for each such line we execute hdfs dfs -ls , it shows us all the tables in a given database with the dates of their last change. We change all the slashes for spaces, pull out the name of the database, the name of the table and the date it was changed, re-sort and schema_tables .

Now the final part:

 # configs - tables join schema_configs schema_tables | awk '{print $2 " " $3 " " $4}' | ... ... | sort -k 1b,1 | uniq > config_tables # reports - tables hive dates join config_reports config_tables | awk '{print $2"#"$3 " " $4}' | ... ... | sort -k 1b,1 > report_table_hive_dates # final! join report_table_date report_table_hive_dates | sed 's/#/ /g' | ... ... | awk '{if ($3<$4) print $1}' | sort | uniq > outdated_reports 

First, join schema_configs and schema_tables across the field named bd, and get the config_tables — the config, the table, and the date it was last modified. Then join config_reports and config_tables to finally get a match report - a table in Hive. And the name of the file with the report and the name of the table are combined into one field with the help # . Well, the final touch is to join report_table_date and report_table_hive_dates , separate the file name with the report and the table name with a space, and print those reports where the report creation date is less than the date of the table change, then look for unique reports and the work is done.

Conclusion


Nine fairly simple lines on the bash turned out to be enough to solve this problem. Next, we run this script on the crown, and the webmord, focusing on the outdated_reports file, can issue the "Report is outdated" header for the report (or not).

Code here

Source: https://habr.com/ru/post/343758/


All Articles