conf
folder, there are configurations of building reports that contain the names of Hive databases in the "schema"
fields, according to the tables of which reports are built. In the reports
folder, the reports
themselves are arranged in folders with configuration names. Each report is a json
containing statistics on Hive tables in the array of "table"
objects, as well as the date of creation in the "created_date"
field. Take it instead of the file creation date, since there is one. We need to find such reports that contain tables that were changed after the report was created.sed s// /g
awk '...' log | sort -k field_n | uniq -c | sort -n -r | head -n N
grep -r "\"table\":" projects/*/reports/* | ...
projects / project1 / reports / run1 / report1.json: "table": "table1",
projects / project1 / reports / run2 / report2.json: "table": "table2",
projects / project2 / reports / run3 / report3.json: "table": "table3",
...
... | sed 's/:/ /g' | awk '{print $1 " " $3}' | sed 's/[\r\n",:]//g' | ... ... | sort -k 1b,1 | uniq > report_tables
grep -r "\"created_date\":" projects/*/reports/* | sed 's/:/ /g' | ... ... | awk '{print $1 " " $3"T"$4":"$5":"$6}' | sed 's/[\r\n",:]//g' | ... ... | sort -k 1b,1 | uniq > report_dates
join report_tables report_dates | awk '{print $1"#"$2 " " $3}' | ... ... | sort -k 1b,1 > report_table_date
projects / project1 / reports / run1 / report1.json # table1 2017-08-07T070918.024907
projects / project1 / reports / run1 / report1.json # table2 2017-08-07T070918.024907
projects / project1 / reports / run1 / report1.json # table3 2017-08-07T070918.024907
...
grep -r "schema\":" projects/*/conf/* | sed 's/:/ /g' | ... ... | awk '{print $3 " " $1}' | sed 's/[\r\n":,]//g' | ... ... | sort -k 1b,1 | uniq > schema_configs
schema1 projects / project1 / conf / run1.conf
schema1 projects / project1 / conf / run2.conf
schema2 projects / project2 / conf / run1.conf
...
cat schema_configs | awk '{print $2}' | sort | uniq | ...
xargs -n1 find ...
we can not, because we lose the line itself with the config, but it is needed. So we will iterate over the loop. Anyway. We put the pipe and go: ... | while read line; do <statements>; done | sort -k 1b,1 > config_reports
dir=$(dirname $line); dir2=$(dirname $dir); ... run=$(echo $line | sed "s/.*\///" | sed 's/\.conf//g'); ... reps=$(find $dir2/reports/$run/ -name *.json); ... for r in $reps; do echo $line $r ; done
dirname
pulls the path to the last slash out of the file path, which we used to go up the report file to a couple of levels ($dir2)
. The following expression run=...
pulls the name of the run.conf
file from $line
and truncates the extension, getting the name of the launch configuration. Further reps
are the names of files with reports for this config, and looping over them we output the file with the $line
config and the file with the report $r
. Re-sort and write the config_reports
.projects / project1 / conf / run1.conf projects / project1 / reports / run1 / report1.json
projects / project1 / conf / run1.conf projects / project1 / reports / run1 / report2.json
projects / project1 / conf / run2.conf projects / project1 / reports / run2 / report3.json
...
cat schema_configs | awk '{print $1}' | sort | uniq | ... ... |sed 's/^/path_in_hive/g' | sed 's/$/\.db/g' | ... ... | xargs -n1 -I dr hdfs dfs -ls dr | sed 's/\// /g' | ... ... | sed 's/\.db//g' | awk '{print $12 " " $13 " " $6"T"$7}' | ... ... | sort -k 1b,1 | uniq > schema_tables
schema_configs
, from there we schema_configs
unique schemes, then with sed
assign the path to the Hive store to the beginning, to the end - the .db
extension. Now for each such line we execute hdfs dfs -ls
, it shows us all the tables in a given database with the dates of their last change. We change all the slashes for spaces, pull out the name of the database, the name of the table and the date it was changed, re-sort and schema_tables
. # configs - tables join schema_configs schema_tables | awk '{print $2 " " $3 " " $4}' | ... ... | sort -k 1b,1 | uniq > config_tables # reports - tables hive dates join config_reports config_tables | awk '{print $2"#"$3 " " $4}' | ... ... | sort -k 1b,1 > report_table_hive_dates # final! join report_table_date report_table_hive_dates | sed 's/#/ /g' | ... ... | awk '{if ($3<$4) print $1}' | sort | uniq > outdated_reports
schema_configs
and schema_tables
across the field named bd, and get the config_tables
— the config, the table, and the date it was last modified. Then join config_reports
and config_tables
to finally get a match report - a table in Hive. And the name of the file with the report and the name of the table are combined into one field with the help #
. Well, the final touch is to join report_table_date
and report_table_hive_dates
, separate the file name with the report and the table name with a space, and print those reports where the report creation date is less than the date of the table change, then look for unique reports and the work is done.outdated_reports
file, can issue the "Report is outdated"
header for the report (or not).Source: https://habr.com/ru/post/343758/
All Articles