jl-sql: SQL queries by JSON logs on the command line

Nobody is interested in the introduction, so I'll start right away with examples of using

% cat log.json

 {"type": "hit", "client": {"ip": "127.1.2.3"}} {"type": "hit", "client": {"ip": "127.2.3.4"}} {"type": "hit", "client": {"ip": "127.3.4.5"}} {"type": "hit", "client": {"ip": "127.3.4.5"}} {"type": "hit", "client": {"ip": "127.1.2.3"}} {"type": "click", "client": {"ip": "127.1.2.3"}} {"type": "click", "client": {"ip": "127.2.3.4"}}

Execute the query:

 % cat log.json | jl-sql 'SELECT client.ip, COUNT(*) AS count WHERE type = "hit" GROUP BY client.ip'

 {"client":{"ip":"127.1.2.3"},"count":2} {"client":{"ip":"127.2.3.4"},"count":1} {"client":{"ip":"127.3.4.5"},"count":2}

Short description

As you can see, to execute the query, the jl-sql utility is used, which accepts a stream of JSON objects separated by a newline character ( "\n" ) as input.

It is worth noting here that the utility is based on the jl-sql-api library, on the basis of which you can quite simply implement data processing in an arbitrary format, not limited only to JSON.

The SQL engine supports WHERE , GROUP BY , HAVING , ORDER BY and {LEFT|INNER} JOIN . As for JOIN , there is a restriction on the expression in ON : support for joins is implemented only by the exact correspondence of two fields, the so-called Equi Join :

 SELECT id, @town.name AS townName JOIN `towns.json` ON @towns.id = townId

Examples

Definition of unique ip-addresses

 % cat log.json | jl-sql 'SELECT DISTINCT client.ip AS ip'

 {"ip":"127.1.2.3"} {"ip":"127.2.3.4"} {"ip":"127.3.4.5"}

Counting the number of unique addresses for each group

 % cat log.json | jl-sql 'SELECT type, COUNT(DISTINCT client.ip) AS ipCount GROUP BY type'

 {"type":"click","ipCount":2} {"type":"hit","ipCount":3}

Rebuilding facilities

With the help of alias ( AS ), you can not only assign aliases to the fields, but also create complex structures within the object:

 % echo '{"a": 1, "b": 2}' | jl-sql 'SELECT a AS sub.bar.first, b AS sub.foo.second'

 {"sub":{"bar":{"first":1},"foo":{"second":2}}}

Delete and modify

In addition to SELECT , DELETE and UPDATE are also supported.

 % cat log.json | jl-sql -b :ip=127.1.2.3 'DELETE WHERE client.ip = :ip'

 % cat log.json | jl-sql -b :ip=127.1.2.3 'UPDATE SET banned = true WHERE client.ip = :ip'

These examples also show the use of bindings (the -b / --bind option), more in detail about them is described in the corresponding section .

Work with dates

 % echo '{"ts": "2017-01-18 00:00:00"}' | jl-sql 'SELECT ts - INTERVAL 2 MONTH 3 DAY 4 HOUR 5 MINUTE 6 SECOND AS past'

The keyword INTERVAL allows you to "add" and "take away" periods of time from a certain date. All this is similar to using INTERVAL in MySQL.

Since JSON does not provide a separate data type for dates, strings are used to save them. The utility understands formats from RFC2822 or ISO 8601 . Others may be used, but the results will be unpredictable.

It should be noted that the system time zone is used for parsing and date manipulation. If this does not suit you, then you can set the TZ environment variable with the desired time zone before running jl-sql .

A more detailed description of working with dates can be found in the documentation on GitHub .

`JOIN`

For JOIN , at least one more data source is required, in terms of SQL, this is called a “table”, as such sources are regular files:

 % cat banned.json

 {"ip": "127.1.2.3"}

 % cat log.json | jl-sql -b :banned=banned.json 'SELECT * INNER JOIN {:banned} ON @banned.ip = client.ip'

 {"type":"hit","client":{"ip":"127.1.2.3"}} {"type":"hit","client":{"ip":"127.1.2.3"}} {"type":"click","client":{"ip":"127.1.2.3"}}

In this example, a new concept was introduced - the data source ( @banned ), more details about this and JOIN in general can be found in the corresponding section .

Productivity and consumed resources

The utility is designed for processing large logs. Depending on the type, the request can be executed either in streaming mode or in the mode of using temporary storage . A huge advantage of streaming mode is the ability to perform jl-sql in conjunction, for example, with tail -f for filtering and reformatting logs in real time.

streaming mode - a mode in which the request processes incoming data immediately, without waiting for the end of the incoming stream. The computational complexity of such processing is O(n) by CPU and O(1) by memory. In this mode, only the simplest queries that do not require sorting can be executed: SELECT <fields...> , SELECT ... WHERE expression .
the mode of using temporary storage - to fulfill the request, the incoming stream must be sorted. This means that you may need temporary storage in the FS. This category includes queries with GROUP BY , ORDER BY and JOIN . The sizes of the internal buffers are given by the -B and -S options (see the description of the options ).

For sorting large volumes of jl-sql uses the system utility sort , which allows the use of more productive native sorting.

Installation

The code is written in JavaScript under Node.js, so the easiest way to install the package is through npm :

 # npm install -g jl-sql

Attention: Node.js version is required not less than 6.0.0

Dirty details

Sorting

Sorting can be set by the standard ORDER BY expression [{DESC|ASC}] , the sorting is supported by several fields at once, including multidirectional ones.

The unstructured data at the input introduces its own corrections: the field by which sorting is performed may be absent in some objects or may have different types of data in different objects. Therefore, it is advisable to always explicitly specify the data type with the STRING() and NUMBER() functions:

Sort by rows

 ORDER BY STRING(field)

Sort by numbers

 ORDER BY NUMBER(field)

If you do not explicitly specify the type of sorting, the utility will try to determine the type according to the rules described here . If the type could not be determined, then the sorting will occur in rows.

Values are converted to a string according to the following rules:

Data type	String representation
line	source line
number	decimal string
boolean	`"true"` / `"false"`
null	`"null"`
missing field	`""` (empty string)
an object	N / A ^*
array	N / A ^*

^* - conversion of strings and arrays to a string is not specified, so relying on the order of these values in the total set should not be.

I recommend always explicitly setting the data type. current default behavior can be changed or completely removed as potentially dangerous.

Binding

In order to solve the problem of data screening on the command line, the utility allows you to use binders (substitutions):

 jl-sql -b :type=hit jl-sql 'SELECT client.ip, COUNT(*) AS count WHERE type = :type GROUP BY client.ip'

Here the option -b :type=hit specifies a substitution with the name :type and the value "hit" . This makes it possible to use conventional shell-level screening for SQL queries.

Similarly, you can use substitutions for the file names used in the JOIN :

 jl-sql -b :towns=towns.json 'SELECT id, @town.name AS townName JOIN {:towns} ON @towns.id = townId'

This will allow the use of autocomplete file names in your favorite shell.

More specifically about JOIN can be read in the appropriate section JOIN .

`JOIN`

JOIN support required an introduction to the syntax of a new entity - the name of the data source (you can call it a "table"). The problem here is that in "classic" SQL you can always determine where in the ident1.ident2 identifier ident1.ident2 name of the table, and where is the name of the field. With JSON, everything is more complicated: objects can have different nesting, so without special syntax it is impossible to say exactly what the user had in mind when he entered ident1.ident2 because ident1 can be either a field name or a "table" name.

To resolve this ambiguity, the @ident syntax was @ident , where @ is the prefix indicating that the identifier next to it is the name of the "table".

Let's return to the example from the beginning of the article and analyze it in more detail:

 % cat log.json | jl-sql -b :banned=banned.json 'SELECT * INNER JOIN {:banned} ON @banned.ip = client.ip'

So let's start from the beginning:

-b :banned=banned.json - create a binding with the name of the source file, this is not an obligatory step, but it allows your shell to auto-complete petey, and also eliminates the need to escape special characters in the path

INNER JOIN {:banned} - here {:banned} is a special syntax for substitution of binding in JOIN . Without the use of binders, this line would look like INNER JOIN `banned.json` . The use of back quotes is mandatory here. otherwise, the dot ( . ) will be interpreted in a special way.

@banned.ip - here @banned is the name of the table. In this case, the name was derived automatically from the name of the binding, but you can explicitly set it via alias: INNER JOIN {:banned} AS @someName , then the call to this table will occur as @someName.ip

I recommend to always use binders, both for files and for regular data. This eliminates many problems.

Currently only two types of JOIN are INNER JOIN : INNER JOIN and LEFT JOIN . If you do not specify a JOIN type in the query, then the INNER JOIN will be used.

Kau has already been said earlier, the expression in ON must have the form ON @table.primary = foreign , that is, the exact correspondence of the field @table.primary (the table to be connected) to the foreign field (the main table).

Comparison operators

For convenience, the operator = (and its alias == ) does not take into account the type of the value and tries to compare the values as liberal as possible, therefore the value of the expression 1 = "1" will be true . In contrast, the === operator also takes into account the type when comparing, so the expression 1 === "1" will already be false . Comparison rules are similar to those used in JavaScript, more details can be found here .

Comparing Objects and Arrays

The behavior of the operators = and === not defined for arrays and objects, so this comparison should be avoided at the moment. In the future, the comparison of objects and arrays will be implemented.

Command line options

 % jl-sql -h

 Usage: jl-sql [OPTIONS] SQL OPTIONS: -h, --help show this help -I, --ignore-json-error ignore broken JSON -v, --verbose display additional information -B, --sort-in-memory-buffer-length=ROWS save up to ROWS rows for in-memory sort -S, --sort-external-buffer-size=SIZE use SIZE bytes for `sort` memory buffer -T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp -b, --bind=BIND=VALUE+ bind valiable See full documentation at https://github.com/avz/jl-sql

More detail on interesting options

-I : ignore errors in JSON. By default, jl-sql fails (non-zero return codes) after detecting the first error in JSON. This option allows you to override this behavior and simply ignore such errors. The warning output in stderr remains
-B : set a limit on the number of lines / objects that will be sorted in RAM without using external sorting through the system utility sort
-S : set a limit on the amount of RAM used by the sort utility as a buffer (see the description of the -S option in man sort )
-T : directory to place temporary sorting files

jl-sql: SQL queries by JSON logs on the command line

Short description

Examples

Definition of unique ip-addresses

Counting the number of unique addresses for each group

Rebuilding facilities

Delete and modify

Work with dates

`JOIN`

Productivity and consumed resources

Installation

Dirty details

Sorting

Binding

`JOIN`

Comparison operators

Comparing Objects and Arrays

Command line options

Links

More articles: