Nobody is interested in the introduction, so I'll start right away with examples of using
% cat log.json
{"type": "hit", "client": {"ip": "127.1.2.3"}} {"type": "hit", "client": {"ip": "127.2.3.4"}} {"type": "hit", "client": {"ip": "127.3.4.5"}} {"type": "hit", "client": {"ip": "127.3.4.5"}} {"type": "hit", "client": {"ip": "127.1.2.3"}} {"type": "click", "client": {"ip": "127.1.2.3"}} {"type": "click", "client": {"ip": "127.2.3.4"}}
Execute the query:
% cat log.json | jl-sql 'SELECT client.ip, COUNT(*) AS count WHERE type = "hit" GROUP BY client.ip'
{"client":{"ip":"127.1.2.3"},"count":2} {"client":{"ip":"127.2.3.4"},"count":1} {"client":{"ip":"127.3.4.5"},"count":2}
As you can see, to execute the query, the jl-sql utility is used, which accepts a stream of JSON objects separated by a newline character ( "\n"
) as input.
It is worth noting here that the utility is based on the jl-sql-api library, on the basis of which you can quite simply implement data processing in an arbitrary format, not limited only to JSON.
The SQL engine supports WHERE
, GROUP BY
, HAVING
, ORDER BY
and {LEFT|INNER} JOIN
. As for JOIN
, there is a restriction on the expression in ON
: support for joins is implemented only by the exact correspondence of two fields, the so-called Equi Join :
SELECT id, @town.name AS townName JOIN `towns.json` ON @towns.id = townId
% cat log.json | jl-sql 'SELECT DISTINCT client.ip AS ip'
{"ip":"127.1.2.3"} {"ip":"127.2.3.4"} {"ip":"127.3.4.5"}
% cat log.json | jl-sql 'SELECT type, COUNT(DISTINCT client.ip) AS ipCount GROUP BY type'
{"type":"click","ipCount":2} {"type":"hit","ipCount":3}
With the help of alias ( AS
), you can not only assign aliases to the fields, but also create complex structures within the object:
% echo '{"a": 1, "b": 2}' | jl-sql 'SELECT a AS sub.bar.first, b AS sub.foo.second'
{"sub":{"bar":{"first":1},"foo":{"second":2}}}
In addition to SELECT
, DELETE
and UPDATE
are also supported.
% cat log.json | jl-sql -b :ip=127.1.2.3 'DELETE WHERE client.ip = :ip'
% cat log.json | jl-sql -b :ip=127.1.2.3 'UPDATE SET banned = true WHERE client.ip = :ip'
These examples also show the use of bindings (the -b
/ --bind
option), more in detail about them is described in the corresponding section .
% echo '{"ts": "2017-01-18 00:00:00"}' | jl-sql 'SELECT ts - INTERVAL 2 MONTH 3 DAY 4 HOUR 5 MINUTE 6 SECOND AS past'
The keyword INTERVAL
allows you to "add" and "take away" periods of time from a certain date. All this is similar to using INTERVAL
in MySQL.
Since JSON does not provide a separate data type for dates, strings are used to save them. The utility understands formats from RFC2822 or ISO 8601 . Others may be used, but the results will be unpredictable.
It should be noted that the system time zone is used for parsing and date manipulation. If this does not suit you, then you can set the TZ
environment variable with the desired time zone before running jl-sql
.
A more detailed description of working with dates can be found in the documentation on GitHub .
JOIN
For JOIN
, at least one more data source is required, in terms of SQL, this is called a “table”, as such sources are regular files:
% cat banned.json
{"ip": "127.1.2.3"}
% cat log.json | jl-sql -b :banned=banned.json 'SELECT * INNER JOIN {:banned} ON @banned.ip = client.ip'
{"type":"hit","client":{"ip":"127.1.2.3"}} {"type":"hit","client":{"ip":"127.1.2.3"}} {"type":"click","client":{"ip":"127.1.2.3"}}
In this example, a new concept was introduced - the data source ( @banned
), more details about this and JOIN in general can be found in the corresponding section .
The utility is designed for processing large logs. Depending on the type, the request can be executed either in streaming mode or in the mode of using temporary storage . A huge advantage of streaming mode is the ability to perform jl-sql in conjunction, for example, with tail -f
for filtering and reformatting logs in real time.
O(n)
by CPU and O(1)
by memory. In this mode, only the simplest queries that do not require sorting can be executed: SELECT <fields...>
, SELECT ... WHERE expression
.GROUP BY
, ORDER BY
and JOIN
. The sizes of the internal buffers are given by the -B
and -S
options (see the description of the options ).For sorting large volumes of jl-sql uses the system utility sort
, which allows the use of more productive native sorting.
The code is written in JavaScript under Node.js, so the easiest way to install the package is through
npm
:
# npm install -g jl-sql
Attention: Node.js version is required not less than 6.0.0
Sorting can be set by the standard ORDER BY expression [{DESC|ASC}]
, the sorting is supported by several fields at once, including multidirectional ones.
The unstructured data at the input introduces its own corrections: the field by which sorting is performed may be absent in some objects or may have different types of data in different objects. Therefore, it is advisable to always explicitly specify the data type with the STRING()
and NUMBER()
functions:
Sort by rows
ORDER BY STRING(field)
Sort by numbers
ORDER BY NUMBER(field)
If you do not explicitly specify the type of sorting, the utility will try to determine the type according to the rules described here . If the type could not be determined, then the sorting will occur in rows.
Values are converted to a string according to the following rules:
Data type | String representation |
---|---|
line | source line |
number | decimal string |
boolean | "true" / "false" |
null | "null" |
missing field | "" (empty string) |
an object | N / A * |
array | N / A * |
* - conversion of strings and arrays to a string is not specified, so relying on the order of these values in the total set should not be.
I recommend always explicitly setting the data type. current default behavior can be changed or completely removed as potentially dangerous.
In order to solve the problem of data screening on the command line, the utility allows you to use binders (substitutions):
jl-sql -b :type=hit jl-sql 'SELECT client.ip, COUNT(*) AS count WHERE type = :type GROUP BY client.ip'
Here the option -b :type=hit
specifies a substitution with the name :type
and the value "hit"
. This makes it possible to use conventional shell-level screening for SQL queries.
Similarly, you can use substitutions for the file names used in the JOIN
:
jl-sql -b :towns=towns.json 'SELECT id, @town.name AS townName JOIN {:towns} ON @towns.id = townId'
This will allow the use of autocomplete file names in your favorite shell.
More specifically about JOIN
can be read in the appropriate section JOIN .
JOIN
JOIN
support required an introduction to the syntax of a new entity - the name of the data source (you can call it a "table"). The problem here is that in "classic" SQL you can always determine where in the ident1.ident2
identifier ident1.ident2
name of the table, and where is the name of the field. With JSON, everything is more complicated: objects can have different nesting, so without special syntax it is impossible to say exactly what the user had in mind when he entered ident1.ident2
because ident1
can be either a field name or a "table" name.
To resolve this ambiguity, the @ident
syntax was @ident
, where @
is the prefix indicating that the identifier next to it is the name of the "table".
Let's return to the example from the beginning of the article and analyze it in more detail:
% cat log.json | jl-sql -b :banned=banned.json 'SELECT * INNER JOIN {:banned} ON @banned.ip = client.ip'
So let's start from the beginning:
-b :banned=banned.json
- create a binding with the name of the source file, this is not an obligatory step, but it allows your shell to auto-complete petey, and also eliminates the need to escape special characters in the path
INNER JOIN {:banned}
- here {:banned}
is a special syntax for substitution of binding in JOIN
. Without the use of binders, this line would look like INNER JOIN `banned.json`
. The use of back quotes is mandatory here. otherwise, the dot ( .
) will be interpreted in a special way.
@banned.ip
- here @banned
is the name of the table. In this case, the name was derived automatically from the name of the binding, but you can explicitly set it via alias: INNER JOIN {:banned} AS @someName
, then the call to this table will occur as @someName.ip
I recommend to always use binders, both for files and for regular data. This eliminates many problems.
Currently only two types of JOIN are INNER JOIN
: INNER JOIN
and LEFT JOIN
. If you do not specify a JOIN type in the query, then the INNER JOIN
will be used.
Kau has already been said earlier, the expression in ON
must have the form ON @table.primary = foreign
, that is, the exact correspondence of the field @table.primary
(the table to be connected) to the foreign
field (the main table).
For convenience, the operator =
(and its alias ==
) does not take into account the type of the value and tries to compare the values as liberal as possible, therefore the value of the expression 1 = "1"
will be true
. In contrast, the ===
operator also takes into account the type when comparing, so the expression 1 === "1"
will already be false
. Comparison rules are similar to those used in JavaScript, more details can be found here .
The behavior of the operators =
and ===
not defined for arrays and objects, so this comparison should be avoided at the moment. In the future, the comparison of objects and arrays will be implemented.
% jl-sql -h
Usage: jl-sql [OPTIONS] SQL OPTIONS: -h, --help show this help -I, --ignore-json-error ignore broken JSON -v, --verbose display additional information -B, --sort-in-memory-buffer-length=ROWS save up to ROWS rows for in-memory sort -S, --sort-external-buffer-size=SIZE use SIZE bytes for `sort` memory buffer -T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp -b, --bind=BIND=VALUE+ bind valiable See full documentation at https://github.com/avz/jl-sql
More detail on interesting options
-I
: ignore errors in JSON. By default, jl-sql
fails (non-zero return codes) after detecting the first error in JSON. This option allows you to override this behavior and simply ignore such errors. The warning output in stderr remains-B
: set a limit on the number of lines / objects that will be sorted in RAM without using external sorting through the system utility sort
-S
: set a limit on the amount of RAM used by the sort
utility as a buffer (see the description of the -S
option in man sort
)-T
: directory to place temporary sorting filesThanks for attention.
Source: https://habr.com/ru/post/319722/
All Articles