How to work effectively with json in R?
It is a continuation of previous publications .
As a rule, the main source of data in json format will be the REST API. The use of json in addition to platform independence and ease of human data perception allows the exchange of unstructured data systems with a complex tree structure.
In API construction tasks, this is very convenient. It is easy to provide versioning of communication protocols, it is easy to provide flexibility of information exchange. At the same time, the complexity of the data structure (levels of nesting can be 5, 6, 10 or even more) does not scare, since it is not that difficult to write a flexible parser for every single record for a single record.
The tasks of data processing also include obtaining data from external sources, including in json format. R has a good set of packages, in particular jsonlite
, designed to convert json to R objects ( list
or, data.frame
, if the data structure allows).
However, in practice, two classes of tasks often arise when the use of jsonlite
and others like it becomes extremely inefficient. Tasks look like this:
data.frame
).An example of such a structure in the illustrations:
Why are these classes of problems problematic?
As a rule, downloads from information systems in json format are an indivisible block of data. To parse it correctly, you need to read it all and run through its entire volume.
Induced problems:
data.frame
.Similar tasks arise, for example, when it is necessary to collect directories required by a business process for work on a packet of requests through an API. Additionally, reference books imply unification and readiness for embedding into an analytical pipeline and potential unloading into a database. And this again leads to the need to transform such summary data into data.frame
.
Induced problems:
NULL
objects that are relevant in the lists but cannot “fit” in the data.frame
, which further complicates the postprocessing and complicates even the basic merging of individual strings in data.frame
(doesn't matter, rbindlist
, bind_rows
, "map_dfr 'or rbind
).JQ
- way outIn particularly difficult situations, the use of very convenient approaches of the jsonlite
package "convert everything to R objects" for the reasons given above gives a serious failure. Well, if the end of the treatment can be reached. Worse, if in the middle you have to open your arms and give up.
An alternative to this approach is to use the json preprocessor, which operates directly with data in the json format. jq
library and jqr
wrapper Practice shows that it is not only little used, but very few people have heard of it at all and are very vain.
Advantages of the jq
library.
data.frame
using jsonlite
;The processing code shrinks to fit the screen and may look something like this:
cont <- httr::content(r3, as = "text", encoding = "UTF-8") m <- cont %>% # jqr::jq('del(.[].movie.rating, .[].movie.genres, .[].movie.trailers)') %>% jqr::jq('del(.[].movie.countries, .[].movie.images)') %>% # jqr::jq('del(.[].schedules[].hall, .[].schedules[].language, .[].schedules[].subtitle)') %>% # jqr::jq('del(.[].cinema.location, .[].cinema.photo, .[].cinema.phones)') %>% jqr::jq('del(.[].cinema.goodies, .[].cinema.subway_stations)') # m2 <- m %>% jqr::jq('[.[] | {date, movie, schedule: .schedules[], cinema}]') df <- fromJSON(m2) %>% as_tibble()
jq is very elegant and fast! Those to whom it is relevant: download, install, understand. We accelerate processing, we simplify life for ourselves and our colleagues.
Previous publication - “How to start using R in Enterprise. An example of a practical approach .
Source: https://habr.com/ru/post/448950/
All Articles