📜 ⬆️ ⬇️

Data flow relocation route: loading into a relational database

Using the R programming language with the free relational database management system PostgresSQL can significantly speed up and simplify the process of loading data into the database.



Structuring files


Before starting to load data into PostgreSQL, you should sort the files by type in different directories. R makes operations quite simple at the OS level:
')
#### 1. Setting directory to FTP folder where files incoming from Adobe ## Has ~2000 files in it from 2 years of data setwd("~/Downloads/datafeed/") #### 2. Sort files into three separate folders ## Manifests - plain text files if(!dir.exists("manifest")){ dir.create("manifest") lapply(list.files(pattern = "*.txt"), function(x) file.rename(x, paste("manifest", x, sep = "/")) } ## Server calls tsv.gz if(!dir.exists("servercalls")){ dir.create("servercalls") lapply(list.files(pattern = "*.tsv.gz"), function(x) file.rename(x, paste("servercalls", x, sep = "/"))) } ## Lookup files .tar.gz if(!dir.exists("lookup")){ dir.create("lookup") lapply(list.files(pattern = "*.tar.gz"), function(x) file.rename(x, paste("lookup", x, sep = "/"))) } 


Perhaps, with a larger number of files of different types, it would be better to simplify them through a function, instead of writing code for each type. But the idea does not change from this - it is necessary to check the presence of the directory and, in case of absence, create it. After that, you need to move the files to the created directory.

Connecting and loading data in PostgreSQL via R


Once the files are structured, you can start uploading them to PostgreSQL using the RPostgreSQL R package. RPostgreSQL is DBI-compatible, so the connection string will be the same for any type of database engine; when loading part of the data into the database, it is almost guaranteed to require a text format (using colClasses = argument “character” in R). This need arises from the constant changes and modifications of Adobe Analytics (a corporate-level solution package for collecting site statistics and working with it); text column format does not allow data loss.

 library(RPostgreSQL) # Connect to database conn = dbConnect(dbDriver("PostgreSQL"), user="postgres", password="", host="localhost", port=5432, dbname="adobe") #Set directory to avoid having to use paste to build urls setwd("~/Downloads/datafeed/servercalls") #Set column headers for server calls column_headers <- read.delim("~/Downloads/datafeed/lookup/column_headers.tsv", stringsAsFactors=FALSE) #Loop over entire list of files #Setting colClasses to character only way to guarantee all data loads #File formats or implementations can change over time; fix schema in database after data loaded for(file in list.files()){ print(file) df <- read.csv2(file, sep = "\t", header = FALSE, stringsAsFactors = FALSE, colClasses = "character") dbWriteTable(conn, name = 'servercalls', value = df, row.names=FALSE, append = TRUE) rm(df) } #Run analyze in PostgreSQL so that query planner has accurate information dbGetQuery(conn, "analyze servercalls") 


Using a small amount of code, a specific table structure was generated, data was loaded, and PostgreSQL began to analyze and collect statistics for effective search queries.

Loading lookup tables in PostgreSQ


Along with the data of the server call that is loaded into the database, you must also load the lookup tables. They maintain a constant format and RPostgreSQL gets the correct column types.

 library(RPostgreSQL) # Connect to database conn = dbConnect(dbDriver("PostgreSQL"), user="postgres", password="", host="localhost", port=5432, dbname="adobe") setwd("~/Downloads/datafeed/lookup/") #Create function due to repetitiveness #Since we're loading lookup tables with mostly same values each time, put source file in table loadlookup <- function(tblname){ df <- read.csv2(paste(tblname,".tsv", sep=""), sep = "\t", header = FALSE, stringsAsFactors = FALSE) df$file <- file dbWriteTable(conn, name = tblname, value = df, row.names=FALSE, append = TRUE) } #untar files, place in directory by day for(file in list.files(pattern = "*.tar.gz")){ print(file) untar(file) for(tbl in c("browser_type", "browser", "color_depth", "column_headers", "connection_type", "country", "event", "javascript_version", "languages", "operating_systems", "plugins", "referrer_type", "resolution", "search_engines")){ loadlookup(tbl) } } 

Source: https://habr.com/ru/post/278897/


All Articles