ROWNUM;VEHICLE_NUM;FULL_NAME;BLANK_NUM;VEHICLE_BRAND_MODEL;INN;OGRN 1;"248197";" «-»";"017263";"FORD FOCUS";"7734653292";"1117746207578" 2;"249197";" «-»";"017264";"FORD FOCUS";"7734653292";"1117746207578" 3;"245197";" «-»";"017265";"FORD FOCUS";"7734653292";"1117746207578" ```
url <- "http://data.mos.ru/datasets/download/655" colnames = c("RowNumber", "RegPlate", "LegalName", "DocNum", "Car", "INN", "OGRN", "Void") rawdata <- read.table(url, header = TRUE, sep = ";", colClasses = c("numeric", rep("character",6), NA), col.names = colnames, strip.white = TRUE, blank.lines.skip = TRUE, stringsAsFactors = FALSE, encoding = "UTF-8")
Now you can begin to analyze and visualize ... ptn <- "^(.+?) (.+)$" # regexp pattern to match first word dt <- data.table(rawdata)[, list(RegPlate, LegalName, Car, OGRN, OrgType = gsub(ptn, "\\1" , toupper( LegalName )), CarBrand = gsub(ptn, "\\1", toupper( Car ))) ] rm(rawdata) # Clear some memory
sort( table(dt$OrgType) )
## ## 1 392 649 17118 17680
The data is formed quite correctly: individual entrepreneurs are leading in terms of the number of licenses obtained (reducing the tax burden?), There are limited liability companies, open and closed joint-stock companies, and even one non-commercial partnership. dt[, list( N = length( unique(OGRN) ) ), by = OrgType][order(N, decreasing = TRUE)]
## OrgType N ## 1: 12352 ## 2: 563 ## 3: 14 ## 4: 6 ## 5: 1
sort( unique( dt[grep("^M.*", CarBrand), CarBrand]))
## [1] "M214" "MASERATI" "MAZDA" ## [4] "MAZDA-" "MERCEDES" "MERCEDES-BENZ" ## [7] "MERCEDES-BENZ-" "MERCEDES-BENZ-S500" "MERCEDES-BENZC" ## [10] "MERCEDES-BENZE200K" "MERCEDES-BENZE220CDI" "MERCEDES-BNZ" ## [13] "MERCERDES-BENZ" "MERCRDES" "MERCRDES-BENZ" ## [16] "MERSEDES-" "MERSEDES-BENZ" "METROCAB" ## [19] "MG" "MINI" "MITSUBISHI"
Unfortunately, a large number of brands of cars is largely due to errors in the data. For example, the same brand - MERCEDES-BENZ - is found under different names. Before analyzing the data must be cleared. 1 - stringdist( c("MERCEDES","MERSEDES","MAZDA","RENAULT","SAAB"), "MERCEDES", method = "jw", p = 0.1)
## [1] 1.0000 0.9417 0.5950 0.3452 0.0000
At first glance, the task of data cleaning is solved simply: for each record, it suffices to choose the most similar value from the directory. Unfortunately, this approach does not always work. First, the directory may not be (as in the current case). Secondly, some situations require manual data correction, even with an accurate reference guide. For example, from the point of view of the method, three marks are equally suitable as an alternative to the incorrect value of “BAZ”: 1 - stringdist("BAZ", c("VAZ", "UAZ", "ZAZ"), method = "jw", p = 0.1)
## [1] 0.7778 0.7778 0.7778
Below, a semi-automatic correction method is used, which makes it possible to significantly ease the work of a data cleansing specialist by programmatically generating options for corrections with which the analyst can either agree or manually correct.t
(about choosing t
later) are taken into account. For each possible value of the make of machine, a recommended reference value is thus determined from the same data set. The “brand - proposed fix” pairs are displayed in a csv file. After analysis and corrections, the corrected csv file is loaded and serves as a dictionary. bestmatch.gen <- function(wc, t = 0){ # wc = counts of all base text words # t = threshold: only the words with similarity above threshold count bestmatch <- function(a){ sim <- 1 - stringdist( toupper(a), toupper( names(wc) ) , method = "jw", p = 0.1 ) # Compute weights and implicitly cut off everything below threshold weights <- sim * wc * (sim > t) # Return the one with maximum combined weight names( sort(weights, decr = TRUE)[1] ) } bestmatch }
The threshold value t
chosen empirically. Here is an example of the function for the threshold parameter t = 0.7. bm07 <- bestmatch.gen( table( dt$CarBrand), t = 0.7 ) s <- c("FORD","RENO","MERS","PEGO") sapply(s, bm07)
## FORD RENO MERS PEGO ## "FORD" "RENAULT" "MERCEDES-BENZ" "PEUGEOT"
At first glance, everything worked wonderfully. However, rejoice too early. Brand names with similar names that are well represented in the data set can “pull over” other correct names. s <- c("HONDA", "CHRYSLER", "VOLVO") sapply(s, bm07)
## HONDA CHRYSLER VOLVO ## "HYUNDAI" "CHEVROLET" "VOLKSWAGEN"
Let's try to increase the threshold value t. bm09 <- bestmatch.gen( table( dt$CarBrand), t = 0.9 ) s <- c("HONDA","CHRYSLER","VOLVO") sapply(s, bm09)
## HONDA CHRYSLER VOLVO ## "HONDA" "CHRYSLER" "VOLVO"
Everything is good? Nearly. Too hard clipping of unlike strings causes the algorithm to consider some erroneous values to be correct. Such errors will have to be corrected manually. s <- c("CEAT", "CVEVROLET") sapply(s, bm09)
## CEAT CVEVROLET ## "CEAT" "CVEVROLET"
Now everything is ready for the formation of a file of the dictionary of unique values of the brands of cars. Since the file will need to be edited by hand, it is convenient if it contains additional fields indicating whether the proposed replacement differs from the original value (this is not always obvious), how often the brand name is found, as well as a label that draws attention to the recording depending on some kind of statistical characteristics of the set. In this case, we want to catch situations in which the algorithm offers infrequent (presumably erroneous) values as correct. ncb <- table(dt$CarBrand) scb <- names(ncb) # Source Car Brands acb <- sapply(scb, bm09) # Auto-generated replacement cbdict_out <- data.table(ncb)[,list( SourceName = scb, AutoName = acb, SourceFreq = as.numeric(ncb), AutoFreq = as.numeric( ncb[acb] ), Action = ordered( scb == acb, labels = c("CHANGE","KEEP")), DictName = acb )] # Add alert flag # Alert when suggested is a low-frequency dictionary word cbdict_out <- cbdict_out[, Alert := ordered( AutoFreq <= quantile(AutoFreq, probs = 0.05, na.rm = TRUE), labels = c("GOOD","ALERT")) ] write.table( cbdict_out[ order(SourceName), list( Alert, Action, SourceName, AutoName, SourceFreq, AutoFreq, DictName) ], "cbdict_out.txt", sep = ";", quote = TRUE, col.names = TRUE, row.name = FALSE, fileEncoding = "UTF-8")
You need to check and edit the values of the DictName field and save the file as “cbdict_in.txt” for later download. if ( file.exists("cbdict_in.txt")) url <- "cbdict_in.txt" else url <- "cbdict_out.txt" cbdict_in <- read.table( url, header = TRUE, sep = ";", colClasses = c( rep("character",4), "numeric", "numeric", "character"), encoding = "UTF-8") cbdict <- cbdict_in$DictName names(cbdict) <- cbdict_in$SourceName
And correct the values of the brands of cars in the data table. dt[, CarBrand := cbdict[CarBrand]] dt[is.na(CarBrand), CarBrand := "UNKNOWN"]
After cleaning the unique values of the brands of cars was almost doubled length( unique(dt$CarBrand) )
## [1] 72
st <- dt[, list( NumCars = length(RegPlate)), by = list(OGRN, LegalName) ] head( st[order( NumCars, decreasing = TRUE)], 10)
## OGRN LegalName NumCars ## 1: 1137746197104 «» 866 ## 2: 1037727000893 «-» 751 ## 3: 1067746273198 « » 547 ## 4: 1037789018849 «» 541 ## 5: 1127746010700 «-24 » 406 ## 6: 1057748223653 «» 349 ## 7: 5067746596297 «» 288 ## 8: 1027739272175 «14 » 267 ## 9: 1137746133250 « » 255 ## 10: 5077746757688 «» 238
Unfortunately, the dataset in question only stores legal information about licensees, not a trademark. On the Internet, it is possible, by the name of the organization and the OGRN, to find under which brand the taxi fleet operates, but this process is not automatic and rather laborious. The search results for the largest taxis are collected in the " top10orgs.csv " file. top10orgs <- data.table( read.table( "top10orgs.csv", header = TRUE, sep = ";", colClasses = "character", encoding = "UTF-8"))
We use the built-in data.table capabilities for the JOIN operation of two tables. setkey(top10orgs,OGRN) setkey(st,OGRN) st[top10orgs][order(NumCars, decreasing = TRUE), list(OrgBrand, EasyPhone, NumCars)]
## OrgBrand EasyPhone NumCars ## 1: 781 81 82 866 ## 2: 956 956 8 956 751 ## 3: - 641 11 11 547 ## 4: 500 0 500 541 ## 5: 24 777 66 24 406 ## 6: 777 5 777 349 ## 7: 940 88 88 288 ## 8: 14 707 2 707 267 ## 9: Cabby 21 21 989 255 ## 10: 927 11 11 238
st <- dt[, list(AGGR = length(RegPlate)), by = list(OrgType, CarBrand) ] st.r <- st[, list(CarBrand, AGGR, r = ( 1 + length(AGGR) - rank(AGGR, ties.method="first"))), by = list(OrgType)] # ranking by one dimension st.out <- st.r[ r <= 3 ][, list(r, OrgType, cval = paste0(CarBrand," (",AGGR,")"))] dcast(st.out, r ~ OrgType, value.var = "cval")[-1] # reshape data and hide r
## ## 1 FORD (212) CHEVROLET (2465) VOLVO (1) KIA (192) FORD (3297) ## 2 RENAULT (175) FORD (2238) <NA> CHEVROLET (115) RENAULT (2922) ## 3 HYUNDAI (122) RENAULT (1996) <NA> FORD (53) HYUNDAI (2812)
st <- dt[, list(N = length(RegPlate)), by = CarBrand ] # Summary table st <- st[, CarBrand := reorder(CarBrand, N) ] piedata <- rbind( st[ N >= 1000 ][ order(N, decreasing=T) ], data.table( CarBrand = " ", N = sum( st[N < 1000]$N) ) ) piedata
## CarBrand N ## 1: FORD 5800 ## 2: RENAULT 5093 ## 3: HYUNDAI 4727 ## 4: CHEVROLET 4660 ## 5: KIA 2220 ## 6: SKODA 2073 ## 7: NISSAN 1321 ## 8: VOLKSWAGEN 1298 ## 9: TOYOTA 1075 ## 10: MERCEDES-BENZ 1039 ## 11: 6534
To build a graph, I would like to fix just such a sequence of brands. If this is not done, the automatic sorting will display “Other brands” from last place to first. piedata <- piedata[, CarBrand := factor(CarBrand, levels = CarBrand, ordered = TRUE)]
To build a chart, use ggplot2. pie <- ggplot(piedata, aes( x = "", y = N, fill = CarBrand)) + geom_bar(stat = "identity") + coord_polar(theta = "y") pie
piedata <- piedata[, pos := cumsum(N) - 0.5*N ] pie <- ggplot(piedata, aes( x = "", y = N, fill = CarBrand)) + geom_bar( color = "black", stat = "identity", width = 0.5) + geom_text( aes(label = N, y = pos), x = 1.4, color = "black", size = 5) + scale_fill_brewer(palette = "Paired", name = " ") + coord_polar(theta = "y") + theme_bw() + theme ( panel.border = element_blank() , panel.grid.major = element_blank() , axis.ticks = element_blank() , axis.title.x = element_blank() , axis.title.y = element_blank() , axis.text.x = element_blank() , legend.title = element_text(face="plain", size=16) ) pie
st <- dt[, list(N = length(RegPlate)), by = list(OrgType, CarBrand) ] # Summary table cbsort <- st[, list( S = sum(N) ), keyby = CarBrand ] # Order by total number setkey(st, CarBrand) st <- st[cbsort] # Join topcb <- st[ S >= 1000 ][ order(S) ] bottomcb <- st[S < 1000, list(CarBrand = " ", OrgType, N = sum(N)), by = OrgType] bottomcb <- bottomcb[, list(CarBrand, OrgType, N, S = sum(N))] bardata <- rbind( bottomcb, topcb) bardata <- bardata[, CarBrand := factor(CarBrand, levels = unique(CarBrand), ordered=T)] # bar <- ggplot(bardata, aes(x = CarBrand, weight = N, fill = OrgType)) + geom_bar() + coord_flip() + scale_fill_brewer(palette = "Spectral", name = "") + labs(list(y = " ", x = " ")) + theme_bw() bar
ln <- dt[grep( "^[^0-9]([0-9])\\1{2}.+$" , RegPlate), list(CarBrand, LuckyNum = gsub("^[^0-9]([0-9]{3}).+$","\\1", RegPlate))] ln <- ln[, list( N = .N), by = list(CarBrand, LuckyNum) ] ln <- ln[, Luck := sum(N), by = list(CarBrand) ] # Total number of lucky regplates per car brand ln <- ln[, CarBrand := reorder(CarBrand, Luck) ] # heatmap <- ggplot(ln, aes(x = CarBrand, y = LuckyNum)) + geom_tile( aes(fill = as.character(N)), color = "black") + scale_fill_brewer(palette = "YlOrRd", name = " «» :") + labs(list(x = " ", y = " ")) + theme_bw() + theme ( panel.grid.major = element_blank() , axis.text.x = element_text(angle = 45, hjust = 1) , axis.title.y = element_text(vjust = 0.3) , legend.position = "top" , legend.title.align = 1 ) heatmap
Source: https://habr.com/ru/post/217963/
All Articles