Laptops Market Analysis with Python

Introduction

In this article I will talk about the state in today's Russian notebook market. We will carry out all analytics with the help of python code. I think it will be useful both to those who are looking for a laptop, and those who want to practice writing in python.

Let's start

For the analysis, we need a data set, unfortunately I could not find web services at the Russian online laptop stores, so I had to download the price list of one of them (I will not call it) and pull prices and basic parameters from it ( in my opinion, such are: processor frequency, monitor diagonal, RAM size, hard disk size and memory size on a video card). Next, I conducted some analysis on the following issues:

The average cost of a laptop
Iron Average Parameters on Laptops
The most expensive / cheap laptop configuration
Which configuration option has the most effect on its price?
Predicting the price of the specified configuration
Schedule of distribution of configurations and prices

Lead code

The price list, which I managed to get, I saved in CSV format, to work with it you need to connect the csv module:

import csv
import re
import random

Also we will connect the module for work with random numbers and regular expressions, which we will need later.

Next, create a method for reading and retrieving laptops:

def get_notebooks ():
reader = csv.reader ( open ( 'data.csv' ), delimiter = ';' , quotechar = '|' )
return filter ( lambda x: x ! = None , map (create_notebook, reader))

everything is simple here, we read the file with data.csv data and filter it by the result of the create_notebook function, since Not all positions in the price list are laptops, but by the way, and it:

def create_notebook (raw):
try :
notebook = Notebook ()
notebook.vendor = raw [0] .split ( '' ) [ 0 ]
notebook.model = raw [0] .split ( '' ) [ 1 ]
notebook.cpu = getFloat ( r "(\ d +) \, (\ d +) \ s \ G" , raw [ 0 ] .split ( '/' ) [ 0 ])
notebook.monitor = getFloat ( r "(\ d +) \. (\ d +) \ ''" , raw [ 0 ] .split ( '/' ) [ 1 ])
notebook.ram = getInt ( r "(\ d +) \ Mb" , raw [ 0 ] .split ( '/' ) [ 2 ])
notebook.hdd = getInt ( r "(\ d +) Gb" , raw [ 0 ] .split ( '/' ) [ 3 ])
notebook.video = getInt ( r "(\ d +) Mb" , raw [ 0 ] .split ( '/' ) [ 4 ])
notebook.price = getInt ( r "(\ d +) \ s \ rub." , raw [ 1 ])
return notebook
except Exception , e:
return none

As you can see, I decided not to pay attention to the vendor, model and processor type (of course, not everything is so simple, but nonetheless), but also - this method contains my custom helper functions:

def getFloat (regex, raw):
m = re.search (regex, raw) .groups ()
return float (m [ 0 ] + '.' + m [1])

def getInt (regex, raw):
m = re.search (regex, raw) .groups ()
return int (m [ 0 ])

I want to note that writing for python is best in the style of data sets, rather than OOP structures, due to the fact that the language has more to this style, but to restore some order in our domain domain (laptops), I introduced a class like You may have noticed above (notebook = Notebook ())

class Notebook :
pass

Great, now we have a structure in memory and it is ready for analysis ( 2005 different configurations and their cost ), what do we start:

Average laptop cost:

def get_avg_price ():
print sum ([n . price for n in get_notebooks ()]) / len (get_notebooks ())

We execute the code and see that 1K $, as a standard for a computer, is still valid:

>> get_avg_price ()
34574

Iron Average Parameters on Laptops

def get_avg_parameters ():
print "cpu {0}" . format ( sum ([n . cpu for n in get_notebooks ()]) / len (get_notebooks ()))
print "monitor {0}" . format ( sum ([n . monitor for n in get_notebooks ()]) / len (get_notebooks ()))
print "ram {0}" . format ( sum ([n . ram for n in get_notebooks ()]) / len (get_notebooks ()))
print "hdd {0}" . format ( sum ([n . hdd for n in get_notebooks ()]) / len (get_notebooks ()))
print "video {0}" . format ( sum ([n . video for n in get_notebooks ()]) / len (get_notebooks ()))

Ta-da, and in our hands averaged configuration:

>> get_avg_parameters ()
cpu 2.0460798005
monitor 14.6333167082
ram 2448
hdd 243
video 289

The most expensive / cheap laptop configuration:

Functions are identical except for min / max functions.

def get_max_priced_notebook ():
maxprice = max ([n . price for n in get_notebooks ()])
maxconfig = filter ( lambda x: x . price == maxprice, get_notebooks ()) [0]
print "cpu {0}" . format (maxconfig.cpu)
print "monitor {0}" . format (maxconfig.monitor)
print "ram {0}" . format (maxconfig.ram)
print "hdd {0}" . format (maxconfig.hdd)
print "video {0}" . format (maxconfig.video)
print "price {0}" . format (maxconfig.price)

>> get_max_priced_notebook ()
cpu 2.26
monitor 18.4
ram 4096
hdd 500
video 1024
price 181660

>> get_min_priced_notebook ()
cpu 1.6
monitor 8.9
ram 512
hdd 8
video 128
price 8090

Which configuration option has the most effect on its price?

It would be very interesting to find out for which of the configuration parameters we pay the most money. Having estimated, I assumed that most likely it is a monitor diagonal and a processor frequency, well, I think that it is worth checking it out.

To begin with, our set of configuration parameters is worth a bit of modification. Due to the fact that the units of measurement of different parameters are different in their order, we need to bring them to the same denominator, i.e. normalize them. So let's get started:

def normalized_set_of_notebooks ():
notebooks = get_notebooks ()
cpu = max ([n . cpu for n in notebooks])
monitor = max ([n . monitor for n in notebooks])
ram = max ([n . ram for n in notebooks])
hdd = max ([n . hdd for n in notebooks])
video = max ([n . video for n in notebooks])
rows = map ( lambda n: [n . cpu / cpu, n.monitor / monitor, float (n . ram) / ram, float (n . hdd) / hdd, float (n . video) / video, n.price ], notebooks)
return rows

In this function, I find the maximum values for each of the parameters, then form the resulting list of laptops, in which each of the parameters is represented as a coefficient (its value will vary from 0 to 1), showing the ratio of its parameter to the maximum value in the set, to For example, a memory of 2048Mb will give the configuration a coefficient of ram = 0.5 (2048/4056).

The contribution of each of the parameters we will consider in rubles, for clarity, we will store these weights in the set:

#cpu, monitor, ram, hdd, video
koes = [0, 0, 0, 0, 0]

I propose to calculate these coefficients for each configuration, and then determine the average value of all coefficients, which will give us averaged data on the weight of each of the configuration elements.

def analyze_params (parameters):
koeshistory = []
# our laptops
notes = normalized_set_of_notebooks ()
for i in range (len (notes)):
koes = [0, 0, 0, 0, 0]
# set coefficients
set_koes (notes [i], koes)
# save history of coefficients
koeshistory . extend (koes)
# show progress
if (i % 100 == 0):
print i
print koes

How will we set the coefficients for each configuration item? My way is as follows:

we need to randomly increase or decrease the value of one of the coefficients
after which we analyze whether we have approached the price per configuration when multiplying the vector of parameters by the vector of coefficients (let me remind you that in our case these are rubles)
if the approach took place, you we repeat this action, if not, then cancel it
repeat this order to the extent until we get close to our price with the accuracy we set

Here is the implementation of this algorithm:

def set_koes (note, koes, error = 500):
price = get_price (note, koes)
lasterror = abs (note [ 5 ] - price)
while (lasterror > error):
k = random.randint (0.4)
# we change the coefficient
inc = (random.random () * 2 - 1) * (error * (1 - error / lasterror))
koes [k] + = inc
# do not let the coefficient become less than zero
if (koes [k] < 0): koes [k] = 0
# get the price when taking into account coefficients
price = get_price (note, koes)
# get the current error
curerror = abs (note [ 5 ] - price)
# check if we are close to the price shown in the price list
if (lasterror < curerror):
koes [k] - = inc
else :
lasterror = curerror

inc is a variable responsible for the increase / decrease of the coefficient; the method of its calculation is explained by the fact that this value should be the greater, the greater the difference in error, in order to quickly and more accurately approach the desired result.

The multiplication of vectors to get the price is as follows:

def get_price (note, koes):
return sum ([note [i] * koes [i] for i in range ( 5 )])

The time has come to perform the analysis:

>> analyze_params ()
cpu, monitor, ram, hdd, video

[ 15455.60675667684 , 20980.560483811361, 12782.535270304281, 17819.904629585861, 14677.889529808042]

We obtained this set due to averaging of the coefficients obtained for each of the configurations:

def get_avg_koes (koeshistory):
koes = [0, 0, 0, 0, 0]
for row in koeshistory:
for i in range ( 5 ):
koes [i] + = koeshistory [i]
for i in range ( 5 ):
koes [i] / = len (koeshistory)
return koes

So, we have the desired set, what can we say from these figures, and can we make a rating of the parameters:

Monitor diagonal
Hard disk capacity
CPU frequency
Video Card Volume
RAM size

I would like to note that this is far from ideal, and you may have different results, however, my assumption that the frequency of the processor and the diagonal of the display, the most important parameters in the configuration, were partially confirmed.

Forecasting the price of the specified configuration

It would be nice to have such a rich set of data to be able to predict the price for a given configuration. This is what we will do.

To begin with we will transform our collection of laptops into the list:

def get_notebooks_list ():
return map ( lambda n: [n . cpu, n.monitor, n.ram, n.hdd, n.video, n.price], get_notebooks ())

Next, we need a function that can determine the distance between two vectors, a good option I see the function of the Euclidean distance:

def euclidean (v1, v2):
d = 0.0
for i in range (len (v1)):
d + = (v1 [i] - v2 [i]) ** 2;
return math . sqrt (d)

The root of the sum of squares of differences pretty clearly and effectively shows us how one vector is different from another. What is this feature useful for us? Everything is simple, when we get a vector, with parameters that interest us, we will go over the entire collection of our set and find the nearest neighbor, and we already know its value, great! Here's how we do it:

def getdistances (data, vec1):
distancelist = []
for i in range (len (data)):
vec2 = data [i]
distancelist.append ((euclidean (vec1, vec2), i))
distancelist.sort ()
return distancelist

Further, it is possible to complicate the task a little, as well as the accuracy of the data provided. To do this, we introduce a function that uses the classification method k weighted nearest neighbors :

weighted nearest neighbors is a metric classification algorithm based on evaluating the similarity of objects. The object being classified belongs to the class to which the objects of the training set that belong to it belong.

Well, take the average value among a certain number of nearest neighbors, which will negate the influence of vendor prices, or configuration specificity:

def knnestimate (data, vec1, k = 3):
dlist = getdistances (data, vec1)
avg = 0.0
for i in range (k):
idx = dlist [i] [1]
avg + = data [idx] [5]
avg / = k
return avg

* the last 3 algorithms are taken from Segeran Toby ’s book “Programming Collective Intelligence”

And what do we get:

>> knnestimate (get_notebooks_list (), [ 2.4 , 17, 3062, 250, 512])
31521.0

>> knnestimate (get_notebooks_list (), [2.0, 15, 2048, 160, 256])
27259.0
>> knnestimate (get_notebooks_list (), [2.0, 15, 2048, 160, 128])
20848.0

Prices are market prices and this is quite enough, although we absolutely do not take into account in this implementation, for example, the processor frequency and monitor diagonal (for this we need to add to the function of comparing their weight vectors, which we calculated in the previous paragraph)

Schedule of distribution of configurations and prices

I would like to embrace the whole distribution picture, i.e. draw the distribution of configurations and prices in the market. Ok, let's do it.

First you need to put the matplotlib library. Next, connect it to our project:

from pylab import *

We also need to create two data sets for the abscissa and ordinate:

def power_of_notebooks_config ():
return map ( lambda x: x [ 0 ] * x [1] * x [2] * x [3] * x [4], normalized_set_of_notebooks ())
def config_prices ():
return map ( lambda x: x [ 5 ], normalized_set_of_notebooks ())

And the function in which we build the distribution graph:

def draw_market ():
plot (config_prices (), power_of_notebooks_config (), 'bo' , linewidth = 1.0)

xlabel ( 'price (Rub)' )
ylabel ('config_power')
title ('Russian Notebooks Market')
grid ( true )
show ()

And what do we get:

In conclusion

So, we managed to conduct a small analysis of the Russian laptop market, as well as lose a little with python.

The source code of the project is available at:

http://code.google.com/p/runm/source/checkout

I apologize for a little important syntax highlighting, my engine ( pygments ) did not want to be perceived as a habr.

Source: https://habr.com/ru/post/68355/

All Articles

Laptops Market Analysis with Python

Introduction

Let's start

Lead code

In conclusion

More articles: