Introduction
In this article I will talk about the state in today's Russian notebook market. We will carry out all analytics with the help of python code. I think it will be useful both to those who are looking for a laptop, and those who want to practice writing in python.
Let's start
')
![diy-03-425 [1] diy-03-425[1]](https://habrastorage.org/getpro/habr/post_images/fc2/f39/bf4/fc2f39bf4d47ef799715d2d7073b2fef.jpg)
For the analysis, we need a data set, unfortunately I could not find
web services at the Russian online laptop stores, so I had to download the price list of one of them (I will not call it) and pull prices and basic parameters from it ( in my opinion, such are: processor frequency, monitor diagonal, RAM size, hard disk size and memory size on a video card). Next, I conducted some analysis on the following issues:
- The average cost of a laptop
- Iron Average Parameters on Laptops
- The most expensive / cheap laptop configuration
- Which configuration option has the most effect on its price?
- Predicting the price of the specified configuration
- Schedule of distribution of configurations and prices
Lead code
The price list, which I managed to get, I saved in CSV format, to work with it you need to connect the csv module:
import csvimport reimport randomAlso we will connect the module for work with random numbers and regular expressions, which we will need later.
Next, create a method for reading and retrieving laptops:
def get_notebooks ():
reader
= csv.reader (
open (
'data.csv' ), delimiter
= ';' , quotechar
= '|' )
return filter (
lambda x: x
! = None , map (create_notebook, reader))
everything is simple here, we read the file with data.csv data and filter it by the result of the create_notebook function, since Not all positions in the price list are laptops, but by the way, and it:
def create_notebook (raw):
try :
notebook
= Notebook ()
notebook.vendor = raw [0] .split (
'' ) [
0 ]
notebook.model = raw [0] .split (
'' ) [
1 ]
notebook.cpu = getFloat (
r "(\ d +) \, (\ d +) \ s \ G" , raw [
0 ] .split (
'/' ) [
0 ])
notebook.monitor = getFloat (
r "(\ d +) \. (\ d +) \ ''" , raw [
0 ] .split (
'/' ) [
1 ])
notebook.ram = getInt (
r "(\ d +) \ Mb" , raw [
0 ] .split (
'/' ) [
2 ])
notebook.hdd = getInt (
r "(\ d +) Gb" , raw [
0 ] .split (
'/' ) [
3 ])
notebook.video = getInt (
r "(\ d +) Mb" , raw [
0 ] .split (
'/' ) [
4 ])
notebook.price = getInt (
r "(\ d +) \ s \ rub." , raw [
1 ])
return notebook
except Exception , e:
return none
As you can see, I decided not to pay attention to the vendor, model and processor type (of course, not everything is so simple, but nonetheless), but also - this method contains my custom helper functions:
def getFloat (regex, raw):
m
= re.search (regex, raw) .groups ()
return float (m [
0 ] +
'.' + m [1])
def getInt (regex, raw):
m
= re.search (regex, raw) .groups ()
return int (m [
0 ])
I want to note that writing for python is best in the style of data sets, rather than OOP structures, due to the fact that the language has more to this style, but to restore some order in our domain domain (laptops), I introduced a class like You may have noticed above (notebook = Notebook ())
class Notebook :
passGreat, now we have a structure in memory and it is ready for analysis (
2005 different configurations and their cost ), what do we start:
Average laptop cost:def get_avg_price ():
print sum ([n
. price
for n
in get_notebooks ()])
/ len (get_notebooks ())
We execute the code and see that 1K $, as a standard for a computer, is still valid:
>> get_avg_price ()
34574Iron Average Parameters on Laptopsdef get_avg_parameters ():
print "cpu {0}" . format (
sum ([n
. cpu
for n
in get_notebooks ()])
/ len (get_notebooks ()))
print "monitor {0}" . format (
sum ([n
. monitor
for n
in get_notebooks ()])
/ len (get_notebooks ()))
print "ram {0}" . format (
sum ([n
. ram
for n
in get_notebooks ()])
/ len (get_notebooks ()))
print "hdd {0}" . format (
sum ([n
. hdd
for n
in get_notebooks ()])
/ len (get_notebooks ()))
print "video {0}" . format (
sum ([n
. video
for n
in get_notebooks ()])
/ len (get_notebooks ()))
Ta-da, and in our hands averaged configuration:
>> get_avg_parameters ()
cpu
2.0460798005monitor 14.6333167082
ram 2448
hdd 243
video 289
The most expensive / cheap laptop configuration:Functions are identical except for min / max functions.
def get_max_priced_notebook ():
maxprice
= max ([n
. price
for n
in get_notebooks ()])
maxconfig
= filter (
lambda x: x
. price == maxprice, get_notebooks ()) [0]
print "cpu {0}" . format (maxconfig.cpu)
print "monitor {0}" . format (maxconfig.monitor)
print "ram {0}" . format (maxconfig.ram)
print "hdd {0}" . format (maxconfig.hdd)
print "video {0}" . format (maxconfig.video)
print "price {0}" . format (maxconfig.price)
>> get_max_priced_notebook ()
cpu
2.26monitor 18.4
ram 4096
hdd 500
video 1024
price 181660
>> get_min_priced_notebook ()
cpu
1.6monitor 8.9
ram 512
hdd 8
video 128
price 8090
Which configuration option has the most effect on its price?It would be very interesting to find out for which of the configuration parameters we pay the most money. Having estimated, I assumed that most likely it is a monitor diagonal and a processor frequency, well, I think that it is worth checking it out.
To begin with, our set of configuration parameters is worth a bit of modification. Due to the fact that the units of measurement of different parameters are different in their order, we need to bring them to the same denominator, i.e. normalize them. So let's get started:
def normalized_set_of_notebooks ():
notebooks
= get_notebooks ()
cpu =
max ([n
. cpu
for n
in notebooks])
monitor
= max ([n
. monitor
for n
in notebooks])
ram
= max ([n
. ram
for n
in notebooks])
hdd
= max ([n
. hdd
for n
in notebooks])
video
= max ([n
. video
for n
in notebooks])
rows
= map (
lambda n: [n
. cpu / cpu, n.monitor / monitor,
float (n
. ram) / ram,
float (n
. hdd) / hdd,
float (n
. video) / video, n.price ], notebooks)
return rows
In this function, I find the maximum values for each of the parameters, then form the resulting list of laptops, in which each of the parameters is represented as a coefficient (its value will vary from 0 to 1), showing the ratio of its parameter to the maximum value in the set, to For example, a memory of 2048Mb will give the configuration a coefficient of ram = 0.5 (2048/4056).
The contribution of each of the parameters we will consider in rubles, for clarity, we will store these weights in the set:
#cpu, monitor, ram, hdd, videokoes
= [0, 0, 0, 0, 0]
I propose to calculate these coefficients for each configuration, and then determine the average value of all coefficients, which will give us averaged data on the weight of each of the configuration elements.
def analyze_params (parameters):
koeshistory
= []
# our laptopsnotes
= normalized_set_of_notebooks ()
for i
in range (len (notes)):
koes
= [0, 0, 0, 0, 0]
# set coefficientsset_koes (notes [i], koes)
# save history of coefficientskoeshistory
. extend (koes)
# show progressif (i
% 100 == 0):
print i
print koes
How will we set the coefficients for each configuration item? My way is as follows:
- we need to randomly increase or decrease the value of one of the coefficients
- after which we analyze whether we have approached the price per configuration when multiplying the vector of parameters by the vector of coefficients (let me remind you that in our case these are rubles)
- if the approach took place, you we repeat this action, if not, then cancel it
- repeat this order to the extent until we get close to our price with the accuracy we set
Here is the implementation of this algorithm:
def set_koes (note, koes, error
= 500):
price = get_price (note, koes)
lasterror =
abs (note [
5 ] - price)
while (lasterror
> error):
k = random.randint (0.4)
# we change the coefficientinc
= (random.random () * 2 - 1) * (error * (1 - error / lasterror))
koes [k] + = inc
# do not let the coefficient become less than zeroif (koes [k]
< 0): koes [k] = 0
# get the price when taking into account coefficientsprice
= get_price (note, koes)
# get the current errorcurerror
= abs (note [
5 ] - price)
# check if we are close to the price shown in the price listif (lasterror
< curerror):
koes [k] - = inc
else :
lasterror
= curerror
inc is a variable responsible for the increase / decrease of the coefficient; the method of its calculation is explained by the fact that this value should be the greater, the greater the difference in error, in order to quickly and more accurately approach the desired result.
The multiplication of vectors to get the price is as follows:
def get_price (note, koes):
return sum ([note [i]
* koes [i]
for i
in range (
5 )])
The time has come to perform the analysis:
>> analyze_params ()
cpu, monitor, ram, hdd, video
[
15455.60675667684 , 20980.560483811361, 12782.535270304281, 17819.904629585861, 14677.889529808042]
We obtained this set due to averaging of the coefficients obtained for each of the configurations:
def get_avg_koes (koeshistory):
koes
= [0, 0, 0, 0, 0]
for row
in koeshistory:
for i
in range (
5 ):
koes [i] + = koeshistory [i]
for i
in range (
5 ):
koes [i] / =
len (koeshistory)
return koes
So, we have the desired set, what can we say from these figures, and can we make a rating of the parameters:
- Monitor diagonal
- Hard disk capacity
- CPU frequency
- Video Card Volume
- RAM size
I would like to note that this is far from ideal, and you may have different results, however, my assumption that the frequency of the processor and the diagonal of the display, the most important parameters in the configuration, were partially confirmed.
Forecasting the price of the specified configurationIt would be nice to have such a rich set of data to be able to predict the price for a given configuration. This is what we will do.
To begin with we will transform our collection of laptops into the list:
def get_notebooks_list ():
return map (
lambda n: [n
. cpu, n.monitor, n.ram, n.hdd, n.video, n.price], get_notebooks ())
Next, we need a function that can determine the distance between two vectors, a good option I see the function of the Euclidean distance:
def euclidean (v1, v2):
d
= 0.0
for i
in range (len (v1)):
d
+ = (v1 [i] - v2 [i]) ** 2;
return math
. sqrt (d)
The root of the sum of squares of differences pretty clearly and effectively shows us how one vector is different from another. What is this feature useful for us? Everything is simple, when we get a vector, with parameters that interest us, we will go over the entire collection of our set and find the nearest neighbor, and we already know its value, great! Here's how we do it:
def getdistances (data, vec1):
distancelist
= []
for i
in range (len (data)):
vec2
= data [i]
distancelist.append ((euclidean (vec1, vec2), i))
distancelist.sort ()
return distancelist
Further, it is possible to complicate the task a little, as well as the accuracy of the data provided. To do this, we introduce a function that uses the classification
method k weighted nearest neighbors :
weighted nearest neighbors is a metric classification algorithm based on evaluating the similarity of objects. The object being classified belongs to the class to which the objects of the training set that belong to it belong.
Well, take the average value among a certain number of nearest neighbors, which will negate the influence of vendor prices, or configuration specificity:
def knnestimate (data, vec1, k
= 3):
dlist = getdistances (data, vec1)
avg = 0.0
for i
in range (k):
idx
= dlist [i] [1]
avg + = data [idx] [5]
avg / = k
return avg
* the last 3 algorithms are taken from
Segeran Toby ’s book
“Programming Collective Intelligence”And what do we get:
>> knnestimate (get_notebooks_list (), [
2.4 , 17, 3062, 250, 512])
31521.0
>> knnestimate (get_notebooks_list (), [2.0, 15, 2048, 160, 256])
27259.0
>> knnestimate (get_notebooks_list (), [2.0, 15, 2048, 160, 128])
20848.0
Prices are market prices and this is quite enough, although we absolutely do not take into account in this implementation, for example, the processor frequency and monitor diagonal (for this we need to add to the function of comparing their weight vectors, which we calculated in the previous paragraph)
Schedule of distribution of configurations and pricesI would like to embrace the whole distribution picture, i.e. draw the distribution of configurations and prices in the market. Ok, let's do it.
First you need to put the
matplotlib library. Next, connect it to our project:
from pylab import *We also need to create two data sets for the abscissa and ordinate:
def power_of_notebooks_config ():
return map (
lambda x: x [
0 ] * x [1] * x [2] * x [3] * x [4], normalized_set_of_notebooks ())
def config_prices ():
return map (
lambda x: x [
5 ], normalized_set_of_notebooks ())
And the function in which we build the distribution graph:
def draw_market ():
plot (config_prices (), power_of_notebooks_config (),
'bo' , linewidth
= 1.0)
xlabel (
'price (Rub)' )
ylabel ('config_power')
title ('Russian Notebooks Market')
grid (
true )
show ()
And what do we get:

In conclusion
So, we managed to conduct a small analysis of the Russian laptop market, as well as lose a little with python.
The source code of the project is available at:
http://code.google.com/p/runm/source/checkoutI apologize for a little important syntax highlighting, my engine (
pygments ) did not want to be perceived as a habr.