When you open your online store, the owner usually does the following:
- I have access to a supplier of underwear, equipment ... (here each inserts his own), why not open an online store, this is cool, I heard on the Internet you can earn a lot, it’s promising and profitable.
Thus, there are thousands of sites selling the same goods, cluttering up the Internet space more and more.
It turns out, the pants sells another 1001 store. Instead of money, as a rule, an entrepreneur gets a headache in the form of a seo, a shmeo and a disproportionate cost of contextual advertising.
Online store is bent, not having time to appear.
I suggest to go the other way.
')
Purpose (aka Theory):
Search for unoccupied niches for trade.
Ideal situation Demand - is, offers - no, Cheap contextual advertising.So - we are looking for "gold".

Let's talk about Web Data Mining - extracting data from the Internet, and the subsequent analysis of the data.
Initial data:In my experiment on testing the theory, I will start from what THAT users of the Internet are looking for in search engines.
At the moment there are several sources to obtain such data.
- Database of keywords collected from various sources (old databases can be found for free).
- Tips from search engines Yandex and Google.
- Yandex technology “Live” - shows in real time user requests.
Since retrieving data from search engines is not an easy task, for a start we will manage a small base of 30 million phrases, walking across the expanses of the Internet.
Preparation of the initial data:- For further analysis, we translate all phrases to lower case
- We clean phrases from unnecessary characters (we are only interested in [a..Z] [a ... Z] [0..9])
- Remove mate and porn and other “stop” words like “free”, “download”, “torrent”.
After that, the base is cut by about 30%.
Required data:So, we are interested in the parameters that characterize supply and demand.
Sources:
- Yandex.Direct API (Budget Forecast: CreateNewForecast, GetForecast)
(free, no restrictions)
- Google Adwords API (trafficEstimatorService Forecast)
(using API for money)
- Yandex.Wordstat (http://wordstat.yandex.ru/)
(free, unstable, quickly banned IP with a large number of requests)
-
* Yandex.Spros (http://direct.yandex.ru/spros)(new service, banyat is not so fast, it works more stable)
-
* Search by Yandex.Direct (http://direct.yandex.ru/search)(from here you can pull out the number of ads for the key phrase, the ban is not seen)
Asterisks mark the services that I used to test my theory.
Data collection:Stage 1.Since the collection process through the API is long and resource-intensive, for the beginning we use the search on Yandex.Direct. Each phrase is matched with the number of ads.
Here came the first underwater rock. The number of ads depends on the time of day.
Therefore, our database will have to walk 2 times.
The first time - round the clock collection.
The second is according to the resulting sample (announcements <1) from 9 am to 6 pm.
Stage 2.Having a list of phrases with the number of ads 0 and 1, we get the number of phrase requests in search engines. The number of phrases at the beginning of stage 2 is 10% of the initial volume.
We will parallelize the collection of information through the lists of proxy servers, for which the search and ranking system of the proxy was written with signs of connection speed and ban.
Result:The theory is confirmed. Unoccupied niche IS, and in completely different areas! The experiment is still ongoing.
(Proof: sapper blades)
But:
- I received a lot of garbage at the output, which I had to look through manually, extracting monetized requests from the list.
- The list of stop words has increased considerably, I could not even imagine what kind of nastiness the users of the network are looking for.
- For more automation of the process, you need to add additional filters (I do not know which ones yet), but at least a classifier.
- To fasten the analysis of Direct and Adwords rates.
- Build your own base through the “Live” Yandex.
- Get in the end PROFFIT :)