As we automated a large online store and began to match products automatically

The article is more technical than about business, but we will also summarize some results from a business point of view. Most attention will be paid to the automatic comparison of goods from different sources.

Work online store consists of a fairly large number of components. And whatever the plan, make a profit right now, or grow and look for investors, or, for example, develop related areas, at least have to close such questions:
')

Work with providers. To sell something unnecessary, you must first buy something unnecessary.
Directory management. Someone has a narrow specialization, and someone sells hundreds of thousands of different products.
Retail pricing management. There will have to take into account the prices of suppliers and the prices of competitors, and affordable financial instruments.
Work with a warehouse. In principle, you can not have your own warehouse, and take the goods from the partners' warehouses, but somehow the question is.
Marketing. Here, filling the site with content, placement on sites, advertising (online and offline), promotions and much more.
Reception and processing of orders. Call center, shopping cart on the site, orders via instant messengers, orders through platforms and marketplaces.
Delivery.
Accounting and other internal systems.

The store, which we will talk about, does not have a narrow specialization, but offers a lot of everything from cosmetics to a mini-tractor. I will tell you how we work with suppliers, monitoring competitors, managing the catalog and pricing (wholesale and retail), working with wholesale customers. A little touch on the topic of the warehouse.

In order to better understand some technical solutions, it will not be out of place to know that
At some point we decided that technological things, if possible, will be done not for ourselves, but universal. And, perhaps, after several attempts it will be released to develop a new business. It turns out, conditionally, a startup within the company.

So we consider a separate system, more or less universal, with which the rest of the company's infrastructure is integrated.

What is the problem of working with suppliers?

And a lot of them, in fact. Just bring some:

There are many suppliers themselves. We have about 400. Everyone needs to be given some time.
There is no single way to get offers from suppliers. Someone sends to the mail on a schedule, someone on request, someone uploads to file sharing services, someone places on the site. There are many ways, including sending a file via Skype.
There is no uniform data format. I even drew a picture on this topic (it is below, the tables symbolize different formats).
There is a concept of minimum retail and minimum wholesale prices that must be observed in order to continue working with the supplier. Often they are provided in their own format.
The nomenclature of each supplier has its own. As a result, the same product is called differently, and there is no unique key by which they can be simply compared. Therefore, we compare difficult.
The system of placing an order with a supplier is not automated. We order someone via skype, someone in our personal account, someone send an Excel file with a list of orders every evening.

We learned how to cope with these problems. In addition to the latter, the last work in progress. Now there will be technical details, and then we will consider the following list.

We collect data

As it was

Provider files were manually collected from various sources and prepared. Preparation included renaming a specific template and editing the content. Depending on the file, it was necessary to delete the non-standard, products that are not available, rename the columns or recalculate the currency, collect data from different tabs on one.

As it became

First of all, we learned to check mail and pick up letters with attachments from there. Then they automated the work with direct links and links to Yandex and Google disks. This resolved the issue of receiving bids from about 75% of our suppliers. We also noticed that it is through these channels that the offers are more often updated, so the real percentage of automation has turned out more. We still get some price lists in messengers.

Secondly, we no longer handle files manually. To do this, we have added supplier profiles, where you can specify which column and tab to take, how to determine the currency and availability, delivery time, and the schedule of the supplier.

It turned out flexibly. Naturally, not everything was taken into account the first time, but now there is enough flexibility to customize the processing of all 400 providers, taking into account the fact that everyone has different file formatting.

As for file formats, we understand xls, xlsx, csv, xml (yml). In our case, this was enough.

Also figured out how to filter the records. We have a list of stop words, and if the supplier’s offer contains it, then we do not process it. The technical details are as follows: on a small list, it is possible and even better “on the forehead,” on large lists, the Bloom filter is faster. They experimented with it and left everything as it is, because the gain is felt on the list an order of magnitude greater than ours.

Another important thing is the supplier’s schedule. Our suppliers work on different schedules, besides they are located in different countries, days off in which do not coincide. And the delivery time is usually specified as a number or range of numbers in work days. When we will form retail and wholesale prices, we will have to somehow estimate the time when we can deliver the goods to the customer. To do this, we have configured configurable calendars, and in the settings of each provider, you can specify which calendar it works for.

I had to make a configuration of discounts and extra charges depending on the category and manufacturer. It so happens that the supplier has a common file for all partners, but there are agreements on discounts with some partners. Thanks to this, it was still possible to add or subtract VAT if necessary.

By the way, the configuration of the rules of discounts and markups leads us to the next topic. After all, before applying them you need to find out what kind of product.

How the mapping works

A small example of how the same product can be named from different suppliers to understand what you have to work with:

Monitor LG LCD 22MP48D-P
21.5 "LG 22MP48D-P Black (16: 9, 1920x1080, IPS, 60 Hz, DVI + D-Sub (VGA) interfaces)
COMP - Computer Peripherals - Monitors LG 22MP48D-P
up to 22 "inclusive LG Monitor LG 22MP48D-P (21.5", black, IPS LED 5ms 16: 9 DVI matte 250cd 1920x1080 D-Sub FHD) 22MP48D-P
Monitors LG 22 "LG 22MP48D-P Glossy-Black (IPS, LED, 1920x1080, 5 ms, 178 ° / 178 °, 250 cd / m, 100M: 1, + DVI) Monitor
LCD Monitors LG Monitor LCD 22 "IPS 22MP48D-P LG 22MP48D-P
LG Monitor 21.5 "LG 22MP48D-P gl.Black IPS, 1920x1080, 5ms, 250 cd / m2, 1000: 1 (Mega DCR), D-Sub, DVI-D (HDCP), vesa 22MP48D-P.ARUZ
LG Monitor LG 22MP48D-P Black 22MP48D-P.ARUZ
LG 22MP48D-P 22MP48D-P monitor
Monitors LG 22MP48D-P Glossy-Black 22MP48D-P
21.5 "LG Flatron 22MP48D-P gl.Black Monitor (IPS, 1920x1080, 16: 9, 178/178, 250cd / m2, 1000: 1, 5ms, D-Sub, DVI-D) (22MP48D-P) 22MP48D-P
Monitor 22 "LG 22MP48D-P
LG 22MP48D-P IPS DVI
LG LG 21.5 "22MP48D-P IPS LED, 1920x1080, 5ms, 250cd / m2, 5Mln: 1, 178 ° / 178 °, D-Sub, DVI, Tilt, VESA, Glossy Black 22MP48D-P
LG 21.5 "22MP48D-P (16: 9, IPS, VGA, DVI) 22MP48D-P
Monitor 21.5`` LG 22MP48D-P Black
LG MONITOR 21.5 "LG 22MP48D-P Glossy-Black (IPS, LED, 1920x1080, 5 ms, 178 ° / 178 °, 250 cd / m, 100M: 1, + DVI) 22MP48D-P
LG Monitor LCD 21.5 '' [16: 9] 1920x1080 (FHD) IPS, nonGLARE, 250cd / m2, H178 ° / V178 °, 1000: 1, 16.7M Color, 5ms, VGA, DVI, Tilt, 2Y, Black OK 22MP48D -P
LCD LG 21.5 "22MP48D-P black {IPS LED 1920x1080 5ms 16: 9 250cd 178 ° / 178 ° DVI D-Sub} 22MP48D-P.ARUZ
IDS_Monitors LG LG 22 "LCD 22MP48D 22MP48D-P
21.5 "16x9 LG Monitor LG 21.5" 22MP48D-P black IPS LED 5ms 16: 9 DVI matte 250cd 1920x1080 D-Sub FHD 2.7kg 22MP48D-P.ARUZ
Monitor 21.5 "LG 22MP48D-P [Black]; 5ms; 1920x1080, DVI, IPS

As it was

Comparison dealt with 1C (third-party paid module). As for convenience / speed / accuracy, such a system made it possible to maintain a catalog with 60 thousand products in stock at this level by 6 people. That is, every day became obsolete and disappeared from the proposals of suppliers as many matched products as new ones were created. Very approximately - 0.5% of the catalog size, i.e. 300 products.

How it became: a general description of the approach

Just above, I gave an example of what we need to compare. Exploring the topic of comparison, I was a little surprised that ElasticSearch is popular for the task of comparison, in my opinion, it has conceptual limitations. As for our technology stack, we use MS SQL Server for data storage, but the mapping runs on our own infrastructure, and since there is a lot of data and it needs to be processed quickly, we use data structures optimized for a specific task and try not to need to access the disk or database and other slow systems.

Obviously, the comparison problem can be solved in many ways and obviously, none of them will give one hundred percent accuracy. Therefore, the basic idea is to try to combine these methods, rank them according to accuracy and speed and apply descending accuracy according to speed.

The execution plan for each of our algorithms (with the proviso about degenerate cases) can be briefly presented with such a general sequence:

Tokenization. We break the source line into something meaningful independent parts. You can do it once and continue to use in all algorithms.

Normalization of tokens. In an amicable way, you need to bring the words of a natural language to a total number and declension, and identifiers like “15” (this is Cyrillic, if anything) to convert to Latin. And bring all to one register.

Categorization of tokens. We are trying to understand what each part means. For example, you can select a category, manufacturer, color, and so on.

Search for the best candidate for a match.

Estimation of the likelihood that the source line and the best candidate actually represent the same product.

The first two points are common to all algorithms that are currently available, and then improvisations begin.

Tokenization. Here we did just break the line into parts by special characters such as space, slash, and so on. The character set turned out to be significant over time, but we didn’t use anything complicated in the algorithm itself.

Then we need to normalize tokens. Convert them to lower case. Instead of bringing everything to the nominative case, we simply cut off the endings. We also have a small dictionary, and we translate our tokens into English. Among other things, the translation saves us from synonyms, Russian words that are similar in meaning are translated equally into English. Where they failed to translate, we change the Cyrillic characters that are similar in spelling to Latin. (It’s not at all superfluous, as it turned out. Even where you don’t expect a dirty trick, for example, in the line “Samsung UE43NU7100U” Cyrillic E may well meet).

Categorization of tokens. We can distinguish a category, manufacturer, model, article, EAN, color. We have a directory where the data is structured. We have data on competitors that we provide trading platforms. When processing them, where possible, we structure the data. We can correct errors or typographical errors, for example, the manufacturer or color that occurs only once in all our sources, not to be considered as a manufacturer and color, respectively. As a result, we have a large dictionary of possible manufacturers, models, articles, colors, and a token categorization is just a dictionary search for O (1). Theoretically, you can have an open list of categories and some sort of clever classification algorithm, but our basic approach works well, and categorization is not a bottleneck.

It should be noted here that sometimes the supplier provides already structured data, for example, the item is in a separate cell in the table, or the supplier makes a discount from retail when wholesale, and retail prices can be obtained in yml (xml) format. Then we save the data structure, and heuristically divide tokens into categories only from unstructured data.

And now about what algorithms and in what order we apply.

Exact and almost exact matches

The easiest case. Rows are broken into tokens, brought them to one form. Then came up with a hash function that is not sensitive to the order of tokens. Moreover, comparing the hash, we can keep all the data in memory, conditional 16 megabytes per dictionary with a million keys we can afford. In practice, the algorithm performed better than simple string comparison.

With regard to hashing, it suggests the use of "exclusive or" and a function like this:

public static long GetLongHashCode(IEnumerable<string> tokens) { long hash = 0; foreach (var token in tokens.Distinct()) { hash ^= GetLongHashCode(token); } return hash; }

The most interesting thing at this stage is getting the hash of a separate line. In practice, it turned out that 32 bits are not enough, a lot of collisions result. And also - you can't just take the source code of a function from the framework and change the type of the return value, there are fewer collisions for individual lines, but after the "exclusive or" they still occur, so we wrote our own. In essence, we simply added nonlinearity from the input data to the function from the framework. It became definitely better, with the new function with collision, we met only once on our millions of records, recorded and postponed until better times.

Thus, we are looking for matches without regard to the order of words and their form. This search works for O (1).

Unfortunately, rarely, but it also happens: “ABC 42 Type 16” and “ABC 16 Type 42”, and these are two different goods. We also learned how to deal with such things, but more on that later.

Comparison of Man-Confirmed Products

We have the goods mapped manually (most often these are the goods mapped automatically, but got on a manual check). In fact, we do the same thing in this case, only now we have added a dictionary of matching hashes, the search for which did not change the time complexity of the algorithm.

Manually mapped strings simply lie in the database, just in case, such raw data will allow in the future to change the hashing algorithm, recalculate everything and lose nothing.

Attribute Mapping

The first two algorithms are fast and accurate, but not enough. Next we apply attribute mapping.

Previously, we have already submitted data in the form of normalized tokens and even sorted them into categories. In this chapter, I call token categories attributes.

The most reliable attribute is EAN (https://ru.wikipedia.org/wiki/European_Article_Number). EAN matches provide an almost 100% guarantee that they are one and the same product. The EAN mismatch, however, does not say anything, because one product may have different EANs. All is good, but in our data EAN is rare, so its effect on the comparison is at the level of error.

Item less reliable. It often gets something weird right from the supplier’s structured data, but in any case, at this stage, we use it.

As in the previous stage, here we use dictionaries (search for O (1)), and the hash from (manufacturer + model + part number) is used as the key. Hashing allows you to perform all operations in memory. At the same time, we also take into account the color, if it is the same or not, then we believe that the goods have coincided.

Search for the best match

The previous stages were simple, fast and fairly reliable, but, unfortunately, they cover less than half of the comparisons.

In the search for the best match lies a simple idea: the coincidence of rare tokens has a large weight, the coincidence of frequent tokens has a small weight. Tokens containing numbers are valued more than alphabetic. Tokens that are matched in the same order are valued more than tokens that are swapped. Long matches are better than short ones.

Now it remains to come up with a fast data structure that can take all this into account at the same time and is stored in memory on the directory in a couple of million entries.

We thought up to present our catalog in the form of a dictionary of dictionaries, at the first level the key will be the hash from the manufacturer (the data in the catalog is structured, we know the manufacturer), the value is the dictionary. Now the second level. The key on the second level is the hash of the token, the value is a list of id products from the catalog where this token is found. And in this case, we use including a combination of tokens in the order in which they are found in our catalog. What to use as a combination, and what not to decide, depending on the number of tokens, their length, and so on, is a compromise between speed, accuracy, and the required memory. In the figure, I simply depicted this structure, without hashes and without normalization.

If an average of 20 tokens is used for each item, then in our lists, which are values of a nested dictionary, the link to the item will be found on average 20 times. There will be no more than 20 times more different tokens than the products in the catalog. Approximately you can calculate the memory required for a directory of a million entries: 20 million keys with 4 bytes for each, 20 million id products with 4 bytes for each, the overhead of organizing dictionaries and lists (the order is the same, but since the size of lists and dictionaries we do not know in advance, but increase on the move, multiply by two). Total - 480 megabytes. In reality, we got a little more tokens for the product, and we need up to 800 megabytes per catalog of a million products. What is acceptable, the capabilities of modern iron allow you to simultaneously store in memory more than a hundred directories of this size.

Let's go back to the algorithm. Having the string that we need to match, we can determine the manufacturer (we have a categorization algorithm), and then get tokens using the same algorithm as for products from the catalog. Here I mean including a combination of tokens.

Then everything is relatively simple. For each token, we can quickly find all the products in which it is found, evaluate the weight of each match, taking into account all that we talked about before - length, frequency, numbers or special characters, and evaluate the “similarity” of all the candidates found. In reality, there are also optimizations here, we do not consider all the candidates, first we form a small list of matches of tokens with a large weight, and we do not apply matches of low weights to all products, but only to this list.

We choose the best match, we look at the coincidence of tokens, which turned out to be categorized, and we consider the matching score. Next we have two thresholds P1 and P2, P1 <P2. If the assessment turned out more than the threshold value P2 - human participation is not required, everything happens automatically. If between two values - we offer to look at the comparison manually, until then it will not participate in pricing. If there is less than P1, then most likely there is no such item in the catalog, we do not return anything.

Returning to the lines “ABC 42 Type 16” and “ABC 16 Type 42”. The solution is surprisingly simple - if several products have the same hashes, we do not match them by hash. And the last algorithm will take into account the order of tokens. Theoretically, such lines in the price list of the supplier cannot be compared to something arbitrary, where the numbers 16 and 42 are not found at all. Practically with such necessity we did not face.

Speed and accuracy of comparison

Now for the speed of all this. The time it takes to prepare dictionaries linearly depends on the size of the catalog. The time that is required for direct comparison linearly depends on the number of goods compared. All data structures involved in the search do not change after creation. This gives us the opportunity to use multi-threading at the mapping stage. Preparatory work for a catalog of one million entries takes about 40-80 seconds. Comparison works at a speed of 20-40 thousand records per second and does not depend on the size of the directory. Then, however, you need to save the results. The chosen approach is generally beneficial for large volumes, but a file with a dozen records will be processed disproportionately long. Therefore, we use the cache and recalculate our search structures every 15 minutes.

True, the data for comparison must be read somewhere (most often it is an Excel file), and the correlated sentences must be saved somewhere, and this also takes time. So the total number of 2-4 thousand records per second.

In order to assess the accuracy, we have prepared a test suite of approximately 20,000 manually checked comparisons of different suppliers from different categories. The algorithm after each change was tested on this data. The results were as follows:

the goods are in the catalog and were compared correctly - 84%
the product is in the catalog, but was not compared, manual comparison is required - 16%
the goods are in the catalog and were compared incorrectly - 0.2%
the product is not in the catalog, and the program correctly identified it - 98.5%
there is no product in the catalog, but the program has compared it to some of the goods - 1.5%

In 80% of cases when the product has been matched, manual confirmation is not required (we automatically confirm the comparison), among such automatically confirmed sentences there is 0.1% error.

By the way, 0.1% of errors is a lot, it turns out. Per million records mapped is a thousand records mapped incorrectly. And a lot of it because buyers just such records are best found. Well, how not to order a tractor at the price of headlights from this tractor. However, this thousand errors are at the start of work on a million sentences, they were gradually corrected. The quarantine for suspicious prices, which closes this question, we had later, the first couple of months we worked without it.

There is one more category of errors, not related to matching, these are the wrong prices of our suppliers. Partly therefore we do not consider the price in comparison. We decided that since we have additional information in the form of a price, we will try using it to determine not only our own, but also other people's mistakes.

Finding the wrong prices

This is the part that we are actively experimenting over now. The basic version is, and it does not allow to sell the phone at the price of the cover, but I have a feeling that you can do better.

For each product we find the boundaries of reasonable prices for suppliers. Depending on what kind of data we have, we take into account the prices of suppliers for this product, prices of competitors, prices of suppliers for goods of this manufacturer in this category. Those prices that do not fall within the boundaries, we quarantine and ignore in all our algorithms. You can manually mark such a suspicious price as normal, then we remember this for this product and recalculate the boundaries of reasonable prices.

Directly, the algorithm for calculating the maximum and minimum acceptable prices is constantly changing, we are looking for a compromise between the number of false positives and the number of incorrect prices detected.

We use the median values in the calculations (the average gives the worst result) and do not analyze the distribution form yet. The analysis of the distribution form is the place where, in my opinion, the algorithm can be improved.

Work with database

From all the above, we can conclude that we frequently update the data on suppliers and competitors often and in large quantities, and working with the database can become a bottleneck. In principle, we initially paid attention to this and tried to achieve maximum performance. When working with a large number of records, we do the following:

we delete indexes from the table with which we work
disable full text indexing on this table
delete all records with a specific condition (for example, all offers of specific suppliers that are currently being processed)
insert new records with BULK COPY
re-create indexes
enable full text indexing

Bulk copy works at a speed of 10-40 thousand records per second, why such a large scatter remains to be seen, but it is also acceptable.

Deleting records takes about the same time as inserting. Some more time is required to re-create the indexes.

By the way, for each directory we have a separate database. We create them on the fly. And now I will tell why we have more than one directory.

What is the problem of directory maintenance?

And there are a lot of them too. Now we will list:

The catalog contains about 400 thousand products from completely different categories. It is impossible to professionally understand each of the categories.
It is necessary to follow a certain style, to follow the rules common to the entire catalog for naming, selecting subcategories, and so on. Thus, we are trying to achieve a coherent and logical directory structure.
The same product can be created several times, and this is a problem. Without a tool that analyzes similar names, the creation of duplicates occurs constantly.
It is reasonable to add to the catalog those products that suppliers have in stock. In this case, you need to have priorities for product categories.
We need several directories. One of his, we are his own, the other - the catalog of the aggregator, it is updated by api. The meaning of the second catalog is that the site aggregator works only with its own catalog, and, accordingly, accepts proposals in its nomenclature. This is another place where it turned out to be a mapping.

We considered that it is logical and correct to keep the catalog in the same place where comparisons are made. So we can prompt the users who administer the catalog what the supplier has, but not in our catalog.

How do we keep the catalog

It will be about the catalog without detailed characteristics, characteristics - a separate big story, about it somehow another time.

We chose the following basic properties:

manufacturer
category
model
vendor code
Colour
EAN

To begin with, they made api for getting a catalog from an external source, and then worked on the convenience of creating, editing and deleting records.

How search works

Convenience of managing a catalog, first of all, is the ability to quickly find a product in a catalog or a supplier's offer, and there are some nuances. For example, you need to be able to look for the string “LG 21.5" 22MP48D-P (16: 9, IPS, VGA, DVI) 22MP48D-P ”for the query“ 2MP48 ”.

Full-text search sql server out of the box is not suitable, because it does not know how, and search using LIKE '% 2MP48%' is too slow.

Our solution is fairly standard, we use N-grams. More precisely - the trigrams. And already on trigrams we build a full-text index and perform a full-text search. I'm not sure that we use space in this case very rationally, but this solution came up with speed, depending on the request, from 50 to 500 milliseconds, sometimes up to a second on an array of three million records.

Let me clarify that the string “LG 21.5" 22MP48D-P (16: 9, IPS, VGA, DVI) 22MP48D-P ”is converted to the string“ lg2 g21 215 152 522 22m 2mp mp4 p48 48d 8dp dp1 p16 169 69i 9ip ips psv svg vga gad adv dvi vi2 i22 ”, which is saved in a separate field that participates in the full-text index.

By the way, the trigrams will still be useful to us.

Creating a new product

For the most part, products in the catalog are created according to the offers of the supplier. That is, we already have information that the supplier offers “LG Monitor LCD 21.5” [16: 9] 1920x1080 (FHD) IPS, nonGLARE, 250cd / m2, H178 ° / V178 °, 1000: 1, 16.7M Color, 5ms, VGA, DVI, Tilt, 2Y, Black OK 22MP48D-P ”at a price of 120 dollars, and it has from 5 to 10 units in stock.

When creating a product, first of all, we need to make sure that such a product has not yet been created in the catalog. We solve this problem in four stages.

First, if we have a product in our catalog, the supplier’s offer is likely to be matched to this product automatically.

Secondly, before we show the user the form of creating a new product, we will perform a search on trigrams and show the most relevant results. (technically this is done using CONTAINSTABLE).

Thirdly, as we fill in the fields of a new product, we will show similar existing products. This solves two problems: it helps to avoid duplicates and maintain the style in the names, similar products can be used as a sample.

And fourth, remember we broke the lines into tokens, normalized them, counted hashes? We will do the same and simply will not let us create products with the same hashes.

Even at this stage we are trying to help the user. In line, which is in the price list, we will try to determine the manufacturer, category, article, EAN and color of the goods. First, by tokens (we can divide them into categories), then, if we don’t, we’ll find the most similar product by trigrams. And, if it is similar enough, fill the manufacturer and category.

Editing a product works almost the same, just not everything is applicable.

How we form our prices

The task is this: to keep a balance between the number and marginality of sales, in fact - to achieve maximum profit. All other aspects of the store's work are also about this, but exactly what happens at the price formation stage has the greatest impact.

At a minimum, we will need information about the offers of suppliers and competitors. You should also consider the minimum retail and wholesale prices and delivery costs, as well as financial instruments such as loans and installment plans.

We collect the prices of competitors

To begin with, we have a lot of profiles of our own prices. There is a profile for retail, there are several for wholesale customers. All of them are created and configured in our system.

Accordingly, the competitors for each profile are different. In retail - other retail stores, in wholesale sales - our suppliers.

Everything is clear with suppliers, and for retail we collect data on competitors in several ways. First, some aggregators provide information on all prices for all products that are on the site. In our own nomenclature, but we can match products, so it works automatically. And this is almost enough. Secondly, we have competitor parsers. Since they are not yet automated and exist in the form of console applications (which sometimes fall down), we rarely use them.

Customize profile

In the profile we have the ability to customize different ranges of markups depending on the price of goods from the supplier, category, manufacturer, supplier.It is still possible to indicate which suppliers for which category or manufacturer we work with, and with which not - with which competitors we take into account.

Then we set up financial instruments, indicate which installments are available and how much the bank will take for itself.

And already within the boundaries of margins, we form our own prices, trying to keep that same balance in the first place, and make our warehouse goods sell better in the second place. This is in a nutshell, but in fact I do not dare to explain in simple words what is happening there.

I can tell you what is not happening. Unfortunately, we are not yet able to forecast demand and take into account the costs of storing goods in a warehouse.

Integration with third-party systems

Important part from the point of view of business, but not interesting from the technical point of view. In a nutshell, I will say that we are able to give data to third-party systems (including incrementally, that is, we understand what has changed since the last exchange) and are able to make mailings.

Mailings are customizable, so (and not only so) we deliver our offers to wholesale customers.

Another way to work with wholesale customers is b2b portal. He is still in active development, will work literally in a month.

Accounts, change logging

Another uninteresting question. We have each user account.

In short, you can say the following: if ORM is used, then it has a built-in change tracking mechanism. If you get into it (in our case it is EF Core and there is even an API there), then you can get logging in almost two lines.

For the change history, we made an interface, and now we can trace who changed what in the system settings, who edited or compared certain products, and so on.

Logs can be considered statistics, which we are doing. We know who created or edited how many products, confirmed how many comparisons manually, and how many rejected, we can view each change.

A bit about the overall system

We have one database for accounts and things that are not dependent on the directory, one database for logs, and a database for each directory. So, catalog queries are easier, and data analysis is easier, and the code is clearer.

By the way, the logging system is self-written; we really need to group logs related to one request or one hard task, besides, we need basic functionality for analyzing them. With ready-made solutions, this was difficult, plus this is another dependency that needs to be maintained.

The web interface is made on ASP.NET Core and bootstrap, while heavy operations are performed by a Windows service.

Another feature that went to the project is, in my opinion, the different models for reading and writing data. We did not implement the full-fledged QQRS, but we took one of the concepts from there. We write to the database through repositories, but the objects that are used for writing never leave the update / create / delete methods. Mass update done through BULK COPY. A separate model and a separate data access layer are made for reading, so we read only what is needed at a particular moment. It turned out that you can use ORM, and at the same time avoid heavy queries, calls to the database at uncertain moments (as with lazy loading), problems N + 1. And we use the model for reading as DTO.

Of the major dependencies, we have ASP.NET Core, several third-party nuget packages, and MS SQL Server. While it is possible, we try not to depend on many third-party systems. In order to fully deploy the project locally, it is enough to install SQL Server, pick up the source code from the version control system and build the project. The necessary databases will be created automatically, and nothing else is needed. You may have to change one or two lines in the configuration.

What did not

Not yet done a system of knowledge on the project. We want to do wikis and tips on the place. Not done simple intuitive interface, the one that is not bad, but for an untrained person a bit confused. CI / CD is only in the plans.

Not done processing the detailed characteristics of the goods. We also plan, but there is no specific time yet.

Results from a business perspective

Since the start of active development until launch, two people worked on the project for 7 months. At the start we had a prototype made in our free time. The most difficult were given the integration with existing systems.

In the three months that we are in production, the amount of goods available for wholesale customers has increased from 70 thousand to 230 thousand, the number of goods on the site from 60 thousand to 140 thousand. The site is always late, because it needs characteristics, pictures, descriptions of goods. On the aggregator, we upload 106 thousand sentences instead of 40 thousand three months ago. The number of people working with the catalog has not changed.

We work with 425 suppliers, this number has almost doubled in three months. We track prices over a thousand competitors. Well, as we track - we have a system for parsing, but in most cases we take ready data from those who regularly provide them.

About sales, unfortunately, I can not tell, I myself do not have reliable data. Demand is seasonal, and you cannot directly compare the month to the previous month. And for the year too much has happened to distinguish the influence of our system from all the factors. Very, very conditional, plus or minus a kilometer, the growth of the catalog, more flexible and competitive prices and the associated sales growth have already paid off the development and implementation.

Another result is a project that is essentially not related to the infrastructure of a particular store, and you can make a public service out of it. It was conceived from the very beginning, and this plan almost worked. Unfortunately, the boxed solution did not work out. To offer a project as a service, where you can register, tick “I agree”, and that works “as is”, without adapting to the client, you need to rework the interface, add flexibility and create a wiki. And also make the infrastructure easily scalable and eliminate a single point of failure. Now we have only regular backups from the reliability assurance tools. As an enterprise solution, I think we are ready to solve business problems. Things are easy - to find a business.

By the way, we have already attracted one third-party client, having the most basic functionality. The guys needed a tool for comparing products, and the inconvenience associated with active development did not frighten them.

Source: https://habr.com/ru/post/456604/

All Articles

As we automated a large online store and began to match products automatically

What is the problem of working with suppliers?

We collect data

As it was

As it became

How the mapping works

As it was

How it became: a general description of the approach

Exact and almost exact matches

Comparison of Man-Confirmed Products

Attribute Mapping

Search for the best match

Speed ​​and accuracy of comparison

Finding the wrong prices

Work with database

What is the problem of directory maintenance?

How do we keep the catalog

How search works

Creating a new product

How we form our prices

We collect the prices of competitors

Customize profile

Integration with third-party systems

Accounts, change logging

A bit about the overall system

What did not

Results from a business perspective

More articles:

Speed and accuracy of comparison