Good day. In the open access finally appeared a huge reference book of bar codes with the names of goods, categories and brands.
We have been working on it for 8 years and now it has about 3 million bar codes in the EAN standards (EAN-13, EAN-8) and UPC (UPC-A, UPC-E).
What is there?
There is a table containing records of bar codes and corresponding product names, in all records there is a category and in many a brand.
The range of the presented goods is very wide. Heavy machinery is not there, but consumer segments are probably all (pharmaceuticals, perfumes, cosmetics, food, toys, sexshop-assortment, books, office, hardware, tools, etc., etc.)
')
The original online version of the directory is stored on the Universe-HTT server.
The open version is hosted on
github . Please note that the source file contains a fragmented database. The full file is
in the release .
Why is it needed?
Those who searched (most often unsuccessfully) on the Internet or anywhere else reference barcode and so know why it is needed. For the rest, I will list the useful properties of such a vast array of data:
- First of all, this is a list of products with "solid" identifiers. That is, you take an arbitrary product, for example, lying on your bedside table, and according to the barcode printed on the package, you can match it with a similar product located somewhere in a warehouse in Rio de Janeiro.
- The consequence of the previous paragraph will be the possibility to facilitate electronic document flow between enterprises, since the problem of synchronization of most (but not all, of course) goods disappears.
- You can quickly open a new store without driving in the incoming goods into the accounting system, but by retrieving them from such a directory by searching for a barcode (a strongly idealized example, but oh well).
The options listed and their possible variations are pretty commonplace. There are much more interesting applications of this handbook:
- Commodity Dictionary Analysis
- Training neural networks for the classification of goods and the normalization of their names
- Development of "intelligent" systems for comparing quotes from different sources
- Comparative analysis of sales and other operations in unrelated enterprises
- ... list continues your fantasy
Presentation format
The database is represented by a text file in UTF-8 encoding with fields separated by a tab.
The record structure is as follows:
- ID: Internal Item ID
- UPCEAN: Barcode
- Name: Product Name
- CategoryID: Internal Category Id
- CategoryName: The name of the category. Since the directory of categories is hierarchical, this name is composite - from the top level to the terminal level to which the product belongs. Level Separators - Slash ('/')
- BrandID: Internal Brand Id
- BrandName: Brand Name
Internal identifiers are hardly interesting to anyone - we upload them only for our own purposes (if you suddenly need to accurately identify the link to the record when any questions arise from outside).
Entries in a free format are sorted by item name in alphabetical order.
Special features
If you carefully review the data presented, you will notice that, unlike most of the similar reference books available on the Internet (both free and paid), intensive work was carried out on the product names.
A few words about how we do it.
First of all, the reference book (administered in the
OpenPapyrus system) is automatically processed using the technology
that I once described on Habré .
I would like to say that the mentioned technology does everything for us. But alas. A lot of work has to be done in semi-automatic and manual modes.
Many names have to be “decrypted” - they can contain unbelievable abbreviations in the original source and completely neglect our product naming system :)
All barcodes published in open access, guaranteed to be tested for compliance with one of the 4 standards: EAN-13, EAN-8, UPC-A, UPC-E and include a check digit. Possible defects and problems will be described below.
Completeness and relevance
To the typical question “Is all bar codes in the directory?” The answer is stereotypical: no, it cannot be.
If we evaluate the completeness of the reference book on the probability of the absence of a barcode there, which accidentally caught your eye, then it will be 10-15 percent (my very rough estimate, besides, they themselves understand, biased). In any case, there is nothing more like in size in the public domain.
Geographical coverage (by countries in which goods are sold) is significant: Russia, Ukraine, Belarus, the United States, the United Kingdom, the European Union, South Africa, Brazil, Malaysia and many others.
Presentation languages ​​are mainly Russian and English. We usually ignore sources with other languages, since we don’t understand anything in those languages ​​(as an exception, there are positions in Spanish, Czech, and other languages).
We update the directory on the Universe-HTT server at intervals of several months (when we accumulate enough data in the preliminary buffer). Last time data was poured in June of this year. Most new positions there are likely to be missing. However, while this may seem surprising, new bar codes do not appear very often. Many products with the same codes have been sold in retail for years.
We also plan to update the open version of the directory from time to time.
Sources
From what sources all this data we take? Mostly from the internet. We collect various price lists, open reports, including government agencies (for example, some states in the United States publish procurement data).
Shoals
Directory contains a number of defects. There are not many of them, but it is necessary to report them.
Defective codes
First of all, come across barcodes that are mistakenly interpreted as UPC-A while in reality it is EAN-13 without a check digit. The reason is that the original source (we no longer know which one) contained the EAN-13 code without the check digit, but the last digit satisfied the rule for calculating the check digit for UPC-A and our modest algorithm considered this code to be related to UPC-A. This could be corrected, but noticed too late and the hands did not reach the mass adjustment.
Problems of this kind are vanishingly small, but, as they say, alas.
Gross discrepancy
Further, there is confusion in the goods. That is, in some (extremely rare cases) the barcode corresponds to the name that is not related to it at all.
Private codes
Some bar codes may be private. Those EAN-13s that start at 2 are discarded at the start, but sometimes something goes wrong and you get private codes, either starting with '2', or those that start with some other number, being nevertheless private, not registered in any of the organizations involved in this (GS1, for example).
Classification
As we did not try to establish a good classification of the directory - we did not succeed much. A third of the positions belong to the default group - that is, it is absolutely not classified. The rest may well have an erroneous category.
Not all products are associated with brands, although we worked very hard on this issue.
How to help?
If you have a desire to help in expanding the directory, we will be grateful for the data sent about the barcodes known to you. I strongly doubt that there will be those who wish, but just in case I inform you that it is not difficult to find me according to the profile information.
Anyone who has the opportunity to implement the automatic classification of the elements of the directory and share ideas and best practices will receive the title of an incredibly kind person. We, on our part, pledge to inform the public about the success of our own research in this area.
Greed
If you like the handbook, mark it on
github with an asterisk. If you like it very much - mark the project
OpenPapyrus with an asterisk , because all administration and management of the directory is done with its help.
Terms of Use
There is no. As you wish, and use. If you give a link to us - thank you, no - we will survive.
Bitter regrets
Not wanting to give out the need for virtue, I’ll inform you that we expected to somehow monetize the handbook in question. However, notable successes in this field over the past years have been achieved. Therefore, they decided: let it be better to be common, than to fuck up. Somehow our motives for this action look like.
Thanks for attention.