Matching problems and how to deal with them

Good day! My name is Alex Bulavin, I represent Sbertech Competence Center for Big Data. Business representatives, product owners and analysts often ask me questions on the same topic - matching. What it is? Why and how to do it? The question “Why it may not work?” Is especially popular. In this article I will try to answer them.

Let's start with a domestic example. I have a little son. He recently mastered a mobile phone and now loves to carry it with him, so that, as an adult, it is easy to call someone whenever he pleases and talk on some “very important” topic. He calls only mom, dad and grandmother. Most of all goes to the grandmother: sometimes he calls her 10 times a day to tell him what happened to him 5 minutes ago.

In kindergarten, he has a friend Denis, and Denis also has a mobile phone. Having met, they as adults are measured by phones, but never call each other. I once asked my son:
')
- Why do not you call and chat with a friend about this and that, discuss your affairs?
- Dad, I don’t need it at all, we already meet in the garden every day and, if anything, we will talk there. Things will wait.

I wondered how so? It turned out that simply neither he nor Denis knows their own telephone number and cannot exchange them. There is a lack of communication due to the lack of keys.

What is matching?

New means of interaction in society generate new opportunities, more closely bind people, and systems indicate their connectedness. Matching is one of the types of connectedness that indicates the subject’s relationship with himself. For example, when one and the same machine is sold on different bulletin boards, and we want to link and perceive these announcements together as a whole.

Why do you need a match?

Information today is a value that can be monetized. Accordingly, additional information provides additional value, increased profits or reduced costs - through the development of new features, a qualitative change in existing or even the creation of new products.

As a rule, our product is clearly associated with those or other objects, knowledge of which we want to enrich. The more additional information we receive from new sources, the more urgent becomes the task of combining information from all sources into a single information space, as if these are attributes of one system.

Difficulty Matching

Retrieve and link the data - seemingly standard technical problem. But due to a number of problems, this can be difficult or even impossible:

No shared keys

People, organizations, objects - in each new system everything is registered anew and receive new own identifiers.

No keys at all

For some data types, IDs are assigned to individual events or messages in the stream, and not to the object that interests us. For example, the system records loan applications, and if the same person issues two separate applications, it will be two different IDs. And for the person himself ID in the system will not be.

Key records are not unique.

Ideally, each unique ID corresponds to a unique object, but in practice this is not the case. For example, the car changed the color or number of the TCP, the person changed the name or gender. Formalities for an automatic system is a new object. Also, a problem may arise if, for example, the operator, instead of searching for an already existing record in the database, simply starts a new one — this is easier for him.

Key records are erroneous or intentionally corrupted.

For example, in social networks, where the owner of the page distorts his first and last name or completely replaces them with fictional ones. You can find dozens of Ivanov Urgants or Vladimirov Poznerov, and these are not namesakes.

Key records are fickle

At different times for the same ID you can expect different objects or subjects. For example, when phone numbers change owners.

It turns out that it is either impossible to link the objects of the two systems to each other by key, or the percentage and quality of relatedness are lower than the desired level. You can try to collect the key as a combination of several information fields, a composite key. But here new difficulties arise:

Multiple key fields are insufficient

No one promises that the usual information fields will be “not null”. And then how lucky. The more fields in a composite key, the greater the likelihood that some keys will not work.

The fields in the composite key have different filling standards.

For example, the address of the office of the organization is filled in an arbitrary manner: d.5 k.2 office 16; house 5 building 2 office 16; 5-2-16. Or phone: +7 (495) 344-3 ..., 8-495-344 ..., 495344 ....

In addition, for the information fields in the composite key, the difficulties that we mentioned earlier are characteristic. The fields included in the composite key may also not be unique, erroneous, deliberately distorted and not permanent.

Quantity vs quality

How to overcome the above difficulties and achieve 100% matching? We should start with the question: does one really need to achieve such high levels of quality? Maybe 70% is enough for solving a business problem?

We have a composite key consisting of a set of attributes. Each of them will be filled with a certain probability and will be suitable for use as a key element with a certain probability. The probability that the entire composite key will be normal is the product of all probabilities over all the attributes of the key. All this still needs to be multiplied by the probability that an object is in principle present in two systems. Then we get the probability of a match. And multiplying it by the total number of entities, we obtain a quantitative forecast by comparison.
The fewer attributes in a composite key, the higher the probability of a match, and the closer it is to the probability that the object is in two systems. But the number of comparisons at the same time grows and often exceeds the forecast. This is due to the fact that with the decrease in the number of attributes in a composite key, the probability of an erroneous match increases.

Simply put, with a decrease in the number of attributes in a composite key, both the number of objects mapped correctly and the number mapped erroneously increase. How much quantity fights against quality. And depending on the business problem, you can choose a match strategy that shifts the result either towards quantity or towards quality.

Enrichment filtering normalization

Is it possible to increase the quality and quantity at the same time? Sure you may. For this you need to spend more, and sometimes a lot more resources for additional data processing.
“Holes” in the data can be filled by getting them from other fields of the source. The city location can be obtained from the phone number code, TIN, region code. Gender can be obtained from the name and surname or by analyzing the author's text. There are a lot of enrichment algorithms.

Next, the data should be passed through the filters. Filters can be both standard and specific, associated with the characteristics of filling and transformation of data of a particular source. For example, a filter that removes non-printable characters, doubles, double characters, parentheses, quotes, spaces can be attributed to the standard ones.

The specific filters include the detection and replacement of characters of another language layout, which look the same visually in both languages - for example, the letter O in the English layout in the name Olya. Or the detection and replacement of characters of another language layout, which sound the same or almost the same in both languages (Light and Light).

Normalization can include translation into another language, transliteration, casting to a fill pattern (name, brand, telephone, address, gender), as well as replacing short names with full names, replacing slang and diminutive forms.

Even with the same key composition for different data sources, it is often necessary to use different criteria. This is due to how a particular source is filled with data. In order to correctly select the criteria, it is advisable to collect and analyze statistics on the filling of the source fields. The improvement in quality may be affected by the use of the coefficient of frequency across the field at the source (for example, for the brand of car, last name), the coefficient of “capacity” (for example, for the name of the settlement depending on how large this settlement is in terms of the number of inhabitants).

With the simultaneous use of different match keys, coefficients can be used as a condition for using a particular key. In the same way, other criteria can be used, for example, the fullness of a field. It is possible to combine matches with different keys between the same sources without the use of conditions - the result is quite acceptable.

Other Matches

There are other matching algorithms that sometimes are completely different from those listed above. For example, a match on a weak key in terms of communication with another object that has already failed on a strong key, if the capacity of such a connection is by definition small.

Let's give an example. Any car or apartment in its entire history, on average, has from 1 to 5 owners. If in two systems the object of an apartment or a car is patched by a strong key, then the subject — the owner of this apartment, clearly associated with it — can be matched by any weakest parameter, for example, last name and first name.

Objects of any social network or similar data structure with a large number of stable connections can be played on weak keys belonging not to the object of the match itself, but to its surroundings. Matching objects themselves may, in addition, have their own weak key, or they may not. In fact, the statement of the ancient Greek poet Euripides is algorithmized: "Tell me who your friends are, and I will tell you who you are."

For two sources with one or several photos of objects that are explicitly associated with their identifiers in the sources, you can apply the matching to the photos. In the photo, objects or faces are highlighted and compared with the same objects or persons in another source. In fact, according to this principle, Google’s neural network services like “Is your portrait in a museum?” Work: they match a face with a photo you have uploaded with faces of medieval people in portraits of museums. The criterion for the match is specially chosen soft in order to get a distant but sufficient similarity.

If you have a large number of copyright text information in different sources, you can try text mining algorithms to connect authors. This is something like a handwriting analysis, only the form of the text is analyzed, not the form of the writing.

Big data

To improve the quality of the match, you need to use different algorithms, which in turn require a lot of resources. The more algorithms, the more resources are required. And if there is a lot of data, they are constantly changing, and they need to be read quickly and inexpensively?

Most likely, it will not be possible to store, process and match data using traditional methods. It is worth thinking about the bigdata infrastructure. Now there are quite a few such solutions, from different vendors and to any wallet.

In Sberbank, for example, the matching of internal corporate data is implemented as a component of the data lake on Hadoop, Spark and HBase. This solution allows you to process heterogeneous unstructured data of large volume, run calculations on a large cluster where data is stored without overhead. At the same time, open source software and a commodity server are used, which makes the solution fairly cheap and effective for this class of tasks. Much has been written about Big Data on Hadoop. I, for one, quite like the way DataArt does it.

Our matchbox

MatchBox is an automatic normalization and match system that we use in Sberbank's data lake. It was recently developed at the Sbertech Big Data Competence Center.
MatchBox is mainly used to build and maintain up to date a single semantic data layer and a single client profile. The system makes it possible to automatically combine information from a large number of sources into a single information super-entity, to be integrated into the process of updating information sources. This enriches knowledge of the current and potential customers of the bank: their socio-demographic, psychological, behavioral features and consumer preferences.

MatchBox works with data of any quality, uses libraries with validated normalization and matching algorithms, has a user rules configurator for this, works in fully automatic mode by event, schedule or as a service. MatchBox can scale, and the number of regularly processed sources is limited only by the resource quotas of the cluster.

This is what we have achieved thanks to the introduction of MatchBox:

high processing speed of large volumes at low cost of the process
combining a large number of sources
full automation of matches on a bigdata-cluster - and, as a result, low cost
Matching regularity - when used in regular cyclic business processes, financial costs are reduced
high quality of matching due to normalization, enrichment and selection of the optimal configuration - the cost of the final data increases
the variability of the matching due to the configuration of the rules - the cost of implementing new initiatives is reduced
unification of the published result (low cost of introducing new initiatives)
alignment of match chains

Now we are also exploring and piloting complex combinatorial matches, graph matches, photo matches. And for sources requiring high accuracy - the validation subsystem.

I hope that the article will help answer questions related to the match, advance in understanding your own problems with working with data and find approaches to solving them.

Source: https://habr.com/ru/post/354564/

All Articles