📜 ⬆️ ⬇️

Missed BigData features

The fact that the BigData multiplied by artificial intelligence is an incredible future has been written almost more than the collected works of the Strugatsky brothers and Jules Verne combined. All of them, and not entirely without thoroughly, argue that the collected huge amounts of data processed using, for example, Deep Learning will be able to identify all scammers today, prevent questionable transactions and predict the most high-yielding markets. By itself, the financial industry will become fully automated under the control of wise artificial intelligence.

It will probably be to some extent. Even today, the degree of automation has reached a level that 10 years ago seemed fantastic. Everything is so ... But, as you know, "little things" can bring many surprises. One of such trifles is the fact that the lion’s share of all the data that could and should be used in the fight against fraud, market forecasting is text data. The number of daily generated written, video and other data amounts to billions of lines, the analysis of which with the help of operators is practically useless. Some may argue that everything is wrong and most of the data are regular tables that are well processed by statistical methods. And, it would seem, he will be right. Banks from the TOP-30 report on the widespread use of BigData. But if you take a closer look, then according to the statement of the same Alfa-Bank, we are talking mainly about structured transactional data.

But even in the analysis of structured data, we will see that all these mountains of numbers run into separate columns that carry additional meaning. In them are the names of the goods, the names of organizations without specifying any TIN, last names, and others, so to say, “unstructured data”.
')
Another huge reservoir are data sets with price lists, apartments, cars, and much more. And here again someone will say: “But after all, practically everywhere there are product catalogs, there is a TN VED, OKVED-2 and much more”. And in this remark already lies the answer to many questions. All these directories are branch, incomplete, there are no complete descriptions and rules of reference, and the human imagination sometimes has no boundaries. As for other areas, such as arrays of contracts, job ads and posts on the Internet, there are no reference books at all.

Uniting all these problems is the recognition of the fact that no statistical methods, even neural networks, can solve this problem without the search and analysis systems of semantic and semiotic analysis. As a simple example, one can cite the task of combating fraud in the field of mortgage lending or issuing a car loan for the purchase of a used car. I think that everybody would like to get a set of data that I would like to get: Is there an apartment or car for which you need to issue a loan in the sale lists? And what is the cost per square meter in the same or neighboring house, or the price of a similar car? And what is the cost within the settlement, and within the agglomeration, etc.?

Download data from sites "as is" today is not a technically difficult task. Having received such a database, we have millions of records with unstructured information and the BigData category database in its entirety. The analysis of the bases for job offers in order to verify the adequacy of the salary indicated in the certificate or the analysis of the reliability of the younger generation without analyzing social networks is in general an impossible task.

Recently, more and more different kinds of state bodies have become interested in the topic of semantic data analysis. As an example, an electronic auction for the development of the “analytical subsystem of the AIS FTS”, which is a subsystem for semantic text analysis, was posted on the state procurement website in May 2017.

Unfortunately, for some reason, behind the triumphant reports there is a complete pool of problems and missed opportunities associated with this. Let's try to understand at least some of them.

First, it is the availability of the data volume itself. The volume and speed of incoming data today excludes the possibility of their processing by operators. The consequence is the urgent need to have products on the market that provide data quality and data mining tasks in automatic mode with an extraction level of at least 80-90 percent at a very high processing speed. And that is not unimportant, the number of errors should be no more than 1-1.5 percent. An attentive reader can say that there are various distributed solutions that can solve poor performance issues, such as Hadoop and so on. That's right, but many people forget that such processes are cyclical. And, what has just been extracted should be added to directories, search indexes, etc. Data not intersecting within one stream may intersect with data from another stream. Consequently, the number of parallel branches should be minimized, and the performance within one thread should be maximum.

Secondly, it is the real percentage that is used. According to some Western sources, the proportion of “dark” or hidden data in different countries reaches half or more. The main reasons for the impossibility of their use are their weak structure against the background of low quality. Here I immediately want to clarify that the problem of structuredness and poor quality are two completely different problems. Unstructured data is difficult to decompose into components and build any dependencies, it is difficult to compare, but they can be absolutely reliable and valid in their essence. Valid, or data with low quality, can be well structured, but not correspond to the objects of the "real" world. For example, a postal address may be remarkably spread out in the fields, but not exist in nature.



Thirdly, this is the lack of competence of the Western systems in the field of semantics of the Russian language. This problem is often overlooked by analysts themselves when choosing systems to work with data. Solution providers and system integrators promise that this is an issue that is easily solved because "our solution is already present in many countries." But as a rule, the fact that it is either international organizations working in English, or is it the language of the same Romance group, or the implementation is not fully localized, is silent. In our experience, all attempts to localize semantic search tasks known on the Russian market did not succeed, reaching a quality level no higher than 60-70 percent of the possible.

Fourthly, different participants in the process may have different ideas about the rules for classifying any entities. In this case, we are not talking about the fact that within the framework of the information landscape there are several systems. Often, within the same system, the same objects are inherently differently described and classified. And the reason is not inattention or negligence of some employees. Primary cause in context or conditions in which the action was performed. National traditions, a different cultural code. To make an unambiguous regulation of the rules in these conditions is simply impossible.

Thus, the challenge of using big data, artificial intelligence, etc. in fact, it requires a broader view, rather united by the term Data Science. And in the process of designing solutions in the field of BigData should be given a separate and equally important to the issues of cleaning and extracting data. Otherwise, following the well-known proverb, an automated mess is still a mess.

Source: https://habr.com/ru/post/329390/


All Articles