⬆️ ⬇️

Breaking ETL Barriers With Spark Streaming by Concur. Meeting report

I visited today on the topic “Breaking the ETL barrier with Spark Streaming and Real Time Txn Volume Forecasting” and decided to record travel notes. The notes turned out to be a little cynical, but, I hope, interesting.







The meeting was organized by Concur, which mainly works for corporate clients, providing them with a set of financial and “tourist” services. The material was interesting, the level is easy, the review will be short.

')

In short, the point is to replace ETL with about the same number of processes that read transaction logs and send them through Kafka to Spark Streaming, where they can be “better processed and analyzed” and then folded in OLAP (as before) . That is, this is essentially ETL, but real time, not batch, and more programmable.



Short introductory - I didn’t work with Spark, I read only fiction about Kafka, I don’t know Java, I know Scala and Akka superficially, I didn’t write my monads, so I cannot evaluate the technical side of the solution provided. I came to the meeting almost by accident. Somewhere like that.



And a little about the organizers. The meeting was held by a small initiative team from Concur. Concur's office is located almost across the street from Expedia's office, which today means nothing, because 90% of the people in one company and another actually sit in India.







I use their services at work and what I see seems to be a combination of 2 services.



The first service reminds Expedia - I can plan and order all the details of a business trip: tickets, car, hotel, choose for everyone (closer-cheaper-faster); good integration with hotels, airlines; details of the company where I work are taken into account, for example, there are “preferred” flights and if I do not use them, I must provide a justification and get permission from my superiors. Well done.



The second service is cost reports and other near-financial topics.



Of course, accounting sees it all differently and more difficult. In general, the system is very complex and large, even huge. Integrated with many other systems - hotels, payments, etc.



Actually, the speaker began with the fact that they have problems related to the overall complexity and size of the system, which they want to overcome. According to the speaker, in Concur-e there are about 28,000 either databases, or ETL processes, but in any case, the figure is impressive. And “at night” they transfer data from OLTP databases to OLAP databases in order to generate reports and analyze any ambiguities. Night is a relative concept from the local time of the customer, but this does not make life much easier.



And then the speaker added the phrase that some ETL lasts up to 10 hours. And he began to trample the current architecture into the dirt:







I was somewhat animated, but did not ask any questions - the people in the hall nodded and approved each nail on the lid of this coffin, in general, I did not find fault. But how can I call 28000 databases a monolith or “hard to scale” I do not understand. Yes, now it is fashionable to call everyone that way, and I understand that the initiative group, on whose report I got, in essence, propagandizes a local revolution (and lifelong payment for 2-3 cities of Indian programmers), and here any epithets are permissible.



Then the speaker said that there is a 100% perfect solution to the problem and the audience somehow immediately agreed with him. Well, that is, 28,000 on the left, of course, they do not scale, yes. But the same 28,000 on the right at once once - and scaled.







Then it went a little more fun - we quickly went through the existing options and somehow also cursed them a little. I did not have time to take a photo where there were more reports on the right, but the same wasn’t particularly concentrated on them - well, the report and the report, business, the report is not architecture! Separately mentioned Alerts - the second part of the event was devoted to them, but more on that later.







Verbally compared Spark Streaming with competitors, separately mentioned the ability to handle events and packet data with the same code. The code of course should be written on Scala. Separately and repeatedly mentioned Exactly Once, guaranteed delivery and built-in backup data replication.







Well, actually the final slide:







At the bottom left, OLTP is those 28,000 bases, they were, are and will be.

On the bottom right, blue OLAP is where ETL data is currently flowing. Replication is and remains, reports are and remain. Here we do not change anything.

Between them now ETL processes run on complex graphics, not shown in the diagram.



Everything else is the future, but very bright!



At the bottom left, “P” is the producer, there are many of them, the reader probably already knows how many there will be. This producer reads transaction logs and sends to Kafka. Of course, like protobuf, so much faster! And Kafka is 5-20 times faster than RabbitMQ. In the future, we will eliminate the OLTP database and replace it with the Service Bus, because the Event Store! But not now, unfortunately, the UI interferes, and here we are powerless ...



Then the data goes to Spark Streaming, where we write simple Scala code and Spak-SQL queries. The result is written in OLAP. It will not be completely correct, because it is an Eventual Consistency, but this is normal, now it is everywhere.



And above it is Hadoop. In principle, we do not need it, but let it be - some of the data will come in the form of files and not by the Service Bus. Alas, it happens. Therefore, Hadoop is necessary.



About 15 minutes of questions and answers passed around this slide.



Then there was the chaotic second part, about 15 minutes, where they mentioned the addition of Twitter Trees. The hall was silent, there were no questions.



It turned out that all this is only planned and, perhaps, the prototype will work by August. It will be necessary to go to the next meeting, to see whether the prototype has earned.

Source: https://habr.com/ru/post/262785/



All Articles