📜 ⬆️ ⬇️

Lean Big Data on 6 Google services

image

Hello Habr! I want to tell you how we did our own Big Data.

Every startup wants to build something cheap, high-quality and flexible. Usually this does not happen, but we seem to have succeeded! Below is a description of our decision and a lot of my purely subjective opinion on this matter.
')
And yes, the secret is that 6 Google services are used and there is almost no written own code.

What was needed?


I work in a fun Singapore startup - Bubbly, which makes a voice social network. The trick is that you can use without a smartphone, quite a good old Nokia. The user calls to a special number and can listen to messages, record their messages, etc. All voice, do not even need to be able to read to use.

In Southeast Asia, we have tens of millions of users. But since it works through mobile operators, in other countries nobody knows about us. These users generate a huge amount of activity that they want to register and analyze in every possible way:

In general, the tasks that are needed by everyone and always practically.

Why reinvent the wheel?


It would seem - why build something, if there are ready-made solutions? I was guided by such motives:

1. I do not want to use Mixpanel (sorry gays!)


2. If you want so much your “own” solution, then why not stir up Hadoop with all the stuffing?

Because the intestine is thin . It is really difficult!


MySQL for the task is clearly not suitable, because we have too much data for it.

Briefly how it works for us


  1. We upload all “events” from users from our servers to Google Big Query
  2. We use Google Spreadsheets for Big Query queries and subsequent data processing. All logic sits in Spreadsheets and scripts tied to it.
  3. Next, visualize the data using Google Charts.
  4. We host these graphics on Google Drive
  5. In a single "Dash Board" these graphics are going to Google Sites
  6. Finally, Google Analytics is on top of Google Sites, which oversees the users of all this analytics.

The advantages of this approach (no, I don’t promote Google for money, which is a pity )


Big Query - Pros


Improvements:
I really wanted to make Big Query schemaless, just to add events to the system and not think about anything. Therefore, a piece of code was attached to the loader, which checks the current table schema in Big Query and compares it with what it wants to load. If there are new columns, they are added to the table via the Big Query API.

Google Spreadsheets - Pros

Better spreadsheets for data analysis is nothing. This is my axiom. For this task, Spreadsheets fits better than MS Excel (no matter how much I love it). The reasons are:

Improvements:
The script from the tutorial has been slightly modified. Now he checks every sheet in a spreadsheet. If “A1” is written in cell A1, it means that the query to Big Query lies in A2. The results of the query script will put on the same sheet.

This is necessary so that when using, do not touch the code at all. Created a new sheet, wrote a request, got the result.

image

Google Charts - Pros


image

Google Sites / Google Drive - Pros


Google Analytics

Recursion! Our Dash Board has about 30 users. Enough to analyze resource usage statistics. Google Sites, not surprisingly, are integrated with Google Analytics in a couple of clicks. Attendance of pages objectively shows which data is most interesting in order to improve the system in this direction.

About the cost of the decision


I believe that in any system the most expensive is the time required for development and man-days spent on development and support. In this sense, this solution is ideal, since its code is almost never written. The whole project was done by one person, in parallel with other tasks, and the first version was made in a month.

There are, of course, suspicions that the integration between the services of Google may break (in their tutorial it can be seen that this has already happened) and it will take support efforts. But I do not expect anything terrible.

As for direct costs, only Big Query costs money in the whole system. Data storage and data requests are paid for. But it's just a penny! We write 60 million events a day and never once more than 200 USD per month did not pay.

Important Add-in for Big Query


Big Query scans the entire table by default. If events for all the time stored in one place, the requests become slower and more expensive over time.

The most interesting is always the data for the last time, so we came to the monthly calculation of these tables. Every month, the events table is backed up on events_201401Jan, events_201402Feb, and so on.

To make it convenient to make such a structure, we expanded the SQL language a bit. Fortunately, everything is controlled by its own script from Spreadsheets, and it can parse and process our requests as needed. Added such commands:

Future plans:




How it all works can be seen on the example here .

I'd like to have knowledgeable people say their opinions. For this article and was written.

PS I ask for errors in the text and my anglicisms to write in a personal, I will correct everything.

PPS I'm not a programmer at all, so my code can be scary (but it works!). I will be glad to constructive criticism.

Source: https://habr.com/ru/post/230243/


All Articles