Using Github as a user data store

Image of commit

When choosing a service for storing my data, an important component is how long such a service will live. From him it is necessary that I could at least read the saved data even if the enthusiasm of the authors of the project ends with the money to pay for hosting and a database. With this approach for my project, I was looking for database services that could store user data for free. A promising project was Parse.com, about which I already wrote earlier in the article “Website without a backend” . But in January 2016 we learned that Parse.com will live only one year and will be closed. In this regard, I decided to transfer the storage of user data to a git repository, which is published on Github.

Scheme of service

User <=> GitHub Pages <=> Separate API on paid VDS <=> git-repository on SSD on the same VDS <=> repository on GitHub

A significant disadvantage of the above criterion here is the API on a paid VDS, which may one day become inaccessible forever. But, thanks to the whole chain, you can get your data in both human and machine-readable formats on Github. Since the user on Github Pages communicates with the API via javascript, the main page of the project will still be available. If you need to exclude Github from the link for any reason, you can transfer all the work to another hosting repository, for example, Bitbucket.

Why not use a pull-request system instead of an API? I found 3 reasons:

Users should be familiar with git and the PR system on Github.
It is necessary to develop an authorization system with a check that the user in PR only affected his files.
Anyway. We need a separate server that will reassemble the markdown files. Do not give up on them.

As a result, I came to the conclusion that solutions with code on my server cannot be avoided. Such server-side communication with Github occurs only through git. Using the ssh key to push it out on Github makes the process easy.

Legitimacy

Thinking about the structure of such data, I examined the question of the legitimacy of using Github for such a purpose. Let us consider in more detail the Terms of Service , which directly relate to the issue.

A. 2

You must be a human. Accounts registered by "bots" or other automated methods are not permitted.

The account to push the repository on Github must be created manually (with a valid email).

A. 4

It can be directed to multiple people. You can create your own plan.

A. 7

It can be used exclusively for conducting automated tasks.

All actions to commit (commit) and pushing (push) are made from a single account. But such actions fit the description of the machine user , because they are automated.

G. 12

If you’re on your own account, you can’t reduce your bandwidth usage.

The project uses text files, and the ejection occurs on a schedule at short intervals. Practically, this item should not be broken.

Other recommendations

Github in some materials provides additional information.

We recommend repositories be kept under 1GB each. If you’re keeping out of the repository. If your repository exceeds 1GB, it would be possible to get it back down.
In addition, we place a strict limit of files exceeding 100 MB in size.

Theoretically, this part could be a problem. For 2 years of life of the service of accounting books read about 8000 records were saved. The size of the repository is about 7 MB. The largest file has a size of about 500 KB - this is a service file with an index of records. Given that the service has restrictions on the length of the transmitted texts, using the service as intended, the limit will not be exceeded in a short time. In the future, we can consider the version of sharding.

In the same place, Github tells us that they were not designed (like git itself) as a means of storing backup copies or a means of storing large sql files. We will not use Github as a means of storing sql files, we have a different data structure. And since the data is supposed to be readable on Github itself not only by machines, but also by people, it is also impossible to call such a structure a clean backup.

Data structure

Creating a functional, but not redundant data structure that could be stored in a git repository is possible only within a specific project. Probably, it would be possible to describe some kind of universal approach to the design of this aspect, but I will describe the final version, leaving only the most necessary.

So, for the project Bookprint it is required to store information about the user: id, alias, date of the last update and, most importantly, the list of books he read. About the book we read we will store: id, title, author, date of reading, user notes, date of last update. The idea of the project is that the user specifies the book with free input, the way he wants it. This allows us not to use the registry of all existing books. In other words, each entry about a book is related only to the user who created it.

Json was chosen as the data format for its non-redundancy and good compatibility with the ideas of git-storage. If you store each json value in a separate line, this will allow you to get a visual diff on Github.

Since, apart from the users and their books, we do not yet need to store anything, we create for each user a separate directory with the user id. Inside this directory, we store a json file with basic information about the user. The books directory is also located here, and inside it there are separate json files for each book with the book id as the file name.

Consider auxiliary files. Despite the fact that each record of a read book is related to a specific user, the API required a quick receipt of the book on its id. I went along the path of creating an auxiliary file - an index of books. This is a csv file that contains the id and the full path to the record for the book. Creating such a file could be avoided by doing a search for books in the context of a specific user (then it would take extra time to search for the file in the user directory), or by making a composite id, which would have a user id (redundancy and id non-atomicity).

The following support files are latest_books.json and latest_books_with_notes.json . They store information about a fixed number of recently added books, as well as the latest_users.json with a fixed number of recently registered users. Thanks to them, on the service, you can show the latest added books with notes and the last active users.

Since we use Github, we can display some information in the repository itself using markdown. To do this, we will reassemble each time new information is added to the README.md files and the latest_books_with_notes.md files separately , based on the json files described above. And most importantly, we can rebuild the user pages themselves with a list of what they read.

We grouped the user directories by initial id symbols to avoid the formation of too many objects on the same path level.

Authentication and authorization

Unlike Parse.com, there is no possibility to store passwords, but even then I used uLogin for authentication, which brings together dozens of social networks and sites, and does not require registration from the user. Working with uLogin is very simple. It sends the access token to the client after successful login. This token should be sent to the server, where you can make a request to the uLogin server to validate the token and get some useful information, for example, the name of the provider network and the user id in it. Using this information, it was possible to link user data directly to his account in his chosen social network. This means that in the case of the end of uLogin, it can be replaced with a similar service (including its own). Therefore, as the user id, I decided to use the combined id of the id-provider type, for example, 83820536-yandex . This approach made it possible to avoid restrictions on the storage of non-public data anywhere.

Planning a service, I provided a scenario for a user to lose access to a social network. Such a scenario was realized recently in connection with blocking in RF LinkedIn. The user requested help. The project has the function "copy records from another account." Since all data is publicly available, there is no harm from the fact that at least someone can copy a list of at least someone. It is reasonable to add some restrictions so that the user “doesn’t shoot himself’s leg”. As a result, the user used the copy function and got access to his records, although he now enters the service through VK.

Now to the question of user authorization when working with the API. In the early stages of development, I created a random access token to which I attached a user id after authentication (the connection itself was stored in the application cache), and the token was returned to the client. He was supposed to include it in every API request. But later I looked at the usual mechanism of sessions and cookies (cookies). Cookies had quite good benefits. First, you can install HttpOnly. Not to say that this advantage has eliminated all XSS attacks, but at least one scenario has become less. Secondly, cookies are transmitted automatically if the client is on js in the browser, and this is our case.

A lot of what a server-side framework makes it very easy to implement the “remember me” mechanism using a long-lived cookie. On it, he raises the session user. Further authorization procedure is done within the framework of the framework. It is necessary, of course, to define the entities and mechanisms for storing data in the file system , but this strongly depends on the data structure. Let me just say that you need to provide for the similarity of transactions for git-commit. This will merge the change of several entities into one commit.

Benefits

Is free
History of all changes
Ready change rollback system
The ability to quickly get a complete copy of all data
You can get data fragments bypassing the API using raw GitHub data
Presumably durability

disadvantages

Read and write speed . Although the use of SSD on the hosting brightens the situation. I recommend ssd hosting DigitalOcean: ref. link with bonuses.

There may be difficulties with scaling . Extrusion and extrusion are not too fast operations to synchronize storage in real time. Perhaps sharding by users will help.

The question of search engine optimization is open. The project on Github Pages + Angular, the search engines do not see this. Perhaps, markdown files will get into the index, or pages on social networks will be indexed, to which the records are exported.

The implementation of the search requires additional efforts.

Localization support for markdown pages on Github is missing. Data duplication may help, but it's not so pretty.

There are no familiar query languages . Receiving and recording should be implemented independently at a fairly low level.

Conclusion

The project using Github as a data warehouse was able to check the time. At the date of publication of the article, he works for more than 3 months. Perhaps, everything did not fall apart, because there were no serious workloads or great guys. And perhaps the idea is viable, at least for small projects.

activity on github