2000 hours alone, or as an RSS reader was made / I am a robocop

Hello,

I'm going to share with you the technical side of how I made a new web rss reader in 16 weeks, and I almost lost my mind.
Departing from a long history, we will assume that everything began in February of this year, when David and I ( dmiloshev , UI-designer) decided to make a prototype of our creation together.
“Alone” - because there were no scams, meetings, “collective intelligence”, and the whole technical part was done by myself.

If I were asked to describe the whole article in one sentence, it would have turned out:
No-SQL, mongodb, node.js, my brain, Evented I / O, queues, outputs, git, nginx, memcached, Google Reader, Atom, TTL, PHP, ZF, jQuery, conclusions.

I. Technology

1. PHP / ZendFramework + something else
All I had from the very beginning is a small frame that makes working with zf a little more convenient. It contains Dependency Injection, Table Data Gateway, Transfer Object, more convenient work with configs, configured by Phing with already assigned tasks for almost all occasions. In general, working with this is all very nice.
')
Architecturally php application consists of the following layers:

Routing / Controller / View - clear ..
Service - here ACL, validation, caching, logging. You can safely tie the REST to it and there will be an excellent API.
Gateway - the body of business logic. Each entity in the system has its own gateway. Absolutely abstracted from the database.
Mapper - here, in fact, direct work with the base.

A few more points that I tried to remember when designing:

Kiss
Without bicycles
Any logical part of the system should be able to scale horizontally.
All procedures dealing with external sources should be performed in the background and not freeze the user interface.
Any high-load should rest on the queue, processor power and timeouts, and not on the number of processes or open connections
Any data can be recovered.
You need to log everything, not just errors
"It can be cached"
David: "It is necessary that this plate was 1 pixel to the left .." =)

2. nginx no comment

3. git Once made it clear that I was not as smart as I thought.

4. Mongodb
Previously, we used it for production in another project, but very carefully, so we could not test it in full. Recently, the No-SQL, sharding, map-reduce and No-SPOF mods have been particularly strongly developed. I decided it was time to climb out of my children's pants. At the very least, it diluted the general routine very much and shook me slightly.
The guys' documentation is very detailed, so it was possible to understand the full depth of the mongodb in the first two weeks. I had to slightly turn my brain inside out after years of working with relational databases. But all the nontrivial tasks could be solved independently, without resorting to asking questions on the forums.
Slightly afraid of launching it on production. To keep abreast of possible problems, I regularly read groups, study issues that other people face.
At the moment, it is configured as a master-master, which is not fully supported, but in our case it should work as it should. In the future we will shardit, and it will definitely be easier than with the same mysql.

5. Memcached
There is nothing to say. Simple as a door. Is that, in the future I want to try it on UDP ... just for fun.

6. Memcacheq
There are a lot of alternatives to this today, but I can say that he showed himself very well at production in the previous project.
And it's nice that it does not need a special driver - it works on top of memcached (it helped in the next paragraph).

7. node.js
This is probably the most interesting thing that happened to me during these four months. Server Evented I / O is very exciting, even more than the differential . Immediately I wanted to rewrite the whole php to ruby. These are my dreams.
The fact is that I discovered it quite recently and quite by accident. After that, quite a lot of things fell into place, both in the system itself and in my head. I had to rewrite quite a lot, but the result is very pleasing to the soul, and I hope it will please future users.
I smoked this page before the filter, at the moment I use: mongoose, kiwi, step, memcache, streamlogger, hashlib, consolelog, eyes, daemon
I wrote jsonsocket from my libraries, which, in my opinion, speaks for itself. Hands do not reach it github. And now I dream of making bsonsocket out of it. Of course, I had to write things to work with queues, and a layer to work with the Gateway layer in php (more on that later).
I also added the prowl , now the background sends me a push message to the phone random quotes from the tower once an hour (at the same time, small statistics in the form of memory usage, etc.)
Many libraries (modules) are very raw, so sometimes you had to edit your hands right in someone else’s code (there is no time to do patches). And dear gentlemen node.js for backward-compatibility, so you can often find simply not working libraries.

8. jQuery
For me, this is almost a synonym for client-side javascript.
Used plugins: blockUI, validate, form, tooltip, hotkeys, easing, scrollTo, text-overflow and a couple of smaller ones.

Ii. Development

I will not delve into the specifics of the service itself, technically, it is almost Google Reader (GR).
While David drove the gray squares around Photoshop, thinking about business logic, I started with basic modeling, after which I immediately switched to the feed download system.

1. Feed Pull

It would seem that everything is simple here - we are pulling the address, pumping out xml, parsing, writing to the database. But there are nuances.

Each feed must be uniquely identified in order to be able to save it in the system.
Also, each entry must be identified to avoid duplicates.
Support for things like if-modified-since and etag
Redirect processing
Different versions of RSS / Atom
Extensions to various services, for example gr: date-published
HTML inside each post should be cleaned, but not completely, leaving good tags, filtering all kinds of heresy
Searching and processing the icon was not the most pleasant thing ... for example, the livejournal does not give the content-type, you have to use magic.mime
The specification, apparently, very few people read, for this xml may not be valid, or NOT valid.

Findings:
External sources are very different, many just spat on the standards. And they can not be trusted - content validation should be as strict as when interacting with the user.
The perfect code did not work. He was overgrown with many conditions and exceptions.
Each item took a lot of time. More than it might seem at first glance.

2. Update

Now it would be desirable, that all existing flows were updated automatically. Moreover, it is desirable to take into account the TTL (update rate) of each individual stream. And I would also like to smear this TTL by the time of day. I did not rely on the protocol, because according to my research, either it does not exist at all or it does not correspond to reality. One way or another, its not enough.

I started thinking about my own system for determining the frequency of updating the streams, and this is what happened:

TTL - average time distance in seconds between entries in the stream for an hour (minimum 2 minutes, maximum 1 hour)
Each stream has a list of average TTL for each of 24 hours in the last 10 days.
Based on actual data for the last 10 days, a forecast is generated for the next day, which represents the average TTL values for each hour.
Each time a stream is updated, the system recalculates its average actual TTL for the current hour (0-23)

So, let's say that the system updates a stream at lunchtime every 2 minutes, it needs to do the appropriate updates regularly during this period of time for 10 days. Up to this point, the system will smoothly adapt and update it more and more often. And, for example, at night, this stream will be updated once an hour.

Actually, the update procedure itself is the previously described Feed Pull, which, like BE, is the same as updating.
And here we smoothly understand that I would like to pull all this onto the line. But I will tell you about their organization a little later.

By the way, in plans to fasten PubSub , and also, to launch the hub.

3. Discovery

Clearly, the list of skills of a convenient rss reader should include searching for rss / atom feeds on any html page. When a user simply drives in the address of the site (for example, www.pravda.ru ), the system should go and search there, in fact, for the feeds to which it can be subscribed.

But this procedure is complicated by the fact that such things cannot be done directly in the user's request, since this is not at all the task of the web server — it must be done asynchronously. At the user's request, we first check directly whether such a stream exists in the database, then we look into the discovery cache (which lives 2 hours), and if we did not find anything, then we put this case in a queue and wait for a maximum of 5 seconds (about Exactly wait, I'll tell you later). If during this time the task did not have time to complete, we complete the script, returning json in the style of {wait: true}. After that, after some timeout, the client-side makes the same request to the server. As soon as the task is completed on the background, its result will be in the discovery cache.

Several nuances associated with this procedure:

Different encodings - sometimes the encoding is not specified either in the headers or in the header ... you have to define it byte by byte (which does not always work)
There can be two identical feeds on one page, one RSS is another Atom - in such a situation you need to choose one of them
You need to additionally request each of the feeds in order to make sure that it works, and take its true title and description.
Redirects
Icons (same problems)
Standards and validity (the same)

I consider this part rather crude, because we are trying to make a service for people who do not have to know what RSS or Atom is. For example, if I, as the most ordinary user, suddenly want to subscribe to my favorite and unique vkontakte.ru, then I will see you blowing blueberry jam. At a minimum, in the future we want to implement something a la gr generated feed . As more than a minimum, make a convenient, human search.
By the way, it is often found that there is no alternate specifically on this page, but somewhere on other pages of the same site there are. There was a thought to write a back-round crawler, which would quietly search for rss / atom streams on those sites that are often entered by users.

Findings:
When dealing with external sources of a different type, it feels like you are digging in a giant trash can in search of a document that is accidentally thrown away.
Requires specific improvements. From the point of view of usability, a simple search for alternates on this page is not enough. We need to do something more universal.

4. Interface

The next thing I really wanted was to see the interface where I could subscribe to some stream, add it as a bookmark to the left column, click and read its contents.
I will not go into the details of the implementation of interfaces, I just want to say that I did all the layout and ui myself. It was very unprofitable and distracted from other tasks. But jquery saved time.
I spent a total of two weeks on the reading room and the general interface (this is not counting rather stressful improvements and alterations in the future). After that, we got a pretty nice toy that lit up on our monitors, pleasing the eyes and soul.

Folders
Of course, we are minimal guys, but without folders I cannot imagine working with the reader. And, excuse me gentlemen, their usability in Google Reader leaves much to be desired. We tried to implement them as accessible and simple as possible.
But I never thought that technically it could be such a problem. The interface is an interface, and on the server side I had to tense up pretty much so that it worked as it should - see the next paragraph.

If possible, tried to use css sprites (where it turned out).
All js and css are collected in one file, minified and compressed with gzip. The average page (with all the statics) weighs 300kb. With a cache - 100kb.
And for ie6 we have a special page.

Findings:
The interface itself looks very easy, but I would not say the same about its implementation.
Finally, when everything is compressed and firebug is turned off, it works smartly.
In total, I counted 28 screens at the moment, and a million usecase s.

5. Read / unread entries

It turned out that this is a rather nontrivial task for a system where stopicot flows can potentially be, and even more subscribers. Most importantly, it can be scaled horizontally.
In each Entry entity, I keep a list of users who read it. Potentially, there may be at least a million identifiers in this list, this will not affect performance, thanks to the mongodb architecture. Also, in a separate collection is stored additional information about the reading time, etc. it is not indexed and is needed solely for statistics, so everything works pretty quickly.

For each user, the date of the last update of his counters is stored - for all threads to which he is subscribed.
When the user refreshes the page, the system finds the number of new entries for each stream that appeared later than this date, and adds it to the number of unread (simple increment). When a user reads any record, a simple decrement occurs.

Selection of unread entries on a separate stream is also very simple.

But choosing only the unread in the folder is already a problem. I do not want to clarify the nuances, but this is due to the fact that there are simply no join-s in mongodb. Simple request or several - it cannot be solved, only through CodeWScope. It is impossible to index, to scale - m / r. This is currently a potential bottleneck.

5.1. Unread on top

If any of you have used Google Reader, then you probably know about the “watch only unread” function. So, if there are no records in the stream that you have not read, you are looking at a blank page. At first, we did the same, but testing showed that users do not even realize that they have this feature turned on. They do not understand why the stream is empty, why there are no records on it, and where they go.
David offered a very interesting solution, where the unread entries simply appear on top, and the reads go down. And it cost me a few days of breaking the brain over how best to implement it, in folders.

Findings:
No-SQL is good in terms of speed and scalability. But some seemingly trivial things turned out to be quite difficult to do with him.
Denormalization is good. It is not necessary to consider a problem that any counter will fail. But for any denormalized data, you need to have a full conversion function (on the background, of course).
M / R in mongodb is still raw for production. After a little testing, it turned out that he was blocking everything to hell during work. In version 1.6, the developers promise to improve it. So far without cost him.
Schema-less decides.

8. Sharing

This is a function that allows you to sew any entry from a readable stream to your page. In short, this means that any authoritative guy A, reading various feeds, can instantly save specific entries (of course, the most interesting and useful) to his stream (s) - similar to Shared Items in GR. And other users have the opportunity to directly subscribe to his “Shared Items” stream, as well as to any feed.

One of the main concepts of our service is convenient distribution of information. A rather interesting from a technical point of view task for me was to implement the construction of chains of shearing. Mongodb was very helpful with its schema-less properties.

An interesting point:
Recently, Google announced a new ReShare feature in Buzz. So in this article (by reference), where “A little more background”, I came across points that David and I had been discussing closely 4 months ago, I came to the same conclusions. Our implementation of sharing is very close.

9. Node.js, background, queues

Initially, the demons were written in PHP, with it, very crooked. And, apart from mongodb, it was the most dumb place for me in the app, since the erener is not intended for such things.
But when I stumbled upon node.js (it was just two weeks ago), my soul started singing, and again I was able to “sleep” calmly. But the problem was that rewriting all the background code into it that was already implemented in PHP (feed-pull, discovery, feed-info) was not the time at all.
Very short picking in the capabilities of the node led me to a compromise solution - child-process.

9.1. Queue manager

This is the first node demon. His task is to read queues, distribute tasks to workers and monitor the process of their work.

One manager can serve many queues
Managers in the system can be run any number, one per server.
Each can be configured in its own way, for example, different managers can work with a different set of queues
The configuration of the queues may differ and has the following parameters
- The maximum number of workers working simultaneously (the actual number is adjusted depending on the load)
- The size of the task buffer (must be configured depending on the type of task and the number of workers)
- Maximum idle time of the worker (automatically kills and frees memory if idle)
- The maximum lifetime of the worker (if this is php-cli, you shouldn’t live long, it’s better to restart sometimes)
- Maximum memory usage in a worker (as soon as it exceeds, we kill)
- Timeout for the task execution (if the worker is stuck during the task execution, kill it, return the task to the queue)
- The number of times a task can be spilled
When a task is selected from a queue, a Lock is placed on it (memcache is used for locks)
If the task has a result, it will be saved in memcache.
Each queue has its own worker, it must be a js class with a specific interface.
- At the moment, only one of these works - import (more will be about it later)
Also, there is WorkerPhp.js, which runs php-cli as a child-process and communicates with it on json
- The life of such a worker (process) does not end with the performance of one task - he can perform one by one until the manager sees that he is noticeably “fattened” and will not dismiss him
- In practice, more than 4 php processes per queue do not start at the same time.
Understands POSIX Signals
In the case of correct completion (not kill -9), it carefully returns all running tasks from memory back to the queue
Each manager opens a port with a REPL interface, you can enter it and ask how it is. Also, you can change its configuration without rebooting on the fly.

By the way, any uncaught exception lays only one thread, but not the whole process, which is good.
And all this - 500 lines of code (with comments).

Findings:
Evented I / O is how most server applications are required to work. The lock should only be where it is really needed.
Proxy php through node showed good results and saved time.
A bunch of work serves only one process (not counting php-cli). JS workers work there asynchronously and very sharply.

9.2. Controller - Publish / Subscribe hub

It often happens that you need to perform bulk tasks (for example, 100) in parallel, and even asynchronously. But the turn is a black hole. To send there 100 tasks ... and even once a second to contact memcache for results is expensive.
You can still bypass the queue, you could use the socket to directly contact the manager and ask him to perform these tasks, waiting for a response in the same connection. But this option is not suitable, as there may be a dozen managers, and we do not know which of them can be addressed ... in short, this is wrong.

And I created a controller (node). It is generally one for the whole system, and at the same time, it is as simple as a stool:

All managers open a permanent connection with the controller.
In case of any result or file of any task, the manager informs the controller in detail.
You can connect to it “from the other side” and subscribe to a specific task or list of tasks.
As information on tasks is received, the controller notifies all subscribers.
If the subscriber expects a lot of tasks, the controller notifies him as they arrive.
There is a client for PHP (blocking)
Garbage collection

Actually, it is in the “discovery” procedure that I described above that the PHP script just calls the controller and waits for the result of the task for 5 seconds, after which it returns the user to the interface.

Findings:
Publish / Subscribe scheme is very effective in non-blocking environments.
A 100% result is not required. If, as a result, 5 tasks out of 100 have not been completed for some reason, as a rule, this is not terrible and we continue to work.

9.3. Feed-updater (background update)

Node process, one for the whole system. Periodically turns into the database, receiving a list of feeds that need to be updated (using TTL data), and throws them into the queue.

9.4. Queues

To avoid race-conditions, a unique md5 identifier is generated for each task. It is this identifier that is placed in the queue, and the data of this task itself is stored in memcached. Because almost all tasks have non-fixed size, and memcacheq is not friendly with it - and should not. When a manager takes a task, he puts a lock on it, which is also a record in memcached. This allows you to avoid re-entering the queue of identical tasks directly during their execution.
I plan to consider Redis as an alternative to all this, because memcached in this case is used for other purposes. If he falls, the whole line will be lost.

I also divided the queues into two groups: user and system. The first - in priority.
This simply led to the addition of “feed-pull-sys”, which is used by the background update without interfering with user tasks.

Findings:
This implementation is still very raw.
The queue must be recoverable in any fall.
Need to use a more advanced locking system - mutex?
User and background processes must have different priorities.

10. Import / Export

Here is another interesting point that I want to talk about. All decent readers are required to support import / export in OPML format. But the fact is that some users can upload their opml with hundreds of feeds that are not yet in our system. And then he will have to wait until they all load. And yet, there may be a dozen of those willing at the same time.

Node saves. There was a new worker called "import" (at the moment it can work up to 10 at a time). After downloading and validating the opml file, php throws the task into the queue and returns the user to the interface, to the progress bar. Meanwhile, the “import” picks up and scatters smaller tasks into the feed-pull queue, and then waits for them to be executed by the controller, updating the counter in parallel. And the user sees a creeping progress bar. With that, he can leave this page, take a walk, and then return. It's nice.

Iii. findings

Do not make bicycles. Virtually any task already has a ready-made solution that just needs to be slightly adapted.
The simpler the product is for the user in the end, the more difficult is its implementation. The consumer, in other matters, is unlikely to notice.
Do not overestimate yourself. Independently making an adult product “for a week” is not possible (although I don’t like this word).
Motivation, however, sometimes makes the impossible.
The product will never be perfect. A running application is always a trade-off between time and quality.
If you work on your own, there’s a lot of brainstorming in the team. It is worth using collective intelligence whenever possible.
It takes a lot of time to switch context. Much more effective when one developer is engaged in more tasks of the same type.
If you intend to make your startup and move on ideas, forget about privacy and Friday beer.

This is only the beginning, and there is still a lot of planned work ahead. Including what we can call innovations, I’m not afraid of that word. Therefore, to be continued ...

In parallel, I want to announce a half-closed launch next week.
The project itself will be written by my colleague in another article the other day.

I would be grateful for any technical comments, advice and constructive criticism.

Source: https://habr.com/ru/post/95526/

All Articles

2000 hours alone, or as an RSS reader was made / I am a robocop

I. Technology

Ii. Development

Iii. findings

More articles: