⬆️ ⬇️

2000 hours alone, or as an RSS reader was made / I am a robocop

I. Am. Robocop. Hello,



I'm going to share with you the technical side of how I made a new web rss reader in 16 weeks, and I almost lost my mind.

Departing from a long history, we will assume that everything began in February of this year, when David and I ( dmiloshev , UI-designer) decided to make a prototype of our creation together.

“Alone” - because there were no scams, meetings, “collective intelligence”, and the whole technical part was done by myself.



If I were asked to describe the whole article in one sentence, it would have turned out:

No-SQL, mongodb, node.js, my brain, Evented I / O, queues, outputs, git, nginx, memcached, Google Reader, Atom, TTL, PHP, ZF, jQuery, conclusions.



I. Technology



1. PHP / ZendFramework + something else

All I had from the very beginning is a small frame that makes working with zf a little more convenient. It contains Dependency Injection, Table Data Gateway, Transfer Object, more convenient work with configs, configured by Phing with already assigned tasks for almost all occasions. In general, working with this is all very nice.

')

Architecturally php application consists of the following layers:

  1. Routing / Controller / View - clear ..
  2. Service - here ACL, validation, caching, logging. You can safely tie the REST to it and there will be an excellent API.
  3. Gateway - the body of business logic. Each entity in the system has its own gateway. Absolutely abstracted from the database.
  4. Mapper - here, in fact, direct work with the base.
A few more points that I tried to remember when designing:



2. nginx no comment



3. git Once made it clear that I was not as smart as I thought.



4. Mongodb

Previously, we used it for production in another project, but very carefully, so we could not test it in full. Recently, the No-SQL, sharding, map-reduce and No-SPOF mods have been particularly strongly developed. I decided it was time to climb out of my children's pants. At the very least, it diluted the general routine very much and shook me slightly.

The guys' documentation is very detailed, so it was possible to understand the full depth of the mongodb in the first two weeks. I had to slightly turn my brain inside out after years of working with relational databases. But all the nontrivial tasks could be solved independently, without resorting to asking questions on the forums.

Slightly afraid of launching it on production. To keep abreast of possible problems, I regularly read groups, study issues that other people face.

At the moment, it is configured as a master-master, which is not fully supported, but in our case it should work as it should. In the future we will shardit, and it will definitely be easier than with the same mysql.



5. Memcached

There is nothing to say. Simple as a door. Is that, in the future I want to try it on UDP ... just for fun.



6. Memcacheq

There are a lot of alternatives to this today, but I can say that he showed himself very well at production in the previous project.

And it's nice that it does not need a special driver - it works on top of memcached (it helped in the next paragraph).



7. node.js

This is probably the most interesting thing that happened to me during these four months. Server Evented I / O is very exciting, even more than the differential . Immediately I wanted to rewrite the whole php to ruby. These are my dreams.

The fact is that I discovered it quite recently and quite by accident. After that, quite a lot of things fell into place, both in the system itself and in my head. I had to rewrite quite a lot, but the result is very pleasing to the soul, and I hope it will please future users.

I smoked this page before the filter, at the moment I use: mongoose, kiwi, step, memcache, streamlogger, hashlib, consolelog, eyes, daemon

I wrote jsonsocket from my libraries, which, in my opinion, speaks for itself. Hands do not reach it github. And now I dream of making bsonsocket out of it. Of course, I had to write things to work with queues, and a layer to work with the Gateway layer in php (more on that later).

I also added the prowl , now the background sends me a push message to the phone random quotes from the tower once an hour (at the same time, small statistics in the form of memory usage, etc.)

Many libraries (modules) are very raw, so sometimes you had to edit your hands right in someone else’s code (there is no time to do patches). And dear gentlemen node.js for backward-compatibility, so you can often find simply not working libraries.



8. jQuery

For me, this is almost a synonym for client-side javascript.

Used plugins: blockUI, validate, form, tooltip, hotkeys, easing, scrollTo, text-overflow and a couple of smaller ones.



Ii. Development



I will not delve into the specifics of the service itself, technically, it is almost Google Reader (GR).

While David drove the gray squares around Photoshop, thinking about business logic, I started with basic modeling, after which I immediately switched to the feed download system.



1. Feed Pull



It would seem that everything is simple here - we are pulling the address, pumping out xml, parsing, writing to the database. But there are nuances.Findings:

External sources are very different, many just spat on the standards. And they can not be trusted - content validation should be as strict as when interacting with the user.

The perfect code did not work. He was overgrown with many conditions and exceptions.

Each item took a lot of time. More than it might seem at first glance.



2. Update



Now it would be desirable, that all existing flows were updated automatically. Moreover, it is desirable to take into account the TTL (update rate) of each individual stream. And I would also like to smear this TTL by the time of day. I did not rely on the protocol, because according to my research, either it does not exist at all or it does not correspond to reality. One way or another, its not enough.



I started thinking about my own system for determining the frequency of updating the streams, and this is what happened:So, let's say that the system updates a stream at lunchtime every 2 minutes, it needs to do the appropriate updates regularly during this period of time for 10 days. Up to this point, the system will smoothly adapt and update it more and more often. And, for example, at night, this stream will be updated once an hour.



Actually, the update procedure itself is the previously described Feed Pull, which, like BE, is the same as updating.

And here we smoothly understand that I would like to pull all this onto the line. But I will tell you about their organization a little later.



By the way, in plans to fasten PubSub , and also, to launch the hub.



3. Discovery



Clearly, the list of skills of a convenient rss reader should include searching for rss / atom feeds on any html page. When a user simply drives in the address of the site (for example, www.pravda.ru ), the system should go and search there, in fact, for the feeds to which it can be subscribed.



But this procedure is complicated by the fact that such things cannot be done directly in the user's request, since this is not at all the task of the web server — it must be done asynchronously. At the user's request, we first check directly whether such a stream exists in the database, then we look into the discovery cache (which lives 2 hours), and if we did not find anything, then we put this case in a queue and wait for a maximum of 5 seconds (about Exactly wait, I'll tell you later). If during this time the task did not have time to complete, we complete the script, returning json in the style of {wait: true}. After that, after some timeout, the client-side makes the same request to the server. As soon as the task is completed on the background, its result will be in the discovery cache.



Several nuances associated with this procedure:I consider this part rather crude, because we are trying to make a service for people who do not have to know what RSS or Atom is. For example, if I, as the most ordinary user, suddenly want to subscribe to my favorite and unique vkontakte.ru, then I will see you blowing blueberry jam. At a minimum, in the future we want to implement something a la gr generated feed . As more than a minimum, make a convenient, human search.

By the way, it is often found that there is no alternate specifically on this page, but somewhere on other pages of the same site there are. There was a thought to write a back-round crawler, which would quietly search for rss / atom streams on those sites that are often entered by users.



Findings:

When dealing with external sources of a different type, it feels like you are digging in a giant trash can in search of a document that is accidentally thrown away.

Requires specific improvements. From the point of view of usability, a simple search for alternates on this page is not enough. We need to do something more universal.



4. Interface



The next thing I really wanted was to see the interface where I could subscribe to some stream, add it as a bookmark to the left column, click and read its contents.

I will not go into the details of the implementation of interfaces, I just want to say that I did all the layout and ui myself. It was very unprofitable and distracted from other tasks. But jquery saved time.

I spent a total of two weeks on the reading room and the general interface (this is not counting rather stressful improvements and alterations in the future). After that, we got a pretty nice toy that lit up on our monitors, pleasing the eyes and soul.



Folders

Of course, we are minimal guys, but without folders I cannot imagine working with the reader. And, excuse me gentlemen, their usability in Google Reader leaves much to be desired. We tried to implement them as accessible and simple as possible.

But I never thought that technically it could be such a problem. The interface is an interface, and on the server side I had to tense up pretty much so that it worked as it should - see the next paragraph.



If possible, tried to use css sprites (where it turned out).

All js and css are collected in one file, minified and compressed with gzip. The average page (with all the statics) weighs 300kb. With a cache - 100kb.

And for ie6 we have a special page.



Findings:

The interface itself looks very easy, but I would not say the same about its implementation.

Finally, when everything is compressed and firebug is turned off, it works smartly.

In total, I counted 28 screens at the moment, and a million usecase s.



5. Read / unread entries



It turned out that this is a rather nontrivial task for a system where stopicot flows can potentially be, and even more subscribers. Most importantly, it can be scaled horizontally.

In each Entry entity, I keep a list of users who read it. Potentially, there may be at least a million identifiers in this list, this will not affect performance, thanks to the mongodb architecture. Also, in a separate collection is stored additional information about the reading time, etc. it is not indexed and is needed solely for statistics, so everything works pretty quickly.



For each user, the date of the last update of his counters is stored - for all threads to which he is subscribed.

When the user refreshes the page, the system finds the number of new entries for each stream that appeared later than this date, and adds it to the number of unread (simple increment). When a user reads any record, a simple decrement occurs.



Selection of unread entries on a separate stream is also very simple.



But choosing only the unread in the folder is already a problem. I do not want to clarify the nuances, but this is due to the fact that there are simply no join-s in mongodb. Simple request or several - it cannot be solved, only through CodeWScope. It is impossible to index, to scale - m / r. This is currently a potential bottleneck.



5.1. Unread on top



If any of you have used Google Reader, then you probably know about the “watch only unread” function. So, if there are no records in the stream that you have not read, you are looking at a blank page. At first, we did the same, but testing showed that users do not even realize that they have this feature turned on. They do not understand why the stream is empty, why there are no records on it, and where they go.

David offered a very interesting solution, where the unread entries simply appear on top, and the reads go down. And it cost me a few days of breaking the brain over how best to implement it, in folders.



Findings:

No-SQL is good in terms of speed and scalability. But some seemingly trivial things turned out to be quite difficult to do with him.

Denormalization is good. It is not necessary to consider a problem that any counter will fail. But for any denormalized data, you need to have a full conversion function (on the background, of course).

M / R in mongodb is still raw for production. After a little testing, it turned out that he was blocking everything to hell during work. In version 1.6, the developers promise to improve it. So far without cost him.

Schema-less decides.



8. Sharing



This is a function that allows you to sew any entry from a readable stream to your page. In short, this means that any authoritative guy A, reading various feeds, can instantly save specific entries (of course, the most interesting and useful) to his stream (s) - similar to Shared Items in GR. And other users have the opportunity to directly subscribe to his “Shared Items” stream, as well as to any feed.



One of the main concepts of our service is convenient distribution of information. A rather interesting from a technical point of view task for me was to implement the construction of chains of shearing. Mongodb was very helpful with its schema-less properties.



An interesting point:

Recently, Google announced a new ReShare feature in Buzz. So in this article (by reference), where “A little more background”, I came across points that David and I had been discussing closely 4 months ago, I came to the same conclusions. Our implementation of sharing is very close.



9. Node.js, background, queues



Initially, the demons were written in PHP, with it, very crooked. And, apart from mongodb, it was the most dumb place for me in the app, since the erener is not intended for such things.

But when I stumbled upon node.js (it was just two weeks ago), my soul started singing, and again I was able to “sleep” calmly. But the problem was that rewriting all the background code into it that was already implemented in PHP (feed-pull, discovery, feed-info) was not the time at all.

Very short picking in the capabilities of the node led me to a compromise solution - child-process.



9.1. Queue manager



This is the first node demon. His task is to read queues, distribute tasks to workers and monitor the process of their work.By the way, any uncaught exception lays only one thread, but not the whole process, which is good.

And all this - 500 lines of code (with comments).



Findings:

Evented I / O is how most server applications are required to work. The lock should only be where it is really needed.

Proxy php through node showed good results and saved time.

A bunch of work serves only one process (not counting php-cli). JS workers work there asynchronously and very sharply.



9.2. Controller - Publish / Subscribe hub



It often happens that you need to perform bulk tasks (for example, 100) in parallel, and even asynchronously. But the turn is a black hole. To send there 100 tasks ... and even once a second to contact memcache for results is expensive.

You can still bypass the queue, you could use the socket to directly contact the manager and ask him to perform these tasks, waiting for a response in the same connection. But this option is not suitable, as there may be a dozen managers, and we do not know which of them can be addressed ... in short, this is wrong.



And I created a controller (node). It is generally one for the whole system, and at the same time, it is as simple as a stool:Actually, it is in the “discovery” procedure that I described above that the PHP script just calls the controller and waits for the result of the task for 5 seconds, after which it returns the user to the interface.



Findings:

Publish / Subscribe scheme is very effective in non-blocking environments.

A 100% result is not required. If, as a result, 5 tasks out of 100 have not been completed for some reason, as a rule, this is not terrible and we continue to work.



9.3. Feed-updater (background update)



Node process, one for the whole system. Periodically turns into the database, receiving a list of feeds that need to be updated (using TTL data), and throws them into the queue.



9.4. Queues



To avoid race-conditions, a unique md5 identifier is generated for each task. It is this identifier that is placed in the queue, and the data of this task itself is stored in memcached. Because almost all tasks have non-fixed size, and memcacheq is not friendly with it - and should not. When a manager takes a task, he puts a lock on it, which is also a record in memcached. This allows you to avoid re-entering the queue of identical tasks directly during their execution.

I plan to consider Redis as an alternative to all this, because memcached in this case is used for other purposes. If he falls, the whole line will be lost.



I also divided the queues into two groups: user and system. The first - in priority.

This simply led to the addition of “feed-pull-sys”, which is used by the background update without interfering with user tasks.



Findings:

This implementation is still very raw.

The queue must be recoverable in any fall.

Need to use a more advanced locking system - mutex?

User and background processes must have different priorities.



10. Import / Export



Here is another interesting point that I want to talk about. All decent readers are required to support import / export in OPML format. But the fact is that some users can upload their opml with hundreds of feeds that are not yet in our system. And then he will have to wait until they all load. And yet, there may be a dozen of those willing at the same time.



Node saves. There was a new worker called "import" (at the moment it can work up to 10 at a time). After downloading and validating the opml file, php throws the task into the queue and returns the user to the interface, to the progress bar. Meanwhile, the “import” picks up and scatters smaller tasks into the feed-pull queue, and then waits for them to be executed by the controller, updating the counter in parallel. And the user sees a creeping progress bar. With that, he can leave this page, take a walk, and then return. It's nice.



Iii. findings

This is only the beginning, and there is still a lot of planned work ahead. Including what we can call innovations, I’m not afraid of that word. Therefore, to be continued ...



In parallel, I want to announce a half-closed launch next week.

The project itself will be written by my colleague in another article the other day.



I would be grateful for any technical comments, advice and constructive criticism.

Source: https://habr.com/ru/post/95526/



All Articles