A stack that allowed Medium to provide readings for 2.6 millennia

I offer the public my translation of Dan Pupius’s article about the architecture of the Medium service and the technologies used. I want to emphasize that the article is a translation, therefore the pronoun “I” used in the text further refers to the author of the original text, and not to the translator.

Background

Medium is a network. This is a place where they exchange stories and ideas that matter - the place where you develop, and where people spent 1.4 billion minutes - 2.6 millennia.

We have over 25 million unique readers per month, and tens of thousands of posts are published every week. But we want that on Medium the measure of success was not the number of views, but points of view. So that the value was the quality of the idea, not the qualification of the author. For Medium to be a place where discussions develop ideas and words are still important.

I lead an engineering team. I used to work as an engineer at Google, where I worked on Google+ and Gmail, and was also one of the co-founders of the Closure project. In a past life, I drove on a snowboard, jumped out of an airplane and lived in the jungle.

Team

I could not be more proud of the team than now. This is an amazing group of talented, inquisitive, attentive people who come together to do great things.
We work in multi-functional, task-oriented teams, so that when some specialize, we can turn to any component of the stack. We believe that turning to different disciplines makes the engineer better. I wrote about our other values earlier .

Teams have a lot of freedom in how they organize their work, but as a company as a whole, we set quarterly goals and encourage an iterative approach. We use GitHub for code review and bug tracking, as well as Google Apps for email, documents and spreadsheets. We use Slack intensively - and its bots - and many teams use Trello.

Initial stack

From the very beginning, we used EC2 . The main applications were written in Node.js , and we migrated to DynamoDB before the public launch.

We had a node-server for image processing, which delegated the actual processing tasks to GraphicsMagick . Another server acted as an SQS queue handler, and performed background tasks. We used SES for email, S3 for static resources, CloudFront as CDN and nginx as reverse proxy. For monitoring, we used DataDog and PagerDuty for alerts.

The base editor used on the site was TinyMCE. By the time of launch, we already used Closure Compiler and some of the components of the Closure Library , but we used Handlebars for templating.

Current stack

For such, at first glance, a simple site like Medium, it may be surprising how much complex is behind the scenes. It's just a blog, right? You can probably just roll something on Rails in a couple of days. :)

In any case, enough reasoning. Let's start from the bottom.

Production environment

We use Amazon Virtual Private Cloud . For configuration management, we use Ansible, which allows us to store configurations in a version control system and easily roll out changes in a controlled manner.

We adhere to a service-oriented architecture, supporting a dozen services (depending on how you count them, several of them are significantly smaller than the others). The main question when deploying a new service is the specificity of the work it performs, what is the probability that it will be necessary to make changes in several other services, as well as the characteristics of resource consumption.

Our main application is still written on Node , which allows us to use one code both on the server and on the client, so we actively use something in the editor and in the processing of the post. From our point of view, Node is not bad, but problems arise if we block the event loop. To smooth out this problem, we run several instances on the same machine and distribute the "expensive" connections to different instances, isolating them in this way. We also look at the V8 runtime to see which tasks take a lot of time; In general, delays are associated with the recovery of objects when JSON is deserialized.

We have several support services written in Go . We found that Go applications are very easy to build and deploy. We like its type safety without JVM redundancy and the need for fine-tuning, as is the case with Java . Personally, I am a fan of using the command of "stubborn" languages; this increases consistency, reduces ambiguity and definitely reduces the chance to shoot yourself in the foot.

Now we are distributing static through CloudFlare , although we are passing 5% of traffic through Fastly and 5% through CloudFront to keep their caches warm in case we need to switch in an emergency. We recently turned on CloudFlare for application traffic - primarily to protect against DDOS, but we were also pleased with the increased performance.

We use nginx and HAProxy together as reverse proxies and load balancers to cover the entire Venn diagram of our needs.

We still use DataDog for monitoring and PagerDuty for notifications, but now we are using ELK extensively ( Elasticsearch , Logstash , Kibana ) to debug problems encountered in production.

Database

DynamoDB is still our main database, but it wasn’t a peaceful swim. One of the perennial problems that we encountered were [hot keys] ( https://medium.com/medium-eng/how-medium-detects-hotspots-in-dynamodb-using-elasticsearch-logstash-and-kibana -aaa3d6632cfd ) ( Note of the translator: meaning the keys for which select a large number of requests in a short period of time ), during events that have become viral or when sending posts to millions of subscribers. We have a Redis cluster that we use as a cache in front of Dynamo, which solves the problem for read operations. Optimizations to improve developer convenience often differ from those for increased stability, but we are working on it.

We started using Amazon Aurora to store new data because it allows more flexibility than Dynamo to extract and filter data.

To store relationships between entities that represent the Medium network, we use Neo4j , we run one master and two replicas. People, posts, tags, and collections are nodes in a graph. Edges are created when creating an entity and when performing actions such as subscriptions, recommendations, or highlighting. We go around the graph to filter and recommend posts.

Data platform

From the very beginning, we were very data-hungry, invested in our data analysis infrastructure to make business and product decisions easier. Recently, we can also use an existing data pipeline for production feedback, allowing functions such as Explore to work.

We use Amazon Redshift as a data warehouse, it provides a scalable data storage and processing system, on top of which our applications are built. We constantly import our basic data (users, posts) from Dynamo to Redshift and event logs (viewing the post, scrolling the page with the post) from S3 to Redshift.

Task execution is planned by Conduit, our internal application that manages scheduling, data dependencies, and monitoring. We use planning based on requirements (assertion-based scheduling), when tasks are started only if their dependencies are satisfied (for example, a daily task that depends on logs from all day). This system has proven its indispensability - data generators are separated from consumers, which simplifies configuration and makes the system predictable and debugged.

Although SQL queries in Redshift suit us, we need to load and unload data into and from Redshift. For ETL, we are increasingly turning to Apache Spark because of its flexibility and ability to scale to our needs. Over time, Spark is likely to become a major component of our data pipeline.

We use Protocol Buffers for data schemas (and schema change rules) to keep all layers of a distributed system synchronized, including mobile applications, web services, and data storage. We annotate schemas with configuration details, such as table and index names, record validation rules, such as maximum row lengths or valid ranges of numbers.

People want to be in touch, so mobile and web application developers can connect to us in the same way, and Data Scientists can interpret the fields in the same way. We help our people work with data by viewing schemas as specifications, carefully documenting messages and fields, and publishing documentation generated from .proto files.

Images

Our image processing server is written in Go and uses a waterfall strategy to return the processed images. Servers use groupcache , which provides an alternative to memcache and reduces the amount of work done twice. The cache in memory is reserved by the persistent cache in S3; after which the images are processed on request. This gives our designers the freedom to change the way images are displayed and optimized for different platforms without having to run large batch tasks to generate scaled images.

Although currently the image processing server is mainly used for scaling and cropping images, early versions of the site supported light filters, blurring and other effects. Processing animated GIFs was a huge pain, but this is a topic for a separate post.

Textshots

The cool TextShorts feature is implemented by a small Go server, using PhantomJS as a renderer.

Textshots

I always wanted to change the rendering engine to something like Pango, but in practice the ability to display images in HTML is much more flexible and convenient. And the frequency with which this function is used makes it quite easy to handle the load there.

Random domains

We allow people to set up arbitrary domains for their publications on Medium. We wanted to be able to log in once and for all domains, as well as HTTPS for all, so making it all work was not trivial. We have a set of HAProxy servers that manage certificates and direct traffic to the main servers. Manual configuration is still required when setting up a domain, but we have automated most of it, due to integration with Namecheap. Certificates and publication binding are performed by dedicated servers.

Web frontend

In the web part, we prefer to stay close to the gland. We have our own single page application framework (SPA), which uses Closure as its standard library. We use Closure Templates for rendering on both the client side and the server side, and Closure Templates for code minification and module separation. The editor is the most difficult part of our application, Nick has already written about it.

iOS

Both of our applications are native, and minimally use the web view.

On iOS, we use a mixture of both self-made framework and embedded components. At the network level, we use NSURLSession to execute queries and Mantle to parse JSON and restore models. We have a caching level built on top of NSKeyedArchiver. We use a common mechanism for rendering list items with support for arbitrary styles, which allows us to easily create new lists with different content types. The post is viewed using a UICollectionView with its own layout. We use common components to render the entire post as a whole and preview the post.

We collect each commit and distribute the assembly among our employees, so that we can try out the application as quickly as possible. The frequency of application calculations in the AppStore is tied to the application verification cycle, but we try to publish applications even if the changes are minimal.

For tests we use XCTest and OCMock.

Android

On Android, we use the latest SDK and auxiliary libraries. We do not use comprehensive frameworks, preferring instead to use standard approaches to solving common problems. We use guava for everything that is missing in Java. At the same time, we prefer to use third-party tools for solving narrower problems . We use our API based on protocol buffers, and then generate objects that we use in the application.

We use mockito and robolectric . We write high-level tests in which we turn to activation and investigate what and how - we create basic versions of tests when we just add a screen or prepare for refactoring. They evolve as we replicate bugs to protect against regression. We write low-level tests to study specific classes as we add new features. They allow you to judge how our classes interact.

Each commit is automatically posted to the Play Store as an alpha version and distributed to our employees. (This also applies to the application for our internal version of Medium - Hatch ). As a rule, on Fridays we transfer the alpha version to our beta testers and let them play with it on the weekend, after which, on Monday, we post it for everyone. Since the latest version of the code is always ready for release, if we find a serious bug, then we immediately publish the fix. If we are worried about the new feature, then we give the beta testers to play with it a little longer; if we like everything, then we release more often.

A | B Testing & Feature Flags

All our clients use feature flags issued by the server and called variants . They are used for A | B testing and disconnection of incomplete functions.

miscellanea

There is a lot more around the product that I didn’t say: Algolia , which allows you to work faster with search functions, SendGrid for incoming and outgoing email, Urban Airship for notifications, SQS for queue processing, Bloomd for Bloom filters, PubSubHubbub and Superfeedr for RSS and so on and so forth.

Assembly, testing and layout

We use continuous integration and delivery (continuous integration and delivery), laying out as quickly as possible. All this is controlled by Jenkins .

Historically, we used Make as a build system, but for newer projects we migrate to Pants .

We have both unit tests and HTTP level function tests . All commits are tested before merge. Together with the Box team, we worked on using Cluster Runner to distribute tests and make testing faster, it also has integration with GitHub .

We put it into the test environment as quickly as we can - at the moment every 15 minutes, the successful editors then become candidates for pushing to production. Primary application servers are updated about five times a day, in some cases up to 10 times.

We use blue / green builds. Before laying out for production, we direct the traffic to the test instance and monitor the error rate before continuing. Rolling back is done by DNS.

What's next?

A lot of everything! There are many things that need to be done to improve the product and to make reading and writing posts better. We are also starting to work on monetization for authors and publishers. This is a new field for us and we enter it with an open mind. We believe that the future requires new mechanisms for financing content, and we want to make sure that our functions stimulate quality and value.

join us

Usually we are always interested in communicating with highly motivated engineers who have experience working on end-user applications. We have no requirements regarding languages that you know, since we believe that good engineers can quickly understand the new discipline, but you must be curious, knowledgeable, decisive and empathetic . The basis can be taken iOS, Android, Node or Go.

We are also preparing a Product Science team, so we are looking for people with experience building data pipelines and large analytical systems.

I’m also looking for engineering leaders to help develop a team. They should be interested in the theory of management of the organization, want to work with their hands and be ready to lead subordinates.

You can read more at Medium Engineering .

Blog Nathaniel Felsen , Jamie Talbot , Nick Santos , Jon Crosby , Rudy Winnacker , Dan Benson , Jean Hsu , Jeff Lu , Daniel McCartney and Kate Lee .

Source: https://habr.com/ru/post/332860/

All Articles

A stack that allowed Medium to provide readings for 2.6 millennia

Background

Team

Initial stack

Current stack

Production environment

Database

Data platform

Images

Textshots

Random domains

Web frontend

iOS

Android

A | B Testing & Feature Flags

miscellanea

Assembly, testing and layout

What's next?

join us

More articles: