Caching Tutorial Part 1

Quite a detailed and interesting presentation of the material relating to the cache and its use. Part 2 .

The author, Mark Nottingham , is a recognized expert in the field of HTTP protocol and web caching. He is chairman of the IETF HTTPbis Working Group . Took part in editing HTTP / 1.1, part. 6: Caching. Currently involved in the development of HTTP / 2.0.

The text is distributed under the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License .
')
From the translator: about typos and inaccuracies, please report in a personal. Thank.

The web cache is located between one or more web servers and a client, or a multitude of clients, and monitors incoming requests, while keeping copies of the responses — HTML pages, images, and files (collectively known as representations); - let me use the word “content” - in my opinion, it doesn’t cut the ear like that) for my own needs. Then, if another request arrives with the same url, the cache may use the previously saved response instead of re-requesting the server.

There are two main reasons why a web cache is used:

1. Reduction of latency - since data on request is taken from the cache (which is located “closer” to the client), it takes less time to get and display content on the client side. This makes the web more responsive (approx. Translator - “responsive” in the context of quick response to a request, and not emotionally).

2. Reduce network traffic - reusing content reduces the amount of data sent to the client. This, in turn, saves money if the customer pays for traffic, and keeps the bandwidth requirements low and more flexible.

Types of web caches

Browser cache

If you explore the settings window of any modern web browser (for example, Internet Explorer, Safari, or Mozilla), you will probably notice the “Cache” setting. This option allows you to select a hard disk area on your computer to store previously viewed content. Browser cache works according to fairly simple rules. It simply checks whether the data is “fresh”, usually once per session (that is, once in the current browser session).

This cache is especially useful when a user clicks the “Back” button or clicks on a link to see the page that he just viewed. Also, if you use the same navigation images on your site, they will get out of the browser cache almost instantly.

Proxy cache

Proxy cache works on a similar principle, but on a much larger scale. Proxies are served by hundreds or thousands of users; Large corporations and Internet providers often set them up on their firewalls or use them as separate devices (intermediaries).

Since proxies are not part of the client or the source server, but are addressed to the network, requests must be forwarded to them somehow. One way is to use the browser settings in order to manually point it to which proxy to contact; another way is to use interception proxy. In this case, the proxy processes web requests forwarded to them by the network, so that the client does not need to configure them or even know about their existence.

Proxy caches are a kind of shared cache: instead of serving one person, they work with a large number of users and are therefore very good at reducing latency and network traffic. Basically, because popular content is requested many times.

Cache Gateway (Gateway Cache)

Also known as “reverse proxy caches” (reverse proxy cache) or “surrogate cache” gateways are also intermediaries, but instead of being used by system administrators to save bandwidth, they (gateways) are usually used by webmasters in order to make their sites more scalable, reliable and efficient.

Requests can be redirected to gateways by a number of methods, but one form of load balancer is usually used.

Content delivery networks (CDNs) distribute gateways throughout the Internet (or some part of it) and give cached content to interested websites. Speedera and Akamai are examples of CDN.

This tutorial mainly focuses on browser caches and proxies, but some information also applies to those who are interested in gateways.

Why should I use it

Caching is one of the most misunderstood technologies on the Internet. Webmasters, in particular, are afraid of losing control over their site, because proxies can “hide” their users, making it difficult to monitor attendance.

Unfortunately for them (webmasters), even if the web cache did not exist, there are too many variables on the Internet to ensure that site owners will be able to get an accurate picture of how users treat the site. If this is a big problem for you, this tutorial will teach you how to get the necessary statistics without making your site a “cache-hacker”.

Another problem is that the cache can store content that is outdated or expired.

On the other hand, if you are responsibly designing your website, the cache can help with faster loading and maintaining server load and Internet connection within acceptable limits. The difference can be impressive: loading a site that does not work with the cache may take a few seconds; while the benefits of using caching can make it seem instantaneous. Users will appreciate the short download time of the site and, perhaps, will visit it more often.

Think of it this way: many large Internet companies spend millions of dollars setting up server farms around the world to replicate content in order to speed up data access for their users as soon as possible. The cache does the same for you and is much closer to the end user.

CDN, from this point of view, is an interesting development, because, unlike many proxy caches, their gateways are aligned with the interests of the cached web site. However, even when you use a CDN, you still have to consider that there will be a proxy and subsequent caching in the browser.

In summary, the proxy and browser cache will be used, whether you like it or not. Remember, if you do not configure your site for correct caching, it will use the default cache settings.

How does web cache work

All sorts of caches have a specific set of rules that they use to determine when to take cached content, if available. Some of these rules are set by protocols (HTTP 1.0 / HTTP 1.1), some by cache administrators (browser users or proxy administrators).

Generally speaking, these are the most general rules (don't worry, if you don't understand the details, they will be explained below):

If the response headers tell the cache not to save them, it will not save.
If the request is authorized (authorized) or secure (that is, HTTPS), it will not be cached.
A cached content is considered “fresh” (that is, it can be sent to a client without checking from the source server) if:
- It has an expiration time or another timeout heading, and it has not expired.
- If the cache recently checked the content and it was modified for a long time.
Fresh content is taken directly from the cache, without checking from the server.
If the content is outdated, the source server will be asked to validate it or tell the cache whether the existing copy is still up to date.
Under certain circumstances — for example, when it is disconnected from the network — the cache can keep stale responses without checking from the source server.

If there is no validator ( ETag or Last-Modified header) in the response, and it does not contain any explicit information about freshness, the content will usually (but not always) be considered non-cached.

Freshness (freshness) and validation (validation) are the most important ways in which the cache works with content. Fresh content will be available instantly from the cache; valid content will avoid resending all packets if it has not been changed.

Source: https://habr.com/ru/post/203548/

All Articles