SaveWeb: site history

The modern Internet is developing very quickly and constantly changing. There used to be simple pages with text, now there are “live” sites with original design, whose lives we influence and which, in turn, change themselves, adjusting to the surrounding world like chameleons.

When I realized that time flies, burning old and creating new - I wanted to stop for a moment, turn around, look around. It would be interesting to return to the past - there you can see something that you have not seen before, something that does not exist in the present, and in the future it will be different. Satisfy simple curiosity, compare how the site looked a year ago and how it looks now - this is the incentive that made me create SaveWeb , which is still very young, but it seems that already knows how to stop time :)

Idea

The idea to regularly maintain the status of sites in the form of screenshots soared in my head for a long time. At first, I did not dare to undertake its implementation, many obvious difficulties - we need servers, a lot of disk space to save data, a wide channel and other resources that are difficult to recoup. After all, in fact, the idea is unprofitable from the moment of birth, and everyone tried to convince me of this. However, curiosity, interest and modest historical significance took over and finally convincing myself in the first place, I set about creating a prototype.
')

Backend

For the prototype, 10,000 of the most popular sites from the Alexa rating were taken, after removing a certain amount of garbage (unfortunately, not all). Then two scripts were written in Python - the controller and, in fact, the screenshots snapshot itself (hereinafter referred to as the shoter). Shoter is a console application that accepts certain parameters and outputs the result in the form of a saved screenshot (if successful) or an error code. The screenshot itself is created using the webkit (qt) launched via Xvfb. The controller controls this whole process and is responsible for the logic and processing of the results of the shoters, as well as provides multithreading. As experiments have shown, this is the most optimal and stable solution for package “screening” under the Nix.

Frontend

All the simple logic was written in PHP. Sites can be viewed for a specific date. Updating the site database is done automatically, if the URL entered is not there, it will be soon. Also on the site page, you can press the R key and wander through the existence stories of other sites. Perhaps from the functional it is all :)

Design

Minimalistic design was drawn long before the start of development (as a documenting idea) and it was laid out during the day.

Constraints

Unfortunately, I do not have the opportunity to fully finance the project from my own pocket (for those who want to help financially - there is a button on the site, I will be very grateful), so the resources are somewhat limited:
- The site, in understanding SaveWeb, is the main page (actually a domain name). Pages are not scanned separately. This is rather a conceptual limitation, but in the future it can be removed if resources permit (there must be a lot of them for this)
- JPEG with compression is used (not very strong, but sometimes noticeable)
- The screenshot is taken in size 1280x1024, but further decreases in width to 1024 pixels (for a smaller volume)
- The height of the screenshot is cropped to 3072 pixels, respectively, if the height (length) of the page is larger, then not everything will fit into the “frame”
- Temporarily turned on the ban on remote insertion of images directly from the server (hotlink protection)

Future plans

- Naturally, I would like to remove as many restrictions as possible.
- Collect meta-information to the sites: particles, pr, alexa rank, other indicators and parameters; draw on them graphics. I want to diversify a snapshot of the site with another, no less useful and interesting information.
- The English version of the site (is about to appear)
- There are many ideas and thoughts on how to optimize the work of the system as a whole, cut costs on resources and so on.
- To introduce additional features in the service interface, for greater convenience and clarity (for example, make a comparison mode)

Total

Now SaveWeb stops time about once a month. At the moment, several iterations have been completed and it is already possible to look at something. While there are funds and resources for six months of the project.

Recursive SaveWeb. For two weeks, a little changed

Recursive SaveWeb. For two weeks, a little changed

Recursive SaveWeb. If you look at both dates, you can see small changes :)

Why all this?

I described the creation of a simple but ambitious project, the idea of which had intrigued me for a long time. I am very pleased that at the moment a prototype has already been created (or an alpha version - as you like). Sites are already saved and the story is already being created, which means the main goal is completed. However, I’m not going to stop there, and I really hope that in a few years we will still be able to look at the results of SaveWeb. After all, the longer it works, the more significant its existence.

With this post, I would like to ask the habrasoobschestva opinion about this project, to collect feedback in the form of ideas and thoughts. And thank you in advance for your attention.

But what about the Wayback Machine ?

The comments ask what is different from the well-known project. I expected such questions, so I will answer as part of the article :)

The main difference in the goals and approaches - Wayback Machine primarily preserves the information component of the resource, and tries to cover as many sites as possible, any. The idea of SaveWeb is to convey the way the Internet looked like before, and not try to save everything. There is no need to save millions of unknown sites to anyone, it is enough to save popular, massive ones - the very ones that change like chameleons.

However, the Wayback Machine has one indisputable and huge advantage - time. But nothing can be done with it, it flies. As they say better late (start) than never;)

Source: https://habr.com/ru/post/114921/

All Articles