📜 ⬆️ ⬇️

Dropbox: inside view

In this article I will talk about the internal structure of the popular cloud storage service Dropbox. In particular, the Dropbox protocol device will be affected, and statistics of its use in some European countries will also be shown. In addition, I will compare it with other services, such as iCloud, Google Drive and SkyDrive.

The article is purely technical. There will be no summary tables with a cost per GB and an analysis of how much more can be obtained for the invited "friends."

The text is based on the scientific article “Dropbox from the inside: Exploring Cloud Storage Services” (Inside Dropbox: Understanding Personal Cloud Storage Services). PDF

In the past few years there has been a huge jump in the popularity of cloud storage services. All major players and several young startups participate in the arms race. Basically, all the information about the internal structure of services and the real numbers of their use is a secret behind seven seals. We are fed only with data passed through the marketing department, which, of course, is somewhat different from reality. So let's dig deeper with the guys Idilio Drago, Anna Sperotto, Marco Mellia, Ramin Sadre, Maurizio M. Munafò and Aiko Pras - the authors of the study.
')

Introduction


The Dropbox client is designed primarily in Python using third-party libraries, such as librsync. The client supports all major operating systems: Windows, Mac, Linux. Using Python unequivocally indicates that the client was designed with lightweight porting to various platforms.

The main element of the system is a block (chunk) up to 4 Mb in size. In case the file is larger, it is divided into several blocks, and each block is perceived by the system independently of the others. For each block, a SHA256 hash is computed, and this information is part of the meta information about the file. Dropbox reduces the amount of data transferred by transferring only the difference between the modified blocks of the file. In addition, locally it contains all the meta information on files, which it synchronizes with the server and sends only changes from the previous version (incremental updates).

Dropbox uses two types of servers: a control (control) and a data server (data storage). Management servers are controlled by Dropbox, data servers are Amazon servers (Amazon S3, EC2). For communication with servers, HTTPS is used in all cases.

The domain names used by Dropbox always end with dropbox.com. The table below lists the subdomains for management and data servers.

SubdomainHostingDescription
client-lb / clientXDropboxMeta data
notifyXDropboxNotifications
apiDropboxAPI control
wwwDropboxWeb servers
dDropboxEvent logs
dlAmazonDirect links
dl-clientXAmazonClient storage
dl-debugXAmazonBack traces
dl-webAmazonWeb storage
api-contentAmazonAPI storage


Dropbox: inside


Since Dropbox uses HTTPS to encrypt all traffic between servers, simply intercepting will not yield any useful information. For research, we installed Squid and sent all traffic from a Linux computer to this proxy. SSL-bump was also installed on the proxy so that SSL could be decrypted. The final step is to install the self-signed certificate on Squid and modify the certificate inside the application launched by Dropbox. This configuration allows you to decrypt and view Dropbox traffic.




The illustration shows the protocol used by Dropbox to upload locally modified blocks to their servers. After registering the client on the clientX.dropbox.com management servers, the list command receives changes in the metadata that show the difference between the local copy and what is on the server. As soon as a local file change occurs, Dropbox invokes the commit_batch ( client-lb.dropbox.com ) command and sends the modified metadata to the server. After that, the server responds to which blocks it needs, using the need_blocks command, and the client sends these blocks to Amazon ( dl-clientX.dropbox.com ). The saving of each block is confirmed by the OK command.

After that, the local client once again sends the commit_batch command to the server and receives confirmation that all blocks have been received. Data storage transactions can be executed in parallel.

Control protocol

Dropbox uses the following management server groups:



Data set and customer popularity

We have chosen the passive way to monitor Dropbox. To collect traffic used open source tool Tstat. Tstat allows you to collect a variety of information about TCP, providing information for more than a hundred different connection parameters. We have taken a few extra steps to analyze Dropbox.

Since Dropbox uses HTTPS, we have determined that the name in all the certificates used by Dropbox is * .dropbox.com. It was important to properly classify traffic.

We filled up the open information with records from the DNS servers to which the clients contacted. So we linked the IP addresses and server names.

Tstat returned unencrypted information about the device and the names of the directories exchanged between the client and the notification server.

Data was obtained using a Tstat installation at 4 points in Europe. Records from points designated Home 1 and Home 2 are data from users of a well-known Internet service provider (ISP) that provides Internet via ADSL and optical cable. The data, designated Campus 1 and Campus 2, was collected at universities. Studies were conducted from March 24, 2012 to May 5, 2012.

NameType ofNumber of IP addressesData Volume (GB)
Campus 1Wired4005.320
Campus 2Wired / wireless2,52855,054
Home 1FTTH / ADSL18,785509,909
Home 2ADSL13,723301,448

Below is a graph that shows how many different IP addresses were associated with a cloud storage service at least once a day.



The second graph shows how much data was transferred to this cloud storage per day.



I would like to draw attention to the following:

For comparison, we give data on the use of services YouTube and Dropbox in Campus 2.



The table shows the total Dropbox traffic that we monitored during our measurements.

Campus 1Campus 2Home 1Home 2Total
Requests167,1891,902,8241,438,369693,0864,204,666
Volume (GB)1461.8141,1535063.624
Devices2836,6093,3501,31311,561

Traffic analysis

The graphs show a cumulative distribution function for a different number of blocks.



It turned out that in more than 80% percent of cases, the number of blocks does not exceed 10 when storing data. same blocks. Analysis of the data obtained shows that the main use case of Dropbox is to constantly work with small, constantly changing files.

As we discussed above, Dropbox uses central data storage servers. This immediately leads to a question about the speed of the service for users who are geographically far from the servers.

The maximum speed we observed was close to 10 Mbit / s and was observed on files larger than 1 Mb. The average speed for Campus 2 was: write - 462 kbits / s and read - 797 kbits / s. For Campus 1: write - 359 kbits / s and read - 783 kbits / s.



Also from the graphs it can be seen that the speed significantly depends on the number of blocks: the more blocks, the lower the speed.

Changes in Dropbox 1.4.0

Starting with version 1.4.0, Dropbox added two new commands: store_batch and retrieve_batch , which allows you to work with several blocks at the same time. This improvement should significantly improve service throughput.

Number of devices

The graph shows the number of Dropbox installations for users at home. In about 60% of cases, users have only 1 device with Dropbox. 25% of home users have 2 devices using Dropbox.


Average usage time

The graph shows the average time to use Dropbox. Analyzing the time of use, we looked at how long the client was in contact with the notification server. Since the customer always keeps this connection open or opens it again, this is a good way to estimate usage time.


The graph shows that the use of Dropbox in most cases is less than 4 hours. The exception is Campus 1, where there are many working computers and computers that work constantly.

Initial data


You can download the source data that was used in this article for further analysis. ( Baseline ).

I want to note that the original article contains more information. It may contain answers to questions that you may have after reading.

Source: https://habr.com/ru/post/163189/


All Articles