Recently, my colleagues and I discussed the LinkedIn social network and raised an interesting question. This site has the function of displaying the path from your account to some other account. The maximum number of steps is three, i.e. the path is as follows: You-> Your Friend-> Your Friend's Friend-> The person you are looking for. The question is, how is the search for this path? Is he searched every time you log on to someone else's account? Or are all paths for all users already calculated in advance? The first option leads to large time costs, the second - to large storage costs.
My colleagues and I were lucky, just in May of this year, representatives of LinkedIn at the JavaOne 2008 conference in San Francisco conducted two presentations about the architecture of their social network.
This blog has an excellent overview of the reports, and
here is its translation. By the way, both posts contain links to original presentations of LinkedIn representatives, if anyone is interested.
')
So it turned out that the entire graph of users occupies 12 GB and is constantly kept in memory. But at the beginning of each session, a sample is created from the general graph personally for a given user, i.e. how would look at the social network from his point of view. This personal graph is stored in the cache throughout the session and takes approximately 2 MB for each user. It is updated only if the user himself makes any changes to it, for example, adds or deletes one of his friends. If the changes come from other users, the graph remains unchanged. Obviously, the search for the path to any account occurs in this column.
Thus, LinkedIn implements a hybrid version, i.e. the search for the path occurs on the fly, but at the same time some part of the information (personal graph) is built in advance and stored in memory. An inquisitive reader may ask the question: “As in the case of a full graph, so in the case of a personal graph, we will have to go through all possible paths from the user to the desired person. What then is the advantage of a personal graph? ”The thing is that graphs are represented as either arrays or linear lists. In this and in another case, in order to find an element, at worst it is necessary to go through all the existing elements. In the full column of such elements there are 22 million (exactly as many LinkedIn registered users). The size of the personal graph is obviously much smaller. For example, if a user, each of his friends and friends of their friends has 30 contacts, then the size of such a graph will be only 27931. In addition, in a personal graph to make sure that the path exists, you do not need to go through all possible paths, it’s quite simple check if the required account is in the column. After all, if this graph was built as a “personal look at the network” with a maximum number of steps equal to three, then the presence of any account in it automatically means that the path to the user within three steps exists and it generally makes sense to look for it.
And finally, a small offtopic. I was surprised to find that the graph of the entire social network has 22 million peaks and 120 million edges. This means that each account has an average of just 5.45 friends. Apparently this is due to the fact that a lot of people are registered, but then do not use their account.