Good morning, habrauzer. In this post I would like to share some experience with the twitter API, and in particular with parsing a large number of users and getting information about each user (account creation date, username, screen_name, user’s webpage, number of tweets, number of friends, number of followers location).
This is my first post, so please do not judge strictly, but I also have nothing against constructive criticism.
Task: There are about 100 active and respected Twitter users (T0). For these users, I needed to get a list of friends (T1) and for each user to get personal data. In the same way we get T2 (T2 - friends of users from T1) and T3.
As a result, we have a user base T = T0 + T1 + T2 + T3. Since each Twitter user has about 1286 friends (statistics obtained from data of about 80 million accounts), the number of users in each group grows very quickly:
- T0: 100 users
- T1: ~ 42000 unique users
- T2: ~ 5,200,000 unique users
- T3: ~ 80 million unique users
')
When parsing so many accounts, the first problem we face is the limit of requests to the API. We can fulfill 150 requests / hour if we are not authorized and 350 requests / hour if we are authorized. In addition, these 150/350 requests are divided into two 30 minute intervals. That is, we can execute 75/175 requests from each user in 30 minutes. This is clearly not enough to get this amount of data. For this, I used a database of about 3000 accounts (bots) from a botnet that I developed for the same customer (if anyone is interested, I can tell you about the functionality of the botnet and some “pitfalls” in a separate post). That is, I had a margin of almost 0.5 million requests in 30 minutes, and then everything else came up against the speed of processing the API response and writing data to the database.
To communicate with the API, I did not invent bicycles and used the
abraham oauth library, which is well-known in narrow circles. I only slightly modified it so that it could use
multi_curl (we remember that we need to make a lot of requests).
To get the friends list of the user, the
friends / ids API method was used. This method allows you to get a list of ID friends of the user. If the number of friends exceeds 5000, the result is paginated (I received a maximum of 5000 friends for each user and did not make additional requests if there were more than 5k).
After we got friends of all users, we need to get data about each user. For this we use the wonderful
users / lookup method. We take from the database ID packs of 100 and parsim data.
As a result, we get a fairly large user database. The following are some statistics:
- average number of tweets ~ 4317
- average number of friends ~ 1286
- average number of followers ~ 35045