Social networks, including Twitter, provide a huge amount of information about what people think about this or that, and the desire to automate and improve public opinion assessment methods based on social data is understandable. networks.
Suppose we need to evaluate the emotional color of tweets, for example, to conduct various sociological measurements (see if such measurements can replace classical social polls
here ). In this case, the obvious approach would be to take a dictionary of emotionally colored words, where emotions are expressed quantitatively, and evaluate tweets by the presence of words from this dictionary. But here a problem arises: such dictionaries are rare, small and may become outdated, moreover, they do not correspond to the “living” language of communication in social. networks. It seems legitimate to replenish available dictionaries with new words, while attributing to them the emotionality of tweets in which these words are found (more precisely, the arithmetic average over all tweets containing this word). Actually, such a task is proposed for solution in the course
“Introduction to Data Science” . The question arises: is such a continuation legal? Will the resulting dictionary depend on those tweets on which it expanded or, more precisely, how different will be the two dictionaries obtained from the same initial dictionary, but supplemented on different tweets?
Tweet rating
You can get tweets, or, more precisely, access to the stream of new tweets, by registering the application on the Twitter website via a
URL and using the python module - oauth2 (for more details, see the description of the corresponding project in the course above).
By itself, a tweet looks like this (dictionary of lists of dictionaries ....), the text is highlighted in bold{u'contributors': None, u'truncated ': False, u'text': u "u'in_reply_to_status_id ': None, u'id': 608365231232978944L , u'favorite_count ': 0, u'source': u '
Twitter for iPhone ', u'retweeted ': False, u'coordinates': None, u'timestamp_ms': u'1433880548662', u'entities': { u'user_mentions': [], u'symbols': [], u'trends': [], u'hashtags': [], u'urls': []}, u'in_reply_to_screen_name ': None, u' id_str ': u'608365231232978944', u'retweet_count ': 0, u'in_reply_to_user_id': None, u'favorited ': False, u'user': {u'follow_request_sent ': None, u'profile_use_background_image: urefest_sent': None, u'profile_use_background_image: urefest_sent_: ' 'default_profile_image': False, u'id ': 906813948, u'verified': False, u'profile_image_url_https': u'https:: #e.e./e./e./e./d/file_images/608142391472570368/b0RxTzZa, you will need to accept your account : u'000000 ', u'profile_text_color': u'000000 ', u'followers_count': 186, u'profile_sidebar_border_color ': u'000000', u'id_str ': u'906813948', u'profile_background_c olor ': u'000000', u'listed_count ': 0, u'profile_background_image_url_https': u'https: //abs.twimg.com/images/themes/theme1/bg.png ', u'utc_offset': -18000 , u'statuses_count ': 1197, u'description': u "u'friends_count": 184, u'location ': u'CCTX', u'Arofile_link_color ': u'AF65D4 ', u'profile_image_url': u'http: //pbs.twimg.com/profile_images/608142391472570368/b0RxTzZS_normal.jpg ', u'following': None, u'geo_enabled ': True, u'profile_banner_url': u'https : //pbs.twimg.com/profile_banners/906813948/1431466945 ', u'profile_background_image_url': u'http: //abs.twimg.com/images/themes/theme1/bg.png ', u'name': u 'Abigail Garcia', u'lang ': u'en', u'profile_background_tile ': False, u'favourites_count': 8431, u'screen_name ': u'AbigailG_23', u'notifications ': None, u'url' : None, u'created_at ': u'Fri Oct 26 21:33:39 +0000 2012', u'contributors_enabled ': False, u'time_zone': u'Central Time (US & Canada) ', u'protected' : False, u'default_profile ': False, u'is_tr anslator ': False}, u'geo': None, u'in_reply_to_user_id_str ': None, u'possibly_sensitive': False, u'lang ': u'en', u'created_at ': u'Tue Jun 09 20:09 : 08 +0000 2015 ', u'filter_level': u'low ', u'in_reply_to_status_id_str': None, u'place ': None}
It is better to save only the text itself to save space. You can also discard other languages, since in the future tweets will be analyzed only in English.
')
The next part is the evaluation of dictionary-based tweets. I used a dictionary of 2500 words, each word is assigned a value from -5 to 5.
Tweet evaluation scheme:
Of course, this approach does not allow accurate assessment of individual tweets, but the assessment of the emotionality of a large number of messages is often quite accurate (see the article above).
The procedure for compiling a new dictionary is also simple, each word is assigned a grade equal to the arithmetic average of all tweets that contain this word.
Scheme of compiling a new dictionary:
Comparison of dictionaries
Next, we turn to the comparison of two dictionaries, obtained on the basis of one initial, but expanded on different series of tweets. Since dictionaries are not interesting on their own, but on how they will evaluate subsequent tweets, I compared them according to how they evaluate an independent series of tweets. For each dictionary, you can make a vector, the i-th coordinate of which is an estimate by the dictionary of the i-th tweet from the series. Thus, the task is reduced to the comparison of two vectors, each of which corresponds to a specific dictionary, and the coordinates to a numerical evaluation of tweets.
Parameters that were calculated
correlation - the closer to +1, the better. Dictionaries with a correlation of +1 “behave” the same way.
average vector difference - how much did the average tweet score differ in two dictionaries?
the standard deviation of the mean is to calculate the probability that the difference in the evaluation of a tweet in two dictionaries is a “random” error.
What happened
If dictionaries are created on the basis of 8 thousand tweets, then:
correlation - 0.66
average difference - 0.105
standard deviation - 0.042
That is, the deviation of the average difference from zero (we assume that the error is random) is 2.5 standard deviations, which, of course, is a bit too much. But in principle, it can be said that dictionaries rate it seems. If the base is taken in the amount of 60 thousand tweets for each dictionary, the results are much better:
correlation - 0.89
average difference - 0.00086
standard deviation - 0.0080
That is, the deviation of the average difference from zero is 0.1 standard deviations, which allows us to conclude that the error (difference) is “random”.
Thus, we obtained that dictionaries, expanded on the basis of more than 60 thousand tweets, do not depend on the base itself. In practice, this means that 30 minutes of downloading Twitter feed (60 thousand filtered messages) allow you to get a new, expanded dictionary with the number of words - 16.5 thousand instead of 2.5 in the initial dictionary.
Further work is to verify that such an expanded dictionary will be not only “unique” but also correct: for example, embed a procedure for correcting values ​​on some known base or a procedure for final verification of the resulting dictionary with the unused part of the initial dictionary.