📜 ⬆️ ⬇️

How to predict an Oscar winner according to social networks or how I spent my day off

It was a snowy Sunday, moreover, forgiven, and in the morning it was decided to throw off the blanket and start preparing his body, swept away during the Maslenitsa, for the summer beach season. Peter is not very supportive of outdoor sports this season, the gym membership has ended, so after 5 km of cross-country skiing the energy required release.
Of course, it was not possible to simply stick to the Internet, and I remembered the idea of ​​predicting the winner of the Oscar in 2018 , the results of which will be known very soon on March 4th. This idea was formed in communication with one interesting person, so thank him for the idea.

I didn’t want to mess with the formation of a data set, kaggle didn’t have one either, but I wanted to do something unusual and interesting. Adjusted the problem: to determine public opinion about the winner of the Oscar?

But in the beginning it is necessary to figure out what is going on in the film industry and who, at least, is nominated.

What is Oscar (version 20! 8)


The 90th Oscar awards ceremony for merits in the field of cinema for 2017 will be held on March 4, 2018 in the Dolby Theater (Hollywood, Los Angeles). Comedian Jimmy Kimmel will hold the ceremony for the second year in a row. The nominees were announced on January 23, 2018 ( who cares ).
')
So, all the nominations are not interesting to me, so we will explore public attention in the following nominations: best film, best actor, best actress, best soundtrack. Denote the data for the preparation of requests.

Nomination best movie



Twitter as a public opinion survey platform


But first you need to provide access to the Twitter API.

CONSUMER_KEY = '' CONSUMER_SECRET = '' OAUTH_TOKEN = '' OAUTH_TOKEN_SECRET = '' auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET) twitter_api = twitter.Twitter(auth=auth) 

Since I did not have a data set, I had to think a little and formulate criteria for evaluating public opinion on Twitter:

1) The need to search and process messages distributed at the current time (resent), which will determine the changes in public opinion, trends. Use the Twitter API method for this.

  tweet=twitter_api.search.tweets(q=(e1.get()), count="100") p = json.dumps(tweet) res2 = json.loads(p) 

2) The need to determine the potential distribution, i.e. I currently have 10 subscribers, I publish a message that retweets 2 of my subscribers, which have 25 subscribers each. T. o. the number of spreads is 2, and potentially possible is 10 + 25 + 25 = 60.

  i=0 while i<len(res2['statuses']): tweet=str(i+1)+') '+str(res2['statuses'][i]['created_at'])+' '+(res2['statuses'][i]['text'])+'\n' retweet_count.append(res2['statuses'][i]['retweet_count']) followers_count.append(res2['statuses'][i]['user']['followers_count']) friends_count.append(res2['statuses'][i]['user']['friends_count']) print u' ', sum(retweet_count) print u' ', sum(followers_count)+sum(friends_count) 

3) The need to determine the tone of messages, as well as the attitude of positive to negative. To do this, we form two dictionaries of positive and negative words. Using the Bayes formula (link), we define the conditional probability of the message tonality.

 def format_sentence(sent): return({word: True for word in nltk.word_tokenize(sent.decode('utf-8'))}) pos = [] with open("pos_tweets.txt") as f: for i in f: pos.append([format_sentence(i), 'pos']) neg = [] with open("neg_tweets.txt") as f: for i in f: neg.append([format_sentence(i), 'neg']) training = pos[:int((.8)*len(pos))] + neg[:int((.8)*len(neg))] test = pos[int((.8)*len(pos)):] + neg[int((.8)*len(neg)):] classifier = NaiveBayesClassifier.train(training) classifier.show_most_informative_features() 

4) (Optional) Definition of message language. In different countries, cinema is perceived differently. We refer to the mentality.

 stopwords = nltk.corpus.stopwords.words('english') en_stop = get_stop_words('en') stemmer = SnowballStemmer("english") #print stopwords[:10] total_word=[] lang=[] while i<len(res2['statuses']): lang.append(res2['statuses'][i]['lang']) w7=Label(window,text=u"  ", font = "Times") w7.place(relx=0.65, rely=0.1) f = Figure(figsize=(6, 4)) a = f.add_subplot(111) t = Counter(lang).keys() y_pos = np.arange(len(t)) performance = Counter(lang).values() error = np.random.rand(len(t)) s = Counter(lang).values() a.barh(y_pos,s) a.set_yticks(y_pos) a.set_yticklabels(t) a.invert_yaxis() a.set_ylabel(u' ') a.set_xlabel(u'') canvas = FigureCanvasTkAgg(f, master=window) canvas.show() canvas.get_tk_widget().place(relx=0.52, rely=0.12)#pack(side=TOP, fill=BOTH, expand=1) canvas._tkcanvas.place(relx=0.52, rely=0.12)#pack(side=TOP, fill=BOTH, expand=1) 

The output is presented in charts.


Distribution of references

image

Attitude of Twitter users to nominated films

image

Display the language distribution


That is, find out about which films speak in which languages ​​(respectively, and countries).

  1. Call me your name
    image
  2. Dark times
    image
  3. Dunkirk
    image
  4. Away
    image
  5. Lady bird
    image
  6. Ghost thread
    image
  7. Secret Dossier
    image
  8. Water form
    image
  9. Three billboards on the border of Ebbing, Missouri
    image

Nomination best actor


(presented in tabular form)

image

Nomination best actress


(presented in tabular form)

image

Suggestions for improvement


Of course, this “slice” of public opinion can only partly show an attitude to films. For more in-depth analysis, it is necessary to collect data from Twitter during the intermediate time, especially when information leads appear. But, at the moment, the leader of public opinion are: The form of water . I think even in the evening to see!

Also, social networks allow you to analyze and classify the audience. Reference to the repository and dictionaries for learning the tonality model .

Source: https://habr.com/ru/post/349436/


All Articles