Magic introduction to classification algorithms

Translation of an article by Brian Berend.

When you first begin to study the theory of data analysis and processing, then one of the first you study the classification algorithms. Their essence is simple: information about a specific observation result (data point) is taken, on the basis of which this result belongs to a particular group or class.

A good example is spam email filter. It should mark incoming letters (that is, the results of observations) as “spam” or “not spam”, focusing on information about the letters (the sender, the number of words starting with capital letters, and so on).
')

This is a good example, but boring. Spam classification is cited as an example in lectures, presentations and conferences, so you probably have heard about it more than once. But what if we talk about a different, more interesting classification algorithm? Some more weird? More ... magic?

That's right! Today we talk about Sorting Hat from the world of Harry Potter. Take some data from the network, analyze and create a classifier that will sort the characters into different faculties. It should be fun!

Note:
Our classifier will not be too complicated. So it should be considered as a “primary approach” to the solution of the problem, demonstrating some basic techniques for extracting text from the network and analyzing it. In addition, given the relatively small sample size, we will not use classical teaching methods like cross-checking . We simply collect certain data, build a simple classifier based on the rules and evaluate the result.

Second note:
The idea of this post is inspired by the wonderful presentation of Brian Lang at the conference PyData Chicago 2016 . The video is here , the slides are here .

Step One: Retrieve Data from the Network

In case you spent the last 20 years in a cave: Distributing hat is a magic hat that places incoming students in four faculties of Hogwarts: Gryffindor, Slytherin, Hufflepuff and Ravenclough. Each faculty has its own characteristics. When a hat is worn on the student's head, she reads his mind and determines which department suits him best. According to this definition, Distributive hat is a multiclass classifier (sorts into more than two groups), unlike a binary classifier (sorts strictly into two groups), which is a spam filter.

To distribute students to faculties, we need to know certain information about them. Fortunately, enough data is available on harrypotter.wikia.com . This site contains articles on almost all aspects of the Harry Potter universe, including descriptions of students and faculties. A nice bonus: Fandom , the site manager, provides an easy-to-use API and a lot of excellent documentation . Hooray!

Let's start by importing pandas and requests . The former will be used to organize the data, and the latter - for requests to the API to receive data.

We also need to competently go through all the students at Hogwarts and record the faculties in which they are scattered with the distribution hat (this will be “real data” with which we will compare the results of our sorting). On the site of the article are divided into categories, like "Students of Hogwarts" and "Films". The API allows you to create lists of articles within a specific category.

Take for example Ravenklou. Put all the data into the info variable and then put it into the Pandas Data Frame.

 #   import pandas as pd import requests #      category = 'Ravenclaws' url = 'http://harrypotter.wikia.com/api/v1/Articles/List?expand=1&limit=1000&category=' + category requested_url = requests.get(url) json_results = requested_url.json() info = json_results['items'] ravenclaw_df = pd.DataFrame(info) print('Number of articles: {}'.format(len(info))) print('') ravenclaw_df.head()

Articles: 158

You can follow the complete analysis with Rodeo !

Note:
If you use our Python IDE, Rodeo, then just copy and paste the above code into the Editor or Terminal. You will see the result in the History or Terminal window. Bonus: you can simply drag the windows with the mouse, changing their location and size.

Based on this data we will find out:

The first item on the list is the “Ravenclaw individual infobox”. Since this is not a student, we need to filter the results by the type column.
Unfortunately, the contents of the articles are not listed in ravenclaw_df ... only descriptions. To get the content, you have to use another kind of API request and request data based on the article ID.
We can also write a cycle, go through all the departments and get one frame with all the necessary data.

 #   houses = ['Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin'] mydf = pd.DataFrame() #  ID , URL    for house in houses: url = "http://harrypotter.wikia.com/api/v1/Articles/List?expand=1&limit=1000&category=" + house + 's' requested_url = requests.get(url) json_results = requested_url.json() info = json_results['items'] house_df = pd.DataFrame(info) house_df = house_df[house_df['type'] == 'article'] house_df.reset_index(drop=True, inplace=True) house_df.drop(['abstract', 'comments', 'ns', 'original_dimensions', 'revision', 'thumbnail', 'type'], axis=1, inplace=True) house_df['house'] = pd.Series([house]*len(house_df)) mydf = pd.concat([mydf, house_df]) mydf.reset_index(drop=True, inplace=True) #   print('Number of student articles: {}'.format(len(mydf))) print('') print(mydf.head()) print('') print(mydf.tail())

Number of articles about students: 748

Retrieving article contents

Having ID articles, we can start requesting content. But some of the articles are just HUGE, they contain an incredible amount of detail. Just look at the articles about Harry Potter and Volan de Mort !

In articles about all key characters there is a section "Personality and Character Traits". It would be logical to extract from this information that the Dispensing Hat will use when making decisions. But this section is not in all articles, so if you focus only on him, then the number of characters will greatly decrease.

The code below extracts the section “Personality and Character Traits” from each article and calculates its length (number of characters). Then, based on the ID, combines this data with our initial data frame mydf (this takes a little time).

 #        "     "    #     -    ,     #     text_dict = {} for iden in mydf['id']: url = 'http://harrypotter.wikia.com/api/v1/Articles/AsSimpleJson?id=' + str(iden) requested_url = requests.get(url) json_results = requested_url.json() sections = json_results['sections'] contents = [sections[i]['content'] for i, x in enumerate(sections) if sections[i]['title'] == 'Personality and traits'] if contents: paragraphs = contents[0] texts = [paragraphs[i]['text'] for i, x in enumerate(paragraphs)] all_text = ' '.join(texts) else: all_text = '' text_dict[iden] = all_text #    DataFrame     "   " text_df = pd.DataFrame.from_dict(text_dict, orient='index') text_df.reset_index(inplace=True) text_df.columns = ['id', 'text'] text_df['text_len'] = text_df['text'].map(lambda x: len(x)) #        mydf_all = pd.merge(mydf, text_df, on='id') mydf_all.sort_values('text_len', ascending=False, inplace=True) #   DataFrame    ,     "   " mydf_relevant = mydf_all[mydf_all['text_len'] > 0] print('Number of useable articles: {}'.format(len(mydf_relevant))) print('') mydf_relevant.head()

Number of eligible articles: 94

Step Two: Obtain Faculty Characteristics with NLTK

Now we know the number of students, we must distribute them among the faculties. To do this, make a list of characteristics of each department. Let's start collecting from harrypotter.wikia.com .

 trait_dict = {} trait_dict['Gryffindor'] = ['bravery', 'nerve', 'chivalry', 'daring', 'courage'] trait_dict['Slytherin'] = ['resourcefulness', 'cunning', 'ambition', 'determination', 'self-preservation', 'fraternity', 'cleverness'] trait_dict['Ravenclaw'] = ['intelligence', 'wit', 'wisdom', 'creativity', 'originality', 'individuality', 'acceptance'] trait_dict['Hufflepuff'] = ['dedication', 'diligence', 'fairness', 'patience', 'kindness', 'tolerance', 'persistence', 'loyalty']

Please note that all words are nouns. It's good. We need consistency in describing character traits. Some of them were not presented in the form of nouns, so we bring them to the general order:

"Ambitious" (adjective) - can be easily replaced by 'ambition'
“Hard work”, “fair play” and “unafraid of toil” - these phrases can also be easily replaced with single-word nouns:
“Hard work” -> 'diligence'
“Fair play” -> 'fairness'
"Unafraid of toil" -> 'persistence'

Having received a list of characteristics for each department, we can simply scan the “Text” column and count the number of times the corresponding words were used in the descriptions of the characters. Sounds easy, right?

Unfortunately, that's not all. Here is a phrase from the section " Personality and Character Traits " about Neville Longbottom:

When he was younger, Neville was clumsy, forgetful, shy, and many thought that he was ill suited to the Gryffindor faculty because he seemed shy .

Thanks to the support of friends to whom he was very devoted ; the inspiration of Professor Remus Lupine to face his fears in the third year of study; and the fact that the torturers of his parents were walking free, Neville became braver , more self-confident, and selfless in the fight against Volan de Mort and his Death Eaters.

(When he was younger, he was clumsy, forgetful, he was shy,

It’s not a bad idea. ” assured, and dedicated to the fight against Lord Voldemort and his Death Eaters.)

Highlighted words should be counted in favor of some faculties, but they will not be counted, because they are adjectives. Also, words like “bravely” and “braveness” will not be taken into account. In order for our classification algorithm to work correctly, you need to identify synonyms, antonyms and other word forms.

Synonyms

You can synsets synonyms using the synsets function from WordNet , the lexical database of the English language included in the nltk module (NLTK - Natural Language Toolkit). “Synset” is “synonym set,” a collection of synonyms, or “lemmas.” The synsets function returns sets of synonyms that are associated with specific words.

Puzzled? Let's run the code, and then analyze it:

 from nltk.corpus import wordnet as wn #      foo1 = wn.synsets('bravery') print("Synonym sets associated with the word 'bravery': {}".format(foo1)) foo2 = wn.synsets('fairness') print('') print("Synonym sets associated with the word 'fairness': {}".format(foo2)) foo3 = wn.synsets('wit') print('') print("Synonym sets associated with the word 'wit': {}".format(foo3)) foo4 = wn.synsets('cunning') print('') print("Synonym sets associated with the word 'cunning': {}".format(foo4)) foo4 = wn.synsets('cunning', pos=wn.NOUN) print('') print("Synonym sets associated with the *noun* 'cunning': {}".format(foo4)) print('') #   (""),    synset foo_list = [foo1, foo2, foo3, foo4] for foo in foo_list: for synset in foo: print((synset.name(), synset.lemma_names()))

Synonym sets associated with the word 'bravery': [Synset ('courage.n.01'), Synset ('fearlessness.n.01')]
Synonym sets associated with the word 'fairness': [Synset (' fairness.n.01 '), Synset (' fairness.n.02 '), Synset (' paleness.n.02 '), Synset (' comeliness.n .01 ')]
Synonym sets associated with the word 'wit': [Synset ('wit.n.01'), Synset ('brain.n.02'), Synset ('wag.n.01')]
Synonym sets associated with the word 'cunning': [Synset ('craft.n.05'), Synset ('cunning.n.02'), Synset ('cunning.s.01'), Synset ('crafty.s .01 '), Synset (' clever.s.03 ')]
Synonym sets associated with the noun 'cunning': [Synset ('craft.n.05'), Synset ('cunning.n.02')]
('courage.n.01', ['courage', 'courageousness',' bravery ',' braveness']) ('fearlessness.n.01', ['fearlessness',' bravery ']) (' fairness. n.01 ', [' fairness ',' equity ']) (' fairness.n.02 ', [' fairness ',' fair-mindedness ',' candor ',' candour ']) (' paleness.n. 02 ', [' paleness ',' blondness ',' fairness ']) (' comeliness.n.01 ', [' comeliness ',' fairness ',' loveliness ',' beauteousness ']) (' wit.n. 01 ', [' wit ',' humor ',' humor ',' witticism ',' wittiness']) ('brain.n.02', ['brain', 'brainpower', 'learning_ability', 'mental_capacity' , 'mentality', 'wit']) ('wag.n.01', ['wag', 'wit', 'card']) ('craft.n.05', ['craft', 'craftiness' , 'cunning', 'foxiness', 'guile', 'slyness', 'wiliness']) ('cunning.n.02', ['cunning'])

So, we got a lot of output. Consider some points and potential problems:

wn.synsets('bravery') is associated with two sets of synonyms: one for courage.n.01 and one for fearlessness.n.01 . Let's see what this means:
The first part ('courage' and 'fearlessness') is a word around which the entire concrete set of synonyms is built. Let's call it the "central" word. That is, all the synonyms in this set (“lemmas”) are similar in meaning to the central word.
The second part ('n') means "noun" ("noun"). For example, the set associated with the word "cunning" includes crafty.s.01 and clever.s.03 (adjectives). They appeared here because the word "cunning" can be both a noun and an adjective. To leave only nouns, you can specify wn.synsets('cunning', pos=wn.NOUN) .
The third part ('01') refers to the specific meaning of the central word. For example, 'fairness' can mean both “compliance with rules and standards” and “making judgments without discrimination or dishonesty”.

As you can see, the synset function can provide unwanted sets of synonyms. For example, paleness.n.02 (“have light skin by nature”) and comeliness.n.01 (“look good and be attractive”) are also associated with the word “fairness”. These traits are clearly not associated with Hufflepuff (although Neville Longbottom grew up handsome), so you have to manually exclude such sets from our analysis.

Translation: getting synonyms harder than it sounds

Antonyms and word forms

After we have collected all the synonyms, you need to take care of the antonyms and different word forms (for example, in relation to "bravery" - "brave", "bravely" and "braver"). A lot of hard work can be done in nltk, but you still have to manually fill out the verbiage and adjectives in a comparative / superlative degree.

 #    (),         "bravery" foo1 = wn.synsets('bravery') for synset in foo1: for lemma in synset.lemmas(): print("Synset: {}; Lemma: {}; Antonyms: {}; Word Forms: {}".format(synset.name(), lemma.name(), lemma.antonyms(), lemma.derivationally_related_forms())) print("")

Synset: courage.n.01; Lemma: courage; Antonyms: [Lemma ('cowardice.n.01.cowardice')]; Word Forms: [Lemma ('brave.a.01.courageous')]
Synset: courage.n.01; Lemma: courageousness; Antonyms: []; Word Forms: [Lemma ('brave.a.01.courageous')]
Synset: courage.n.01; Lemma: bravery; Antonyms: []; Word Forms: []
Synset: courage.n.01; Lemma: braveness; Antonyms: []; Word Forms: [Lemma ('brave.a.01.brave'), Lemma ('audacious.s.01.brave')]
Synset: fearlessness.n.01; Lemma: fearlessness; Antonyms: [Lemma ('fear.n.01.fear')]; Word Forms: [Lemma ('audacious.s.01.fearless'), Lemma ('unafraid.a.01.fearless')]
Synset: fearlessness.n.01; Lemma: bravery; Antonyms: []; Word Forms: []

Putting it all together

The following code creates a list of synonyms, antonyms and word forms for each characteristic of the faculties. For completeness, some of the words may be spelled incorrectly.

 #   ,    relevant_synsets = {} relevant_synsets['Ravenclaw'] = [wn.synset('intelligence.n.01'), wn.synset('wit.n.01'), wn.synset('brain.n.02'), wn.synset('wisdom.n.01'), wn.synset('wisdom.n.02'), wn.synset('wisdom.n.03'), wn.synset('wisdom.n.04'), wn.synset('creativity.n.01'), wn.synset('originality.n.01'), wn.synset('originality.n.02'), wn.synset('individuality.n.01'), wn.synset('credence.n.01'), wn.synset('acceptance.n.03')] relevant_synsets['Hufflepuff'] = [wn.synset('dedication.n.01'), wn.synset('commitment.n.04'), wn.synset('commitment.n.02'), wn.synset('diligence.n.01'), wn.synset('diligence.n.02'), wn.synset('application.n.06'), wn.synset('fairness.n.01'), wn.synset('fairness.n.01'), wn.synset('patience.n.01'), wn.synset('kindness.n.01'), wn.synset('forgivingness.n.01'), wn.synset('kindness.n.03'), wn.synset('tolerance.n.03'), wn.synset('tolerance.n.04'), wn.synset('doggedness.n.01'), wn.synset('loyalty.n.01'), wn.synset('loyalty.n.02')] relevant_synsets['Gryffindor'] = [wn.synset('courage.n.01'), wn.synset('fearlessness.n.01'), wn.synset('heart.n.03'), wn.synset('boldness.n.02'), wn.synset('chivalry.n.01'), wn.synset('boldness.n.01')] relevant_synsets['Slytherin'] = [wn.synset('resourcefulness.n.01'), wn.synset('resource.n.03'), wn.synset('craft.n.05'), wn.synset('cunning.n.02'), wn.synset('ambition.n.01'), wn.synset('ambition.n.02'), wn.synset('determination.n.02'), wn.synset('determination.n.04'), wn.synset('self-preservation.n.01'), wn.synset('brotherhood.n.02'), wn.synset('inventiveness.n.01'), wn.synset('brightness.n.02'), wn.synset('ingenuity.n.02')] # ,      def get_forms(lemma): drfs = lemma.derivationally_related_forms() output_list = [] if drfs: for drf in drfs: drf_pos = str(drf).split(".")[1] if drf_pos in ['n', 's', 'a']: output_list.append(drf.name().lower()) if drf_pos in ['s', 'a']: #  + "-ness"  +  &   if len(drf.name()) == 3: last_letter = drf.name()[-1:] output_list.append(drf.name().lower() + last_letter + 'er') output_list.append(drf.name().lower() + last_letter + 'est') output_list.append(drf.name().lower()+'ness') output_list.append(drf.name().lower()+'ly') elif drf.name()[-4:] in ['able', 'ible']: output_list.append(drf.name().lower()+'r') output_list.append(drf.name().lower()+'st') output_list.append(drf.name().lower()+'ness') output_list.append(drf.name()[:-1].lower()+'y') elif drf.name()[-1:] == 'e': output_list.append(drf.name().lower()+'r') output_list.append(drf.name().lower()+'st') output_list.append(drf.name().lower()+'ness') output_list.append(drf.name().lower()+'ly') elif drf.name()[-2:] == 'ic': output_list.append(drf.name().lower()+'er') output_list.append(drf.name().lower()+'est') output_list.append(drf.name().lower()+'ness') output_list.append(drf.name().lower()+'ally') elif drf.name()[-1:] == 'y': output_list.append(drf.name()[:-1].lower()+'ier') output_list.append(drf.name()[:-1].lower()+'iest') output_list.append(drf.name()[:-1].lower()+'iness') output_list.append(drf.name()[:-1].lower()+'ily') else: output_list.append(drf.name().lower()+'er') output_list.append(drf.name().lower()+'est') output_list.append(drf.name().lower()+'ness') output_list.append(drf.name().lower()+'ly') return output_list else: return output_list #      #    ,      ,    ,      import copy new_trait_dict = copy.deepcopy(trait_dict) antonym_dict = {} #      ()   ;    (  )    for house, traits in trait_dict.items(): antonym_dict[house] = [] for trait in traits: synsets = wn.synsets(trait, pos=wn.NOUN) for synset in synsets: if synset in relevant_synsets[house]: for lemma in synset.lemmas(): new_trait_dict[house].append(lemma.name().lower()) if get_forms(lemma): new_trait_dict[house].extend(get_forms(lemma)) if lemma.antonyms(): for ant in lemma.antonyms(): antonym_dict[house].append(ant.name().lower()) if get_forms(ant): antonym_dict[house].extend(get_forms(ant)) new_trait_dict[house] = sorted(list(set(new_trait_dict[house]))) antonym_dict[house] = sorted(list(set(antonym_dict[house]))) #    print("Gryffindor traits: {}".format(new_trait_dict['Gryffindor'])) print("") print("Gryffindor anti-traits: {}".format(antonym_dict['Gryffindor'])) print("")

Gryffindor Characteristics: ['bold', 'bolder', 'boldest', 'boldly', 'boldness',' brass', 'brassier', 'brassiest', 'brassily', 'brassiness',' brassy ',' brave ',' bravely ',' braveness', 'braver', 'bravery', 'bravest', 'cheek', 'cheekier', 'cheekiest', 'cheekily', 'cheekiness',' cheeky ',' chivalry ', 'courage', 'courageous',' courageouser ',' courageousest ',' courageously ',' courageousness', 'daring', 'face', 'fearless',' fearlesser ',' fearlessest ',' fearlessly ',' fearlessness ',' gallantry ',' hardihood ',' hardiness', 'heart', 'mettle', 'nerve', 'nervier', 'nerviest', 'nervily', 'nerviness',' nervy ',' politesse ', 'spunk', 'spunkier', 'spunkiest', 'spunkily', 'spunkiness', 'spunky']

Gryffindor anti-characteristics: ['cowardice', 'fear', 'timid', 'timider', 'timidest', 'timidity', 'timidly', 'timidness']

 # ,             from itertools import combinations def test_overlap(dict): results = [] house_combos = combinations(list(dict.keys()), 2) for combo in house_combos: results.append(set(dict[combo[0]]).isdisjoint(dict[combo[1]])) return results #   ;   "False" print("Any words overlap in trait dictionary? {}".format(sum(test_overlap(new_trait_dict)) != 6)) print("Any words overlap in antonym dictionary? {}".format(sum(test_overlap(antonym_dict)) != 6))

Are there any repetitions in the character traits dictionary? False

Repetitions in the dictionary of antonyms? False

Step Three: We distribute students by faculty

It's time to distribute the students to the faculties! Our classification algorithm will work as follows:

Passes for each word in the section "Personality and Character Traits" for each student.
If a word is in the list of features characteristic of a particular faculty, then 1 is added to the points of this faculty.
If a word is in the list of anti-cricket specific to a particular faculty, then 1 is subtracted from the points of this faculty.
The student is assigned to the faculty that scores the most points.
If there is a draw, then simply write “Tie!”.

Suppose in the section "Personality and Character Traits" there is only the sentence "Alice was brave." Then Alice will get 1 point for Gryffindor and 0 points for the rest of the faculties. Accordingly, Alice will fall into Gryffindor.

 #  "word_tokenize",       from nltk import word_tokenize # ,   def sort_student(text): text_list = word_tokenize(text) text_list = [word.lower() for word in text_list] score_dict = {} houses = ['Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin'] for house in houses: score_dict[house] = (sum([True for word in text_list if word in new_trait_dict[house]]) - sum([True for word in text_list if word in antonym_dict[house]])) sorted_house = max(score_dict, key=score_dict.get) sorted_house_score = score_dict[sorted_house] if sum([True for i in score_dict.values() if i==sorted_house_score]) == 1: return sorted_house else: return "Tie!" #   print(sort_student('Alice was brave')) print(sort_student('Alice was British'))

Gryffindor Tie!

Looks like the function works. Apply it to our data and see what happens!

 #   pd.options.mode.chained_assignment = None mydf_relevant['new_house'] = mydf_relevant['text'].map(lambda x: sort_student(x)) mydf_relevant.head(20)

 print("Match rate: {}".format(sum(mydf_relevant['house'] == mydf_relevant['new_house']) / len(mydf_relevant))) print("Percentage of ties: {}".format(sum(mydf_relevant['new_house'] == 'Tie!') / len(mydf_relevant)))

Match: 0.2553191489361702
Share of draws: 0.32978723404255317

Hm We expected other results. Let's find out why Volan de Mort got into Hufflepuff.

 #   -- tom_riddle = word_tokenize(mydf_relevant['text'].values[0]) tom_riddle = [word.lower() for word in tom_riddle] #        ,          words_dict = {} anti_dict = {} houses = ['Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin'] for house in houses: words_dict[house] = [word for word in tom_riddle if word in new_trait_dict[house]] anti_dict[house] = [word for word in tom_riddle if word in antonym_dict[house]] print(words_dict) print("") print(anti_dict)

{'Slytherin': ['ambition'], 'Ravenclaw': ['intelligent', 'intelligent', 'mental', 'individual', 'mental', 'intelligent'], 'Hufflepuff': ['kind', 'loyalty', 'true', 'true', 'true', 'loyalty'], 'Gryffindor': ['brave', 'face', 'bold', 'face', 'bravery', 'brave', 'courageous', 'bravery']}
{'Slytherin': [], 'Ravenclaw': ['common'], 'Hufflepuff': [], 'Gryffindor': ['fear', 'fear', 'fear', 'fear', 'fear', 'fear', 'cowardice', 'fear', 'fear']}

As you can see, Slytherin scored (1-0) = 1 points, Ravenclaw - (6-1) = 5, Hufflepuff - (6-0) = 6, Gryffindor - (8-9) = -1.

It is interesting to note that in the section “Personality and Character Traits” of Volan de Mort, the longest among all students, only 31 words coincided with dictionaries. This means that the other students probably had much more coincidences. That is, we make a decision on classification based on too little data, which explains the high proportion of errors and a large number of draws.

findings

The classifier we created does not work very well (a little more accurately than simple guessing), but do not forget that our approach was simplified. Modern spam filters are very complex and do not classify only on the basis of the presence of a specific word. So our algorithm can be improved so that it takes into account more information. Here is a short list of ideas:

, .
, , « » .
« » , .
, .

API nltk, . , Python, .

Source: https://habr.com/ru/post/331352/

All Articles