
Let me remind you: we ended up with the fact that our classifier considered the idea of going to an Italian restaurant 5 times better than a Mexican one.
ConceptNet . They are contained in the conceptnet5.vectors.evaluation.bias module. I would like to add other ethnic groups. For this, it may be necessary to take into account not only names, but also surnames. NAMES_BY_ETHNICITY = { # The first two lists are from the Caliskan et al. appendix describing the # Word Embedding Association Test. 'White': [ 'Adam', 'Chip', 'Harry', 'Josh', 'Roger', 'Alan', 'Frank', 'Ian', 'Justin', 'Ryan', 'Andrew', 'Fred', 'Jack', 'Matthew', 'Stephen', 'Brad', 'Greg', 'Jed', 'Paul', 'Todd', 'Brandon', 'Hank', 'Jonathan', 'Peter', 'Wilbur', 'Amanda', 'Courtney', 'Heather', 'Melanie', 'Sara', 'Amber', 'Crystal', 'Katie', 'Meredith', 'Shannon', 'Betsy', 'Donna', 'Kristin', 'Nancy', 'Stephanie', 'Bobbie-Sue', 'Ellen', 'Lauren', 'Peggy', 'Sue-Ellen', 'Colleen', 'Emily', 'Megan', 'Rachel', 'Wendy' ], 'Black': [ 'Alonzo', 'Jamel', 'Lerone', 'Percell', 'Theo', 'Alphonse', 'Jerome', 'Leroy', 'Rasaan', 'Torrance', 'Darnell', 'Lamar', 'Lionel', 'Rashaun', 'Tyree', 'Deion', 'Lamont', 'Malik', 'Terrence', 'Tyrone', 'Everol', 'Lavon', 'Marcellus', 'Terryl', 'Wardell', 'Aiesha', 'Lashelle', 'Nichelle', 'Shereen', 'Temeka', 'Ebony', 'Latisha', 'Shaniqua', 'Tameisha', 'Teretha', 'Jasmine', 'Latonya', 'Shanise', 'Tanisha', 'Tia', 'Lakisha', 'Latoya', 'Sharise', 'Tashika', 'Yolanda', 'Lashandra', 'Malika', 'Shavonn', 'Tawanda', 'Yvette' ], # This list comes from statistics about common Hispanic-origin names in the US. 'Hispanic': [ 'Juan', 'José', 'Miguel', 'Luís', 'Jorge', 'Santiago', 'Matías', 'Sebastián', 'Mateo', 'Nicolás', 'Alejandro', 'Samuel', 'Diego', 'Daniel', 'Tomás', 'Juana', 'Ana', 'Luisa', 'María', 'Elena', 'Sofía', 'Isabella', 'Valentina', 'Camila', 'Valeria', 'Ximena', 'Luciana', 'Mariana', 'Victoria', 'Martina' ], # The following list conflates religion and ethnicity, I'm aware. So do given names. # # This list was cobbled together from searching baby-name sites for common Muslim names, # as spelled in English. I did not ultimately distinguish whether the origin of the name # is Arabic or Urdu or another language. # # I'd be happy to replace it with something more authoritative, given a source. 'Arab/Muslim': [ 'Mohammed', 'Omar', 'Ahmed', 'Ali', 'Youssef', 'Abdullah', 'Yasin', 'Hamza', 'Ayaan', 'Syed', 'Rishaan', 'Samar', 'Ahmad', 'Zikri', 'Rayyan', 'Mariam', 'Jana', 'Malak', 'Salma', 'Nour', 'Lian', 'Fatima', 'Ayesha', 'Zahra', 'Sana', 'Zara', 'Alya', 'Shaista', 'Zoya', 'Yasmin' ] } def name_sentiment_table(): frames = [] for group, name_list in sorted(NAMES_BY_ETHNICITY.items()): lower_names = [name.lower() for name in name_list] sentiments = words_to_sentiment(lower_names) sentiments['group'] = group frames.append(sentiments) # Put together the data we got from each ethnic group into one big table return pd.concat(frames) name_sentiments = name_sentiment_table() plot = seaborn.swarmplot(x='group', y='sentiment', data=name_sentiments) plot.set_ylim([-10, 10]) plot = seaborn.barplot(x='group', y='sentiment', data=name_sentiments, capsize=.1) statsmodels package to find out, among other things, how pronounced the observed effect is. ols_model = statsmodels.formula.api.ols('sentiment ~ group', data=name_sentiments).fit() ols_model.fvalue # 13.041597745167659 model and embeddings ) would not be worth it. But the raw research code has a great advantage: it allows us to track the results of each stage and draw conclusions. We will try not to do too much work, we will write a function that will repeat some of the operations performed. def retrain_model(new_embs): """ Repeat the steps above with a new set of word embeddings. """ global model, embeddings, name_sentiments embeddings = new_embs pos_vectors = embeddings.loc[pos_words].dropna() neg_vectors = embeddings.loc[neg_words].dropna() vectors = pd.concat([pos_vectors, neg_vectors]) targets = np.array([1 for entry in pos_vectors.index] + [-1 for entry in neg_vectors.index]) labels = list(pos_vectors.index) + list(neg_vectors.index) train_vectors, test_vectors, train_targets, test_targets, train_labels, test_labels = \ train_test_split(vectors, targets, labels, test_size=0.1, random_state=0) model = SGDClassifier(loss='log', random_state=0, n_iter=100) model.fit(train_vectors, train_targets) accuracy = accuracy_score(model.predict(test_vectors), test_targets) print("Accuracy of sentiment: {:.2%}".format(accuracy)) name_sentiments = name_sentiment_table() ols_model = statsmodels.formula.api.ols('sentiment ~ group', data=name_sentiments).fit() print("F-value of bias: {:.3f}".format(ols_model.fvalue)) print("Probability given null hypothesis: {:.3}".format(ols_model.f_pvalue)) # Show the results on a swarm plot, with a consistent Y-axis plot = seaborn.swarmplot(x='group', y='sentiment', data=name_sentiments) plot.set_ylim([-10, 10]) GloVe . This archive is based on all the sites processed by Common Crawl robot (including a lot of highly questionable, and about 20 more copies of the Urban Dictionary , a dictionary of urban jargon). Maybe the problem is this? What if you take the good old word2vec , the result of processing Google News?word2vec files that we managed to find is this file in Google Drive . Download and save it as data/word2vec-googlenews-300.bin.gz . # Use a ConceptNet function to load word2vec into a Pandas frame from its binary format from conceptnet5.vectors.formats import load_word2vec_bin w2v = load_word2vec_bin('data/word2vec-googlenews-300.bin.gz', nrows=2000000) # word2vec is case-sensitive, so case-fold its labels w2v.index = [label.casefold() for label in w2v.index] # Now we have duplicate labels, so drop the later (lower-frequency) occurrences of the same label w2v = w2v.reset_index().drop_duplicates(subset='index', keep='first').set_index('index') retrain_model(w2v) # Accuracy of sentiment: 94.30% # F-value of bias: 15.573 # Probability given null hypothesis: 7.43e-09 word2vec even worse. F-measure for it exceeds 15, the differences in tone for ethnic groups are more pronounced.ConceptNet is a knowledge graph with built-in functions for calculating vector values of words. In his learning process, a special stage is used to identify and eliminate some sources of algorithmic racism and sexism by adjusting numerical values. The idea of this stage is based on the article Debiasing Word Embeddings . It is summarized to take into account several forms of prejudice. As far as I know, other semantic systems with a similar function do not yet exist.ConceptNet vectors and publish a package called ConceptNet Numberbatch . The phase of eliminating human bias was added in April 2017. Let's load the vector meanings of English words and retrain our tonality analysis model on them.numberbatch-en-17.04b.txt.gz , save it in the data/ folder and re numberbatch-en-17.04b.txt.gz model. retrain_model(load_embeddings('data/numberbatch-en-17.04b.txt')) # Accuracy of sentiment: 97.46% # F-value of bias: 3.805 # Probability given null hypothesis: 0.0118 GloVe or word2vec . The value of the metric has decreased by more than 3 times relative to GloVe and approximately 4 times relative to word2vec . In general, the fluctuations of tonality with the change in the text of the names significantly decreased, which is what we wanted, because the tone of the text should not depend on the names at all.ConceptNet takes into account and eliminates only part of the sources of algorithmic racism. But this is a good start.ConceptNet Numberbatch , the accuracy of the forecast of tonality increased.word2vec and GloVe data have nothing to do with accuracy.Source: https://habr.com/ru/post/337272/
All Articles