Let me remind you: we ended up with the fact that our classifier considered the idea of going to an Italian restaurant 5 times better than a Mexican one.
ConceptNet
. They are contained in the conceptnet5.vectors.evaluation.bias
module. I would like to add other ethnic groups. For this, it may be necessary to take into account not only names, but also surnames. NAMES_BY_ETHNICITY = { # The first two lists are from the Caliskan et al. appendix describing the # Word Embedding Association Test. 'White': [ 'Adam', 'Chip', 'Harry', 'Josh', 'Roger', 'Alan', 'Frank', 'Ian', 'Justin', 'Ryan', 'Andrew', 'Fred', 'Jack', 'Matthew', 'Stephen', 'Brad', 'Greg', 'Jed', 'Paul', 'Todd', 'Brandon', 'Hank', 'Jonathan', 'Peter', 'Wilbur', 'Amanda', 'Courtney', 'Heather', 'Melanie', 'Sara', 'Amber', 'Crystal', 'Katie', 'Meredith', 'Shannon', 'Betsy', 'Donna', 'Kristin', 'Nancy', 'Stephanie', 'Bobbie-Sue', 'Ellen', 'Lauren', 'Peggy', 'Sue-Ellen', 'Colleen', 'Emily', 'Megan', 'Rachel', 'Wendy' ], 'Black': [ 'Alonzo', 'Jamel', 'Lerone', 'Percell', 'Theo', 'Alphonse', 'Jerome', 'Leroy', 'Rasaan', 'Torrance', 'Darnell', 'Lamar', 'Lionel', 'Rashaun', 'Tyree', 'Deion', 'Lamont', 'Malik', 'Terrence', 'Tyrone', 'Everol', 'Lavon', 'Marcellus', 'Terryl', 'Wardell', 'Aiesha', 'Lashelle', 'Nichelle', 'Shereen', 'Temeka', 'Ebony', 'Latisha', 'Shaniqua', 'Tameisha', 'Teretha', 'Jasmine', 'Latonya', 'Shanise', 'Tanisha', 'Tia', 'Lakisha', 'Latoya', 'Sharise', 'Tashika', 'Yolanda', 'Lashandra', 'Malika', 'Shavonn', 'Tawanda', 'Yvette' ], # This list comes from statistics about common Hispanic-origin names in the US. 'Hispanic': [ 'Juan', 'José', 'Miguel', 'Luís', 'Jorge', 'Santiago', 'Matías', 'Sebastián', 'Mateo', 'Nicolás', 'Alejandro', 'Samuel', 'Diego', 'Daniel', 'Tomás', 'Juana', 'Ana', 'Luisa', 'María', 'Elena', 'Sofía', 'Isabella', 'Valentina', 'Camila', 'Valeria', 'Ximena', 'Luciana', 'Mariana', 'Victoria', 'Martina' ], # The following list conflates religion and ethnicity, I'm aware. So do given names. # # This list was cobbled together from searching baby-name sites for common Muslim names, # as spelled in English. I did not ultimately distinguish whether the origin of the name # is Arabic or Urdu or another language. # # I'd be happy to replace it with something more authoritative, given a source. 'Arab/Muslim': [ 'Mohammed', 'Omar', 'Ahmed', 'Ali', 'Youssef', 'Abdullah', 'Yasin', 'Hamza', 'Ayaan', 'Syed', 'Rishaan', 'Samar', 'Ahmad', 'Zikri', 'Rayyan', 'Mariam', 'Jana', 'Malak', 'Salma', 'Nour', 'Lian', 'Fatima', 'Ayesha', 'Zahra', 'Sana', 'Zara', 'Alya', 'Shaista', 'Zoya', 'Yasmin' ] }
def name_sentiment_table(): frames = [] for group, name_list in sorted(NAMES_BY_ETHNICITY.items()): lower_names = [name.lower() for name in name_list] sentiments = words_to_sentiment(lower_names) sentiments['group'] = group frames.append(sentiments) # Put together the data we got from each ethnic group into one big table return pd.concat(frames) name_sentiments = name_sentiment_table()
plot = seaborn.swarmplot(x='group', y='sentiment', data=name_sentiments) plot.set_ylim([-10, 10])
plot = seaborn.barplot(x='group', y='sentiment', data=name_sentiments, capsize=.1)
statsmodels
package to find out, among other things, how pronounced the observed effect is. ols_model = statsmodels.formula.api.ols('sentiment ~ group', data=name_sentiments).fit() ols_model.fvalue # 13.041597745167659
model
and embeddings
) would not be worth it. But the raw research code has a great advantage: it allows us to track the results of each stage and draw conclusions. We will try not to do too much work, we will write a function that will repeat some of the operations performed. def retrain_model(new_embs): """ Repeat the steps above with a new set of word embeddings. """ global model, embeddings, name_sentiments embeddings = new_embs pos_vectors = embeddings.loc[pos_words].dropna() neg_vectors = embeddings.loc[neg_words].dropna() vectors = pd.concat([pos_vectors, neg_vectors]) targets = np.array([1 for entry in pos_vectors.index] + [-1 for entry in neg_vectors.index]) labels = list(pos_vectors.index) + list(neg_vectors.index) train_vectors, test_vectors, train_targets, test_targets, train_labels, test_labels = \ train_test_split(vectors, targets, labels, test_size=0.1, random_state=0) model = SGDClassifier(loss='log', random_state=0, n_iter=100) model.fit(train_vectors, train_targets) accuracy = accuracy_score(model.predict(test_vectors), test_targets) print("Accuracy of sentiment: {:.2%}".format(accuracy)) name_sentiments = name_sentiment_table() ols_model = statsmodels.formula.api.ols('sentiment ~ group', data=name_sentiments).fit() print("F-value of bias: {:.3f}".format(ols_model.fvalue)) print("Probability given null hypothesis: {:.3}".format(ols_model.f_pvalue)) # Show the results on a swarm plot, with a consistent Y-axis plot = seaborn.swarmplot(x='group', y='sentiment', data=name_sentiments) plot.set_ylim([-10, 10])
GloVe
. This archive is based on all the sites processed by Common Crawl robot (including a lot of highly questionable, and about 20 more copies of the Urban Dictionary , a dictionary of urban jargon). Maybe the problem is this? What if you take the good old word2vec
, the result of processing Google News?word2vec
files that we managed to find is this file in Google Drive . Download and save it as data/word2vec-googlenews-300.bin.gz
. # Use a ConceptNet function to load word2vec into a Pandas frame from its binary format from conceptnet5.vectors.formats import load_word2vec_bin w2v = load_word2vec_bin('data/word2vec-googlenews-300.bin.gz', nrows=2000000) # word2vec is case-sensitive, so case-fold its labels w2v.index = [label.casefold() for label in w2v.index] # Now we have duplicate labels, so drop the later (lower-frequency) occurrences of the same label w2v = w2v.reset_index().drop_duplicates(subset='index', keep='first').set_index('index') retrain_model(w2v) # Accuracy of sentiment: 94.30% # F-value of bias: 15.573 # Probability given null hypothesis: 7.43e-09
word2vec
even worse. F-measure for it exceeds 15, the differences in tone for ethnic groups are more pronounced.ConceptNet
is a knowledge graph with built-in functions for calculating vector values of words. In his learning process, a special stage is used to identify and eliminate some sources of algorithmic racism and sexism by adjusting numerical values. The idea of this stage is based on the article Debiasing Word Embeddings . It is summarized to take into account several forms of prejudice. As far as I know, other semantic systems with a similar function do not yet exist.ConceptNet
vectors and publish a package called ConceptNet Numberbatch . The phase of eliminating human bias was added in April 2017. Let's load the vector meanings of English words and retrain our tonality analysis model on them.numberbatch-en-17.04b.txt.gz
, save it in the data/
folder and re numberbatch-en-17.04b.txt.gz
model. retrain_model(load_embeddings('data/numberbatch-en-17.04b.txt')) # Accuracy of sentiment: 97.46% # F-value of bias: 3.805 # Probability given null hypothesis: 0.0118
GloVe
or word2vec
. The value of the metric has decreased by more than 3 times relative to GloVe
and approximately 4 times relative to word2vec
. In general, the fluctuations of tonality with the change in the text of the names significantly decreased, which is what we wanted, because the tone of the text should not depend on the names at all.ConceptNet
takes into account and eliminates only part of the sources of algorithmic racism. But this is a good start.ConceptNet Numberbatch
, the accuracy of the forecast of tonality increased.word2vec
and GloVe
data have nothing to do with accuracy.Source: https://habr.com/ru/post/337272/
All Articles