Abstract
Recently many women have come forward telling their stories of sexual assault to raise awareness and empower other women to do the same. Regardless, society is still a toxic place for victims of sexual assault and often blames women for the assault that they endured. This research looks at the language used in tweets surrounding the Harvey Weinstein scandal and looks specifically at tweets about him and a few of the women that came forward. A heuristic is developed to measure the level of sexism contained in a tweet. The tweets are clustered using their GloVe word embeddings and the groups of terms are visualized.
Introduction
In the past few months, a barrage of sexual assault allegations have been flying left and right, and Men’s Rights Activists have been crying out about how they are being victimized and are painting the women coming forward as predatory (oh the irony). They point to the fact that so many women are coming forward at once as evidence that it is some crusade to unjustly attack men. However, the answers to why this is happening now and all at once can’t be seen through the lens of male victimhood. The main reason that so many women are coming forward now is just that so many other women are doing it, so it’s more difficult for public perception to focus its crosshairs on each individual woman. Forever, society has used slut shaming and victim blaming as a way to rationalize assault and to shame victims into silence. But just because more women are feeling empowered to come forward, that doesn’t mean that the victim blaming has suddenly stopped. This analysis looks into attempting to extract and measure the language that is used to talk about both the women and the assaulter in the wake of sexual assault allegations. Specifically, this paper looks into tweets referencing Harvey Weinstein and the victims of sexual assault who came forward between the dates 10/5/2017 and 10/28/2017.
Methodology
Data
The data used for this project are various collections of tweets. First a sample of tweets is drawn from over the course of 2015 and 2016 to establish a baseline for word frequencies so that more common words can be normalized by their typical frequency so that the results aren’t filled with only common words. Next, tweets were pulled from Twitter search results containing the names Harvey Weinstein, Annabella Sciorra, Zoe Brock, Asia Argento, and Louisette Geiss, between the dates 10/5/2017 and 10/28/2017.
Processing
SpaCy, a natural language library, was used for text processing and tokenization, as well as for extracting word embeddings. The embeddings used are GloVe word embeddings, provided by Stanford NLP Group. These embeddings are 384 dimensional vectors that represents a compressed semantic representation of a word based on word cooccurrences in the training dataset. The embeddings are a fascinating and rich representation of words, allowing for some interesting semantic manipulations of word relationships (e.g. \(vector[‘bird’]-vector[‘air’]+vector[‘water’] \simeq vector[‘fish’]\)). Sentiment analysis is computed using a pre-trained model provided by the Python library NLTK, Vader Sentiment Intensity Analyzer, which is specifically trained to detect sentiments expressed in social media.
A metric that represents the level of sexism in a block of text is difficult to achieve. For the purposes of this research, sexism is gauged using two factors: the gender polarity and the text's sentiment. The gender polarity is a measure of how close a word is associated with either the word "woman" or the word "man". The metric is calculated as follows:
\(genderpolarity=\ln(cosinedistance(word,"woman")/cosinedistance(word,"man"))\)
where a polarity greater than zero is more closely associated with women and a value less than zero is more closely associated with men. The sentiment score is calculated as the compound polarity score from the Vader Sentiment Analyzer. The sexism score is then calculated as follows:
\(sexismscore = - genderpolarity * (wordsentiment + sentencesentiment + tweetsentiment) / 3\)
where values greater than zero are considered sexist and values less than zero are considered not sexist.This sexism score only accounts for malevolent sexism towards women and benevolent sexism towards men, meaning that it won’t measure the inverse. This is a known limitation, but cannot be solved without a more complex model with a generous amount of training data.
Analysis
The data is first preprocessed and tokenized using spaCy and the sexism scores are calculated. N-grams are extracted for groups of 1, 2, and 3 words and the occurrences of each n-gram are counted. The counts are then normalized by the counts for that n-gram found in the baseline dataset. Words that do not appear in the baseline dataset are assigned a baseline occurrence of 1. The most frequent terms are then extracted and their word embeddings are compressed into 2 dimensions using T-SNE dimensionality reduction. This allows the data to be plotted as a scatter plot as well as improving clustering, as clustering algorithms do not typically perform well in high dimensional data. The n-grams are then clustered using the 2 dimensional transformation using K-Means with a cluster size of 10.
Results
Conclusions
This
One major limitation in this research is the quantification of sexism. The current implementation tries to create a heuristic that approximates certain forms of sexism, however it has several drawbacks. First, it operates under the assumption that any tweet that is posted about that person in this timeframe is in reference to the sexual assault allegations and that they have some degree of sexist charge to them if they use gender-directed terms (terms that have a gender polarity) and have a strong sentiment. If either the gender polarity or the sentiment is close to zero, then the phrase would not be considered sexist, however this is not always the case with non-sexist tweets. As it was mentioned at the CUSP Hackathon, this method does not account for a negative tweet that is expressing sympathy towards women. In order to do this, a context-aware model, such as an LSTM neural network could be used to provide a better sexism metric. This would require a very large dataset considering the complexity of the problem, which would be a large feat to assemble.
References
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.