To see my full implementation, you can visit my GitHub repo.
https://github.com/antonecg/SuperYule
Ultimately, I found the utility of this metric limited and had trouble finding clear drivers of complexity. One measure did not give me enough information to model human behavior.
My following approach was to obtain more ways to measure behavior. Researching ways to measure personality, I came across the research of the Cambridge psychometrics institute.\cite{Kosinski2016}Researchers at the institute used latent Dirichlet allocation (which I will examine further below) to create a dataset that scored words with associations to big five personality profiles.
Using my Reddit dataset and the Cambridge psychometrics institute study results, I created an algorithm that scored each word on its association with the personality traits openness, conscientiousness, extraversion, agreeableness, and neuroticism.
def neuroscore(entry):
    d = {}
    stemmer = PorterStemmer()
    totalscorepost = 0
    scoreword = 0
    for w in words(entry):
        w = stemmer.stem(w).lower()
        try:
            d[w] += 1
        except KeyError:
            d[w] = 1
        scoreword = dict.get(w)
        try:
            totalscorepost += float(neurodict[w])
        except:
            continue
I was able to create personality scores for the posts, which could be a potential area for further study. I would be interested in seeing a potential convergence of personality in posts on a thread or subreddit. The big issue I found with this approach was context-dependence. Different subreddits had different language and unique vocabularies. For example, “That’s Bad” could mean many different things depending on context.
I then decided to try my implementation of Latent Dirichlet Allocation. Rather than relying on an association with arbitrary personality types, my implementation created word groups called topics and sorted the words into this topic.
The algorithm would first break the document up into topics and randomly assign each word to a topic. It would next take the proportion of words in a document assigned to a topic and find the proportion of assignments to the topic that come from a word.
For more details, see : https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2
Here is my implementation: [https://github.com/antonecg/RedditLDA/blob/master/Reddit-%20LDA.ipynb] With it, I was able to break down any Reddit threads into different topics. However, I was underwhelmed with the results. The topics were an interesting tool for explanation and could be interesting for social science research but did not have the predictive power that I was looking for.
I realized that if I wanted to predict how an online community shares, reacts to and interprets the news, I would need to move past statistical NLP and into Neural NLP.

Neural NLP

Seeing the limitations of Statistical NLP methods, I decided to explore neural NLP methods and create my implementation.
This research was extremely computationally intensive, but fortunately, I had access to a rack-mounted server with a Tesla K80 and could run days-long jobs.
I researched three areas of Neural NLP; word embeddings, recurrent neural networks, and transformers. I implemented a transformer architecture based on google’s BERT with my Reddit training set.
The word embedding technique I researched was Word2Vec, created by Thomas Mikolov at google. Word2Vec uses two-layer neural networks to generate vectors out of massive blocks of text.
For example, a vector would represent the word “Help”. It could have a neuroticism score of 90/100 and an extraversion score of 40/100. Using cosine similarity, one is then able to find similar words.
For an excellent introduction to word2vec, see: [https://jalammar.github.io/illustrated-word2vec/]
This seemed like the next step after my research into Latent Dirichlet Allocations, but I found that word embeddings are generally pre-trained on very large datasets, so are not context-specific. I would not be able to create a word embedding on a Reddit thread and thus would be unable to use word embeddings to predict how online communities would predict information.
All of the models I had researched or implemented up to this point relied on the “bag of words” notion and looked only at words rather than building a notion of context.
However, we are living in a time of rapid progress with deep learning, and new algorithms are able to develop an understanding of context and have a memory.
While a traditional neural network only considers the immediately preceding state, a recurrent neural network implements back-propagation. This means that each layer feeds forward to the next layer, keeping a memory of recent layers.