To call the Reddit API, you must first set up a connection. The most straightforward way is to register an application with Reddit via OAuth. I registered a web application and stated “Research” as the purpose.
As part of this registration, I was granted a client id and a client secret. The following expression shows how it was implemented in python:
reddit = praw.Reddit(client_id='xxxxx,
client_secret=xxxx
user_agent=xxxx
Then I needed to specify the thread to call for the API. Each reddit thread has a unique key that you can pull from the URL. That key becomes the logical entity I named “submission”
submission = reddit.submission(key from the url)
This uses the Praw API to pull data related to a thread.
This returns an unstructured forest of comments. At the top, is the first post in the thread, which can have multiple comments, which then can each have other comments. Therefore, it was necessary to structure the data.
I sorted the comments to get the oldest first with the specification:
submission.comment_sort = 'old' _
Then, I created a for loop to iterate through each comment. For each comment, it harvests the time that the comment was created and appends it to an array.
for comment in submission.comments.list():
time = comment.created
ltime.append(comment.created_utc)_
In the for loop, I implemented either tokenisation or stemming depending on the experiment.
Tokenisation simply turns the sentence string into a list of words. In natural language processing, this is called the “bag of words” technique. This allows for easy processing but totally ignores sentence structure.
Stemming removes word endings. For example, “carefully” and “careful” both become the word “careful”. I used a python implementation of the Porter stemmer for my code, developed by Martin Porter in 1979 and still maintained by him. [
https://tartarus.org/martin/PorterStemmer/]
stemmer = PorterStemmer()
for w in words(entry):
w= stemmer.stem(w).lower()
Statistical Natural Language Processing
My first experiment with statistical natural language processing implemented Udny Yule’s K characteristic
\cite{yule1944} . This formed the basis for my master’s dissertation. [
https://www.authorea.com/users/107755/articles/277345-measuring-online-feedback-loops?commit=d79a10bf2e22949ba350e2a858b73f430b42aee5] This metric measures the probability of two words being the same in a text. The assumption is that less repetition = complexity. This agrees with Kolmogorov complexity theory
\cite{kolmogorov25}. The Kolmogorov complexity of an object is the shortest possible program that it would take to specify an object. In this case, many unique words would require a complex program.
M1 = float(len(d))
M2 = sum([len(list(g))*(freq**2) for freq,g in groupby(sorted(d.values()))]
I scored each Reddit post by its complexity and then plotted the complexity across time in a Reddit thread. I then charted the average complexity to measure whether or not a post was increasing or decreasing in complexity through time.