Downloading large amounts of Reddit data
I really want to start learning more NPL. But to do that I first needed to get large amounts of text data. Here are some quick and easy ways to get large amounts of text data from the internet, including Google Big Query
I still want to learn more NLP.
But before I can really dig into how the models actually work, I want to learn how they work at a much higher level. I want to make cool stuff. But before I can make said cool stuff, I need a ton of text data.
Parsing the dumped JSON data
There is, conveniently, and on-going project that makes Reddit posts and comment data publicly available. These can easily be downloaded from PushShift.io: https://files.pushshift.io/reddit/. However, they are BIG downloads. They are also compressed in an uncommon way, and are a bunch of JSON objects which need to be parsed to extract the information you are interested it.
Below is a short script I used to read in the compressed data, extract all comments from the subreddit dataisbeautiful and write those to a new json file.
import zstandard as zstd
import json
import sys
import gzip
infile = sys.argv[1] # INPUT FILE
cctz = zstd.ZstdDecompressor()
ofilepath = infile + "_dataisbeautiful.json.gz"
output_file = gzip.open(ofilepath, "wb")
with open(infile, "rb") as fh:
with cctz.stream_reader(fh) as reader:
previous_line = ""
while True:
chunk = reader.read(65536)
if not chunk:
break
string_data = chunk.decode("utf-8")
lines = string_data.split("\n")
for i, line in enumerate(lines):
if i == 0:
line = previous_line + line
try:
obj = json.loads(line)
except json.JSONDecodeError:
continue
if obj['subreddit'].lower() == "dataisbeautiful":
# json.dump(obj, output_file)
outline = json.dumps(obj)
outline = outline + "\n"
output_file.write(bytes(outline.encode("utf-8")))
previous_line = lines[-1]
output_file.close()
print("Done!")
Query from Google BigQuery
This reddit data are also made available in Google BigQuery. And with Google's pretty generous free-tier, you can process up to 1TB of data for free every month. This is more than enough to get a bunch of comment data to explore.
If you are only retrieving small amounts of data at a time from BigQuery (16,000 rows or < 1GB csv), you can easily just use the web console and download it to your computer. But if you want to get much larger amounts, you need to use the APIs. This takes a few more steps to set up. There are several different ways to do this. But this was the easiest way I found.
Steps to get BigQuery API credentials
- Go go Google Cloud Console and enable to BigQuery APIs
- Go to BigQuery APIs, go to Credentials, and Create a new Service Account.
- Give the account a name
- Assign appropriate permissions
- Create Key for the account
- Download the key as JSON
Using python you can then do this in the beginning of your script or notebook
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/path/to/key.json"
Now you can install a couple extra packages and use pandas to read directly from Google BigQuery. Just like other times, conda will make your life much easier.
conda install pandas-gbq --channel conda-forge
Now right from Python, you can run and return your queries from BigQuery. So you can write out the SQL you want as a string and get the results
import pandas as pd
import os
# Set credentials key
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/path/to/key.json"
# SQL Query to submit
quer = """SELECT r.body, r.score_hidden, r.name, r.author, r.subreddit, r.score
FROM `fh-bigquery.reddit_comments.2019_08` r
WHERE r.subreddit = "dataisbeautiful"
and r.body != '[removed]'
and r.body != '[deleted]'
LIMIT 20
"""
# Submit and get the results as a pandas dataframe
res = pd.read_gbq(quer, project_id="YOUR-PROJECT-NAME")
This may take a couple minutes to run, but afterwards you will not have your results as a nice data frame for you to do anything you want with.