Downloading large amounts of Reddit data

I really want to start learning more NPL. But to do that I first needed to get large amounts of text data. Here are some quick and easy ways to get large amounts of text data from the internet, including Google Big Query

Downloading large amounts of Reddit data

I still want to learn more NLP.

But before I can really dig into how the models actually work, I want to learn how they work at a much higher level. I want to make cool stuff. But before I can make said cool stuff, I need a ton of text data.

Parsing the dumped JSON data

There is, conveniently, and on-going project that makes Reddit posts and comment data publicly available. These can easily be downloaded from PushShift.io: https://files.pushshift.io/reddit/. However, they are BIG downloads. They are also compressed in an uncommon way, and are a bunch of JSON objects which need to be parsed to extract the information you are interested it.

Below is a short script I used to read in the compressed data, extract all comments from the subreddit dataisbeautiful and write those to a new json file.

import zstandard as zstd
import json
import sys
import gzip


infile = sys.argv[1] # INPUT FILE

cctz = zstd.ZstdDecompressor()

ofilepath = infile + "_dataisbeautiful.json.gz"

output_file = gzip.open(ofilepath, "wb")

with open(infile, "rb") as fh:
    with cctz.stream_reader(fh) as reader:
        previous_line = ""
        while True:
            chunk = reader.read(65536)
            if not chunk:
                break
                
            string_data = chunk.decode("utf-8")
            lines = string_data.split("\n")
            for i, line in enumerate(lines):
                if i == 0:
                    line = previous_line + line
                try:
                    obj = json.loads(line)
                except json.JSONDecodeError:
                    continue
                if obj['subreddit'].lower() == "dataisbeautiful":
                    # json.dump(obj, output_file)
                    outline = json.dumps(obj)
                    outline = outline + "\n"
                    output_file.write(bytes(outline.encode("utf-8")))
            previous_line = lines[-1]
            
output_file.close()

print("Done!")

Query from Google BigQuery

This reddit data are also made available in Google BigQuery. And with Google's pretty generous free-tier, you can process up to 1TB of data for free every month. This is more than enough to get a bunch of comment data to explore.

If you are only retrieving small amounts of data at a time from BigQuery (16,000 rows or < 1GB csv), you can easily just use the web console and download it to your computer. But if you want to get much larger amounts, you need to use the APIs. This takes a few more steps to set up. There are several different ways to do this. But this was the easiest way I found.

Steps to get BigQuery API credentials

  1. Go go Google Cloud Console and enable to BigQuery APIs
  2. Go to BigQuery APIs, go to Credentials, and Create a new Service Account.
  3. Give the account a name
  4. Assign appropriate permissions
  5. Create Key for the account
  6. Download the key as JSON

Using python you can then do this in the beginning of your script or notebook

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/path/to/key.json"

Now you can install a couple extra packages and use pandas to read directly from Google BigQuery. Just like other times, conda will make your life much easier.

conda install pandas-gbq --channel conda-forge

Now right from Python, you can run and return your queries from BigQuery. So you can write out the SQL you want as a string and get the results

import pandas as pd
import os

# Set credentials key
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/path/to/key.json"


# SQL Query to submit
quer = """SELECT r.body, r.score_hidden, r.name, r.author, r.subreddit, r.score
FROM `fh-bigquery.reddit_comments.2019_08` r
WHERE r.subreddit = "dataisbeautiful" 
and r.body != '[removed]'
and r.body != '[deleted]'
LIMIT 20
"""

# Submit and get the results as a pandas dataframe
res = pd.read_gbq(quer, project_id="YOUR-PROJECT-NAME")

This may take a couple minutes to run, but afterwards you will not have your results as a nice data frame for you to do anything you want with.