FlaiReddit

3 minute read

Published:

FlaiReddit is a text classification webapp deployed on Heroku which detects the ‘flair’ tags of a Reddit Post from the subreddit r/india. This project consists all major steps important to any applied machine learning pipeline - Data Collection, Processing, Optimized Classifier, Deployment.


Data Collection - Web Scraper

We will use the pushlift.io API instead to make calls and extract JSON packages. The data extractor extracts posts from a wide time period to eliminate the Bias towards some hot topics.

  • You can save and load your progress at checkpoints too (especially useful for online collection and storage),
  • Approximately 600 posts can be extracted per second, however as a result of the moderation of the subreddit only 20% of the data is actually available.
  • All logs are made in crawler.log, warnings are displayed.
  • To optimize space removed, empty flairs are removed batch wise.

Usage

from modules.crawler import *
start_time = #Enter the unix timestamp of date since when scraping should begin
end_time= #Enter the unix timestamp of date since when scraping should end
scraper = Crawler(size=1000, difference=12, sleep=0.5, start=start_time)

while(scraper.current > end time):
	red.query() #Query the database
red.dump() #Dump the stats and csv

A commited notebook is available at kaggle

Exploratory Data Analysis

Extensive analysis has been done, important words are visualized through WordClouds, in depth explanation of these and preprocessing is present in my Notebook

A baseline model from BOW is also implemented at the end.

Training the Model [BERT, TFIDF]

We set the seed for reproducibility and use BERT - uncased, base, freezing all layes apart from the last layer and the weights are saved for easier inference at :

Model Summary [Inference Time]:

ModelMicro-F1Macro-F1Inference
TFIDF Combined0.510.50331 Samples/s
BERT0.600.592.37 Samples/s
TFIDF0.490.48273 Samples/s

The confusion matrix is plotted below

Confusion Matrix

WebApp - Flask TFIDF

  • For the web app we have used the TFIDF model keeping the CPU Rate and Memory Usage in mind [BERT BASE has 114 M parameters].
  • The app is created on flask, the root view is a simple webpage where you can enter the weblink and the predicted flair is displayed.
  • The other end point is \auto, to which a post request is sent and the prediction json is sent back.
  • Logs and Error pages will be enabled in a future update.
  • The colour theme used is taken from reddit’s own theme :)

Root page

Auto Endpoint

>>> import requests
>>> with open('file.txt','wb') as f:
		f.write(b"r/india post urls")
>>> base_url = "https://flaireddittest.herokuapp.com" #http://127.0.0.1:5000/ if local
>>> url = f"{base_url}/auto"
>>> files = {'upload_file': open('file.txt','rb')}
>>> r = requests.post(url, files=files)
>>> r
<Response [200]>
>>> r.json()
{"post_url" : 'predicted tag'}

This way you can use the app from a user/developer’s perspective —

HEROKU DEPLOYMENT