Categorizing posts on Craigslist

The purpose of this project is to build a classifier which categorizes posts from Craigslist into the correct category. This tool is useful for categorizing posts that are in catch-all sections such as the “bartering” and “for sale” categories.

Outline

  1. Introduction
    1. Motivation for building this
    2. Exact goals
  2. The challenges that I went through
    1. Rushing to deep learning and TensorFlow
    2. Andrew Ng machine learning course
    3. Using eBay data

Import goop

Most of the lines below are imports for functions and libraries that we’ll be using. We import two libraries for this project from the lib folder in the repository.

In [4]:
from __future__ import print_function

import sys
sys.path.append('..')

# Libraries functions that were built for this project
# or copied and pasted from elsewhere
from lib.item_selector import ItemSelector
from lib.model_performance_plotter import plot_learning_curve

import json
import pandas
from pprint import pprint
from sklearn.base import BaseEstimator
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from time import time

"""File to load category mapping from"""
CATEGORY_FILE = 'data/categories.json'
"""File to load data set from"""
DATA_FILE = 'data/cl_posts.csv'
"""File to save the complete model into"""
MODEL_FILE = 'out/cl_model.pkl'

Load and explore the data

Use pandas to load Craigslist posts from a CSV file, then drop all examples within that have any N/A or NaN fields.

In [6]:
data = pandas.read_csv(DATA_FILE)
data = data.dropna()
In [7]:
# Display the first few examples of the data set
data.head(10)
Out[7]:
title description category url
0 Are You a Married Woman Looking for Two Guys? We’re two fun, discreet married white professi… men seeking women https://chicago.craigslist.org/chc/m4w/d/are-y…
1 Producing Consultant Building effective and collaborative relations… business/mgmt https://iowacity.craigslist.org/bus/d/producin…
2 Need ride to tampa thursday for court!! I am a single mother fighting for custody of m… rideshare https://ocala.craigslist.org/rid/d/need-ride-t…
3 Corsair GS 800 Desktop Power Supply Selling my Corsair GS 800 Desktop Power Supply… computer parts – by owner https://blacksburg.craigslist.org/sop/d/corsai…
4 Free MCAT Quiz for premed students: Can you th… Free MCAT Quiz for premed students: Can you th… lessons & tutoring https://albuquerque.craigslist.org/lss/d/free-…
5 Wanted Classic Cars and Trucks Any Condition.. Call/text 1.765.613.313one Price Pending Condi… wanted – by owner https://richmondin.craigslist.org/wan/d/wanted…
6 Massage Therapist Wanted Ontario Family Chiropractic is a holistic base… healthcare https://rochester.craigslist.org/hea/d/massage…
7 Lease Take Over at Manchester Motorworks 1 bed… Manchester Motorworks is offering a 1 bedroom … sublets & temporary https://richmond.craigslist.org/sub/d/lease-ta…
8 🚗 DENVER CAR OWNERS: PAY FOR YOUR CAR BY RENTI… Turo is a peer-to-peer car sharing marketplace… et cetera https://denver.craigslist.org/etc/d/denver-car…
9 Trunk Mounted Bike Rack w/ 3 Spaces – Universa… Trunk Mounted Bike Rack w/ 3 Spaces – Universa… bicycle parts – by owner https://cosprings.craigslist.org/bop/d/trunk-m…
In [8]:
# Display the fields of the data set
list(data)
Out[8]:
['title', 'description', 'category', 'url']

Map Craigslist categories to our application categories

The categories that Craigslist

In [ ]:
"""
Load category map to convert from Craigslist categories to our own
local app categories.
"""
with open(CATEGORY_FILE) as handle:
    category_map = json.loads(handle.read())

"""Load example data using Pandas"""
In [ ]:
# data, _ = train_test_split(data, test_size=0.5)

"""Remove all examples with null fields"""


"""Strip out all "X - by owner", etc. text."""
data['category'], _ = data['category'].str.split(' -', 1).str

"""Remap all Craigslist categories to the categories for our use case"""
data['category'].replace(to_replace=category_map, inplace=True)

"""
Drop all examples with null fields again; this time the categories that
we're skipping.
"""
data = data.dropna()

print('All categories:\n', data.category.value_counts())

Training and test data split

GridSearchCV already splits a cross validation data set from the training set.

In [18]:
train, test = train_test_split(data, test_size=0.1)

Data pipeline

Pipeline the process to make it more clear what’s going on, use less
memory, and enable faster insertion of new steps.

FeatureUnion

A FeatureUnion allows for unifying multiple input features so that
the model trains itself on all of them.

selector

Select this column only for the purposes of this step of the
pipeline.

Example:

{
    'title': 'Lagavulin 16',
    'description': 'A fine bottle this is.',
    'category': 'Alcohol & Spirits'
}

=> 'Lagavulin 16'

vect

Embed the words in text using a matrix of token counts.

Example:

["dog cat fish", "dog cat", "fish bird", "bird"]

=>

[[0, 1, 1, 1],
 [0, 2, 1, 0],
 [1, 0, 0, 1],
 [1, 0, 0, 1]]

tfidf

Deprioritize words that appear very often, such as “the”, “an”, “craigslist”, etc.

Example:

[[3, 0, 1],
 [2, 0, 0],
 [3, 0, 0]]

=>

[[ 0.81940995,  0.        ,  0.57320793],
 [ 1.        ,  0.        ,  0.        ],
 [ 1.        ,  0.        ,  0.        ]]

clf

clf is the classifier that we feed the data from the data pipeline into. In this case, we choose LogisticRegression since it’s one of the known best ones for text classification. The others are 1) LinearSVC, which is effectively just a linear regression and is similar to LogisticRegression, and 2) neural nets, which without very complicated convolutional and recurrent networks don’t give us much of an advantage over more classic methods.

In [19]:
pipeline = Pipeline([
    ('union', FeatureUnion(
        transformer_list=[
            ('title', Pipeline([
                ('selector', ItemSelector(key='title')),
                ('vect', CountVectorizer(stop_words='english',
                                         decode_error='replace',
                                         strip_accents='ascii',
                                         max_df=0.8)),
                ('tfidf', TfidfTransformer(smooth_idf=False))
            ])),
            ('description', Pipeline([
                ('selector', ItemSelector(key='description')),
                ('vect', CountVectorizer(stop_words='english',
                                         decode_error='replace',
                                         strip_accents='ascii',
                                         binary=True,
                                         max_df=0.8,
                                         min_df=10)),
                ('tfidf', TfidfTransformer(smooth_idf=False))
            ]))
        ]
    )),
    ('clf', LogisticRegression(C=5, dual=False, class_weight='balanced'))
])

Pipeline parameters

We can optionally set our pipeline parameters to get more control over each step. In the code above, the optimal parameters are already filled out.

In [ ]:
parameters = {
    # Controls on regression model.
    # 'clf__C': [0.1, 0.3, 1, 3, 5, 10, 30, 100, 300, 1000]
    # 'clf__class_weight': [None, 'balanced'],
    # 'clf__dual': [True, False],

    # Controls on word vectorization.
    # 'union__title__vect__max_df': [0.8, 0.85, 0.9, 0.95, 1],
    # 'union__title__vect__min_df': [1, 10],
    # 'union__title__vect__ngram_range': [(1, 1), (1, 2)],
    # 'union__description__vect__ngram_range': [(1, 1), (1, 2)],
    # 'union__description__vect__max_df': [0.8, 0.85, 0.9, 0.95, 1],
    # 'union__description__vect__min_df': [1, 10, 100],

    # Controls on TfIdf normalization.
    # 'union__title__tfidf__norm': [None, 'l1', 'l2'],
    # 'union__title__tfidf__use_idf': [True, False],
    # 'union__title__tfidf__smooth_idf': [True, False],
    # 'union__title__tfidf__sublinear_tf': [False, True],
    # 'union__description__tfidf__norm': [None, 'l1', 'l2'],
    # 'union__description__tfidf__use_idf': [True, False],
    # 'union__description__tfidf__smooth_idf': [True, False],
    # 'union__description__tfidf__sublinear_tf': [False, True],
}

Train the model

In [ ]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=10)

print('Performing grid search...')
print('Pipeline: ', [name for name, __ in pipeline.steps])
print('Parameters: ')
pprint(parameters)
t0 = time()
grid_search.fit(train[['title', 'description']], train['category'])
print("Done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

score = grid_search.score(test[['title', 'description']], test['category'])
print("Test accuracy: %f" % score)

joblib.dump(grid_search, MODEL_FILE)
Performing grid search...
Pipeline:  ['union', 'clf']
Parameters:
{}
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV] ....................... , score=0.7874693401652609, total=33.0min
[CV] ....................... , score=0.7887705122605603, total=33.1min
[CV] ....................... , score=0.7877126558241279, total=34.9min
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 36.5min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed: 36.5min finished

Plot the data

In [ ]:
plot_learning_curve(grid_search.best_estimator_,
                    'Item Categorizer',
                    train[['title', 'description']],
                    train['category'])
plt.show()