Classifying Instagram profiles by gender

The purpose of this project is to create a model which, given an Instagram user’s profile, predicts their gender as accurately as possible. The motivation for this undertaking is to be able to target for marketing purposes Instagram users of specific demographics. The model is trained using labeled text-based profile data passed through a tuned logistic regression model. The model parameters are optimized using the AUROC metric to reduce variability in the precision and recall of predictions for each gender. The resulting model achieves 90% overall accuracy from a dataset of 20,000, though it deviates substantially in the recall of each gender.

Introduction

All supporting files for this project can be found in its GitHub repository.

This project write-up assumes that the reader has a basic understanding of machine learning and statistics concepts including logistic regression, word encodings including bag-of-words, and the terminology surrounding false/true positives/negatives.

The following high-level details are of note:

  • Instagram profiles are mostly text. The modeling methods that typically perform best for text classification are regressions and neural nets. This project employs logistic regression as its model of choice.
  • The model is designed to perform equally well on both genders. This stipulation was a business constraint of which the most significant consequence was the replacement of the cross-validation optimization metric of accuracy with AUROC.
  • The data pipeline makes heavy use of n-grams and both word and character encodings. The use of these more complicated bag-of-words encodings improves results substantially beyond simpler 1-gram word-based encodings.

Due to the difficulty of obtaining reliable data about genders other than male and female, and the lack of marketing value in these smaller demographics, the following analysis eschews these additional labels. Rest assured, this omission is for economic as opposed to political or social reasons.

For reasons including logistical difficulty and the data constituting business trade secrets, the labeled profiles cannot be posted publicly. One way in which the results of this project can be replicated is by querying the Instagram User API and then labeling the data using Amazon Mechanical Turk.

Data Engineering and Cleaning

The code in the following section loads, organizes, and formats the labeled training data in such a way that it can later be passed into an off-the-shelf model.

In [39]:
"""
Jupyter notebook boilerplate setup code.
"""

%matplotlib inline
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [40]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

"""
Files from which to load datasets and labels. In this example, the labels
are separate from the rest of the user profile data, and these data are
related using a dictionary.
"""
DATA_FILES = {
    'doug_labeled_user_batch.json': 'doug_labels.json',
    'doug_finaly_labeled_cleaned_batch_2.json': 'doug_labels_batch_2.json'
}

"""
Load example data using Pandas.
"""
datasets = []
for profiles_file, labels_file in DATA_FILES.items():
    datasets.append({
        'profiles': pd.read_json(profiles_file, encoding='ISO-8859-1'),
        'labels': pd.read_json(labels_file, encoding='ISO-8859-1')
    })

"Loaded %d datasets" % len(DATA_FILES.items())
Out[40]:
'Loaded 2 datasets'

Check the loaded data

It’s best to examine the loaded data to verify that it is in the expected format.

In [41]:
total_examples = 0
for dataset in datasets:
    total_examples += len(dataset['profiles'])
"%d total examples" % total_examples
Out[41]:
'20460 total examples'
In [42]:
datasets[0]['profiles'].head()
Out[42]:
biography blocked_by_viewer connected_fb_page country_block external_url external_url_linkshimmed followed_by followed_by_viewer follows follows_viewer has_requested_viewer id is_private is_verified media profile_pic_url profile_pic_url_hd requested_by_viewer saved_media username
0 Just me. 19\nsnapchat: abbipandi False NaN False http://twitter.com/abiigaildg?s=09 http://l.instagram.com/?u=http%3A%2F%2Ftwitter… {‘count’: 256} False {‘count’: 642} False False 255372732 True False {‘count’: 182, ‘nodes’: [], ‘page_info’: {‘end… https://instagram.flju1-1.fna.fbcdn.net/t51.28… https://instagram.flju1-1.fna.fbcdn.net/t51.28… False {‘nodes’: [], ‘page_info’: {‘end_cursor’: None… abigail13d
1 Just a 23 year old living in Milwaukee False NaN False None None {‘count’: 169} False {‘count’: 223} False False 899493065 True False {‘count’: 18, ‘nodes’: [], ‘page_info’: {‘end_… https://instagram.flju1-1.fna.fbcdn.net/t51.28… https://instagram.flju1-1.fna.fbcdn.net/t51.28… False {‘nodes’: [], ‘page_info’: {‘end_cursor’: None… jkst0329
2 ✖️⚠️Follow @billz433⚠️✖️ False NaN False http://www.thiscrush.com/~nicobonta http://l.instagram.com/?u=http%3A%2F%2Fwww.thi… {‘count’: 111} False {‘count’: 52} False False 5566432352 False False {‘count’: 3, ‘nodes’: [{‘__typename’: ‘GraphIm… https://instagram.flju1-1.fna.fbcdn.net/t51.28… https://instagram.flju1-1.fna.fbcdn.net/t51.28… False {‘nodes’: [], ‘page_info’: {‘end_cursor’: None… nicobonta18
3 None False NaN False None None {‘count’: 71} False {‘count’: 511} False False 5416238465 False False {‘count’: 3, ‘nodes’: [{‘__typename’: ‘GraphIm… https://instagram.flju1-1.fna.fbcdn.net/t51.28… https://instagram.flju1-1.fna.fbcdn.net/t51.28… False {‘nodes’: [], ‘page_info’: {‘end_cursor’: None… sunnykumar7094
4 Don’t worry be happy☺\n🎉wish me 23 Feb🎂\nBigge… False NaN False None None {‘count’: 116} False {‘count’: 1143} False False 5679439304 False False {‘count’: 2, ‘nodes’: [{‘__typename’: ‘GraphIm… https://instagram.flju1-1.fna.fbcdn.net/t51.28… https://instagram.flju1-1.fna.fbcdn.net/t51.28… False {‘nodes’: [], ‘page_info’: {‘end_cursor’: None… purbashamalakar

5 rows × 22 columns

In [43]:
datasets[0]['labels'].head()
Out[43]:
gender id
0 female 255372732
1 female 899493065
2 male 5566432352
3 male 5416238465
4 male 5679439304

Notice the presence of the id field in both datasets. This field is the one upon which the profile and label data will be merged.

Combine the profiles with the labels

In this step, the profiles and labels are merged on the id field present in each. The datasets are then concatenated to produce one large set of examples.

In [44]:
for dataset in datasets:
    dataset['merged'] = pd.merge(left=dataset['profiles'],
                                 right=dataset['labels'],
                                 left_on='id',
                                 right_on='id')

data = pd.concat(map(lambda dataset: dataset['merged'], datasets))
data.fillna(value='', inplace=True)
data.head()
Out[44]:
biography blocked_by_viewer connected_fb_page country_block external_url external_url_linkshimmed followed_by followed_by_viewer follows follows_viewer id is_private is_verified media profile_pic_url profile_pic_url_hd requested_by_viewer saved_media username gender
0 Just me. 19\nsnapchat: abbipandi False False http://twitter.com/abiigaildg?s=09 http://l.instagram.com/?u=http%3A%2F%2Ftwitter… {‘count’: 256} False {‘count’: 642} False 255372732 True False {‘count’: 182, ‘nodes’: [], ‘page_info’: {‘end… https://instagram.flju1-1.fna.fbcdn.net/t51.28… https://instagram.flju1-1.fna.fbcdn.net/t51.28… False {‘nodes’: [], ‘page_info’: {‘end_cursor’: None… abigail13d female
1 Just a 23 year old living in Milwaukee False False {‘count’: 169} False {‘count’: 223} False 899493065 True False {‘count’: 18, ‘nodes’: [], ‘page_info’: {‘end_… https://instagram.flju1-1.fna.fbcdn.net/t51.28… https://instagram.flju1-1.fna.fbcdn.net/t51.28… False {‘nodes’: [], ‘page_info’: {‘end_cursor’: None… jkst0329 female
2 ✖️⚠️Follow @billz433⚠️✖️ False False http://www.thiscrush.com/~nicobonta http://l.instagram.com/?u=http%3A%2F%2Fwww.thi… {‘count’: 111} False {‘count’: 52} False 5566432352 False False {‘count’: 3, ‘nodes’: [{‘__typename’: ‘GraphIm… https://instagram.flju1-1.fna.fbcdn.net/t51.28… https://instagram.flju1-1.fna.fbcdn.net/t51.28… False {‘nodes’: [], ‘page_info’: {‘end_cursor’: None… nicobonta18 male
3 False False {‘count’: 71} False {‘count’: 511} False 5416238465 False False {‘count’: 3, ‘nodes’: [{‘__typename’: ‘GraphIm… https://instagram.flju1-1.fna.fbcdn.net/t51.28… https://instagram.flju1-1.fna.fbcdn.net/t51.28… False {‘nodes’: [], ‘page_info’: {‘end_cursor’: None… sunnykumar7094 male
4 Don’t worry be happy☺\n🎉wish me 23 Feb🎂\nBigge… False False {‘count’: 116} False {‘count’: 1143} False 5679439304 False False {‘count’: 2, ‘nodes’: [{‘__typename’: ‘GraphIm… https://instagram.flju1-1.fna.fbcdn.net/t51.28… https://instagram.flju1-1.fna.fbcdn.net/t51.28… False {‘nodes’: [], ‘page_info’: {‘end_cursor’: None… purbashamalakar male

5 rows × 23 columns

Format gender to be compatible with AUROC

The encoding of the gender field is as a string with the value “male” or “female”. The AUROC optimization metric requires a binary (0 or 1) value. The code below re-encodes gender into a new gender_enc field in which males are encoded with the value 0 and females with the value 1.

In [45]:
data['gender_enc'] = data.apply(
    lambda x: 0 if x['gender'] == 'male' else 1,
    axis=1)
data[['gender', 'gender_enc']].head()
Out[45]:
gender gender_enc
0 female 1
1 female 1
2 male 0
3 male 0
4 male 0

Prepare the writing_example field

One of the fields provided in the dataset is media, which is an array of metadata about each user’s photos including the captions. The caption field provides a point of leverage because it is the one section of the user’s profile in which they can write freeform text. This field is also high leverage for the project because the way in which the data are prepared affects the results substantially.

Intuitively, there are two methods to vectorize the captions:

  1. Each caption could be encoded in isolation with each one being treated as an entirely different field.
  2. The captions could all be concatenated and encoded together.

Option #2 makes the most sense because there is no natural ordering of captions; for any given two users, there should be nothing in common between each of their first photo captions, or each of their second photo captions, etc.

It is possible that information can be lost by combining captions as in option #2. One example of this loss of data is when two photo captions with completely different sentiments are concatenated; however, the gender of the user who wrote the captions remains constant, which is the important detail.

In [46]:
def extract_writing_example(row):
    captions = []
    for medium in row.media['nodes']:
        if 'caption' not in medium:
            continue
            
        caption = medium['caption']
        if caption is not None:
            captions.append(caption)

    return ' '.join(captions)
        
data['writing_example'] = data.apply(lambda x: extract_writing_example(x), axis=1)

data[['username', 'writing_example']].head()
Out[46]:
username writing_example
0 abigail13d
1 jkst0329
2 nicobonta18 Se tocchi la squadra giuro non torni a casa!😈😤…
3 sunnykumar7094
4 purbashamalakar Some people ask me this is my fake profile but…

Prepare the hash_tags field

This section may seem unnecessary because hash tags will already be detected and appropriately prioritized by the writing_example vectorizer. The impetus for synthesizing the hash_tags field is to be able to apply additional constraints on its vectorizer. One example of a beneficial tuning is binarizing the field instead of retaining the amount of times that the hash tag exists in a given writing_example.

In [47]:
import re

def extract_hash_tags(row):
    hash_tags = re.findall('#[a-zA-Z]*', row['writing_example'])
    
    hash_tags = [hash_tag.replace('#', '') for hash_tag in hash_tags]
    
    return ' '.join(hash_tags)

data['hash_tags'] = data.apply(lambda x: extract_hash_tags(x), axis=1)

data[data['hash_tags'].apply(lambda x: len(x) > 0)][['writing_example', 'hash_tags']].head()
Out[47]:
writing_example hash_tags
4 Some people ask me this is my fake profile but… trust
5 Super excited to move to a new city, but not l… plantlife plantlady plantlife diyjewelry dewyn…
6 I think my cat wants to go on a vacation 😅😅😅 #… cats catsofinstagram cats catsoninstagram vaca…
7 Vegan Peanut Butter Banana Split! Organic vega… hurricaneirma hurricaneirma hurricaneirma vega…
8 Nyder solnedgang på hørhus kollegiet #hollywoo… hollywood walkoffame favoritehero grandcanyon …

Prepare the first_name field

The full_name field is deceivingly unideal for use with bag-of-words encoding. Recall that 1-gram bag-of-words encoding discards information about where each word occurs in the text. Furthermore, people often have last names which could function as first names for the opposite gender, e.g. “Patricia James.” This scenario would be particularly bad if the weighting for the association of “James” with “male” were stronger than “Patricia”‘s association with “female.”

To solve this problem, the first_name field is extracted and encoded separately as a crude approximation for the loss of positional information when encoding with bag-of-words. The use of n-grams does not solve this problem because very few people have the same full_name, so the model would be overfitted to the train data. The reason that the first_name field does not entirely replace the full_name field is that it may contain emojis or middle names with which predictions can be improved.

In [48]:
def extract_first_name(row):
    return row['full_name'].split(' ', 1)[0]

data['first_name'] = data.apply(lambda x: extract_first_name(x), axis=1)
data[['full_name', 'first_name']].head()
Out[48]:
full_name first_name
0 Abigail Diaz Gamio 13🍁 Abigail
1 Jaronsa Taylor Jaronsa
2 Nico Bontà😈 Nico
3 Sunny Kumar Sunny
4 purbasha….. purbasha…..

Data Exploration

The section that follows will more deeply explore the dataset to identify the data’s features and trends. The goal of this investigation is to identify which encodings are optimal for analysis.

Sample the fully formatted data

Looking at a subset of the dataset with all of the fields properly formatted is the best way to spot obvious relationships and potential paths forward.

In [49]:
data = data[['username', 'first_name', 'full_name', 'biography', 'writing_example', 'hash_tags', 'gender', 'gender_enc']]
data.head(10)
Out[49]:
username first_name full_name biography writing_example hash_tags gender gender_enc
0 abigail13d Abigail Abigail Diaz Gamio 13🍁 Just me. 19\nsnapchat: abbipandi female 1
1 jkst0329 Jaronsa Jaronsa Taylor Just a 23 year old living in Milwaukee female 1
2 nicobonta18 Nico Nico Bontà😈 ✖️⚠️Follow @billz433⚠️✖️ Se tocchi la squadra giuro non torni a casa!😈😤… male 0
3 sunnykumar7094 Sunny Sunny Kumar male 0
4 purbashamalakar purbasha….. purbasha….. Don’t worry be happy☺\n🎉wish me 23 Feb🎂\nBigge… Some people ask me this is my fake profile but… trust male 0
5 chelsea_a_bear Chelsea Chelsea Hebert Super excited to move to a new city, but not l… plantlife plantlady plantlife diyjewelry dewyn… female 1
6 jimsansa Jim Jim Van Mourik I think my cat wants to go on a vacation 😅😅😅 #… cats catsofinstagram cats catsoninstagram vaca… male 0
7 jnlfunfitfoodie Jennifer Jennifer Nicole Lee “Happiest Woman Alive!” Blessed to motivate al… Vegan Peanut Butter Banana Split! Organic vega… hurricaneirma hurricaneirma hurricaneirma vega… female 1
8 peter_jordt Peter Peter Jordt Nyder solnedgang på hørhus kollegiet #hollywoo… hollywood walkoffame favoritehero grandcanyon … male 0
9 _d.kolobova_ Darina Darina Kolobova Огромное спасибо за эти 2 незабываемых недели😘… malenkayastrana malenkayastrana scetchbook ma… female 1

Several patterns are immediately apparent in this dataset:

  1. Many users make liberal use of emojis. One path forward is to perform character vectorization of all fields in which users may enter emojis.
  2. Users write freeform text with context. It may be beneficial to encode any user-inputted fields as n-grams rather than the default 1-grams to retain this context.
  3. Not everyone has a writing_example, but almost everyone has filled in at least one text field. This observation is good news for the model; users without any user-entered text fields filled in are typically less useful for marketing purposes.

These observations could be validated by checking their correctness statistically instead of visually from the small sample of ten examples.

Chart the frequency of word counts for each field and each gender

Following observation #3 above, charting the frequency with which users of each gender fill in text-based fields will give some idea as to how reliable the use of these fields would be.

In [50]:
field_lens_to_plot = ['writing_example', 'biography', 'full_name', 'hash_tags']

for field in field_lens_to_plot:
    for gender in [0, 1]:
        plt.figure()
        gender_data = data[data['gender_enc'] == gender]
        gender_data["%s_len" % field] = gender_data[field].apply(
            lambda x: len(re.findall(r'\w+', x)))
        gender_data["%s_len" % field].plot(
            title="Frequency of %s for %s" % (field, 'males' if gender == 0 else 'females'),
            kind='hist', color='blue' if gender == 0 else 'red')
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  

The only field with a substantial discrepancy between genders is biography—females seem to be more likely to enter more text. Unfortunately, this finding was a red herring; when the trained model included the length of the biography field as a predictor, there was no significant difference in predictive power.

Check for imbalances in the genders

A substantial imbalance in the data may require intervention.

In [51]:
"Number of males: %d; Number of females: %d" % (
    len(data[data['gender_enc'] == 0]),
    len(data[data['gender_enc'] == 1])
)
Out[51]:
'Number of males: 8909; Number of females: 11709'

There is an imbalance in gender representations within the dataset, but the lopsidedness is insufficient to warrant drastic measures. One way in which the analysis could be made more robust is by using the AUROC metric in place of accuracy for model optimization. This technique is typically used to compensate for acute asymmetry in the data, but it can also be employed for less extreme corrections. One challenge in the use of AUROC is that it is limited to binary classification, which limits the ability of the model to be extended later to support more than the binary genders.

Plot the most predictive words for each field

In this section, untuned logistic regression models will be trained on each field in isolation, and the most extreme weights outputted in graphical format. This illustration is not particularly useful or actionable, but it is interesting.

In [52]:
from sklearn.linear_model import LogisticRegression

from sklearn.feature_extraction.text import CountVectorizer

from model_performance_plotter import plot_coefficients

plot_coefficients(LogisticRegression(), CountVectorizer(),
                  'Biography Most Predictive Terms',
                  data['biography'], data['gender_enc'])
plot_coefficients(LogisticRegression(), CountVectorizer(),
                  'Writing Example Most Predictive Terms',
                  data['writing_example'], data['gender_enc'])
plot_coefficients(LogisticRegression(), CountVectorizer(),
                  'Full Name Most Predictive Terms',
                  data['full_name'], data['gender_enc'])
plot_coefficients(LogisticRegression(), CountVectorizer(),
                  'Hash Tags Most Predictive Terms',
                  data['hash_tags'], data['gender_enc'])

Model Training and Validation

The final step is to train and validate the model. In practice, this section took many iterations to get to where it now is.

Split the main dataset into training and test datasets

SciKit will automatically sample from the training dataset to create a cross-validation dataset, so only the test dataset must be created manually.

In [53]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2)
print("Train data set size: %d" % len(train))
print("Test data set size: %d" % len(test))
Train data set size: 16494
Test data set size: 4124
In [54]:
parameters = {
    # 'clf__solver': ['liblinear', 'lbfgs', 'newton-cg', 'saga']
    # 'clf__loss': ['squared_hinge', 'hinge'],
    # 'clf__penalty': ['l1', 'l2'],
    # 'clf__C': [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100],
    # 'clf__C': [10, 15, 20, 25, 30, 35, 50, 80, 100, 120, 150],
    # 'clf__dual': [False, True],
    # 'clf__class_weight': [None, 'balanced'],
}
In [56]:
from sklearn.pipeline import Pipeline

from sklearn.pipeline import FeatureUnion

from transformers import ItemSelector
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction import DictVectorizer

tfidf_transformer = TfidfTransformer()

encoding_args = {
    'decode_error': 'replace',
    'strip_accents': 'unicode',
}

word_vectorizer_args = {
    **encoding_args,
    'ngram_range': (1, 2)
}

char_vectorizer_args = {
    **encoding_args,
    'analyzer': 'char',
    'ngram_range': (1, 3)
}

word_vectorizer = CountVectorizer(**word_vectorizer_args)
char_vectorizer = CountVectorizer(**char_vectorizer_args)

transformers = {
    'username': {
        'char': char_vectorizer
    },
    'biography': {
        'word': word_vectorizer,
        'char': char_vectorizer
    },
    'full_name': {
        'word': CountVectorizer(**encoding_args),
        'char': char_vectorizer
    },
    'first_name': {
        'word': CountVectorizer(**encoding_args)
    },
    'hash_tags': {
        'word': CountVectorizer(**encoding_args, binary=True),
        'char': CountVectorizer(**char_vectorizer_args, binary=True)
    },
    'writing_example': {
        'word': word_vectorizer,
        'char': char_vectorizer
    }
}

transformer_list = []
for key, transformer_types in transformers.items():
    for transformer_type, transformer in transformer_types.items():
        transformer_list.append(
            ("%s_%s" % (key, transformer_type), Pipeline([
                ('selector', ItemSelector(key=key)),
                ('vect', transformer),
                ('tfidf', tfidf_transformer)
            ]))
        )

pipeline = Pipeline([
    ('union', FeatureUnion(transformer_list=transformer_list)),
    ('clf', LogisticRegression(C=150))
])
In [57]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV

scoring = {'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score)}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=10, scoring=scoring, refit='AUC')
grid_search.fit(train, train['gender_enc'])

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")

best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

score = grid_search.score(test, test['gender_enc'])
print("Test score: %f" % score)

y_pred = grid_search.predict(test)
print("Test accuracy: %f" % accuracy_score(test['gender_enc'], y_pred))

# Use this to assess the probability of each classification.
# grid_search.predict_proba(test)

from sklearn.metrics import classification_report
print(classification_report(test['gender_enc'], y_pred))
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV]  ................................................................
[CV]  ................................................................
[CV]  ................................................................
[CV]  , AUC=0.9606773914601812, Accuracy=0.8974358974358975, total= 1.1min
[CV]  , AUC=0.9595042119976971, Accuracy=0.891778828664969, total= 1.1min
[CV]  , AUC=0.9536360530299655, Accuracy=0.8863016190649445, total= 1.1min
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  2.0min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  2.0min finished
Best score: 0.958
Best parameters set:
Test score: 0.959340
Test accuracy: 0.892338
             precision    recall  f1-score   support

          0       0.90      0.85      0.87      1821
          1       0.88      0.93      0.91      2303

avg / total       0.89      0.89      0.89      4124

In [58]:
from model_performance_plotter import plot_learning_curve, \
                                      plot_roc_curve, \
                                      plot_precision_recall_curve

title = 'Gender Classifier'

plot_roc_curve(title, y_pred, test['gender_enc'])

plot_precision_recall_curve(title, test['gender_enc'], grid_search.decision_function(test))

plot_learning_curve(grid_search.best_estimator_, title, train, train['gender_enc'])
Out[58]:
<module 'matplotlib.pyplot' from '/usr/local/lib/python3.6/site-packages/matplotlib/pyplot.py'>

Examine cases where the model makes correct predictions

It is good practice to verify that the model is making reasonable predictions and that the labels were accurate.

In [59]:
test[test['gender_enc'] == y_pred].sample(10)
Out[59]:
username first_name full_name biography writing_example hash_tags gender gender_enc
1365 _kim_law stacey stacey 🇯🇲Proud Jamaican…Island girl🌴🇯🇲 female 1
4085 katiebiancaxox Katie Katie Bianca 3 soon to be 4 ❤️ female 1
3968 jon.bon.jovi_always immortal immortal rock 🎸☇🔥🤘 “Shot through the heart \nAnd you’re to blame … Jon’s original vocals only, isolated from the … jonbonjovi bonjovi rock rockbands rockmusic ha… male 0
9788 ylimenarod Emily Emily Doran 悲しい女の子 ( ͡° ͜ʖ ͡°)\n@intotheshade_ Looks like one eyed Kenny\n#35mm #minoltax700 … minoltax agfa minoltax agfa female 1
6936 sacredlotus17 Melissa Melissa Pattinson Here’s a sneak peek of what I’m working on at … female 1
5076 angelfesh744 Felicia Felicia female 1
5356 prescott127 Prescott Prescott 活在當下!!! live in present!!! male 0
8681 tjmacca Thomas Thomas McKenzie male 0
2099 kellyhosy Kelly Kelly Ho Join The Jobless Club and be fabulous Sexercise the wall \n@verxniques 🤰🏻#rockclimbi… rockclimbing exercise sexercise booty rockclim… female 1
1664 alfonso3892 Alfonso Alfonso Martinez male 0

Examine cases where the model makes incorrect predictions

It is also good practice to investigate the cases for which the model makes incorrect predictions. Note that in the list below, the gender field is the true label, and the opposite of this label is what the model predicted. The majority of these mistakes are due to incorrect labels.

In [62]:
test[test['gender_enc'] != y_pred].sample(10)
Out[62]:
username first_name full_name biography writing_example hash_tags gender gender_enc
9734 yigitsun97 Sanity is in the eye of the beholder. Philoxene Iskeleden bir cisim yaklasiyor kapta… female 1
10102 ztingle17 Zachry Zachry Ray Tingle Texas A&M University Class of 2018 👍🏻 Basketba… Great weekend with the family! Glad my cousins… NationalBestFriendDay NationalSiblingDay MyGor… female 1
7412 gorjuszj Shae Shae SNAP ME 👻 [GORJUSZ]. ♈️3/25. Raising a Princes… ♥️ New hair who dis ? 📞!\n#autumnhair #fallhai… autumnhair fallhair naturalhair bob dallasstyl… male 0
4982 khor_meng_yang107 Khor Khor Meng Yang 🏣SMK MIHARJA \n💒†FGA CYC\n🎂1007\nWC :khor1234… Today 🌝#04132017 Today SPAM吗😎 female 1
1992 pouria_.rs #BhMn\n#Я§ons\n❤👑 male 0
3058 dinarinus Din Din Arinus male 0
621 viv.ek.5203 Vîv Vîv Ek Simply luvable male 0
3963 santannasheneal Signature Signature by Santanna Sheneal Makeup Artist/Esthetician 🎨\nXtreme Lash Speci… The moment I’ve been waiting for. 😍😍😍\nThe onl… Beauty Bar Supply MakeupArtist Braiders Beauti… male 0
1607 tmongram Gabriele Gabriele Beddoni △ Rome,Paris⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀Illustr… ⠀⠀⠀ 1 3 windows ⠀⠀⠀⠀ ⠀⠀⠀⠀⠀ #picoftheday #inst… picoftheday instagood instamood goodvibes sony… female 1
441 symiko70 💖Symiko💖 💖Symiko💖 👬Loving My 2 Boys👬 female 1
In [20]:
from sklearn.externals import joblib

MODEL_FILE = 'ig_gender_classifier.pkl'
joblib.dump(grid_search, MODEL_FILE)
Out[20]:
['ig_gender_classifier.pkl']

Conclusion

The model achieves 90% accuracy with 90% precision and 85% recall for males, and 88% precision and 93% recall for females; therefore, it is slightly superior at picking out females.

In the future, this project could be improved in the following ways:

  • Investigating why the model performs better on females than males. One possible cause for this discrepancy is that there are more females in the dataset, so the model has more data with which to identify females.
  • Translating non-English text to English and then passing that through the model. One way to look at translation is that it is a poor man’s form of PCA; the model could share the weights of English terms rather than being spread thin on every input language. This experiment was attempted but it was found to be too slow due the need for a web request for every example.
  • Redoing the project with a neural net instead of logistic regression. Neural nets typically require at least 50,000 to 100,000 examples to perform substantially better than classical models. This experiment was attempted early on in the project, but failed due to an insufficient number of examples.
  • Incorporating user photos into the model via ensemble methods. Computer vision is expensive and slow, so this addition is unlikely to add substantial value to the end result.

Leave a Reply

Your email address will not be published. Required fields are marked *