Monday, May 26, 2014

Movie Review Sentiment Analysis Predictor using Sklearn and Pandas

[]

The goal of the project is to illustrate how big of an impact data understanding, data cleaning and data vectorization can have to the performance of the predictor. This is in contrast to focusing most of the time on trying various classifiers.

This notebook describes the final project for the General Assembly Data Science class. The intent was to develop a model that would predict whether a movie review was positive (1) or negative (0) based upon the text and training data.

At the highest level, the design is as follows:

  1. Import the various vectorizers (these transform sentences to some form of numerical encoding such that a classifier can be applier later), classifiers and supporting modules

  2. Using panda's read in the training (already predicted data) and test (what needs to be predicted) file

  3. Cleaning up the data. E.g. removing non alphabets and stop words or even any word that is not in the english language

  4. Reading the data into a vectorizer of choice

  5. Cross-validation - this allows us to compare various vectorizer/classifier models using a quantitative measure of prediction quality.

  6. Prediction and output into a results file.

Below are the iterative set of solutions from least optimized to most:

  1. Count vectorizer with naive bayes and no text clean up. Default parameters for count vectorizer.

  2. Count vectorizer with naive bayes and text clean up. Min_df of 3 and ngrams of 1-3 words for count vectorizer.

  3. Tfidf vectorizer with naive bayes and text clean up. Min_df of 4 and ngrams of 1-3 words for count vectorizer.

In [22]:
%pylab inline
from textblob import TextBlob
import pandas as pd
from optparse import OptionParser
import sys
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from nltk.stem.porter import PorterStemmer
import csv
from nltk.corpus import stopwords
from sklearn.cross_validation import cross_val_score
# take each review, generate a list of words, remove stop words, stem words, vectorize and then train model with that 
from sklearn import cross_validation
import scipy as sp
import numpy as np
from nltk.corpus import wordnet
import time

import matplotlib.pyplot as plt
Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['clf']
`%pylab --no-import-all` prevents importing * from pylab and numpy

Now that we have the various modules, we define a set of helper functions that cleans up the text. We will use this in solution 2 onwards.
In [23]:
def IsEnglish (word):
    # this function uses synsets (english module) to find out whether the word that is passed in is in the english languge or not.
  if not wordnet.synsets(word):
    return False
  else:
    return True

def StringCleanUp (text):
# this function takes an entire review (multiple sentences) and returns back a modified sentence that only contains words (alphabets)
# and english language words only
  alphabetic_only = "".join(c for c in text if c == " " or c.isalpha())
  split_text = alphabetic_only.split()
  res = [tex for tex in split_text if IsEnglish(tex)]
  restored_text = " ".join(res)
  return restored_text

def llfun(act, pred):
    # this function does the log likelihood calculation for cross validation purposes
    epsilon = 1e-15
    pred = sp.maximum(epsilon, pred)
    pred = sp.minimum(1-epsilon, pred)
    ll = sum(act*sp.log(pred) + sp.subtract(1,act)*sp.log(sp.subtract(1,pred)))
    ll = ll * -1.0/len(act)
    return ll
In [24]:
# read training data and then test data for prediction purposes

readfile = pd.read_csv("train1.csv")
readoutfile = pd.read_csv("test2.csv")
In [25]:
# Lets take  lookat the training file input. The goal is to then predict what the rating (1 for positive sentiment
# and 0 for negativ sentiment) would be only based on the review text itself.
readfile.head()
Out[25]:
rating review
0 0 Man, this movie sucked big time! I didn't even...
1 1 The 1930s. Classy, elegant Adele (marvelously ...
2 1 This film could have been a silent movie; it c...
3 0 Shamefull as it may be, this movie actually ma...
4 0 This film was terrible. OK, my favourite film ...

5 rows × 2 columns

In [26]:
# example of a review
readfile.review[0]
Out[26]:
"Man, this movie sucked big time! I didn't even manage to see the hole thing (my girlfriend did though). Really bad acting, computer animations so bad you just laugh (woman to werewolf), strange clips, the list goes on and on. Don't know if its just me or does this movie remind you of a porn movie? And I don't mean all the naked ladys... It's something about the light or something... This could maybee become a classic just because of the bad acting and all the naked women, but not because it's an original movie white a nice plot twist. My final words are: Don't see it! It's not worth the time. If you wanna see it because the nakedness there's lots of better ones to see!"
In [27]:
# SOLUTION #1 - 
# 1. Count vectorizer with naive bayes and no text clean up. Default parameters for count vectorizer.

# Now lets try putting this into a Count Vectorizer. Default params
input_vectorizer = CountVectorizer(stop_words='english')

# one of the rows was corrupted withno rating. Remove it.
readfile = readfile.drop(readfile.index[readfile.rating=="rating"])

Ratings = np.array([int(x) for x in readfile.rating])

# perform vectorization using Count Vectorizer (counts # of times a word appears)
X_train = input_vectorizer.fit_transform(readfile.review)

input_vectorizer.get_feature_names()[0:50]
# these are the features used for vectorization.
Out[27]:
[u'00',
 u'000',
 u'0000000000001',
 u'00001',
 u'00015',
 u'000s',
 u'001',
 u'003830',
 u'006',
 u'007',
 u'0079',
 u'0080',
 u'0083',
 u'0093638',
 u'00am',
 u'00pm',
 u'00s',
 u'01',
 u'01pm',
 u'02',
 u'020410',
 u'029',
 u'03',
 u'04',
 u'041',
 u'05',
 u'050',
 u'06',
 u'06th',
 u'07',
 u'08',
 u'087',
 u'089',
 u'08th',
 u'09',
 u'0f',
 u'0ne',
 u'0r',
 u'0s',
 u'10',
 u'100',
 u'1000',
 u'1000000',
 u'10000000000000',
 u'1000lb',
 u'1000s',
 u'1001',
 u'100b',
 u'100k',
 u'100m']
As you can see there are plenty of features that are somewhat nonsensical and would not intuitively make sense to use as a sentiment predictor. This is mainly because: 1. The review text includes various non alphabetics 2. The vectorizer is using everything even if the text appears only once in the entire corpus
In [28]:
#
# Try out different classifiers here. 
clf = MultinomialNB()
In [29]:
#cross validate
#cross validate
cv = cross_validation.KFold(len(Ratings), k=10, indices=False)

likelihood_arr = []
for traincv, testcv in cv:
  probas = clf.fit(X_train[traincv], Ratings[traincv]).predict(X_train[testcv])
  correct_predictions = [Ratings[testcv][i]==probas[i] for i in range(0,len(probas)-1)]
  likelihood = sum(correct_predictions)*1.0/len(probas)    
  likelihood_arr.append(likelihood)


print "Results: " + str( np.array(likelihood_arr).mean() )
Results: 0.85792

/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:240: DeprecationWarning: The parameter k was renamed to n_folds and will be removed in 0.15.
  " removed in 0.15.", DeprecationWarning)

In [30]:
start_time = time.time()
# Now lets try putting this into a Count Vectorizer. Default params
input_vectorizer = CountVectorizer(stop_words='english')

# one of the rows was corrupted withno rating. Remove it.
readfile = readfile.drop(readfile.index[readfile.rating=="rating"])

Ratings = np.array([int(x) for x in readfile.rating])

# perform vectorization using Count Vectorizer (counts # of times a word appears)
X_train = input_vectorizer.fit_transform(readfile.review)

# Train model
clf2 = MultinomialNB()
clf2 = MultinomialNB().fit(X_train, Ratings)

# perform prediction
Y_train = input_vectorizer.transform(readoutfile.review)
prediction = clf2.predict(Y_train)

print "Execution time is " + str(time.time() - start_time)

#write to output file
data = pd.DataFrame({"ID":readoutfile.ID, "Predicted":prediction.tolist()})
data.to_csv("x.csv")
Execution time is 13.0662171841

The final score for solution #1 was 0.81828 with execution time of 15.2 seconds on ipython notebook.

In [31]:
start_time = time.time()
# Solution #2 
# 2. Count vectorizer with naive bayes and text clean up. Min_df of 3 and ngrams of 1-3 words for count vectorizer.

input_vectorizer = CountVectorizer(stop_words='english',min_df=3,ngram_range=(1,4))

readfile = readfile.drop(readfile.index[readfile.rating=="rating"])
Ratings = np.array([int(x) for x in readfile.rating])

readfile.review = readfile.review.apply(StringCleanUp)

X_train = input_vectorizer.fit_transform(readfile.review)

# Try out different classifiers here. 
clf = MultinomialNB()

#cross validate
cv = cross_validation.KFold(len(Ratings), k=10, indices=False)

likelihood_arr = []
for traincv, testcv in cv:
  probas = clf.fit(X_train[traincv], Ratings[traincv]).predict(X_train[testcv])
  correct_predictions = [Ratings[testcv][i]==probas[i] for i in range(0,len(probas)-1)]
  likelihood = sum(correct_predictions)*1.0/len(probas)    
  likelihood_arr.append(likelihood)


print "Results: " + str( np.array(likelihood_arr).mean() )
# Train model
clf2 = MultinomialNB()
clf2 = MultinomialNB().fit(X_train, Ratings)

# perform prediction
Y_train = input_vectorizer.transform(readoutfile.review)
prediction = clf2.predict(Y_train)

print "Execution time is " + str(time.time() - start_time)

#write to output file
data = pd.DataFrame({"ID":readoutfile.ID, "Predicted":prediction.tolist()})
data.to_csv("x.csv")
Results: 0.87024
Execution time is 249.107995033

/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:240: DeprecationWarning: The parameter k was renamed to n_folds and will be removed in 0.15.
  " removed in 0.15.", DeprecationWarning)

The final score for solution #1 was 0.85225. The huge increase is probably because sequences of words (ngrams <=3) have a much higher indication of sentiment than single words. Execution time of 249 seconds.

In [32]:
# Solution #3
# 3. Tfidf vectorizer with naive bayes and text clean up. Min_df of 4 and ngrams of 1-3 words for count vectorizer.

# the goal here is to improve the vectorizer further by using Tfidf which weights the value of the word relative to the number of documents in the corpus
start_time = time.time()
input_vectorizer = TfidfVectorizer(stop_words='english',min_df=3,ngram_range=(1,4))
#input_vectorizer = CountVectorizer(stop_words='english')

readfile = readfile.drop(readfile.index[readfile.rating=="rating"])
Ratings = np.array([int(x) for x in readfile.rating])

readfile.review = readfile.review.apply(StringCleanUp)

X_train = input_vectorizer.fit_transform(readfile.review)

# Try out different classifiers here. 
clf = MultinomialNB()

#cross validate
cv = cross_validation.KFold(len(Ratings), k=10, indices=False)

likelihood_arr = []
for traincv, testcv in cv:
  probas = clf.fit(X_train[traincv], Ratings[traincv]).predict(X_train[testcv])
  correct_predictions = [Ratings[testcv][i]==probas[i] for i in range(0,len(probas)-1)]
  likelihood = sum(correct_predictions)*1.0/len(probas)    
  likelihood_arr.append(likelihood)


print "Results: " + str( np.array(likelihood_arr).mean() )
# Train model
clf2 = MultinomialNB()
clf2 = MultinomialNB().fit(X_train, Ratings)

# perform prediction
Y_train = input_vectorizer.transform(readoutfile.review)
prediction = clf2.predict(Y_train)
print "Execution time is " + str(time.time() - start_time)

#write to output file
data = pd.DataFrame({"ID":readoutfile.ID, "Predicted":prediction.tolist()})
data.to_csv("x.csv")
Results: 0.87452
Execution time is 218.934711933

/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:240: DeprecationWarning: The parameter k was renamed to n_folds and will be removed in 0.15.
  " removed in 0.15.", DeprecationWarning)

Solution #3 has a total score of 0.86096 . 1st place too. Further improvements can be performed but at the sacrifice of processing time and memory consumption (e.g. increase ngram counts). Execution time of 218 seconds. The main conclusion is that simple data cleansing and the vectorization can have fairly significant impact on the predictor performance.

1 comment:

Followers