Building a Linear Regression Predictor
The challenge is to build a predictor that takes in a set of tab delimited floating values and predicts a floating value.
This is fundamentally a linear regression problem and requires some special treatment to ensure that the data that goes into the regression model is clean and does not have issues such as multi-colinearity and hetereoskedacitic. Also, it would be important to detect if there are external factors that affect the prediction but are not captured as features.
The data science approach and results is as follows:
- Data exploration.
- Summary statistics on the data.
- Detect colinearity, heteroskedacity issues. The goal here is to check if the feature set is not polluted by either redundant features or indications of external influences.
- Identify NAN's and replace with mean values
- Data Clean Up
- The only clean up is to replace with mean values.
- Pre-processing algorithms
- Cross-train split from cross-validation (CV). There are two levels of splitting. One CV split is to evaluate between different predictors. The other CV split is to perform hyper-parameter selection using gridsearch.
- Model evaluation
- Compare with using R^2.
- Compare with using R^2.
- Future work
- Apply broader hyper-parameter optimization using gridsearch
- Using Hadoop, Spark or GraphLab in order to deal with large datasets. This would allow larger versions of the dataset without having to load the entire dataset into one continuous memory block.
- Look at spearman vs. pearson correlation to identify whether the independent variable is linear or non linear. If non linear, perhaps introduce a polynomial term as well.
- Try PCA for dimensionality reduction
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.cross_validation import train_test_split
import numpy as np
import statsmodels.stats.diagnostic as std
import matplotlib.pyplot as plt
import seaborn as sns
Data Exploration¶
df = pd.read_csv("codetest_train.txt",sep='\t')
df.head()
y = df.pop("target")
df.describe()
# Highest correlated features with target column. Should see if they are linear correlations
corr_matrix["target"].iloc[np.argsort(corr_matrix["target"])][::-1][0:10]
pd.scatter_matrix(df[["f_175","f_161","f_205","target"]])
df[["f_175","f_161","f_205","target"]].corr()
From the scatter matrix and correlation matrix, notice that there is little correlation between the features and the target. Also note that the relationship might be non linear. It would be quite time consuming to do this with visual inspection. However, we can use the difference between pearson and spearman correlation to detect nonlinear relationships that we can then perform feature transformation upon in order to linearize prior to performing the linear regression.
import scipy.stats as st
print st.spearmanr(df[["f_175","target"]])
print st.pearsonr(df["f_175"],df["target"])
print st.spearmanr(df[["f_161","target"]])
print st.pearsonr(df["f_161"],df["target"])
print st.spearmanr(df[["f_205","target"]])
print st.pearsonr(df["f_205"],df["target"])
Pearson vs. Spearman correlation
Above, we have done a few comparisons between Pearson and Spearman correlation. They are almost equal which implies that the data set is eliptical in distribution. There is unlikely a strong benefit of using a non-linear model for this. Let us move on to looking at the feature datatypes to see if there are any non floating value features...
df.dtypes[df.dtypes!="float64"]
The features above are categorical columns. Categorical columns can be converted to binary data and merged back.¶
df = pd.read_csv("codetest_train.txt",sep='\t')
dummies_df = pd.get_dummies(df["f_237"])
dummies_df2 = pd.get_dummies(df["f_215"])
dummies_df.reset_index(inplace=True)
dummies_df2.reset_index(inplace=True)
x = dummies_df.merge(dummies_df2,left_on="index",right_on="index")
x.set_index("index")
df.reset_index(inplace=True)
df = df.merge(x,left_on="index",right_on="index")
df.head()
df.pop("index")
df.head()
x = df.dtypes[df.dtypes!="float64"].index
[df.pop(col) for col in x.values]
df = df.fillna(df.mean()) # Fill missing values with the mean of that column. Other approaches can be considered as well
corr_matrix = df.corr()
def Detect_Colinearity (corr_matrix,corr_threshold):
col_set =set()
for col in corr_matrix.columns:
column = corr_matrix.loc[:,col][corr_matrix.loc[:,col]!=1]
if max(column) > corr_threshold:
tup = tuple(sort ((np.argmax(column), col)))
col_set.add(tup)
print np.argmax(column), " max correlation with " ,col ," is ", max(column)
return col_set
col_set = Detect_Colinearity(corr_matrix,0.7)
col_set
def draw_scatter(colA,colB):
plt.scatter(df[colA],df[colB])
plt.xlabel(colA)
plt.ylabel(colB)
plt.show()
[draw_scatter(colA,colB) for colA,colB in col_set]
#draw_scatter("f_35","f_47")
Notice that the above features are highly correlated to each other and not heteroskedastic as well. Hence, it would be prudent to remove one of those pairs of features since having just one of them provides sufficient explantory power.
Data Clean-Up¶
[df.pop(colA) for colA,colB in col_set]
# Remove one of the pair features that are highly correlated
Now that we have done the data clean up, lets put it all into a single function that can be applied on the test and train set.¶
def data_clean_up (input,data_type="train",col_set=None):
# This function cleans up the data in the following manner:
# 1. Convert all classification columns to binarized columns and merge it back
# 2. Remove the classification columns as well as the first column of the binarized (this is redundant since the value can be deduced from the other columns)
# 3. Detect colinearity and hetroskedacity and remove those columns
df = pd.read_csv(input,sep='\t')
x = df.dtypes[df.dtypes!="float64"].index
total_dummies = pd.get_dummies(df[x[0]])
total_dummies.reset_index(inplace=True)
for col in x[1:]:
dummies_df = pd.get_dummies(df[col])
dummies_df.reset_index(inplace=True)
asd = range(0,dummies_df.shape[1])
asd.pop(1)
dummies_df = dummies_df.iloc[:,asd]
total_dummies = total_dummies.merge(dummies_df2,left_on="index",right_on="index")
df.reset_index(inplace=True)
df = df.merge(total_dummies,left_on="index",right_on="index")
df.pop("index")
if data_type == "test":
# remove all the features identified by training clean-up
[df.pop(colA) for colA,colB in col_set]
if data_type=="train":
col_set = set()
y = df.pop("target")
[col_set.add(tuple([val,val])) for val in x.values]
df = df.fillna(df.mean()) #df.mean()) # Fill missing values with the mean of that column. Other approaches can be considered as well
corr_matrix = df.corr()
if data_type=="train":
# find all the columns to drop based on colinearity/hetreskedacity detection and non-floats
ss= Detect_Colinearity(corr_matrix,1)
col_set = col_set|ss
[df.pop(colA) for colA,colB in col_set]
return df,y,col_set
return df
df_train,y_train,col_set = data_clean_up("./codetest_train.txt")
print "Features to be removed due to colinearity and non floats are " , [a for a,b in col_set]
df_test = data_clean_up("./codetest_test.txt",data_type="test",col_set=col_set)
Pre-Processing and Model Evaluation¶
Model evaluation is done using R^2 score. Note that CV version of regressors are used in order to further improve the generalization strength.
import sklearn.linear_model as lm
xtrain, xtest, ytrain, ytest = train_test_split(df_train.values, y_train.values, test_size=0.3, random_state=42)
clf = lm.LassoCV(cv=10)
clf.fit(xtrain,ytrain)
ypred = clf.predict(xtest)
Lasso_score = clf.score(xtest,ytest)
print "score for Lasso Linear Regression is " , Lasso_score
clf = lm.RidgeCV(cv=10)
clf.fit(xtrain,ytrain)
ypred = clf.predict(xtest)
Ridge_score = clf.score(xtest,ytest)
print "score for Ridge Linear Regression is " ,Ridge_score
clf = lm.LassoLarsCV(cv=10)
clf.fit(xtrain,ytrain)
ypred = clf.predict(xtest)
LassoLars_score = clf.score(xtest,ytest)
print "score for LassoLarsCV Linear Regression is " , LassoLars_score
Lasso Regression is the best. Now train with the complete dataset to gain maximum amount of info before performing the prediction on the test set.¶
clf = lm.LassoCV(cv=60)
clf.fit(df_train.values, y_train.values)
cols = np.where(clf.coef_>0)
print "Selected most important features based on Lasso Regression is " ,df.columns[cols]
clf.score(xtest,ytest)
Output Results and Linear Regression predictor to file¶
results = clf.predict(df_test.values)
fid = open("results.csv",'w')
results.tofile(fid,sep=',',format='%f')
pd.to_pickle(clf,"./Part1LassoRegression.pickle")
Conclusion
The score / R^2 was improved significantly by maintaining the classification data. In terms of performance, the clean-up and training does take a polynomial amount of time especially as the number of cross validation runs increase. (40 CV's were done here)
So much statistics in single post, it took me a while to understand the blog post. Need to bookmark your blog to read more such blogs. Thank you for sharing it with us.
ReplyDeleteNice Blog, When i was read this blog i learnt new things & its truly have well stuff related to developing technology, Thank you for sharing this blog.
ReplyDeleteMicrosoft Azure Training in Chennai | Azure Training in Chennai
For Every visit of your site i have learned some new things and much need contents are in the Articles..It is really helpful for my Feature Reference...
ReplyDeleteJava training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery
Quick up your career with Azure Training in Chennai from Infycle Technologies, the best software training institute in Chennai. A massive place to learn topmost technical courses like Medical Coding, Mobile App Development, Data Science, Big Data, BlockChain, Full Stack Development, Digital Marketing with Graphic Design, Python, and Oracle with emphasized trainers of the industry. Speak to us on 7504633633, 7502633633 to know the updated offers for your learning.
ReplyDelete