Saturday, April 8, 2017

My Path to Become a Data Scientist - Decision Making Under Uncertainty


It was the summer of 2013. A year that unbeknownst to me at the time, would mark the beginning of the most challenging journey in my life.  I was a software engineering manager at a medical device startup, Cameron Health that had recently been acquired by Boston Scientific Inc. The gusto and motivation in which I entered this role 5 years ago in 2008 had eroded over time. Now, after being acquired, I felt distant and unable to define an identity of my own let alone be an efficient leader for my team. As a manager, my core technical skills were eroded to being a glorified project manager whose focus was to push a derived tactical set of orders onto my team. I had lost the once outstanding ability to think big and inspire others towards an exciting new future. This was precisely the ability that allowed me to quickly move up the ranks within General Motors and eventually gaining a critical engineering management role at Cameron Health.


It wasn't a huge surprise when I got the message that Boston Scientific was considering transitioning all operations to the headquarters in Minnesota. After all, they had just acquired all this technology and manufacturing capability which is more or less redundant to a number of core pieces of pre-existing technology that already exists within their corporation. However, it was a surprise when I was told by my manager, the engineering director, that only key staff members were going to be retained and the process was going to be completed in a week. Clearly, management individuals were not among those who would be retained. Both me and my manager were laid off the next week, along with 15% of the staff. Another 80% of the original ~200 person staff would be laid off over the following year.


After 11 years of continuous employment and never even having to apply for another job, other than right out of graduate school, I was faced with a whole new set of experiences and emotions. Initially, I felt very confident of securing a new and better role. I took a lengthy vacation for a few weeks before getting started on that process. However, reality struck when month after month passed and interviews were scarce, let alone job offers. It dawned on me that despite the fact I had 11 years of experience, my technical skills were extremely stale and the industry I had been in, had changed dramatically in both the type of problems they were solving and the technology stack. Management experience in itself was not valued if it was not accompanied with strong hands-on skills on new technologies and experience in solving domain specific problems.


I had a few alternatives at this stage - continue with my current job search in hopes of finding a position, expanding my search to other industries or geographies, sacrificing a search for a long term career for short term employment or simply reinventing myself. There was no data, no prescribed solution or best practice that was available to guide me in the right direction. Most of the advice I received were in the lines of simply performing better at interviews or taking a less desirable job just for the sake of having a job in hopes of growing into a better role in the future. I even tried analyzing the various alternatives I had and evaluated them relative to a few set of quantitative measures such as how close my skills were to the desired skills necessary and how many jobs were available within that industry. However, all these decision analysis simply created more confusion and did not get me to any concrete steps towards a solution. At this stage, my confidence and my skills were quickly eroding. I had two mortgages and financial obligations that were pressurizing my timeline to secure a new source of cash flow.


Necessity is the mother of invention. This was certainly the case for me at this point. I was forced to get out of my comfort zone of doing job applications and networking. I started to venture out to unique conferences, hackathons and entrepreneur events that were in different industries. I started to notice trends of where the industry was moving to and over the course of a few weeks, I was convinced that data science and machine learning were going to reinvent the next 20 years of technology. However, I had absolutely zero skills in this arena. I had never used Python before and all I recalled of statistics were buzzwords that I had long forgotten their meaning. I looked at the requirements for data scientist roles at the time and it was almost greek to me. The path to transitioning from scratch to become a data scientist at the ripe old age of 35, was seemingly too challenging to overcome.


The path to success is paved by pain and suffering. I had overcome many difficulties when I first came to the US to start a new life. In situations where there isn't much information to make a decision and a deadline quickly approaching, I decided that my strategy is to pick a clear and simple vision and then put your blinders on and trudge your way towards that vision despite the odds or any challenges that may come. My vision was to be a data scientist. Now all I needed to do is to learn everything necessary to be one.


I now armed myself with this clear vision and with a continual mantra that no matter how difficult the path is, given enough time and hard work, it can be overcome. I signed up to a part time bootcamp in data science in Los Angeles. It was a 4 hour drive one way from my home in Orange County but the class gave me a first glimpse of the knowledge of a data scientist. Three months later, I signed up for a full time bootcamp in San Francisco. I was going all in with this vision. I relocated for 3 months and simply worked tirelessly every single day, morning till night, on absorbing every piece of knowledge and skill necessary to be a data scientist. However, even after the full time bootcamp, I still had a tough time getting through the interview process. Most companies would consider me too junior in skills but too senior in age.


I was humbled to the degree where I decided that I would have to work my way up again from the bottom even if it meant throwing away 11 years worth of seniority. I was willing to take any role that would give me critical experiences as a data scientist. I took on contract work, hackathon projects and even any company’s take home assignments. It was pure hustling, something that I had never done before when I was in the enclave of a cushy full time corporate career. Over the course of 5 months past the completion of my full time bootcamp, I steadily built up key experiences, won seven hackathons with 40k in winnings, third place in an international data science competition and built a solid data science portfolio. Finally, I was getting more interviews and getting further in the interviews as well.


Five months after the completion of my bootcamp, I landed a role as a data science manager at Change.org. I had a broad exposure and I was able to utilize both my management experience and my new founded skills in data science. However, I wanted to get more hands on so I left that to join 6Sense, a machine learning as a service startup where I quickly proved my worth by building an entirely automated data pipeline and machine learning model training platform that improved the performance by over 300%. Now, I work as a senior data scientist in Goldman Sachs among a team of highly qualified data scientists from some of the top universities in the world. I am also a lead instructor for the same part time bootcamp that I started my humble beginnings 3.5 years ago. They say it takes 5 years to be an expert in any domain. I feel I am well on my way to this.

The key to decision making under uncertainty is to have a clear and simple vision that will become a compass for all other tactical decisions. After that it is simply the unending focus and motivation that will drive you towards that vision.

Saturday, March 26, 2016

Deriving the Cosine Distance for Machine Learning Distance Metrics

Deriving the Cosine Distance for Machine Learning



Cosine distance is used heavily in machine learning and is a very common distance metric alongside the euclidean distance. The biggest advantage of the cosine distance is its robustness against data scale. I.e. if you have various features that have different min and max scales to it, you can directly apply cosine distance and its notion of similarity.


This blog post is to help understand how the cosine distance is derived.


Figure 1: Illustration of the Angular Distance of Theta


In Figure 1 we see vectors A and B in 2 dimensions. In practical situations, you may have many more dimensions but that just means that your vector is longer in length. Now the equation used for cosine distance is typically:




The challenge is how do we get this equation. Note that this equation is very nice since it only depends on dot products and a division. It is extremely efficient to calculate and cos of 90 degrees is 0 and cos of 0 is 1 and cos of 180 is -1, which works perfectly inline with how we see A and B as similar or dissimilar.


It comes from applying vector math on the usual Cosine Rule but with a little trick.
Cosine Rule - c2 = a2 + b2 - 2ab cos theta.
Now lets look at that formula but in vector math notation. Note that vectors are basically each data point with each feature as an independent dimension. E.g. A = (f1,f2,f3...fN)

I couldnt figure out how to add equations in blogpost so forgive the screen shot I had to take just to paste the equations...
Note that ||a-b||^2 is the length of the line between the two points. This is the same as c^2 in the cosine law equation. 

Which is what was the definition of the cosine distance we had in the beginning…

Thursday, August 28, 2014

A Methodology to Perform Linear Regression

CapitalOneInterviewAssignment_Part 1

Building a Linear Regression Predictor

The challenge is to build a predictor that takes in a set of tab delimited floating values and predicts a floating value.

This is fundamentally a linear regression problem and requires some special treatment to ensure that the data that goes into the regression model is clean and does not have issues such as multi-colinearity and hetereoskedacitic. Also, it would be important to detect if there are external factors that affect the prediction but are not captured as features.

The data science approach and results is as follows:

  1. Data exploration.
    • Summary statistics on the data.
    • Detect colinearity, heteroskedacity issues. The goal here is to check if the feature set is not polluted by either redundant features or indications of external influences.
    • Identify NAN's and replace with mean values
  2. Data Clean Up
    • The only clean up is to replace with mean values.
  3. Pre-processing algorithms
    • Cross-train split from cross-validation (CV). There are two levels of splitting. One CV split is to evaluate between different predictors. The other CV split is to perform hyper-parameter selection using gridsearch.
  4. Model evaluation
    • Compare with using R^2.
  5. Future work
    • Apply broader hyper-parameter optimization using gridsearch
    • Using Hadoop, Spark or GraphLab in order to deal with large datasets. This would allow larger versions of the dataset without having to load the entire dataset into one continuous memory block.
    • Look at spearman vs. pearson correlation to identify whether the independent variable is linear or non linear. If non linear, perhaps introduce a polynomial term as well.
    • Try PCA for dimensionality reduction
In [19]:
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.cross_validation import train_test_split
import numpy as np
import statsmodels.stats.diagnostic as std
import matplotlib.pyplot as plt
import seaborn as sns

Data Exploration

In [20]:
df = pd.read_csv("codetest_train.txt",sep='\t')
In [21]:
df.head()
Out[21]:
target f_0 f_1 f_2 f_3 f_4 f_5 f_6 f_7 f_8 ... f_244 f_245 f_246 f_247 f_248 f_249 f_250 f_251 f_252 f_253
0 3.066056 -0.652557 0.255256 -0.615412 -1.833283 -0.735654 NaN 1.115108 -0.170707 -0.350879 ... -1.607174 -1.400451 -0.920319 -0.198487 -0.944919 -0.572555 0.170449 -0.417764 -1.244350 -0.502889
1 -1.910473 1.178667 -0.093398 -0.555568 0.811153 -0.467931 -0.005488 -0.116258 -1.242854 1.984955 ... 1.281957 0.032356 -0.060830 NaN -0.061294 -0.301641 1.280799 -0.849742 0.821387 -0.259742
2 7.830711 0.180984 -0.778108 -0.919190 0.112607 0.887144 -0.761681 1.872443 -1.708934 0.134782 ... -0.236893 -0.659965 1.072506 -0.192872 0.570371 -0.267201 1.434515 1.331823 -1.146947 2.579905
3 -2.180862 0.745228 -0.245268 -1.343348 1.163072 -0.168847 -0.150516 -1.099532 0.225346 1.223015 ... 0.709195 -0.203496 -0.135737 -0.570835 1.682078 0.243042 -0.380941 0.612880 1.032715 0.400305
4 5.462784 1.217383 -1.323946 -0.958151 0.448290 -2.873187 -0.855653 0.602568 0.763372 0.019770 ... 0.892303 -0.432659 -0.877101 0.289258 0.653759 1.230144 0.456956 -0.754051 -0.025100 -0.931225

5 rows × 255 columns

In [22]:
y = df.pop("target")
df.describe()
Out[22]:
f_0 f_1 f_2 f_3 f_4 f_5 f_6 f_7 f_8 f_9 ... f_244 f_245 f_246 f_247 f_248 f_249 f_250 f_251 f_252 f_253
count 4903.000000 4928.000000 4908.000000 4910.000000 4907.000000 4912.000000 4897.000000 4904.000000 4893.000000 4911.000000 ... 4910.000000 4883.000000 4914.000000 4894.000000 4902.000000 4886.000000 4900.000000 4921.000000 4904.000000 4904.000000
mean -0.000432 0.002564 0.028875 -0.005436 -0.006764 0.005568 0.001536 -0.001016 0.009748 -0.002622 ... 0.013529 0.004937 0.023262 -0.018450 -0.009832 0.016954 -0.004944 0.016878 -0.001356 0.010331
std 0.999737 0.997931 1.019343 0.990349 1.006289 0.995800 1.004624 0.997360 0.988310 1.005418 ... 1.001454 0.997329 0.996465 1.005001 0.989225 1.011337 0.991580 1.001362 1.003412 1.006887
min -3.940976 -3.847382 -3.818473 -3.434423 -3.400202 -4.051005 -3.179496 -3.889719 -3.856546 -3.910096 ... -3.585086 -3.493958 -3.484578 -4.012033 -3.252239 -3.821459 -3.376073 -3.373395 -3.949648 -3.727771
25% -0.673040 -0.684705 -0.650854 -0.654998 -0.685200 -0.659947 -0.671828 -0.679356 -0.662404 -0.675319 ... -0.666017 -0.676178 -0.661783 -0.692160 -0.662535 -0.647772 -0.679635 -0.647463 -0.695102 -0.676679
50% -0.010543 -0.002652 0.047116 0.003579 -0.007036 -0.007822 -0.003055 -0.021591 0.016544 -0.002122 ... 0.025606 -0.027898 0.026912 -0.034966 -0.010708 0.002441 0.009770 0.019531 0.002471 0.015306
75% 0.677278 0.674600 0.718753 0.667814 0.653662 0.648748 0.678594 0.670028 0.698020 0.685997 ... 0.671749 0.670035 0.702553 0.655212 0.650979 0.709693 0.660469 0.692202 0.672384 0.705512
max 3.830801 3.995597 3.198775 4.962191 3.106121 4.295765 4.165960 3.798185 4.195049 4.842176 ... 3.365021 3.455789 3.881038 3.689781 3.629413 4.144275 3.873226 3.186501 3.724332 3.955565

8 rows × 250 columns

In [64]:
# Highest correlated features with target column. Should see if they are linear correlations
corr_matrix["target"].iloc[np.argsort(corr_matrix["target"])][::-1][0:10]
Out[64]:
target    1.000000
f_175     0.559211
f_161     0.395634
f_205     0.363730
f_195     0.263667
f_35      0.184608
f_218     0.164864
f_47      0.139391
Mexico    0.115002
f_75      0.097581
Name: target, dtype: float64
In [66]:
pd.scatter_matrix(df[["f_175","f_161","f_205","target"]])
Out[66]:
array([[<matplotlib.axes.AxesSubplot object at 0xa610c14c>,
        <matplotlib.axes.AxesSubplot object at 0xaa0315cc>,
        <matplotlib.axes.AxesSubplot object at 0xa5e2272c>,
        <matplotlib.axes.AxesSubplot object at 0xa5dbbd4c>],
       [<matplotlib.axes.AxesSubplot object at 0xa5d8fa4c>,
        <matplotlib.axes.AxesSubplot object at 0xa5d4514c>,
        <matplotlib.axes.AxesSubplot object at 0xa5cfdecc>,
        <matplotlib.axes.AxesSubplot object at 0xa5d1aaec>],
       [<matplotlib.axes.AxesSubplot object at 0xa5c70dec>,
        <matplotlib.axes.AxesSubplot object at 0xa5c2faec>,
        <matplotlib.axes.AxesSubplot object at 0xa5c63f2c>,
        <matplotlib.axes.AxesSubplot object at 0xa5c290ec>],
       [<matplotlib.axes.AxesSubplot object at 0xa5bd866c>,
        <matplotlib.axes.AxesSubplot object at 0xa5b9682c>,
        <matplotlib.axes.AxesSubplot object at 0xa5bac86c>,
        <matplotlib.axes.AxesSubplot object at 0xa5b0a54c>]], dtype=object)
In [65]:
df[["f_175","f_161","f_205","target"]].corr()
Out[65]:
f_175 f_161 f_205 target
f_175 1.000000 0.698535 0.008231 0.559211
f_161 0.698535 1.000000 -0.006862 0.395634
f_205 0.008231 -0.006862 1.000000 0.363730
target 0.559211 0.395634 0.363730 1.000000

From the scatter matrix and correlation matrix, notice that there is little correlation between the features and the target. Also note that the relationship might be non linear. It would be quite time consuming to do this with visual inspection. However, we can use the difference between pearson and spearman correlation to detect nonlinear relationships that we can then perform feature transformation upon in order to linearize prior to performing the linear regression.

In [79]:
import scipy.stats as st
print st.spearmanr(df[["f_175","target"]])
print st.pearsonr(df["f_175"],df["target"])
(0.56502924196593451, 0.0)
(0.55921068208270208, 0.0)

In [81]:
print st.spearmanr(df[["f_161","target"]])
print st.pearsonr(df["f_161"],df["target"])
(0.39713888083122312, 1.4784063942274236e-188)
(0.39563355008574125, 5.105752113862635e-187)

In [82]:
print st.spearmanr(df[["f_205","target"]])
print st.pearsonr(df["f_205"],df["target"])
(0.36077413223262689, 1.4329521566736594e-153)
(0.36373023641314262, 3.0004829719354247e-156)

Pearson vs. Spearman correlation

Above, we have done a few comparisons between Pearson and Spearman correlation. They are almost equal which implies that the data set is eliptical in distribution. There is unlikely a strong benefit of using a non-linear model for this. Let us move on to looking at the feature datatypes to see if there are any non floating value features...

In [23]:
df.dtypes[df.dtypes!="float64"]
Out[23]:
f_61     object
f_121    object
f_215    object
f_237    object
dtype: object

The features above are categorical columns. Categorical columns can be converted to binary data and merged back.

In [24]:
df = pd.read_csv("codetest_train.txt",sep='\t')
dummies_df = pd.get_dummies(df["f_237"])
dummies_df2 = pd.get_dummies(df["f_215"])
dummies_df.reset_index(inplace=True)
dummies_df2.reset_index(inplace=True)
x = dummies_df.merge(dummies_df2,left_on="index",right_on="index")
x.set_index("index")
df.reset_index(inplace=True)
df = df.merge(x,left_on="index",right_on="index")
df.head()
df.pop("index")
df.head()
Out[24]:
target f_0 f_1 f_2 f_3 f_4 f_5 f_6 f_7 f_8 ... f_251 f_252 f_253 Canada Mexico USA blue orange red yellow
0 3.066056 -0.652557 0.255256 -0.615412 -1.833283 -0.735654 NaN 1.115108 -0.170707 -0.350879 ... -0.417764 -1.244350 -0.502889 1 0 0 0 0 1 0
1 -1.910473 1.178667 -0.093398 -0.555568 0.811153 -0.467931 -0.005488 -0.116258 -1.242854 1.984955 ... -0.849742 0.821387 -0.259742 1 0 0 1 0 0 0
2 7.830711 0.180984 -0.778108 -0.919190 0.112607 0.887144 -0.761681 1.872443 -1.708934 0.134782 ... 1.331823 -1.146947 2.579905 1 0 0 0 1 0 0
3 -2.180862 0.745228 -0.245268 -1.343348 1.163072 -0.168847 -0.150516 -1.099532 0.225346 1.223015 ... 0.612880 1.032715 0.400305 0 0 1 1 0 0 0
4 5.462784 1.217383 -1.323946 -0.958151 0.448290 -2.873187 -0.855653 0.602568 0.763372 0.019770 ... -0.754051 -0.025100 -0.931225 1 0 0 0 1 0 0

5 rows × 262 columns

In [25]:
x = df.dtypes[df.dtypes!="float64"].index
[df.pop(col) for col in x.values]
Out[25]:
[0       b
 1       a
 2       b
 3       a
 4       b
 5       d
 6       d
 7       d
 8       a
 9       c
 10      c
 11      d
 12    NaN
 13      e
 14      c
 ...
 4985      d
 4986      a
 4987    NaN
 4988      c
 4989      c
 4990      a
 4991      b
 4992      e
 4993      d
 4994      a
 4995      e
 4996      c
 4997      d
 4998      d
 4999      a
 Name: f_61, Length: 5000, dtype: object, 0     D
 1     A
 2     B
 3     C
 4     E
 5     F
 6     A
 7     F
 8     B
 9     D
 10    B
 11    A
 12    A
 13    C
 14    D
 ...
 4985    C
 4986    E
 4987    A
 4988    A
 4989    F
 4990    F
 4991    E
 4992    B
 4993    F
 4994    E
 4995    B
 4996    F
 4997    F
 4998    C
 4999    B
 Name: f_121, Length: 5000, dtype: object, 0        red
 1       blue
 2     orange
 3       blue
 4     orange
 5       blue
 6     yellow
 7        red
 8     orange
 9       blue
 10    yellow
 11       NaN
 12       red
 13       red
 14    orange
 ...
 4985       red
 4986    orange
 4987       red
 4988    orange
 4989      blue
 4990    yellow
 4991    yellow
 4992       red
 4993    yellow
 4994    orange
 4995      blue
 4996       red
 4997    yellow
 4998      blue
 4999       red
 Name: f_215, Length: 5000, dtype: object, 0     Canada
 1     Canada
 2     Canada
 3        USA
 4     Canada
 5        USA
 6        USA
 7     Mexico
 8     Mexico
 9     Mexico
 10    Canada
 11    Mexico
 12    Canada
 13    Canada
 14    Mexico
 ...
 4985       USA
 4986       USA
 4987    Canada
 4988       USA
 4989       USA
 4990    Canada
 4991    Canada
 4992    Canada
 4993    Mexico
 4994       USA
 4995    Canada
 4996    Canada
 4997    Mexico
 4998       USA
 4999    Mexico
 Name: f_237, Length: 5000, dtype: object]
In [26]:
df = df.fillna(df.mean()) # Fill missing values with the mean of that column. Other approaches can be considered as well
corr_matrix = df.corr()
In [27]:
def Detect_Colinearity (corr_matrix,corr_threshold):
    col_set =set()
    for col in corr_matrix.columns:
        column = corr_matrix.loc[:,col][corr_matrix.loc[:,col]!=1]
        if max(column) > corr_threshold:
            tup = tuple(sort ((np.argmax(column), col)))
            col_set.add(tup)
            print np.argmax(column), " max correlation with " ,col ," is ", max(column)
    return col_set
col_set = Detect_Colinearity(corr_matrix,0.7)
f_218  max correlation with  f_75  is  0.701528006836
f_205  max correlation with  f_195  is  0.700222900414
f_195  max correlation with  f_205  is  0.700222900414
f_75  max correlation with  f_218  is  0.701528006836

In [28]:
col_set
Out[28]:
{('f_195', 'f_205'), ('f_218', 'f_75')}
In [29]:
def draw_scatter(colA,colB):
    plt.scatter(df[colA],df[colB])
    plt.xlabel(colA)
    plt.ylabel(colB)
    plt.show()

[draw_scatter(colA,colB) for colA,colB in col_set]
#draw_scatter("f_35","f_47")
Out[29]:
[None, None]

Notice that the above features are highly correlated to each other and not heteroskedastic as well. Hence, it would be prudent to remove one of those pairs of features since having just one of them provides sufficient explantory power.

Data Clean-Up

In [30]:
[df.pop(colA) for colA,colB in col_set]
# Remove one of the pair features that are highly correlated
Out[30]:
[0     0.539137
 1     0.561481
 2     0.750839
 3    -0.320563
 4     0.609882
 5     0.060647
 6    -0.362968
 7     0.847730
 8     0.305817
 9    -0.856485
 10    1.606802
 11   -0.953853
 12    0.876890
 13    0.914836
 14    0.267366
 ...
 4985   -0.015994
 4986    0.666057
 4987   -0.021838
 4988    0.550384
 4989   -0.368570
 4990    0.134997
 4991   -0.970546
 4992    0.010943
 4993    0.416705
 4994   -0.553957
 4995   -0.193834
 4996   -0.375043
 4997   -0.015489
 4998    0.381753
 4999    0.884792
 Name: f_195, Length: 5000, dtype: float64, 0    -0.081401
 1    -0.422620
 2    -0.559654
 3    -0.166179
 4     0.953225
 5    -0.835506
 6    -0.974140
 7    -0.879044
 8     0.101077
 9    -0.954207
 10    0.627886
 11   -0.956676
 12   -0.550488
 13   -1.348389
 14   -0.423770
 ...
 4985    0.597803
 4986   -1.274587
 4987    0.712450
 4988    0.040832
 4989   -0.408023
 4990    0.738290
 4991    0.483821
 4992    0.641688
 4993    0.486608
 4994   -0.986703
 4995   -1.677835
 4996   -0.005801
 4997   -0.156283
 4998    0.471449
 4999   -0.093550
 Name: f_218, Length: 5000, dtype: float64]

Now that we have done the data clean up, lets put it all into a single function that can be applied on the test and train set.

In [31]:
def data_clean_up (input,data_type="train",col_set=None):
    # This function cleans up the data in the following manner:
# 1. Convert all classification columns to binarized columns and merge it back
# 2. Remove the classification columns as well as the first column of the binarized (this is redundant since the value can be deduced from the other columns)
# 3. Detect colinearity and hetroskedacity and remove those columns

    df = pd.read_csv(input,sep='\t')
    x = df.dtypes[df.dtypes!="float64"].index
    total_dummies = pd.get_dummies(df[x[0]])
    total_dummies.reset_index(inplace=True)

    for col in x[1:]:
        dummies_df = pd.get_dummies(df[col])
        dummies_df.reset_index(inplace=True)
        asd = range(0,dummies_df.shape[1])
        asd.pop(1)
        dummies_df = dummies_df.iloc[:,asd]
        total_dummies = total_dummies.merge(dummies_df2,left_on="index",right_on="index")

        
    df.reset_index(inplace=True)
    df = df.merge(total_dummies,left_on="index",right_on="index")
    df.pop("index")

    if data_type == "test":
        # remove all the features identified by training clean-up
        [df.pop(colA) for colA,colB in col_set]
    if data_type=="train": 
        col_set = set()
        y = df.pop("target")
        [col_set.add(tuple([val,val])) for val in x.values]
    df = df.fillna(df.mean()) #df.mean()) # Fill missing values with the mean of that column. Other approaches can be considered as well
    corr_matrix = df.corr()
    if data_type=="train": 
        # find all the columns to drop based on colinearity/hetreskedacity detection and non-floats
        ss= Detect_Colinearity(corr_matrix,1)
        col_set = col_set|ss
        [df.pop(colA) for colA,colB in col_set]
        return df,y,col_set
    return df

df_train,y_train,col_set = data_clean_up("./codetest_train.txt")
print "Features to be removed due to colinearity and non floats are " , [a for a,b in col_set]
df_test = data_clean_up("./codetest_test.txt",data_type="test",col_set=col_set)
Features to be removed due to colinearity and non floats are  ['f_61', 'f_121', 'f_215', 'f_237']

Pre-Processing and Model Evaluation

Model evaluation is done using R^2 score. Note that CV version of regressors are used in order to further improve the generalization strength.

In [32]:
import sklearn.linear_model as lm

xtrain, xtest, ytrain, ytest = train_test_split(df_train.values, y_train.values, test_size=0.3, random_state=42)

clf = lm.LassoCV(cv=10)
clf.fit(xtrain,ytrain)
ypred = clf.predict(xtest)
Lasso_score = clf.score(xtest,ytest)
print "score for Lasso Linear Regression is " , Lasso_score


clf = lm.RidgeCV(cv=10)
clf.fit(xtrain,ytrain)
ypred = clf.predict(xtest)
Ridge_score = clf.score(xtest,ytest)
print "score for Ridge Linear Regression is " ,Ridge_score 

clf = lm.LassoLarsCV(cv=10)
clf.fit(xtrain,ytrain)
ypred = clf.predict(xtest)
LassoLars_score = clf.score(xtest,ytest)
print "score for LassoLarsCV Linear Regression is " , LassoLars_score
score for Lasso Linear Regression is  0.56259194772
score for Ridge Linear Regression is  0.544774032938
score for LassoLarsCV Linear Regression is  0.563229331306

Lasso Regression is the best. Now train with the complete dataset to gain maximum amount of info before performing the prediction on the test set.

In [15]:
clf = lm.LassoCV(cv=60)
clf.fit(df_train.values, y_train.values)
cols = np.where(clf.coef_>0)
print "Selected most important features based on Lasso Regression is " ,df.columns[cols]
Selected most important features based on Lasso Regression is  Index([u'f_10', u'f_16', u'f_19', u'f_28', u'f_34', u'f_35', u'f_53', u'f_66', u'f_69', u'f_73', u'f_85', u'f_91', u'f_93', u'f_100', u'f_115', u'f_136', u'f_141', u'f_146', u'f_168', u'f_174', u'f_192', u'f_194', u'f_196', u'f_200', u'f_205', u'f_216', u'f_219', u'f_220', u'f_233', u'f_236', u'f_239', u'f_243', u'Canada', u'Mexico'], dtype='object')

In [16]:
clf.score(xtest,ytest)
Out[16]:
0.57499834755624368

Output Results and Linear Regression predictor to file

In [17]:
results = clf.predict(df_test.values)
fid = open("results.csv",'w')
results.tofile(fid,sep=',',format='%f')
In [18]:
pd.to_pickle(clf,"./Part1LassoRegression.pickle")

Conclusion

The score / R^2 was improved significantly by maintaining the classification data. In terms of performance, the clean-up and training does take a polynomial amount of time especially as the number of cross validation runs increase. (40 CV's were done here)

Followers