Thursday, May 29, 2014

Zillow Housing Prices Trend Analysis using Python Pandas

[]

This analysis is on average home prices in US metro cities over the last 10 years. The specific focus is on the housing downturn and more recent recovery. The main conclusion is that there were a number of cities especially in the south that benefitted from the downturn. I.e. the home prices in these areas primarily increased while other cities experienced a sharp decline. On the other hand, during the recovery, those cities experienced decline while other cities recovered.

In [1]:
%pylab inline
import pandas as pd
metro_df = pd.read_csv("Metro_Zhvi_3bedroom.csv")
metro_df[1:10]
Populating the interactive namespace from numpy and matplotlib

Out[1]:
RegionName 1996-04 1996-05 1996-06 1996-07 1996-08 1996-09 1996-10 1996-11 1996-12 1997-01 1997-02 1997-03 1997-04 1997-05 1997-06 1997-07 1997-08 1997-09 1997-10
1 New York, NY 171200 171800 172200 172200 172000 171900 171200 169900 168600 168600 169500 170500 170900 171200 171400 171600 171800 172200 172700 ...
2 Los Angeles, CA 165600 166200 165600 164800 164300 164400 164400 164500 164800 165500 166100 166900 167500 168000 167900 167500 167400 168000 168800 ...
3 Chicago, IL 132700 133100 133300 133300 133300 133100 132700 132800 133300 133900 134400 134600 134300 134200 134400 134600 134700 134900 135400 ...
4 Dallas-Fort Worth, TX NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 86300 86400 86400 86200 86200 86200 86400 86600 87400 ...
5 Philadelphia, PA 90000 90300 90500 90500 90400 90400 90300 90200 90300 90600 90800 90700 90600 90400 90200 90000 89800 89800 90100 ...
6 Houston, TX 66500 66700 66800 67100 67100 67100 67200 67400 67700 68300 69100 69800 70400 71100 71700 72300 73000 73400 73600 ...
7 Washington, DC 147000 147300 147700 147600 147400 147300 147100 146900 146800 147100 147500 147600 147500 146300 144200 142900 143000 142900 143100 ...
8 Miami-Fort Lauderdale, FL 111800 111900 111800 111800 111800 112100 112300 112400 112600 113200 113900 114300 114300 114100 114100 114300 114800 115000 115000 ...
9 Atlanta, GA 96800 96600 96400 96500 96700 96600 96600 96800 97200 97700 98300 98600 98700 99100 99800 100200 100500 101100 101900 ...

9 rows × 217 columns

In [2]:
import matplotlib.pyplot as plt
res = pd.isnull(metro_df).sum()
plt.plot(res)

# Null data reduces over time
Out[2]:
[<matplotlib.lines.Line2D at 0xabe048cc>]
In [3]:
metro_df = metro_df.transpose()
In [4]:
# set column names to proper city names. Remove the 1st row since it contains the city names
metro_df.columns = metro_df.iloc[0,:]
metro_df = metro_df.ix[1:,:]
metro_df[0:10]
Out[4]:
RegionName United States New York, NY Los Angeles, CA Chicago, IL Dallas-Fort Worth, TX Philadelphia, PA Houston, TX Washington, DC Miami-Fort Lauderdale, FL Atlanta, GA Boston, MA San Francisco, CA Detroit, MI Riverside, CA Phoenix, AZ Seattle, WA Minneapolis-St Paul, MN San Diego, CA St. Louis, MO Tampa, FL
1996-04 101600 171200 165600 132700 NaN 90000 66500 147000 111800 96800 152500 218500 119200 105900 93300 148400 110400 159700 94300 78600 ...
1996-05 101900 171800 166200 133100 NaN 90300 66700 147300 111900 96600 152700 219400 119800 105800 93900 148200 110800 159700 94600 78700 ...
1996-06 102000 172200 165600 133300 NaN 90500 66800 147700 111800 96400 152900 219300 120100 105900 94300 148000 111300 159400 94800 78700 ...
1996-07 102100 172200 164800 133300 NaN 90500 67100 147600 111800 96500 153200 218600 120500 106100 94700 147700 111600 158800 94900 78700 ...
1996-08 102300 172000 164300 133300 NaN 90400 67100 147400 111800 96700 153500 218400 119500 106100 95100 147500 111700 158300 95100 78600 ...
1996-09 102500 171900 164400 133100 NaN 90400 67100 147300 112100 96600 153700 218700 117900 105900 95600 147800 112000 158000 95500 78500 ...
1996-10 102600 171200 164400 132700 NaN 90300 67200 147100 112300 96600 154000 219000 117300 105800 95900 148300 112400 157900 95700 78500 ...
1996-11 102500 169900 164500 132800 NaN 90200 67400 146900 112400 96800 154400 219500 117900 105800 96100 148800 112900 158000 96000 78700 ...
1996-12 102300 168600 164800 133300 NaN 90300 67700 146800 112600 97200 155100 220300 118500 105900 96500 149400 113500 158300 96300 79000 ...
1997-01 102500 168600 165500 133900 NaN 90600 68300 147100 113200 97700 156100 221800 119300 106200 97300 150200 114100 159200 96500 79300 ...

10 rows × 540 columns

In [5]:
# lets plot how home prices in the US has changed over time
Avg_US_Home = metro_df["United States"]
Avg_US_Home[1:].plot()

# home prices dropped in 2009  and started increase in late 2011
Out[5]:
<matplotlib.axes.AxesSubplot at 0xac3aa7cc>
In [6]:
MultipleHomes = metro_df.iloc[1:,1:10]
plot_obj = MultipleHomes.plot(figsize = (20,20),legend=False)
plot_obj.legend()

# plot across multiple cities
Out[6]:
<matplotlib.legend.Legend at 0xabca6b4c>
In [6]:
In [21]:
# would be interesting to do a correlation between cities. Derive a correlation matrix

MultipleHomes = metro_df[:]
MultipleHomes = MultipleHomes.astype(float) # This is to make the pandas correlation function work....
corr_matrix = MultipleHomes.corr()
corr_matrix.iloc[0:8,0:8]
Out[21]:
RegionName United States New York, NY Los Angeles, CA Chicago, IL Dallas-Fort Worth, TX Philadelphia, PA Houston, TX Washington, DC
RegionName
United States 1.000000 0.982080 0.983924 0.859225 0.968904 0.940225 0.905826 0.966548
New York, NY 0.982080 1.000000 0.974750 0.831541 0.962792 0.961342 0.929631 0.972398
Los Angeles, CA 0.983924 0.974750 1.000000 0.812335 0.950913 0.930821 0.914062 0.992396
Chicago, IL 0.859225 0.831541 0.812335 1.000000 0.863047 0.693303 0.640801 0.758110
Dallas-Fort Worth, TX 0.968904 0.962792 0.950913 0.863047 1.000000 0.881857 0.896538 0.933179
Philadelphia, PA 0.940225 0.961342 0.930821 0.693303 0.881857 1.000000 0.937129 0.937688
Houston, TX 0.905826 0.929631 0.914062 0.640801 0.896538 0.937129 1.000000 0.925237
Washington, DC 0.966548 0.972398 0.992396 0.758110 0.933179 0.937688 0.925237 1.000000

8 rows × 8 columns

In [115]:
# print out the list of least correlated cities to most

corr_matrix.iloc[0:,0].sort()
LeastCorrelated = corr_matrix.sort("United States",ascending=True).iloc[0:,0]
LeastCorrelated
Out[115]:
RegionName
Fort Polk South, LA   -0.877464
Athens, TX            -0.779179
Columbia, MO          -0.561690
Bismarck, ND          -0.543546
Lufkin, TX            -0.357009
Altoona, PA           -0.263352
Sulphur Springs, TX   -0.257235
Big Rapids, MI        -0.214048
New Orleans, LA       -0.181503
Clarksburg, WV        -0.097781
Silver City, NM       -0.043185
Stephenville, TX       0.044545
Connersville, IN       0.073267
Indianapolis, IN       0.154578
Las Cruces, NM         0.164161
...
Stamford, CT        0.980132
Seattle, WA         0.980450
Prescott, AZ        0.980493
Gainesville, FL     0.980575
Beaver Dam, WI      0.980984
Kahului, HI         0.981021
Milwaukee, WI       0.981355
New York, NY        0.982080
Torrington, CT      0.983499
Willimantic, CT     0.983569
Los Angeles, CA     0.983924
New Haven, CT       0.984015
Whitewater, WI      0.984205
Fayetteville, AR    0.985830
United States       1.000000
Name: United States, Length: 540, dtype: float64
In [121]:
# Lets plot a few of the least correlated cities.

metro_df[LeastCorrelated.index[0:5]].plot()
metro_df["United States"].plot(figsize=(20,20))
fig.legend()
Out[121]:
<matplotlib.legend.Legend at 0xa7678f4c>

No comments:

Post a Comment

Followers