[]

This analysis is on average home prices in US metro cities over the last 10 years. The specific focus is on the housing downturn and more recent recovery. The main conclusion is that there were a number of cities especially in the south that benefitted from the downturn. I.e. the home prices in these areas primarily increased while other cities experienced a sharp decline. On the other hand, during the recovery, those cities experienced decline while other cities recovered.¶

In [1]:

%pylab inline
import pandas as pd
metro_df = pd.read_csv("Metro_Zhvi_3bedroom.csv")
metro_df[1:10]

Populating the interactive namespace from numpy and matplotlib

Out[1]:

	RegionName	1996-04	1996-05	1996-06	1996-07	1996-08	1996-09	1996-10	1996-11	1996-12	1997-01	1997-02	1997-03	1997-04	1997-05	1997-06	1997-07	1997-08	1997-09	1997-10
1	New York, NY	171200	171800	172200	172200	172000	171900	171200	169900	168600	168600	169500	170500	170900	171200	171400	171600	171800	172200	172700	...
2	Los Angeles, CA	165600	166200	165600	164800	164300	164400	164400	164500	164800	165500	166100	166900	167500	168000	167900	167500	167400	168000	168800	...
3	Chicago, IL	132700	133100	133300	133300	133300	133100	132700	132800	133300	133900	134400	134600	134300	134200	134400	134600	134700	134900	135400	...
4	Dallas-Fort Worth, TX	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	86300	86400	86400	86200	86200	86200	86400	86600	87400	...
5	Philadelphia, PA	90000	90300	90500	90500	90400	90400	90300	90200	90300	90600	90800	90700	90600	90400	90200	90000	89800	89800	90100	...
6	Houston, TX	66500	66700	66800	67100	67100	67100	67200	67400	67700	68300	69100	69800	70400	71100	71700	72300	73000	73400	73600	...
7	Washington, DC	147000	147300	147700	147600	147400	147300	147100	146900	146800	147100	147500	147600	147500	146300	144200	142900	143000	142900	143100	...
8	Miami-Fort Lauderdale, FL	111800	111900	111800	111800	111800	112100	112300	112400	112600	113200	113900	114300	114300	114100	114100	114300	114800	115000	115000	...
9	Atlanta, GA	96800	96600	96400	96500	96700	96600	96600	96800	97200	97700	98300	98600	98700	99100	99800	100200	100500	101100	101900	...

9 rows × 217 columns

In [2]:

import matplotlib.pyplot as plt
res = pd.isnull(metro_df).sum()
plt.plot(res)

# Null data reduces over time

Out[2]:

[<matplotlib.lines.Line2D at 0xabe048cc>]

In [3]:

metro_df = metro_df.transpose()

In [4]:

# set column names to proper city names. Remove the 1st row since it contains the city names
metro_df.columns = metro_df.iloc[0,:]
metro_df = metro_df.ix[1:,:]
metro_df[0:10]

Out[4]:

RegionName	United States	New York, NY	Los Angeles, CA	Chicago, IL	Dallas-Fort Worth, TX	Philadelphia, PA	Houston, TX	Washington, DC	Miami-Fort Lauderdale, FL	Atlanta, GA	Boston, MA	San Francisco, CA	Detroit, MI	Riverside, CA	Phoenix, AZ	Seattle, WA	Minneapolis-St Paul, MN	San Diego, CA	St. Louis, MO	Tampa, FL
1996-04	101600	171200	165600	132700	NaN	90000	66500	147000	111800	96800	152500	218500	119200	105900	93300	148400	110400	159700	94300	78600	...
1996-05	101900	171800	166200	133100	NaN	90300	66700	147300	111900	96600	152700	219400	119800	105800	93900	148200	110800	159700	94600	78700	...
1996-06	102000	172200	165600	133300	NaN	90500	66800	147700	111800	96400	152900	219300	120100	105900	94300	148000	111300	159400	94800	78700	...
1996-07	102100	172200	164800	133300	NaN	90500	67100	147600	111800	96500	153200	218600	120500	106100	94700	147700	111600	158800	94900	78700	...
1996-08	102300	172000	164300	133300	NaN	90400	67100	147400	111800	96700	153500	218400	119500	106100	95100	147500	111700	158300	95100	78600	...
1996-09	102500	171900	164400	133100	NaN	90400	67100	147300	112100	96600	153700	218700	117900	105900	95600	147800	112000	158000	95500	78500	...
1996-10	102600	171200	164400	132700	NaN	90300	67200	147100	112300	96600	154000	219000	117300	105800	95900	148300	112400	157900	95700	78500	...
1996-11	102500	169900	164500	132800	NaN	90200	67400	146900	112400	96800	154400	219500	117900	105800	96100	148800	112900	158000	96000	78700	...
1996-12	102300	168600	164800	133300	NaN	90300	67700	146800	112600	97200	155100	220300	118500	105900	96500	149400	113500	158300	96300	79000	...
1997-01	102500	168600	165500	133900	NaN	90600	68300	147100	113200	97700	156100	221800	119300	106200	97300	150200	114100	159200	96500	79300	...

10 rows × 540 columns

In [5]:

# lets plot how home prices in the US has changed over time
Avg_US_Home = metro_df["United States"]
Avg_US_Home[1:].plot()

# home prices dropped in 2009  and started increase in late 2011

Out[5]:

<matplotlib.axes.AxesSubplot at 0xac3aa7cc>

In [6]:

MultipleHomes = metro_df.iloc[1:,1:10]
plot_obj = MultipleHomes.plot(figsize = (20,20),legend=False)
plot_obj.legend()

# plot across multiple cities

Out[6]:

<matplotlib.legend.Legend at 0xabca6b4c>

In [6]:

In [21]:

# would be interesting to do a correlation between cities. Derive a correlation matrix

MultipleHomes = metro_df[:]
MultipleHomes = MultipleHomes.astype(float) # This is to make the pandas correlation function work....
corr_matrix = MultipleHomes.corr()
corr_matrix.iloc[0:8,0:8]

Out[21]:

RegionName	United States	New York, NY	Los Angeles, CA	Chicago, IL	Dallas-Fort Worth, TX	Philadelphia, PA	Houston, TX	Washington, DC
RegionName
United States	1.000000	0.982080	0.983924	0.859225	0.968904	0.940225	0.905826	0.966548
New York, NY	0.982080	1.000000	0.974750	0.831541	0.962792	0.961342	0.929631	0.972398
Los Angeles, CA	0.983924	0.974750	1.000000	0.812335	0.950913	0.930821	0.914062	0.992396
Chicago, IL	0.859225	0.831541	0.812335	1.000000	0.863047	0.693303	0.640801	0.758110
Dallas-Fort Worth, TX	0.968904	0.962792	0.950913	0.863047	1.000000	0.881857	0.896538	0.933179
Philadelphia, PA	0.940225	0.961342	0.930821	0.693303	0.881857	1.000000	0.937129	0.937688
Houston, TX	0.905826	0.929631	0.914062	0.640801	0.896538	0.937129	1.000000	0.925237
Washington, DC	0.966548	0.972398	0.992396	0.758110	0.933179	0.937688	0.925237	1.000000

8 rows × 8 columns

In [115]:

# print out the list of least correlated cities to most

corr_matrix.iloc[0:,0].sort()
LeastCorrelated = corr_matrix.sort("United States",ascending=True).iloc[0:,0]
LeastCorrelated

Out[115]:

RegionName
Fort Polk South, LA   -0.877464
Athens, TX            -0.779179
Columbia, MO          -0.561690
Bismarck, ND          -0.543546
Lufkin, TX            -0.357009
Altoona, PA           -0.263352
Sulphur Springs, TX   -0.257235
Big Rapids, MI        -0.214048
New Orleans, LA       -0.181503
Clarksburg, WV        -0.097781
Silver City, NM       -0.043185
Stephenville, TX       0.044545
Connersville, IN       0.073267
Indianapolis, IN       0.154578
Las Cruces, NM         0.164161
...
Stamford, CT        0.980132
Seattle, WA         0.980450
Prescott, AZ        0.980493
Gainesville, FL     0.980575
Beaver Dam, WI      0.980984
Kahului, HI         0.981021
Milwaukee, WI       0.981355
New York, NY        0.982080
Torrington, CT      0.983499
Willimantic, CT     0.983569
Los Angeles, CA     0.983924
New Haven, CT       0.984015
Whitewater, WI      0.984205
Fayetteville, AR    0.985830
United States       1.000000
Name: United States, Length: 540, dtype: float64

In [121]:

# Lets plot a few of the least correlated cities.

metro_df[LeastCorrelated.index[0:5]].plot()
metro_df["United States"].plot(figsize=(20,20))
fig.legend()

Out[121]:

<matplotlib.legend.Legend at 0xa7678f4c>

Sri's Blog

Thursday, May 29, 2014

Zillow Housing Prices Trend Analysis using Python Pandas

No comments:

Post a Comment

About Me

Followers