Data Science Portfolio

Analysis of Boston House Prices

For this drill, we will analyze Boston house-price data from 1978.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import numpy as np
import pandas as pd
import scipy
import seaborn as sns
from collections import Counter
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline

boston = load_boston()
data = boston.data
target = boston.target

df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target
print(df)
         CRIM    ZN  INDUS  CHAS    NOX     RM    AGE     DIS   RAD    TAX  \
0     0.00632  18.0   2.31   0.0  0.538  6.575   65.2  4.0900   1.0  296.0   
1     0.02731   0.0   7.07   0.0  0.469  6.421   78.9  4.9671   2.0  242.0   
2     0.02729   0.0   7.07   0.0  0.469  7.185   61.1  4.9671   2.0  242.0   
3     0.03237   0.0   2.18   0.0  0.458  6.998   45.8  6.0622   3.0  222.0   
4     0.06905   0.0   2.18   0.0  0.458  7.147   54.2  6.0622   3.0  222.0   
5     0.02985   0.0   2.18   0.0  0.458  6.430   58.7  6.0622   3.0  222.0   
6     0.08829  12.5   7.87   0.0  0.524  6.012   66.6  5.5605   5.0  311.0   
7     0.14455  12.5   7.87   0.0  0.524  6.172   96.1  5.9505   5.0  311.0   
8     0.21124  12.5   7.87   0.0  0.524  5.631  100.0  6.0821   5.0  311.0   
9     0.17004  12.5   7.87   0.0  0.524  6.004   85.9  6.5921   5.0  311.0   
10    0.22489  12.5   7.87   0.0  0.524  6.377   94.3  6.3467   5.0  311.0   
11    0.11747  12.5   7.87   0.0  0.524  6.009   82.9  6.2267   5.0  311.0   
12    0.09378  12.5   7.87   0.0  0.524  5.889   39.0  5.4509   5.0  311.0   
13    0.62976   0.0   8.14   0.0  0.538  5.949   61.8  4.7075   4.0  307.0   
14    0.63796   0.0   8.14   0.0  0.538  6.096   84.5  4.4619   4.0  307.0   
15    0.62739   0.0   8.14   0.0  0.538  5.834   56.5  4.4986   4.0  307.0   
16    1.05393   0.0   8.14   0.0  0.538  5.935   29.3  4.4986   4.0  307.0   
17    0.78420   0.0   8.14   0.0  0.538  5.990   81.7  4.2579   4.0  307.0   
18    0.80271   0.0   8.14   0.0  0.538  5.456   36.6  3.7965   4.0  307.0   
19    0.72580   0.0   8.14   0.0  0.538  5.727   69.5  3.7965   4.0  307.0   
20    1.25179   0.0   8.14   0.0  0.538  5.570   98.1  3.7979   4.0  307.0   
21    0.85204   0.0   8.14   0.0  0.538  5.965   89.2  4.0123   4.0  307.0   
22    1.23247   0.0   8.14   0.0  0.538  6.142   91.7  3.9769   4.0  307.0   
23    0.98843   0.0   8.14   0.0  0.538  5.813  100.0  4.0952   4.0  307.0   
24    0.75026   0.0   8.14   0.0  0.538  5.924   94.1  4.3996   4.0  307.0   
25    0.84054   0.0   8.14   0.0  0.538  5.599   85.7  4.4546   4.0  307.0   
26    0.67191   0.0   8.14   0.0  0.538  5.813   90.3  4.6820   4.0  307.0   
27    0.95577   0.0   8.14   0.0  0.538  6.047   88.8  4.4534   4.0  307.0   
28    0.77299   0.0   8.14   0.0  0.538  6.495   94.4  4.4547   4.0  307.0   
29    1.00245   0.0   8.14   0.0  0.538  6.674   87.3  4.2390   4.0  307.0   
..        ...   ...    ...   ...    ...    ...    ...     ...   ...    ...   
476   4.87141   0.0  18.10   0.0  0.614  6.484   93.6  2.3053  24.0  666.0   
477  15.02340   0.0  18.10   0.0  0.614  5.304   97.3  2.1007  24.0  666.0   
478  10.23300   0.0  18.10   0.0  0.614  6.185   96.7  2.1705  24.0  666.0   
479  14.33370   0.0  18.10   0.0  0.614  6.229   88.0  1.9512  24.0  666.0   
480   5.82401   0.0  18.10   0.0  0.532  6.242   64.7  3.4242  24.0  666.0   
481   5.70818   0.0  18.10   0.0  0.532  6.750   74.9  3.3317  24.0  666.0   
482   5.73116   0.0  18.10   0.0  0.532  7.061   77.0  3.4106  24.0  666.0   
483   2.81838   0.0  18.10   0.0  0.532  5.762   40.3  4.0983  24.0  666.0   
484   2.37857   0.0  18.10   0.0  0.583  5.871   41.9  3.7240  24.0  666.0   
485   3.67367   0.0  18.10   0.0  0.583  6.312   51.9  3.9917  24.0  666.0   
486   5.69175   0.0  18.10   0.0  0.583  6.114   79.8  3.5459  24.0  666.0   
487   4.83567   0.0  18.10   0.0  0.583  5.905   53.2  3.1523  24.0  666.0   
488   0.15086   0.0  27.74   0.0  0.609  5.454   92.7  1.8209   4.0  711.0   
489   0.18337   0.0  27.74   0.0  0.609  5.414   98.3  1.7554   4.0  711.0   
490   0.20746   0.0  27.74   0.0  0.609  5.093   98.0  1.8226   4.0  711.0   
491   0.10574   0.0  27.74   0.0  0.609  5.983   98.8  1.8681   4.0  711.0   
492   0.11132   0.0  27.74   0.0  0.609  5.983   83.5  2.1099   4.0  711.0   
493   0.17331   0.0   9.69   0.0  0.585  5.707   54.0  2.3817   6.0  391.0   
494   0.27957   0.0   9.69   0.0  0.585  5.926   42.6  2.3817   6.0  391.0   
495   0.17899   0.0   9.69   0.0  0.585  5.670   28.8  2.7986   6.0  391.0   
496   0.28960   0.0   9.69   0.0  0.585  5.390   72.9  2.7986   6.0  391.0   
497   0.26838   0.0   9.69   0.0  0.585  5.794   70.6  2.8927   6.0  391.0   
498   0.23912   0.0   9.69   0.0  0.585  6.019   65.3  2.4091   6.0  391.0   
499   0.17783   0.0   9.69   0.0  0.585  5.569   73.5  2.3999   6.0  391.0   
500   0.22438   0.0   9.69   0.0  0.585  6.027   79.7  2.4982   6.0  391.0   
501   0.06263   0.0  11.93   0.0  0.573  6.593   69.1  2.4786   1.0  273.0   
502   0.04527   0.0  11.93   0.0  0.573  6.120   76.7  2.2875   1.0  273.0   
503   0.06076   0.0  11.93   0.0  0.573  6.976   91.0  2.1675   1.0  273.0   
504   0.10959   0.0  11.93   0.0  0.573  6.794   89.3  2.3889   1.0  273.0   
505   0.04741   0.0  11.93   0.0  0.573  6.030   80.8  2.5050   1.0  273.0   

     PTRATIO       B  LSTAT  MEDV  
0       15.3  396.90   4.98  24.0  
1       17.8  396.90   9.14  21.6  
2       17.8  392.83   4.03  34.7  
3       18.7  394.63   2.94  33.4  
4       18.7  396.90   5.33  36.2  
5       18.7  394.12   5.21  28.7  
6       15.2  395.60  12.43  22.9  
7       15.2  396.90  19.15  27.1  
8       15.2  386.63  29.93  16.5  
9       15.2  386.71  17.10  18.9  
10      15.2  392.52  20.45  15.0  
11      15.2  396.90  13.27  18.9  
12      15.2  390.50  15.71  21.7  
13      21.0  396.90   8.26  20.4  
14      21.0  380.02  10.26  18.2  
15      21.0  395.62   8.47  19.9  
16      21.0  386.85   6.58  23.1  
17      21.0  386.75  14.67  17.5  
18      21.0  288.99  11.69  20.2  
19      21.0  390.95  11.28  18.2  
20      21.0  376.57  21.02  13.6  
21      21.0  392.53  13.83  19.6  
22      21.0  396.90  18.72  15.2  
23      21.0  394.54  19.88  14.5  
24      21.0  394.33  16.30  15.6  
25      21.0  303.42  16.51  13.9  
26      21.0  376.88  14.81  16.6  
27      21.0  306.38  17.28  14.8  
28      21.0  387.94  12.80  18.4  
29      21.0  380.23  11.98  21.0  
..       ...     ...    ...   ...  
476     20.2  396.21  18.68  16.7  
477     20.2  349.48  24.91  12.0  
478     20.2  379.70  18.03  14.6  
479     20.2  383.32  13.11  21.4  
480     20.2  396.90  10.74  23.0  
481     20.2  393.07   7.74  23.7  
482     20.2  395.28   7.01  25.0  
483     20.2  392.92  10.42  21.8  
484     20.2  370.73  13.34  20.6  
485     20.2  388.62  10.58  21.2  
486     20.2  392.68  14.98  19.1  
487     20.2  388.22  11.45  20.6  
488     20.1  395.09  18.06  15.2  
489     20.1  344.05  23.97   7.0  
490     20.1  318.43  29.68   8.1  
491     20.1  390.11  18.07  13.6  
492     20.1  396.90  13.35  20.1  
493     19.2  396.90  12.01  21.8  
494     19.2  396.90  13.59  24.5  
495     19.2  393.29  17.60  23.1  
496     19.2  396.90  21.14  19.7  
497     19.2  396.90  14.10  18.3  
498     19.2  396.90  12.92  21.2  
499     19.2  395.77  15.10  17.5  
500     19.2  396.90  14.33  16.8  
501     21.0  391.99   9.67  22.4  
502     21.0  396.90   9.08  20.6  
503     21.0  396.90   5.64  23.9  
504     21.0  393.45   6.48  22.0  
505     21.0  396.90   7.88  11.9  

[506 rows x 14 columns]
df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
df.describe()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000
df.corr(method='pearson')
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
CRIM 1.000000 -0.199458 0.404471 -0.055295 0.417521 -0.219940 0.350784 -0.377904 0.622029 0.579564 0.288250 -0.377365 0.452220 -0.385832
ZN -0.199458 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537 0.664408 -0.311948 -0.314563 -0.391679 0.175520 -0.412995 0.360445
INDUS 0.404471 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779 -0.708027 0.595129 0.720760 0.383248 -0.356977 0.603800 -0.483725
CHAS -0.055295 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518 -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 0.175260
NOX 0.417521 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470 -0.769230 0.611441 0.668023 0.188933 -0.380051 0.590879 -0.427321
RM -0.219940 0.311991 -0.391676 0.091251 -0.302188 1.000000 -0.240265 0.205246 -0.209847 -0.292048 -0.355501 0.128069 -0.613808 0.695360
AGE 0.350784 -0.569537 0.644779 0.086518 0.731470 -0.240265 1.000000 -0.747881 0.456022 0.506456 0.261515 -0.273534 0.602339 -0.376955
DIS -0.377904 0.664408 -0.708027 -0.099176 -0.769230 0.205246 -0.747881 1.000000 -0.494588 -0.534432 -0.232471 0.291512 -0.496996 0.249929
RAD 0.622029 -0.311948 0.595129 -0.007368 0.611441 -0.209847 0.456022 -0.494588 1.000000 0.910228 0.464741 -0.444413 0.488676 -0.381626
TAX 0.579564 -0.314563 0.720760 -0.035587 0.668023 -0.292048 0.506456 -0.534432 0.910228 1.000000 0.460853 -0.441808 0.543993 -0.468536
PTRATIO 0.288250 -0.391679 0.383248 -0.121515 0.188933 -0.355501 0.261515 -0.232471 0.464741 0.460853 1.000000 -0.177383 0.374044 -0.507787
B -0.377365 0.175520 -0.356977 0.048788 -0.380051 0.128069 -0.273534 0.291512 -0.444413 -0.441808 -0.177383 1.000000 -0.366087 0.333461
LSTAT 0.452220 -0.412995 0.603800 -0.053929 0.590879 -0.613808 0.602339 -0.496996 0.488676 0.543993 0.374044 -0.366087 1.000000 -0.737663
MEDV -0.385832 0.360445 -0.483725 0.175260 -0.427321 0.695360 -0.376955 0.249929 -0.381626 -0.468536 -0.507787 0.333461 -0.737663 1.000000
#We can plot the diagonal correlation matrix
#Variable pairs that have a strong positive correlation will be bright red
#Variable pairs that have a strong negative correlation will be greenish/blue
corr = df.corr(method='pearson')
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(figsize=(10, 10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=0.3, center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink":0.5})
<matplotlib.axes._subplots.AxesSubplot at 0x11a27ad30>

png

correlation_matrix = df.corr().round(2)
sns.heatmap(data=correlation_matrix, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x11a1bbf28>

png

We can see that several of the variables above have strong correlations (dark red). Now let’s plot the distribution of median home values.

#plt.figure(figsize=(8,8))
g = sns.distplot(df['MEDV'])
g.set(xlabel='MEDV', ylabel='', title='Distribution of Median Home Values')
plt.show()

png

Now let’s plot a scatter plot for each of the variables against the target variable (median home values).

#plt.figure(figsize=(8,8))
g = sns.regplot(x=df['CRIM'], y=df['MEDV'], data=df)
g.set(xlabel='CRIM', ylabel='MEDV', title='')
plt.show()

png

plt.figure(figsize=(8,8))
g = sns.regplot(x=df['ZN'], y=df['MEDV'], data=df)
g.set(xlabel='ZN', ylabel='MEDV', title='')
plt.show()

png

plt.figure(figsize=(8,8))
g = sns.regplot(x=df['INDUS'], y=df['MEDV'], data=df)
g.set(xlabel='INDUS', ylabel='MEDV', title='')
plt.show()

png

plt.figure(figsize=(8,8))
g = sns.regplot(x=df['CHAS'], y=df['MEDV'], data=df)
g.set(xlabel='CHAS', ylabel='MEDV', title='')
plt.show()

png

plt.figure(figsize=(8,8))
g = sns.regplot(x=df['NOX'], y=df['MEDV'], data=df)
g.set(xlabel='NOX', ylabel='MEDV', title='')
plt.show()

png

#plt.figure(figsize=(8,8))
g = sns.regplot(x=df['RM'], y=df['MEDV'], data=df)
g.set(xlabel='RM', ylabel='MEDV', title='')
plt.show()

png

plt.figure(figsize=(8,8))
g = sns.regplot(x=df['AGE'], y=df['MEDV'], data=df)
g.set(xlabel='AGE', ylabel='MEDV', title='')
plt.show()

png

plt.figure(figsize=(8,8))
g = sns.regplot(x=df['DIS'], y=df['MEDV'], data=df)
g.set(xlabel='DIS', ylabel='MEDV', title='')
plt.show()

png

plt.figure(figsize=(8,8))
g = sns.regplot(x=df['RAD'], y=df['MEDV'], data=df)
g.set(xlabel='RAD', ylabel='MEDV', title='')
plt.show()

png

plt.figure(figsize=(8,8))
g = sns.regplot(x=df['TAX'], y=df['MEDV'], data=df)
g.set(xlabel='TAX', ylabel='MEDV', title='')
plt.show()

png

plt.figure(figsize=(8,8))
g = sns.regplot(x=df['PTRATIO'], y=df['MEDV'], data=df)
g.set(xlabel='PTRATIO', ylabel='MEDV', title='')
plt.show()

png

plt.figure(figsize=(8,8))
g = sns.regplot(x=df['B'], y=df['MEDV'], data=df)
g.set(xlabel='B', ylabel='MEDV', title='')
plt.show()

png

#Now let's plot a scatter plot for each of the variables against the target variable (median home values)
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['LSTAT'], y=df['MEDV'], data=df)
g.set(xlabel='LSTAT', ylabel='MEDV', title='')
plt.show()

png

The above scatter plots reveal some key insights about this data, highlighted below:

-Lower crime rates are associated with higher median home value

-Higher proportions of non-retail business acres are associated with higher median home value

-More rooms are associated with higher median home value

-With some exceptions, lower pupil-teacher ratios are associated with higher median home value

-The strongest correlation appears to be between MEDV and LSTAT; in other words, a lower % lower status of the population is strongly associated with higher median home value

My favorite plot is the one showing MEDV vs. PTRATIO. It indicates that lower-income neighborhoods have more students and less teachers. It could also indicate that there are less public schools in higher-income neighborhoods.

I found the plot of MEDV vs. B interesting. I don’t quite understand the formula 1000(Bk-0.63)^2, but from the plot it seems like many homes with high median home prices are associated with a higher proportion of blacks. A better explanation of this formula and how it is derived is probably one thing missing from this dataset’s description.

df['MEDV'].describe()
count    506.000000
mean      22.532806
std        9.197104
min        5.000000
25%       17.025000
50%       21.200000
75%       25.000000
max       50.000000
Name: MEDV, dtype: float64