For this drill, we will analyze Boston house-price data from 1978.
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
import numpy as np
import pandas as pd
import scipy
import seaborn as sns
from collections import Counter
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline
boston = load_boston()
data = boston.data
target = boston.target
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target
print(df)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0
6 0.08829 12.5 7.87 0.0 0.524 6.012 66.6 5.5605 5.0 311.0
7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0
9 0.17004 12.5 7.87 0.0 0.524 6.004 85.9 6.5921 5.0 311.0
10 0.22489 12.5 7.87 0.0 0.524 6.377 94.3 6.3467 5.0 311.0
11 0.11747 12.5 7.87 0.0 0.524 6.009 82.9 6.2267 5.0 311.0
12 0.09378 12.5 7.87 0.0 0.524 5.889 39.0 5.4509 5.0 311.0
13 0.62976 0.0 8.14 0.0 0.538 5.949 61.8 4.7075 4.0 307.0
14 0.63796 0.0 8.14 0.0 0.538 6.096 84.5 4.4619 4.0 307.0
15 0.62739 0.0 8.14 0.0 0.538 5.834 56.5 4.4986 4.0 307.0
16 1.05393 0.0 8.14 0.0 0.538 5.935 29.3 4.4986 4.0 307.0
17 0.78420 0.0 8.14 0.0 0.538 5.990 81.7 4.2579 4.0 307.0
18 0.80271 0.0 8.14 0.0 0.538 5.456 36.6 3.7965 4.0 307.0
19 0.72580 0.0 8.14 0.0 0.538 5.727 69.5 3.7965 4.0 307.0
20 1.25179 0.0 8.14 0.0 0.538 5.570 98.1 3.7979 4.0 307.0
21 0.85204 0.0 8.14 0.0 0.538 5.965 89.2 4.0123 4.0 307.0
22 1.23247 0.0 8.14 0.0 0.538 6.142 91.7 3.9769 4.0 307.0
23 0.98843 0.0 8.14 0.0 0.538 5.813 100.0 4.0952 4.0 307.0
24 0.75026 0.0 8.14 0.0 0.538 5.924 94.1 4.3996 4.0 307.0
25 0.84054 0.0 8.14 0.0 0.538 5.599 85.7 4.4546 4.0 307.0
26 0.67191 0.0 8.14 0.0 0.538 5.813 90.3 4.6820 4.0 307.0
27 0.95577 0.0 8.14 0.0 0.538 6.047 88.8 4.4534 4.0 307.0
28 0.77299 0.0 8.14 0.0 0.538 6.495 94.4 4.4547 4.0 307.0
29 1.00245 0.0 8.14 0.0 0.538 6.674 87.3 4.2390 4.0 307.0
.. ... ... ... ... ... ... ... ... ... ...
476 4.87141 0.0 18.10 0.0 0.614 6.484 93.6 2.3053 24.0 666.0
477 15.02340 0.0 18.10 0.0 0.614 5.304 97.3 2.1007 24.0 666.0
478 10.23300 0.0 18.10 0.0 0.614 6.185 96.7 2.1705 24.0 666.0
479 14.33370 0.0 18.10 0.0 0.614 6.229 88.0 1.9512 24.0 666.0
480 5.82401 0.0 18.10 0.0 0.532 6.242 64.7 3.4242 24.0 666.0
481 5.70818 0.0 18.10 0.0 0.532 6.750 74.9 3.3317 24.0 666.0
482 5.73116 0.0 18.10 0.0 0.532 7.061 77.0 3.4106 24.0 666.0
483 2.81838 0.0 18.10 0.0 0.532 5.762 40.3 4.0983 24.0 666.0
484 2.37857 0.0 18.10 0.0 0.583 5.871 41.9 3.7240 24.0 666.0
485 3.67367 0.0 18.10 0.0 0.583 6.312 51.9 3.9917 24.0 666.0
486 5.69175 0.0 18.10 0.0 0.583 6.114 79.8 3.5459 24.0 666.0
487 4.83567 0.0 18.10 0.0 0.583 5.905 53.2 3.1523 24.0 666.0
488 0.15086 0.0 27.74 0.0 0.609 5.454 92.7 1.8209 4.0 711.0
489 0.18337 0.0 27.74 0.0 0.609 5.414 98.3 1.7554 4.0 711.0
490 0.20746 0.0 27.74 0.0 0.609 5.093 98.0 1.8226 4.0 711.0
491 0.10574 0.0 27.74 0.0 0.609 5.983 98.8 1.8681 4.0 711.0
492 0.11132 0.0 27.74 0.0 0.609 5.983 83.5 2.1099 4.0 711.0
493 0.17331 0.0 9.69 0.0 0.585 5.707 54.0 2.3817 6.0 391.0
494 0.27957 0.0 9.69 0.0 0.585 5.926 42.6 2.3817 6.0 391.0
495 0.17899 0.0 9.69 0.0 0.585 5.670 28.8 2.7986 6.0 391.0
496 0.28960 0.0 9.69 0.0 0.585 5.390 72.9 2.7986 6.0 391.0
497 0.26838 0.0 9.69 0.0 0.585 5.794 70.6 2.8927 6.0 391.0
498 0.23912 0.0 9.69 0.0 0.585 6.019 65.3 2.4091 6.0 391.0
499 0.17783 0.0 9.69 0.0 0.585 5.569 73.5 2.3999 6.0 391.0
500 0.22438 0.0 9.69 0.0 0.585 6.027 79.7 2.4982 6.0 391.0
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0
PTRATIO B LSTAT MEDV
0 15.3 396.90 4.98 24.0
1 17.8 396.90 9.14 21.6
2 17.8 392.83 4.03 34.7
3 18.7 394.63 2.94 33.4
4 18.7 396.90 5.33 36.2
5 18.7 394.12 5.21 28.7
6 15.2 395.60 12.43 22.9
7 15.2 396.90 19.15 27.1
8 15.2 386.63 29.93 16.5
9 15.2 386.71 17.10 18.9
10 15.2 392.52 20.45 15.0
11 15.2 396.90 13.27 18.9
12 15.2 390.50 15.71 21.7
13 21.0 396.90 8.26 20.4
14 21.0 380.02 10.26 18.2
15 21.0 395.62 8.47 19.9
16 21.0 386.85 6.58 23.1
17 21.0 386.75 14.67 17.5
18 21.0 288.99 11.69 20.2
19 21.0 390.95 11.28 18.2
20 21.0 376.57 21.02 13.6
21 21.0 392.53 13.83 19.6
22 21.0 396.90 18.72 15.2
23 21.0 394.54 19.88 14.5
24 21.0 394.33 16.30 15.6
25 21.0 303.42 16.51 13.9
26 21.0 376.88 14.81 16.6
27 21.0 306.38 17.28 14.8
28 21.0 387.94 12.80 18.4
29 21.0 380.23 11.98 21.0
.. ... ... ... ...
476 20.2 396.21 18.68 16.7
477 20.2 349.48 24.91 12.0
478 20.2 379.70 18.03 14.6
479 20.2 383.32 13.11 21.4
480 20.2 396.90 10.74 23.0
481 20.2 393.07 7.74 23.7
482 20.2 395.28 7.01 25.0
483 20.2 392.92 10.42 21.8
484 20.2 370.73 13.34 20.6
485 20.2 388.62 10.58 21.2
486 20.2 392.68 14.98 19.1
487 20.2 388.22 11.45 20.6
488 20.1 395.09 18.06 15.2
489 20.1 344.05 23.97 7.0
490 20.1 318.43 29.68 8.1
491 20.1 390.11 18.07 13.6
492 20.1 396.90 13.35 20.1
493 19.2 396.90 12.01 21.8
494 19.2 396.90 13.59 24.5
495 19.2 393.29 17.60 23.1
496 19.2 396.90 21.14 19.7
497 19.2 396.90 14.10 18.3
498 19.2 396.90 12.92 21.2
499 19.2 395.77 15.10 17.5
500 19.2 396.90 14.33 16.8
501 21.0 391.99 9.67 22.4
502 21.0 396.90 9.08 20.6
503 21.0 396.90 5.64 23.9
504 21.0 393.45 6.48 22.0
505 21.0 396.90 7.88 11.9
[506 rows x 14 columns]
df.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
df.describe()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.593761 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.596783 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.647423 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
df.corr(method='pearson')
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CRIM | 1.000000 | -0.199458 | 0.404471 | -0.055295 | 0.417521 | -0.219940 | 0.350784 | -0.377904 | 0.622029 | 0.579564 | 0.288250 | -0.377365 | 0.452220 | -0.385832 |
ZN | -0.199458 | 1.000000 | -0.533828 | -0.042697 | -0.516604 | 0.311991 | -0.569537 | 0.664408 | -0.311948 | -0.314563 | -0.391679 | 0.175520 | -0.412995 | 0.360445 |
INDUS | 0.404471 | -0.533828 | 1.000000 | 0.062938 | 0.763651 | -0.391676 | 0.644779 | -0.708027 | 0.595129 | 0.720760 | 0.383248 | -0.356977 | 0.603800 | -0.483725 |
CHAS | -0.055295 | -0.042697 | 0.062938 | 1.000000 | 0.091203 | 0.091251 | 0.086518 | -0.099176 | -0.007368 | -0.035587 | -0.121515 | 0.048788 | -0.053929 | 0.175260 |
NOX | 0.417521 | -0.516604 | 0.763651 | 0.091203 | 1.000000 | -0.302188 | 0.731470 | -0.769230 | 0.611441 | 0.668023 | 0.188933 | -0.380051 | 0.590879 | -0.427321 |
RM | -0.219940 | 0.311991 | -0.391676 | 0.091251 | -0.302188 | 1.000000 | -0.240265 | 0.205246 | -0.209847 | -0.292048 | -0.355501 | 0.128069 | -0.613808 | 0.695360 |
AGE | 0.350784 | -0.569537 | 0.644779 | 0.086518 | 0.731470 | -0.240265 | 1.000000 | -0.747881 | 0.456022 | 0.506456 | 0.261515 | -0.273534 | 0.602339 | -0.376955 |
DIS | -0.377904 | 0.664408 | -0.708027 | -0.099176 | -0.769230 | 0.205246 | -0.747881 | 1.000000 | -0.494588 | -0.534432 | -0.232471 | 0.291512 | -0.496996 | 0.249929 |
RAD | 0.622029 | -0.311948 | 0.595129 | -0.007368 | 0.611441 | -0.209847 | 0.456022 | -0.494588 | 1.000000 | 0.910228 | 0.464741 | -0.444413 | 0.488676 | -0.381626 |
TAX | 0.579564 | -0.314563 | 0.720760 | -0.035587 | 0.668023 | -0.292048 | 0.506456 | -0.534432 | 0.910228 | 1.000000 | 0.460853 | -0.441808 | 0.543993 | -0.468536 |
PTRATIO | 0.288250 | -0.391679 | 0.383248 | -0.121515 | 0.188933 | -0.355501 | 0.261515 | -0.232471 | 0.464741 | 0.460853 | 1.000000 | -0.177383 | 0.374044 | -0.507787 |
B | -0.377365 | 0.175520 | -0.356977 | 0.048788 | -0.380051 | 0.128069 | -0.273534 | 0.291512 | -0.444413 | -0.441808 | -0.177383 | 1.000000 | -0.366087 | 0.333461 |
LSTAT | 0.452220 | -0.412995 | 0.603800 | -0.053929 | 0.590879 | -0.613808 | 0.602339 | -0.496996 | 0.488676 | 0.543993 | 0.374044 | -0.366087 | 1.000000 | -0.737663 |
MEDV | -0.385832 | 0.360445 | -0.483725 | 0.175260 | -0.427321 | 0.695360 | -0.376955 | 0.249929 | -0.381626 | -0.468536 | -0.507787 | 0.333461 | -0.737663 | 1.000000 |
#We can plot the diagonal correlation matrix
#Variable pairs that have a strong positive correlation will be bright red
#Variable pairs that have a strong negative correlation will be greenish/blue
corr = df.corr(method='pearson')
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots(figsize=(10, 10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=0.3, center=0,
square=True, linewidths=0.5, cbar_kws={"shrink":0.5})
<matplotlib.axes._subplots.AxesSubplot at 0x11a27ad30>
correlation_matrix = df.corr().round(2)
sns.heatmap(data=correlation_matrix, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x11a1bbf28>
We can see that several of the variables above have strong correlations (dark red). Now let’s plot the distribution of median home values.
#plt.figure(figsize=(8,8))
g = sns.distplot(df['MEDV'])
g.set(xlabel='MEDV', ylabel='', title='Distribution of Median Home Values')
plt.show()
Now let’s plot a scatter plot for each of the variables against the target variable (median home values).
#plt.figure(figsize=(8,8))
g = sns.regplot(x=df['CRIM'], y=df['MEDV'], data=df)
g.set(xlabel='CRIM', ylabel='MEDV', title='')
plt.show()
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['ZN'], y=df['MEDV'], data=df)
g.set(xlabel='ZN', ylabel='MEDV', title='')
plt.show()
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['INDUS'], y=df['MEDV'], data=df)
g.set(xlabel='INDUS', ylabel='MEDV', title='')
plt.show()
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['CHAS'], y=df['MEDV'], data=df)
g.set(xlabel='CHAS', ylabel='MEDV', title='')
plt.show()
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['NOX'], y=df['MEDV'], data=df)
g.set(xlabel='NOX', ylabel='MEDV', title='')
plt.show()
#plt.figure(figsize=(8,8))
g = sns.regplot(x=df['RM'], y=df['MEDV'], data=df)
g.set(xlabel='RM', ylabel='MEDV', title='')
plt.show()
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['AGE'], y=df['MEDV'], data=df)
g.set(xlabel='AGE', ylabel='MEDV', title='')
plt.show()
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['DIS'], y=df['MEDV'], data=df)
g.set(xlabel='DIS', ylabel='MEDV', title='')
plt.show()
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['RAD'], y=df['MEDV'], data=df)
g.set(xlabel='RAD', ylabel='MEDV', title='')
plt.show()
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['TAX'], y=df['MEDV'], data=df)
g.set(xlabel='TAX', ylabel='MEDV', title='')
plt.show()
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['PTRATIO'], y=df['MEDV'], data=df)
g.set(xlabel='PTRATIO', ylabel='MEDV', title='')
plt.show()
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['B'], y=df['MEDV'], data=df)
g.set(xlabel='B', ylabel='MEDV', title='')
plt.show()
#Now let's plot a scatter plot for each of the variables against the target variable (median home values)
plt.figure(figsize=(8,8))
g = sns.regplot(x=df['LSTAT'], y=df['MEDV'], data=df)
g.set(xlabel='LSTAT', ylabel='MEDV', title='')
plt.show()
The above scatter plots reveal some key insights about this data, highlighted below:
-Lower crime rates are associated with higher median home value
-Higher proportions of non-retail business acres are associated with higher median home value
-More rooms are associated with higher median home value
-With some exceptions, lower pupil-teacher ratios are associated with higher median home value
-The strongest correlation appears to be between MEDV and LSTAT; in other words, a lower % lower status of the population is strongly associated with higher median home value
My favorite plot is the one showing MEDV vs. PTRATIO. It indicates that lower-income neighborhoods have more students and less teachers. It could also indicate that there are less public schools in higher-income neighborhoods.
I found the plot of MEDV vs. B interesting. I don’t quite understand the formula 1000(Bk-0.63)^2, but from the plot it seems like many homes with high median home prices are associated with a higher proportion of blacks. A better explanation of this formula and how it is derived is probably one thing missing from this dataset’s description.
df['MEDV'].describe()
count 506.000000
mean 22.532806
std 9.197104
min 5.000000
25% 17.025000
50% 21.200000
75% 25.000000
max 50.000000
Name: MEDV, dtype: float64