California housing I - Feature selection and data exploration
Welcome to the first of a series of articles that intend to demonstrate a simple “pipeline” of how to cut through the data in Python with the help of some of its friends, such as Pandas and Scikit-learn.
We’ll see some ways to gain simple insight by looking at a few statistics from a summary of the data. We’ll also try to select and throw away variables according to their importance for our target in our task. Another similar process we’ll see is how to project the data into less dimensions but keeping all its information (or the vast majority of it). To furthur understand why these methods give the output they’ll give us, we’ll do some plotting of the variables.
As an excuse for applying these techniques, we’ll try to predict the value of a property in California with the California Housing dataset, from the US Census of 1990. It’s already uploaded into Google’s Colab environment so you can start playing with it very easily. At the end of this series I’ll provide a full notebook so you can go crazy with it.
Data read
Let’s begin by looking at a summary of the sample distribution of each variable:
home_data = pd.read_csv("sample_data/california_housing_train.csv")
home_data.describe()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 |
mean | -119.562108 | 35.625225 | 28.589353 | 2643.664412 | 539.410824 | 1429.573941 | 501.221941 | 3.883578 | 207300.912353 |
std | 2.005166 | 2.137340 | 12.586937 | 2179.947071 | 421.499452 | 1147.852959 | 384.520841 | 1.908157 | 115983.764387 |
min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
25% | -121.790000 | 33.930000 | 18.000000 | 1462.000000 | 297.000000 | 790.000000 | 282.000000 | 2.566375 | 119400.000000 |
50% | -118.490000 | 34.250000 | 29.000000 | 2127.000000 | 434.000000 | 1167.000000 | 409.000000 | 3.544600 | 180400.000000 |
75% | -118.000000 | 37.720000 | 37.000000 | 3151.250000 | 648.250000 | 1721.000000 | 605.250000 | 4.767000 | 265000.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 37937.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
Let’s give a short description of the variables, as given in the full dataset link: each observation corresponds to a block in California; longitude
and latitude
are self explanatory; housing_median_age
is the median age of a house in the block; total_rooms
is the amount of rooms in that block and, accordingly, total_bedrooms
is the amount of bedrooms in it; households
it the total number of households within the block; median_income
is the median income within the block (in thousands of US dollars); median_house_value
is the median price for a property in the block.
In the table we can see that the variables total_rooms
, population
, households
and total_bedrooms
have quite large ranges of values. We see in the minima or maxima of many variables that they are suspiciously pretty numbers.
In median_housing_value
, our target variable, we can see that almost half the range of values are in the last quartile. The median is at considerably less than 250,000. That may indicate that the distribution is skewed to the left (lower prices). Let’s look at its mode:
home_data.median_house_value.mode()
0 500001.0
dtype: float64
This doesn’t tell us much about the skewness but enables an interesting question: why would the mode be at exactly the maximum value and, still, so far from the sample mean and median? We’ll dig into it that later on.
To get a little into the relationship between the variables, let’s look at the correlation coefficient.
home_data.corr()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
longitude | 1.000000 | -0.925208 | -0.114250 | 0.047010 | 0.071802 | 0.101674 | 0.059628 | -0.015485 | -0.044982 |
latitude | -0.925208 | 1.000000 | 0.016454 | -0.038773 | -0.069373 | -0.111261 | -0.074902 | -0.080303 | -0.144917 |
housing_median_age | -0.114250 | 0.016454 | 1.000000 | -0.360984 | -0.320434 | -0.295890 | -0.302754 | -0.115932 | 0.106758 |
total_rooms | 0.047010 | -0.038773 | -0.360984 | 1.000000 | 0.928403 | 0.860170 | 0.919018 | 0.195383 | 0.130991 |
total_bedrooms | 0.071802 | -0.069373 | -0.320434 | 0.928403 | 1.000000 | 0.881169 | 0.980920 | -0.013495 | 0.045783 |
population | 0.101674 | -0.111261 | -0.295890 | 0.860170 | 0.881169 | 1.000000 | 0.909247 | -0.000638 | -0.027850 |
households | 0.059628 | -0.074902 | -0.302754 | 0.919018 | 0.980920 | 0.909247 | 1.000000 | 0.007644 | 0.061031 |
median_income | -0.015485 | -0.080303 | -0.115932 | 0.195383 | -0.013495 | -0.000638 | 0.007644 | 1.000000 | 0.691871 |
median_house_value | -0.044982 | -0.144917 | 0.106758 | 0.130991 | 0.045783 | -0.027850 | 0.061031 | 0.691871 | 1.000000 |
For instance, we could do a reading like this: if we look at the columns for total_rooms
, total_bedrooms
, population
and households
we can see that they are considerably correlated, while they don’t explain much of median_house_value
.
As we’re interested in which variables are able to predict or explain median_house_value
, we’ll focus on them. However, some correlated variables might help during training but then prove to be an obstacle for generalization during testing or after deployment (A.K.A. overfitting). A good example for it could be the latitude
: while it might look like a good choice (in addition to others) in the correlation table, it could incur overfitting or even data mismatch. This could be the case if new data about a real estate development on comes into a model that was previously trained with latitude
as a variable, and this development was done on a location that used to have cheaper properties but has seen an increase in their value, the model will generalize poorly. Therefore, it might help short term but in the long run it becomes intuitively a poor choice if the model is not continuously updated.
Analyzing the latter’s column, median_income
is the most correlated one, followed by latitude
, total_rooms
and housing_median_age
.
Pre-processing
Normalization
Normalizing/standardizing is useful for a lot of different algorithms. For some it’s even absolutely necessary. For instance, support vector machines, which we intend to use, work better under standardized data. Intuitively, this makes sense, as if we have a variable with a variance of, say, 1000 that has little importance on the target variable and another one with a variance of 1 that’s more important for the task, the difference on the variances will give useless vectors (being a little extreme). On practice, not scaling the variables could considerably slow down training.
There are many ways to standardize data, but we’ll stick with the standard normal.
df_normal = home_data.copy(deep=True)
μ = pd.Series({col: 0 for col in home_data.columns}, dtype="float32")
σ = pd.Series({col: 0 for col in home_data.columns}, dtype="float32")
for col in home_data.columns:
μ[col] = home_data[col].mean()
σ[col] = home_data[col].std()
df_normal[col] = (home_data[col] - μ[col]) / σ[col]
df_normal.head(15)
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
0 | 2.619289 | -0.671501 | -1.079639 | 1.361655 | 1.764152 | -0.361173 | -0.075996 | -1.252506 | -1.210522 |
1 | 2.539496 | -0.573248 | -0.761850 | 2.296540 | 3.230346 | -0.261858 | -0.099402 | -1.081451 | -1.096713 |
2 | 2.494612 | -0.905436 | -0.920745 | -0.882436 | -0.866931 | -0.955326 | -0.999223 | -1.170071 | -1.048430 |
3 | 2.489624 | -0.928830 | -1.159087 | -0.524171 | -0.480216 | -0.796769 | -0.715753 | -0.362590 | -1.154480 |
4 | 2.489624 | -0.961581 | -0.682402 | -0.545731 | -0.506313 | -0.701809 | -0.622130 | -1.026424 | -1.222593 |
5 | 2.484637 | -0.933509 | 0.032625 | -0.576466 | -0.719837 | -0.660863 | -0.681945 | -0.282879 | -1.149307 |
6 | 2.484637 | -0.942866 | -0.285165 | 0.120799 | 0.333545 | 0.358431 | 0.342707 | -0.632431 | -1.076883 |
7 | 2.479650 | -0.372063 | 0.985994 | -0.840233 | -0.881166 | -0.918736 | -0.892596 | -1.139989 | -1.369165 |
8 | 2.479650 | -0.942866 | 0.429862 | 0.984123 | 1.507924 | 1.484882 | 1.442778 | -0.893731 | -1.283808 |
9 | 2.474663 | -0.372063 | 1.383231 | -0.526006 | -0.546646 | -0.559805 | -0.598724 | -0.887127 | -1.372614 |
10 | 2.474663 | -0.938187 | -1.000192 | 0.503377 | 0.620616 | 0.875048 | 0.839429 | -0.630912 | -1.041533 |
11 | 2.474663 | -0.947545 | -0.602955 | -0.300771 | -0.133834 | -0.215684 | -0.167018 | -1.183644 | -1.252769 |
12 | 2.469676 | -0.367384 | 1.542126 | -0.620503 | -0.691367 | -0.740142 | -0.754763 | -0.904788 | -1.368303 |
13 | 2.469676 | -0.372063 | 0.191520 | -0.075995 | -0.178911 | -0.072809 | -0.057791 | -0.351951 | -1.180345 |
14 | 2.459702 | -1.340557 | -1.079639 | -0.548483 | -0.382944 | -0.418672 | -0.523306 | -1.585341 | -1.399342 |
Feature selection
There are several methods for feature selection. First we’ll work with univariate tests for it, scoring each feature. We may even get into some recursive feature elimination, in which a model is trained with the all the variables and then pruned getting rid of the least important ones.
For classification, possible univariate statistical tests are χ-squared-test, mutual information and F-test ANOVA; on the other hand, for regression, we have again F-test and mutual information. We’ll feed them into SelectKBest(test_func, k=)
as scoring functions.
χ-squared-test test of independence is used to determine if there’s a considerable association between two nominal variables, by analyzing whether the frequencies of one’s class vary across the frequencies of the other’s classes. It’s therefore used for classification feature selection. It requires features to be non-negative, as it works with contingency tables.
Mutual information measures mutual dependence between two variables by quantifying the amount of information obtained from one by observing the other. This is calculated through the KL divergence between the joint distribution and the product of the marginal distributions.
F-test for ANOVA analyzes how the variance of a variable across the values of another variable compares to itself. If it varies with a significant level, then some values of the second variable are responsible for a considerable part of the variance of the first one, while other values less so.
Methods based on the F-test estimate linear dependency between variables and are therefore prone to disregard alinear relationships. Mutual information does not fall into this, but requires more samples to work properly. Since $χ^2$ works with nominal features, we won’t be using it for now, but we’ll try to compare the other two.
anova_filter_reg = SelectKBest(f_regression, k=5)
anova_filter_reg.fit(df_normal.drop(["median_house_value"], axis=1),
df_normal["median_house_value"])
for i in anova_filter_reg.get_support(indices=True):
print(df_normal.columns[i])
latitude
housing_median_age
total_rooms
households
median_income
This tells us that the important variables according to F-test are latitude
, housing_median_age
, total_rooms
and median_income
, which concurs with what we saw while looking at the correlation.
mi_filter_reg = SelectKBest(mutual_info_regression, k=5)
mi_filter_reg.fit(df_normal.drop(["median_house_value"], axis=1),
df_normal["median_house_value"])
for i in mi_filter_reg.get_support(indices=True):
print(df_normal.columns[i])
longitude
latitude
housing_median_age
total_rooms
median_income
We see that four out of the five selected features match. The two selectors differ in households
vs. longitude
. We anticipated that mutual information is susceptible to non-linear relationships while F-test isn’t, so that could be one of the reasons they don’t agree. Here you can find a nice, intuitive way to see this.
PCA decomposition
Another way to reduce the dimensionality of data is principal components analysis (PCA). It is an unsupervised learning method that projects the data onto new axes, on which it has the most variance. It does so by calculating the eigenvalues of the covariance matrix and leaving only the eigenvectors on which the data has the most variance (usually around 99% of it).
Instantiation
pca = PCA(n_components=0.99, svd_solver='auto')
Fitting
pca.fit(df_normal.drop(["median_house_value"], axis=1))
PCA(copy=True, iterated_power='auto', n_components=0.99, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
We can use the pca.components_
attribute to see which eigenvectors were found, which might be useful for plotting.
print(pca.components_)
[[ 0.07904068 -0.07631081 -0.2181667 0.48316747 0.49022067 0.47243444 0.49156034 0.04170786]
[-0.70080222 0.7013323 0.02073468 0.07747009 0.06348259 0.02977278 0.06579091 -0.03650765]
[-0.05845201 0.01421506 -0.38966987 0.09753053 -0.11547515 -0.11482791 -0.10743051 0.89272908]
[-0.06642402 -0.1028198 0.88809884 0.11535503 0.06358447 0.08414255 0.09701523 0.40305669]
[-0.10510948 -0.05497627 -0.03609806 -0.30865347 -0.37872447 0.84925327 -0.14741019 0.05446535]
[ 0.47261304 0.45743518 0.09126113 0.5681172 -0.24096963 0.12581277 -0.39828178 -0.06148691]]
We can also see why they were chosen with the variance explained by each of them:
print(f"Explained variance in data:{pca.explained_variance_},\nRatio:{pca.explained_variance_ratio_}")
Explained variance in data:[3.91167716 1.90686359 1.07259181 0.82308895 0.14375284 0.08117511].
Ratio:[0.48895963 0.23835794 0.13407397 0.10288612 0.0179691 0.01014689]
This means that, with the data normalized, the first eigenvector is responsible for almost 49% of the variance in the data and we need six orthogonal axes, six dimensions, to explain 99% of the variance in data.
When we’re ready to project the data into the new axes, we do it with the following code:
df_pca = pca.transform(df_normal.drop(["median_house_value"], axis=1))
Visualization
Given our top features selected with the different methods and our new dataset with reduced dimensionality, let’s see how they’re distributed and how they relate to our target variable.
Value distribution
First, let’s see what a histogram says about the distribution of values in median_house_value
.
We can see that it confirms our suspicions on the skewness of the data, but there’s some particular phenomena at the maximum value. This could be due to replacing done on outliers and extreme values.
Households
Distribution
Let’s see the distribution of households.
We take the logarithm on the values to get rid of the long tail.
pd.DataFrame(data=np.log10(home_data["households"]) \
.iplot(kind="histogram",
bins=100,
histnorm="probability",
filename="cufflinks/histogram"))
We can see that the log-distribution looks like a mostly symmetric Gaussian bell. This means that the amount of people per block (as per the dataset description) hovers around 500.
Households vs. price
Let’s see what the relation between the value and the households
variable is with a scatter plot.
The samples crowd in a cluster with some cases set apart toward higher values and lower population density.
House age
Distribution
Let’s see a histogram for the meadian_housing_age
variable. Remember it’s one of the top correlated variables.
Again we find that outliers have been replaced with the maximum value of 52. There are many prominent modes but there’s no obvious skewness.
Age vs. price
To the naked eye, the relationship between age and price seems mostly random. Some obvious remarks are: the age variable is not continuous but rather natural (it contains years rather than time and dates), extreme prices ($>=$500,000$) are more frequent with houses at least 15 years old, there seems to be considerably less data points for houses younger than 12 and the distribution of price along all years seems skewed toward cheaper values. It also looks like the value has higher variance or less skewness in some years, perhaps with certain periocidity. Another noticeable thing is that houses older than 52 appear to have been replaced with the age of 52.
Bedrooms
Distribution
We took the logarithm to get rid of the long tail. Values in the original variable are centered in approximately 500 (as we saw in the first table) bedrooms per block, with more observations toward fewer of them.
Price vs. bedrooms
Here we have a quite similar distribution to the plot of households
vs. median_house_value
.
Rooms
Distribution
There’s considerable skewness in total_rooms
, so let’s try with log(total_rooms)
.
Value vs. rooms
We can see that the samples make a big cluster with some outliers but without any obvious relation between the variables.
Income
Distribution
This was labeled as the variable the most correlated to the value in the correlation table, let’s see why.
Nothing too surprising here. Again, some possible replacing done on the outliers. Let’s take a natural logarithm.
Income vs. price
Scatter plot
Here we see more of an influence on the price, the positive correlation is evident. The importance given to this variable by the results of the feature selection methods and the correlation analysis makes sense.
Joint distribution
Let’s do a simple sample joint distribution without any kernel smoothing, just density.
t-SNE
Lastly, t-distributed Stochastic Neighbor Embedding is a good resource for the visualization of high-dimensional data. As its —rather cryptic— name suggests, it embeds data into lower dimensions. It does so by converting similarities between data points to joint probabilities and then trying to minimize the KL divergence between these joint probabilities in the embedded and the high-dimensional data, via gradient descent. It focuses on the local structure of the data so it tends to cluster the samples in the embedded space, which is what we want to use it for.
tsne = TSNE(n_components=2,
perplexity=30,
early_exaggeration=20.0,
init='pca',
random_state=10, # for reproducibility
verbose=1) # for control freaks
df_normal["quantile"] = pd.qcut(home_data["median_house_value"],
[0, .25, .5, .75, .9, 1.],
labels=False)
df_tsne = tsne.fit_transform(df_normal.drop(["median_house_value"], axis=1))
df_tsne = pd.DataFrame(data={"col0":df_tsne[:, 0],
"col1":df_tsne[:, 1]})
df_tsne["value"] = home_data["median_house_value"]
df_tsne["income"] = home_data["median_income"]
df_tsne["quantile"] = df_normal["quantile"]
sample_tsne = df_tsne.sample(frac=0.01)
sample_tsne["quantile"] = sample_tsne["quantile"].astype('float16') # hot fix for cufflinks
sample_tsne.iplot(kind="bubble",
size="income",
categories="quantile",
text="income",
x="col0", y="col1",
filename="cufflinks/simple-bubble-chart")
We can see by the color of the bubbles —given by the forced category— that the algorithm could kind of find layers of prices in just two variables. However, it’s visible that it still struggles to make fully non-overlapping clusters for the value of the chosen categories (at least with these hyperparameters setup).
Let’s try with 3 components.
tsne3 = TSNE(n_components=3,
perplexity=30,
early_exaggeration=20.0,
init='pca',
random_state=10)
df_tsne3 = tsne3.fit_transform(df_normal.drop(["median_house_value"], axis=1))
df_tsne3 = pd.DataFrame(data={"col0":df_tsne3[:, 0],
"col1":df_tsne3[:, 1],
"col2":df_tsne3[:, 2]})
df_tsne3["value"] = home_data["median_house_value"]
df_tsne3["income"] = home_data["median_income"]
df_tsne3["quantile"] = df_normal["quantile"]
sample_tsne3 = df_tsne3.sample(frac=0.01)
sample_tsne3["quantile"] = sample_tsne3["quantile"].astype('float16')
sample_tsne3.iplot(kind="scatter3d",
mode="markers",
categories="quantile",
text="income",
x="col0", y="col1", z='col2')
At a first glimpse, the 3D scatter plot doesn’t say much, but playing with it interactively, deactivating different classes, rotating the axes and such, allows you to see that t-SNE found a way to make the classes almost linearly separable. To do it, select the “turntable rotation” tool on the upper right if you are on a computer or click and do the same in the plotly website if you are on your phone of tablet.
Conclusions
That wraps up this edition. We’ve covered how to get some insights into the data by taking a simple summary with Pandas’ describe()
method, hot to get a fast idea of whether each variable is linearly correlated to the target one with the .corr()
method, we normalized and standardized the data, we saw different methods present in scikit-learn
that allow us to select important features, we learned how to project the data into a lower dimensionality while keeping its variance with PCA and we took different plots to gain some intuition into why the different methods did what they did.
Next week, I’ll do some comparisons between different models.