Welcome to the first of a series of articles that intend to demonstrate a simple “pipeline” of how to cut through the data in Python with the help of some of its friends, such as Pandas and Scikit-learn.

We’ll see some ways to gain simple insight by looking at a few statistics from a summary of the data. We’ll also try to select and throw away variables according to their importance for our target in our task. Another similar process we’ll see is how to project the data into less dimensions but keeping all its information (or the vast majority of it). To furthur understand why these methods give the output they’ll give us, we’ll do some plotting of the variables.

As an excuse for applying these techniques, we’ll try to predict the value of a property in California with the California Housing dataset, from the US Census of 1990. It’s already uploaded into Google’s Colab environment so you can start playing with it very easily. At the end of this series I’ll provide a full notebook so you can go crazy with it.

Data read

Let’s begin by looking at a summary of the sample distribution of each variable:

home_data = pd.read_csv("sample_data/california_housing_train.csv")
home_data.describe()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000
mean	-119.562108	35.625225	28.589353	2643.664412	539.410824	1429.573941	501.221941	3.883578	207300.912353
std	2.005166	2.137340	12.586937	2179.947071	421.499452	1147.852959	384.520841	1.908157	115983.764387
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.790000	33.930000	18.000000	1462.000000	297.000000	790.000000	282.000000	2.566375	119400.000000
50%	-118.490000	34.250000	29.000000	2127.000000	434.000000	1167.000000	409.000000	3.544600	180400.000000
75%	-118.000000	37.720000	37.000000	3151.250000	648.250000	1721.000000	605.250000	4.767000	265000.000000
max	-114.310000	41.950000	52.000000	37937.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

Let’s give a short description of the variables, as given in the full dataset link: each observation corresponds to a block in California; longitude and latitude are self explanatory; housing_median_age is the median age of a house in the block; total_rooms is the amount of rooms in that block and, accordingly, total_bedrooms is the amount of bedrooms in it; households it the total number of households within the block; median_income is the median income within the block (in thousands of US dollars); median_house_value is the median price for a property in the block.

In the table we can see that the variables total_rooms, population, households and total_bedrooms have quite large ranges of values. We see in the minima or maxima of many variables that they are suspiciously pretty numbers.

In median_housing_value, our target variable, we can see that almost half the range of values are in the last quartile. The median is at considerably less than 250,000. That may indicate that the distribution is skewed to the left (lower prices). Let’s look at its mode:

home_data.median_house_value.mode()

0    500001.0
dtype: float64

This doesn’t tell us much about the skewness but enables an interesting question: why would the mode be at exactly the maximum value and, still, so far from the sample mean and median? We’ll dig into it that later on.

To get a little into the relationship between the variables, let’s look at the correlation coefficient.

home_data.corr()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
longitude	1.000000	-0.925208	-0.114250	0.047010	0.071802	0.101674	0.059628	-0.015485	-0.044982
latitude	-0.925208	1.000000	0.016454	-0.038773	-0.069373	-0.111261	-0.074902	-0.080303	-0.144917
housing_median_age	-0.114250	0.016454	1.000000	-0.360984	-0.320434	-0.295890	-0.302754	-0.115932	0.106758
total_rooms	0.047010	-0.038773	-0.360984	1.000000	0.928403	0.860170	0.919018	0.195383	0.130991
total_bedrooms	0.071802	-0.069373	-0.320434	0.928403	1.000000	0.881169	0.980920	-0.013495	0.045783
population	0.101674	-0.111261	-0.295890	0.860170	0.881169	1.000000	0.909247	-0.000638	-0.027850
households	0.059628	-0.074902	-0.302754	0.919018	0.980920	0.909247	1.000000	0.007644	0.061031
median_income	-0.015485	-0.080303	-0.115932	0.195383	-0.013495	-0.000638	0.007644	1.000000	0.691871
median_house_value	-0.044982	-0.144917	0.106758	0.130991	0.045783	-0.027850	0.061031	0.691871	1.000000

For instance, we could do a reading like this: if we look at the columns for total_rooms, total_bedrooms, population and households we can see that they are considerably correlated, while they don’t explain much of median_house_value.

As we’re interested in which variables are able to predict or explain median_house_value, we’ll focus on them. However, some correlated variables might help during training but then prove to be an obstacle for generalization during testing or after deployment (A.K.A. overfitting). A good example for it could be the latitude: while it might look like a good choice (in addition to others) in the correlation table, it could incur overfitting or even data mismatch. This could be the case if new data about a real estate development on comes into a model that was previously trained with latitude as a variable, and this development was done on a location that used to have cheaper properties but has seen an increase in their value, the model will generalize poorly. Therefore, it might help short term but in the long run it becomes intuitively a poor choice if the model is not continuously updated.

Analyzing the latter’s column, median_income is the most correlated one, followed by latitude, total_rooms and housing_median_age.

Pre-processing

Normalization

Normalizing/standardizing is useful for a lot of different algorithms. For some it’s even absolutely necessary. For instance, support vector machines, which we intend to use, work better under standardized data. Intuitively, this makes sense, as if we have a variable with a variance of, say, 1000 that has little importance on the target variable and another one with a variance of 1 that’s more important for the task, the difference on the variances will give useless vectors (being a little extreme). On practice, not scaling the variables could considerably slow down training.

There are many ways to standardize data, but we’ll stick with the standard normal.

df_normal = home_data.copy(deep=True)
μ = pd.Series({col: 0 for col in home_data.columns}, dtype="float32")
σ = pd.Series({col: 0 for col in home_data.columns}, dtype="float32")

for col in home_data.columns:
    μ[col] = home_data[col].mean()
    σ[col] = home_data[col].std()
    df_normal[col] = (home_data[col] - μ[col]) / σ[col]
df_normal.head(15)

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	2.619289	-0.671501	-1.079639	1.361655	1.764152	-0.361173	-0.075996	-1.252506	-1.210522
1	2.539496	-0.573248	-0.761850	2.296540	3.230346	-0.261858	-0.099402	-1.081451	-1.096713
2	2.494612	-0.905436	-0.920745	-0.882436	-0.866931	-0.955326	-0.999223	-1.170071	-1.048430
3	2.489624	-0.928830	-1.159087	-0.524171	-0.480216	-0.796769	-0.715753	-0.362590	-1.154480
4	2.489624	-0.961581	-0.682402	-0.545731	-0.506313	-0.701809	-0.622130	-1.026424	-1.222593
5	2.484637	-0.933509	0.032625	-0.576466	-0.719837	-0.660863	-0.681945	-0.282879	-1.149307
6	2.484637	-0.942866	-0.285165	0.120799	0.333545	0.358431	0.342707	-0.632431	-1.076883
7	2.479650	-0.372063	0.985994	-0.840233	-0.881166	-0.918736	-0.892596	-1.139989	-1.369165
8	2.479650	-0.942866	0.429862	0.984123	1.507924	1.484882	1.442778	-0.893731	-1.283808
9	2.474663	-0.372063	1.383231	-0.526006	-0.546646	-0.559805	-0.598724	-0.887127	-1.372614
10	2.474663	-0.938187	-1.000192	0.503377	0.620616	0.875048	0.839429	-0.630912	-1.041533
11	2.474663	-0.947545	-0.602955	-0.300771	-0.133834	-0.215684	-0.167018	-1.183644	-1.252769
12	2.469676	-0.367384	1.542126	-0.620503	-0.691367	-0.740142	-0.754763	-0.904788	-1.368303
13	2.469676	-0.372063	0.191520	-0.075995	-0.178911	-0.072809	-0.057791	-0.351951	-1.180345
14	2.459702	-1.340557	-1.079639	-0.548483	-0.382944	-0.418672	-0.523306	-1.585341	-1.399342

Feature selection

There are several methods for feature selection. First we’ll work with univariate tests for it, scoring each feature. We may even get into some recursive feature elimination, in which a model is trained with the all the variables and then pruned getting rid of the least important ones.

For classification, possible univariate statistical tests are χ-squared-test, mutual information and F-test ANOVA; on the other hand, for regression, we have again F-test and mutual information. We’ll feed them into SelectKBest(test_func, k=) as scoring functions.

χ-squared-test test of independence is used to determine if there’s a considerable association between two nominal variables, by analyzing whether the frequencies of one’s class vary across the frequencies of the other’s classes. It’s therefore used for classification feature selection. It requires features to be non-negative, as it works with contingency tables.

Mutual information measures mutual dependence between two variables by quantifying the amount of information obtained from one by observing the other. This is calculated through the KL divergence between the joint distribution and the product of the marginal distributions.

F-test for ANOVA analyzes how the variance of a variable across the values of another variable compares to itself. If it varies with a significant level, then some values of the second variable are responsible for a considerable part of the variance of the first one, while other values less so.

Methods based on the F-test estimate linear dependency between variables and are therefore prone to disregard alinear relationships. Mutual information does not fall into this, but requires more samples to work properly. Since $χ^2$ works with nominal features, we won’t be using it for now, but we’ll try to compare the other two.

anova_filter_reg = SelectKBest(f_regression, k=5)
anova_filter_reg.fit(df_normal.drop(["median_house_value"], axis=1),
                     df_normal["median_house_value"])

for i in anova_filter_reg.get_support(indices=True):
    print(df_normal.columns[i])

latitude
housing_median_age
total_rooms
households
median_income

This tells us that the important variables according to F-test are latitude, housing_median_age, total_rooms and median_income, which concurs with what we saw while looking at the correlation.

mi_filter_reg = SelectKBest(mutual_info_regression, k=5)
mi_filter_reg.fit(df_normal.drop(["median_house_value"], axis=1),
                  df_normal["median_house_value"])

for i in mi_filter_reg.get_support(indices=True):
    print(df_normal.columns[i])

longitude
latitude
housing_median_age
total_rooms
median_income

We see that four out of the five selected features match. The two selectors differ in households vs. longitude. We anticipated that mutual information is susceptible to non-linear relationships while F-test isn’t, so that could be one of the reasons they don’t agree. Here you can find a nice, intuitive way to see this.

PCA decomposition

Another way to reduce the dimensionality of data is principal components analysis (PCA). It is an unsupervised learning method that projects the data onto new axes, on which it has the most variance. It does so by calculating the eigenvalues of the covariance matrix and leaving only the eigenvectors on which the data has the most variance (usually around 99% of it).

Instantiation

pca = PCA(n_components=0.99, svd_solver='auto')

Fitting

pca.fit(df_normal.drop(["median_house_value"], axis=1))

PCA(copy=True, iterated_power='auto', n_components=0.99, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

We can use the pca.components_ attribute to see which eigenvectors were found, which might be useful for plotting.

print(pca.components_)

[[ 0.07904068 -0.07631081 -0.2181667   0.48316747  0.49022067  0.47243444  0.49156034  0.04170786]
 [-0.70080222  0.7013323   0.02073468  0.07747009  0.06348259  0.02977278  0.06579091 -0.03650765]
 [-0.05845201  0.01421506 -0.38966987  0.09753053 -0.11547515 -0.11482791  -0.10743051  0.89272908]
 [-0.06642402 -0.1028198   0.88809884  0.11535503  0.06358447  0.08414255  0.09701523  0.40305669]
 [-0.10510948 -0.05497627 -0.03609806 -0.30865347 -0.37872447  0.84925327  -0.14741019  0.05446535]
 [ 0.47261304  0.45743518  0.09126113  0.5681172  -0.24096963  0.12581277  -0.39828178 -0.06148691]]

We can also see why they were chosen with the variance explained by each of them:

print(f"Explained variance in data:{pca.explained_variance_},\nRatio:{pca.explained_variance_ratio_}")

Explained variance in data:[3.91167716 1.90686359 1.07259181 0.82308895 0.14375284 0.08117511].
Ratio:[0.48895963 0.23835794 0.13407397 0.10288612 0.0179691  0.01014689]

This means that, with the data normalized, the first eigenvector is responsible for almost 49% of the variance in the data and we need six orthogonal axes, six dimensions, to explain 99% of the variance in data.

When we’re ready to project the data into the new axes, we do it with the following code:

df_pca = pca.transform(df_normal.drop(["median_house_value"], axis=1))

Visualization

Given our top features selected with the different methods and our new dataset with reduced dimensionality, let’s see how they’re distributed and how they relate to our target variable.

Value distribution

First, let’s see what a histogram says about the distribution of values in median_house_value.

We can see that it confirms our suspicions on the skewness of the data, but there’s some particular phenomena at the maximum value. This could be due to replacing done on outliers and extreme values.

Households

Distribution

Let’s see the distribution of households.

We take the logarithm on the values to get rid of the long tail.

pd.DataFrame(data=np.log10(home_data["households"]) \
             .iplot(kind="histogram",
                    bins=100,
                    histnorm="probability",
                    filename="cufflinks/histogram"))

We can see that the log-distribution looks like a mostly symmetric Gaussian bell. This means that the amount of people per block (as per the dataset description) hovers around 500.

Households vs. price

Let’s see what the relation between the value and the households variable is with a scatter plot.

The samples crowd in a cluster with some cases set apart toward higher values and lower population density.

House age

Distribution

Let’s see a histogram for the meadian_housing_age variable. Remember it’s one of the top correlated variables.

Again we find that outliers have been replaced with the maximum value of 52. There are many prominent modes but there’s no obvious skewness.

Age vs. price

To the naked eye, the relationship between age and price seems mostly random. Some obvious remarks are: the age variable is not continuous but rather natural (it contains years rather than time and dates), extreme prices ($>=$500,000$) are more frequent with houses at least 15 years old, there seems to be considerably less data points for houses younger than 12 and the distribution of price along all years seems skewed toward cheaper values. It also looks like the value has higher variance or less skewness in some years, perhaps with certain periocidity. Another noticeable thing is that houses older than 52 appear to have been replaced with the age of 52.

Bedrooms

Distribution

We took the logarithm to get rid of the long tail. Values in the original variable are centered in approximately 500 (as we saw in the first table) bedrooms per block, with more observations toward fewer of them.

Price vs. bedrooms

Here we have a quite similar distribution to the plot of households vs. median_house_value.

Rooms

Distribution

There’s considerable skewness in total_rooms, so let’s try with log(total_rooms).

Value vs. rooms

We can see that the samples make a big cluster with some outliers but without any obvious relation between the variables.

Income

Distribution

This was labeled as the variable the most correlated to the value in the correlation table, let’s see why.

Nothing too surprising here. Again, some possible replacing done on the outliers. Let’s take a natural logarithm.

Income vs. price

Scatter plot

Here we see more of an influence on the price, the positive correlation is evident. The importance given to this variable by the results of the feature selection methods and the correlation analysis makes sense.

Joint distribution

Let’s do a simple sample joint distribution without any kernel smoothing, just density.

t-SNE

Lastly, t-distributed Stochastic Neighbor Embedding is a good resource for the visualization of high-dimensional data. As its —rather cryptic— name suggests, it embeds data into lower dimensions. It does so by converting similarities between data points to joint probabilities and then trying to minimize the KL divergence between these joint probabilities in the embedded and the high-dimensional data, via gradient descent. It focuses on the local structure of the data so it tends to cluster the samples in the embedded space, which is what we want to use it for.

tsne = TSNE(n_components=2,
            perplexity=30,
            early_exaggeration=20.0,
            init='pca',
            random_state=10,  # for reproducibility
            verbose=1)  # for control freaks

df_normal["quantile"] = pd.qcut(home_data["median_house_value"],
                                [0, .25, .5, .75, .9, 1.],
                                labels=False)

df_tsne = tsne.fit_transform(df_normal.drop(["median_house_value"], axis=1))

df_tsne = pd.DataFrame(data={"col0":df_tsne[:, 0],
                             "col1":df_tsne[:, 1]})
df_tsne["value"] = home_data["median_house_value"]
df_tsne["income"] = home_data["median_income"]
df_tsne["quantile"] = df_normal["quantile"]

sample_tsne = df_tsne.sample(frac=0.01)
sample_tsne["quantile"] = sample_tsne["quantile"].astype('float16')  # hot fix for cufflinks

sample_tsne.iplot(kind="bubble",
                  size="income",
                  categories="quantile",
                  text="income",
                  x="col0", y="col1",
                  filename="cufflinks/simple-bubble-chart")

We can see by the color of the bubbles —given by the forced category— that the algorithm could kind of find layers of prices in just two variables. However, it’s visible that it still struggles to make fully non-overlapping clusters for the value of the chosen categories (at least with these hyperparameters setup).

Let’s try with 3 components.

tsne3 = TSNE(n_components=3,
             perplexity=30,
             early_exaggeration=20.0,
             init='pca',
             random_state=10)

df_tsne3 = tsne3.fit_transform(df_normal.drop(["median_house_value"], axis=1))

df_tsne3 = pd.DataFrame(data={"col0":df_tsne3[:, 0],
                              "col1":df_tsne3[:, 1],
                              "col2":df_tsne3[:, 2]})
df_tsne3["value"] = home_data["median_house_value"]
df_tsne3["income"] = home_data["median_income"]
df_tsne3["quantile"] = df_normal["quantile"]

sample_tsne3 = df_tsne3.sample(frac=0.01)
sample_tsne3["quantile"] = sample_tsne3["quantile"].astype('float16')

sample_tsne3.iplot(kind="scatter3d",
                   mode="markers",
                   categories="quantile",
                   text="income",
                   x="col0", y="col1", z='col2')

At a first glimpse, the 3D scatter plot doesn’t say much, but playing with it interactively, deactivating different classes, rotating the axes and such, allows you to see that t-SNE found a way to make the classes almost linearly separable. To do it, select the “turntable rotation” tool on the upper right if you are on a computer or click and do the same in the plotly website if you are on your phone of tablet.

Conclusions

That wraps up this edition. We’ve covered how to get some insights into the data by taking a simple summary with Pandas’ describe() method, hot to get a fast idea of whether each variable is linearly correlated to the target one with the .corr() method, we normalized and standardized the data, we saw different methods present in scikit-learn that allow us to select important features, we learned how to project the data into a lower dimensionality while keeping its variance with PCA and we took different plots to gain some intuition into why the different methods did what they did.

Next week, I’ll do some comparisons between different models.