Ames: Feature Engineering

Grouping Neighborhoods by Region

I have never been to Ames, Iowa, so I cannot speak to the relative desirability of each neighborhood. I do know that sometimes geography precedes "neighborhood" in terms of desirability of housing. To investigate this potential relationship, I performed a geographic grouping of neighborhoods based on publicly available mapping of the neighborhoods within Ames (Figure 1). I placed the 26 neighborhoods represented in the dataset into 10 geographic groups:

Figure 1: Geographic grouping scheme used to generate the "geo-group" feature.

The idea was that by reducing the number of "geographic areas" from 26 to 10 might provide us with a more generalizable model. First, I created strip plots of the home sales prices by neighborhood (Figure 2). These strip plots show that the prices of houses in a most (but not all) neighborhoods can vary significantly. It also shows that there are notable statistical differences between certain neighborhoods.

Figure 2: Strip plots of sale prices grouped by neighborhood

To see if this grouping technique provided any value, I first examined the distribution of home sale price in each area (Figure 3). Based on this initial evaluation of the data, we can see that there are visible differences in the sale prices between different areas. Notable, the Northwest area appears to have the highest home prices and the Downtown and Southwest areas have the lowest sale prices.

Figure 3: Strip plots of home sale prices grouped by geographical area

When we step back and evaluate the differences in a more quantitative fashion, we can see that the linear correlation coefficients for the Northwest neighborhoods and the Downtown neighborhoods have correlation coefficients of 0.52 and -0.35 respectively (Table 1). This indicates a significant relationship between the geographic location of the home and the sales price that is easier to digest than examining each of the neighborhoods individually.

Table 1: Comparison of group correlation coefficients (r_g) and neighborhood correlation coefficients (r_n)

Extracting Multiple Features from Complex Inputs

Other columns in the dataset contained data that was difficult to interpret without further explanation. The prime example of this was the "ms-subclass" column, which originally contained numbers corresponding to different sub-classes of homes. Table 2 lists the codes along with the corresponding string definitions. From these strings, I was able to extract three features that were not contained elsewhere in the dataset:

Table 2: Features extracted from the "ms-subclass" column that were not represented elsewhere in the dataset.