Ames: Feature Engineering
Grouping Neighborhoods by Region
I have never been to Ames, Iowa, so I cannot speak to the relative desirability of each neighborhood. I do know that sometimes geography precedes "neighborhood" in terms of desirability of housing. To investigate this potential relationship, I performed a geographic grouping of neighborhoods based on publicly available mapping of the neighborhoods within Ames (Figure 1). I placed the 26 neighborhoods represented in the dataset into 10 geographic groups:
Northwest
Far North
Near North
Downtown
Far West
Near West
Southwest Iowa State University (ISU)
Far Southwest
South
The idea was that by reducing the number of "geographic areas" from 26 to 10 might provide us with a more generalizable model. First, I created strip plots of the home sales prices by neighborhood (Figure 2). These strip plots show that the prices of houses in a most (but not all) neighborhoods can vary significantly. It also shows that there are notable statistical differences between certain neighborhoods.
To see if this grouping technique provided any value, I first examined the distribution of home sale price in each area (Figure 3). Based on this initial evaluation of the data, we can see that there are visible differences in the sale prices between different areas. Notable, the Northwest area appears to have the highest home prices and the Downtown and Southwest areas have the lowest sale prices.
When we step back and evaluate the differences in a more quantitative fashion, we can see that the linear correlation coefficients for the Northwest neighborhoods and the Downtown neighborhoods have correlation coefficients of 0.52 and -0.35 respectively (Table 1). This indicates a significant relationship between the geographic location of the home and the sales price that is easier to digest than examining each of the neighborhoods individually.
Extracting Multiple Features from Complex Inputs
Other columns in the dataset contained data that was difficult to interpret without further explanation. The prime example of this was the "ms-subclass" column, which originally contained numbers corresponding to different sub-classes of homes. Table 2 lists the codes along with the corresponding string definitions. From these strings, I was able to extract three features that were not contained elsewhere in the dataset:
The number of stories in the house
If the house is part of a planned unit development (i.e. is there a homeowner's association)
If more than one family occupies the house (a duplex)