Predicting Housing Prices in Ames, Iowa
Photo by Blake Wheeler on Unsplash
Executive Summary
In this project, I worked with the Ames, Iowa housing dataset with the goal of predicting home prices based upon the many data columns within the dataset. Using a combination of data cleaning and feature engineering, I produced a dataset with information usable by a machine learning model. Afterwards, I examined individual columns in the dataset to determine which columns would be non-collinear and a good fit for use in the machine learning model. Finally, I performed K nearest neighbor and linear regression fits on multiple subsets of the chosen data to obtain a linear regression RMSE of $25,032, or 14% on an average sale price of $175,778
Introduction
The Ames, Iowa housing dataset represents a classic exercise in machine learning. This extensive dataset contains information ranging from the unquestionably important square footage of a house down to the value of the dilapidated shed that the owner forgot was in their backyard. The multitude of datapoints provided in the dataset challenges learners to perform extensive data cleaning as well as think critically about the interplay of seemingly different pieces of data before including them in a machine learning model.
In this project, I used feature engineering and machine learning to analyze the Ames, Iowa housing dataset.
Data Cleaning
The major steps I performed on the raw data available on Kaggle are detailed on the data cleaning subpage. The following operations were performed:
Transformation of pseudonumeric categorical data
Transformation of ambiguous categorical data
Outlier removal
Exploratory Data Analysis (EDA)
The Ames dataset contains 83 columns of data. Some of these are more obviously useful than others, as evidenced by a correlation analysis on the cleaned dataset (Figure 1):
Figure 1: Example Linear Correlations between dataset features and "sale-price"