Predicting Housing Prices in Ames, Iowa

Executive Summary

In this project, I worked with the Ames, Iowa housing dataset with the goal of predicting home prices based upon the many data columns within the dataset. Using a combination of data cleaning and feature engineering, I produced a dataset with information usable by a machine learning model. Afterwards, I examined individual columns in the dataset to determine which columns would be non-collinear and a good fit for use in the machine learning model. Finally, I performed K nearest neighbor and linear regression fits on multiple subsets of the chosen data to obtain a linear regression RMSE of $25,032, or 14% on an average sale price of $175,778

Introduction

The Ames, Iowa housing dataset represents a classic exercise in machine learning. This extensive dataset contains information ranging from the unquestionably important square footage of a house down to the value of the dilapidated shed that the owner forgot was in their backyard. The multitude of datapoints provided in the dataset challenges learners to perform extensive data cleaning as well as think critically about the interplay of seemingly different pieces of data before including them in a machine learning model.

In this project, I  used feature engineering and machine learning to analyze the Ames, Iowa housing dataset.

Data Cleaning

The major steps I performed on the raw data available on Kaggle are detailed on the data cleaning subpage. The following operations were performed:

Exploratory Data Analysis (EDA)

The Ames dataset contains 83 columns of data. Some of these are more obviously useful than others, as evidenced by a correlation analysis on the cleaned dataset (Figure 1):

Figure 1: Example Linear Correlations between dataset features and "sale-price"

... to be continued