Jack Kolberg-Edelbrock, PhD - Ames: Data Cleaning

Many of the columns in the dataset are filled with either pseudo-numeric or ambiguous categorical data. To render these columns more useful in a machine learning context, I converted several columns to numeric values and renamed ambiguous categorical data to values that were easier to interpret.

Pseudo-numeric categorical data

Several of the "generalized" columns describing the overall condition of the houses are filled with string values representing a spectrum of possible evaluations (table 1). These values can be converted into numeric values to demonstrate their relationships with each other. In my evaluation, I assumed that the categorical values are equally spaced. I believe there is a reasonable case for utilizing unequal spacing between the categorical values as, particularly if the method of categorizing houses led to certain "percentiles" being assigned to each category.