ATR 301: Data Cleaning
String Standardization
To eliminate ambiguity in weather descriptors, I transformed all strings in the dataset to lower case. This eliminates false separation of identical data, such as "Sky is Clear", "sky is clear", and "Sky is clear".
Removing Outliers
Several outliers existed in the dataset. I dropped these rows, as the total number was < 0.01% of the overall data. Some of the more interesting outliers were:
A reported temperature of 0 Kelvin (it does get cold in the twin cities, but not that cold...)
One hour rainfall of ~10,000 mm
Duplicate Data
Data was provided in one hour increments, so no day should have more than 24 rows. Upon examination, it appears that multiple data points are recorded for a single hour whenever two or more weather conditions are reported. As seen in rows 178 and 179, all values are identical except for the "weather_main" and "weather_descriptor" columns.
To circumvent this issue, I transformed both the weather columns into lists and condensed replicate hours into one row by combining all the reported weather descriptors into a single list.
Holiday Labels
Holidays in the dataset were only labeled in the row corresponding to midnight (hour = 0). This is easily determined by showing that there are only five rows labeled for most holidays in the dataset. There should be on the order of 24 * 5 rows.
This was corrected, as shown in the data sample below: