ATR 301: Data Cleaning

String Standardization

To eliminate ambiguity in weather descriptors, I transformed all strings in the dataset to lower case. This eliminates false separation of identical data, such as "Sky is Clear", "sky is clear", and "Sky is clear".

Removing Outliers

Several outliers existed in the dataset. I dropped these rows, as the total number was < 0.01% of the overall data. Some of the more interesting outliers were:

A reported temperature of 0 Kelvin (it does get cold in the twin cities, but not that cold...)
One hour rainfall of ~10,000 mm

Duplicate Data

Data was provided in one hour increments, so no day should have more than 24 rows. Upon examination, it appears that multiple data points are recorded for a single hour whenever two or more weather conditions are reported. As seen in rows 178 and 179, all values are identical except for the "weather_main" and "weather_descriptor" columns.

To circumvent this issue, I transformed both the weather columns into lists and condensed replicate hours into one row by combining all the reported weather descriptors into a single list.

Holiday Labels

Holidays in the dataset were only labeled in the row corresponding to midnight (hour = 0). This is easily determined by showing that there are only five rows labeled for most holidays in the dataset. There should be on the order of 24 * 5 rows.

This was corrected, as shown in the data sample below: