Big data collection and dataset cleaning

Published on August 31, 2017   9 min
Please wait while the transcript is being prepared...
Hello. This presentation is about data preparation, also sometimes referred to as data cleaning or scrubbing or less euphemistically, as data janitor work. I'm Matt Wong, CEO of Liquidaty, a New York City based startup company.
Often the biggest challenge when working with data is not its availability, quality, volume, or accessibility, but rather is the arduous task of preparing the data before it can be used for the particular purpose at hand. To address this challenge, corporations may resort to hiring full-time employees whose sole job is to prepare data that resides in one place so that it can be moved to another place for processing. This activity in general is what I will refer to as data preparation. So what exactly does that mean? Well, you've probably heard of new fields in technology commonly referred to as big data or machine learning. And one thing that these new technologies have in common is that generally they consume vast quantities of data. If there is one thing that you take away from this presentation, remember this: Just as we humans prefer much of our food to be cooked before we eat it, data consuming technologies typically need their data to be prepared often in a very specific manner before it can be consumed.
Let's start with a simple example. Imagine you have created a holiday card on a website, and to provide a list of recipients you'd like to import your address book from a spreadsheet. Most likely, the import process will require the data to be formatted in a very specific manner. In particular, it may require that your data columns are named in accordance with a predefined specification. It might be zip code or postal code or it could just be zip. Last name and first name, sometimes might be combined, sometimes it's required to be separate. Should there be street address or just address? Will it be okay if there are spaces between those column header words? Or if, for example, there is a dash between E and Mail. In addition to the application being particular about how the data is formatted, it might also be particular about the content of the data. Is it acceptable to be missing a zip code or to have a zip code that is not a number? Or if the zip is not consistent with the state or if the address does not actually exist? Depending on the consuming application, these kinds of exceptions may or may not need to be identified and handled, perhaps by fixing them or perhaps by removing the entire entry with the offending data. So let's say in order to get your address book data in a form that is consumable by your holiday card website, you change some column names, remove some entries with missing critical data, and save the resulting spreadsheet as a CSV file. Congratulations, you've just completed an exercise in data preparation.