Share these talks and lectures with your colleagues
Invite colleaguesWe noted you are experiencing viewing problems
-
Check with your IT department that JWPlatform, JWPlayer and Amazon AWS & CloudFront are not being blocked by your network. The relevant domains are *.jwplatform.com, *.jwpsrv.com, *.jwpcdn.com, jwpltx.com, jwpsrv.a.ssl.fastly.net, *.amazonaws.com and *.cloudfront.net. The relevant ports are 80 and 443.
-
Check the following talk links to see which ones work correctly:
Auto Mode
HTTP Progressive Download Send us your results from the above test links at access@hstalks.com and we will contact you with further advice on troubleshooting your viewing problems. -
No luck yet? More tips for troubleshooting viewing issues
-
Contact HST Support access@hstalks.com
-
Please review our troubleshooting guide for tips and advice on resolving your viewing problems.
-
For additional help, please don't hesitate to contact HST support access@hstalks.com
We hope you have enjoyed this limited-length demo
This is a limited length demo talk; you may
login or
review methods of
obtaining more access.
Printable Handouts
Navigable Slide Index
This material is restricted to subscribers.
Topics Covered
- Challenges when working with data
- Requirements
- Automation
- Loan application
- Data formats
- Approach
Links
Series:
Categories:
Talk Citation
Wong, M. (2017, August 31). Big data collection and dataset cleaning [Video file]. In The Business & Management Collection, Henry Stewart Talks. Retrieved November 21, 2024, from https://doi.org/10.69645/CJAJ3391.Export Citation (RIS)
Publication History
Other Talks in the Series: Business Intelligence, Big Data, and Applications in Industry
Transcript
Please wait while the transcript is being prepared...
0:00
Hello. This presentation is about data preparation,
also sometimes referred to as data cleaning or scrubbing or less euphemistically,
as data janitor work.
I'm Matt Wong, CEO of Liquidaty,
a New York City based startup company.
0:17
Often the biggest challenge when working with data is not its availability,
quality, volume, or accessibility,
but rather is the arduous task of preparing
the data before it can be used for the particular purpose at hand.
To address this challenge,
corporations may resort to hiring full-time employees whose sole job is to
prepare data that resides in one place so
that it can be moved to another place for processing.
This activity in general is what I will refer to as data preparation.
So what exactly does that mean?
Well, you've probably heard of new fields in technology commonly
referred to as big data or machine learning.
And one thing that these new technologies have in common
is that generally they consume vast quantities of data.
If there is one thing that you take away from this presentation,
remember this:
Just as we humans prefer much of our food to be cooked before we eat it,
data consuming technologies typically need their data to be
prepared often in a very specific manner before it can be consumed.
1:25
Let's start with a simple example.
Imagine you have created a holiday card on a website,
and to provide a list of recipients
you'd like to import your address book from a spreadsheet.
Most likely, the import process will
require the data to be formatted in a very specific manner.
In particular, it may require that your data columns
are named in accordance with a predefined specification.
It might be zip code or postal code or it could just be zip.
Last name and first name,
sometimes might be combined,
sometimes it's required to be separate.
Should there be street address or just address?
Will it be okay if there are spaces between those column header words?
Or if, for example,
there is a dash between E and Mail.
In addition to the application being particular about how the data is formatted,
it might also be particular about the content of the data.
Is it acceptable to be missing a zip code or to have a zip code that is not a number?
Or if the zip is not consistent with the state or if the address does not actually exist?
Depending on the consuming application,
these kinds of exceptions may or may not need to be identified and handled,
perhaps by fixing them or perhaps by removing the entire entry with the offending data.
So let's say in order to get your address book data in
a form that is consumable by your holiday card website,
you change some column names,
remove some entries with missing critical data,
and save the resulting spreadsheet as a CSV file.
Congratulations, you've just completed an exercise in data preparation.