Big data collection and dataset cleaning

Wong, Matt

We noted you are experiencing viewing problems

Check with your IT department that JWPlatform, JWPlayer and Amazon AWS & CloudFront are not being blocked by your network. The relevant domains are *.jwplatform.com, *.jwpsrv.com, *.jwpcdn.com, jwpltx.com, jwpsrv.a.ssl.fastly.net, *.amazonaws.com and *.cloudfront.net. The relevant ports are 80 and 443.
Check the following talk links to see which ones work correctly:
Auto Mode
HTTP Progressive Download Send us your results from the above test links at access@hstalks.com and we will contact you with further advice on troubleshooting your viewing problems.
No luck yet? More tips for troubleshooting viewing issues
Contact HST Support access@hstalks.com

Please review our troubleshooting guide for tips and advice on resolving your viewing problems.
For additional help, please don't hesitate to contact HST support access@hstalks.com

We hope you have enjoyed this limited-length demo

Request free trial
Recommend to your librarian

Share
Share This Talk
Messaging

Outlook

Gmail

Yahoo!

WhatsApp
Social

Facebook

X

LinkedIn

VKontakte
Permalink
Replay Talk

This is a limited length demo talk; you may login or review methods of obtaining more access.

Slides
Questions
Topics
Links
Citation

Topics Covered

Challenges when working with data
Requirements
Automation
Loan application
Data formats
Approach

Links

Series:

Business Intelligence, Big Data, and Applications in Industry

Categories:

Technology & Operations

Talk Citation

Wong, M. (2017, August 31). Big data collection and dataset cleaning [Video file]. In The Business & Management Collection, Henry Stewart Talks. Retrieved July 3, 2025, from https://doi.org/10.69645/CJAJ3391.
Export Citation (RIS)

Publication History

Published on August 31, 2017

Embed in course/own notesEmbed Lecture

Big data collection and dataset cleaning

Mr. Matt Wong – Liquidaty, USA

Published on August 31, 2017 9 min

Review
Share
Share This Talk
Messaging

Outlook

Gmail

Yahoo!

WhatsApp
Social

Facebook

X

LinkedIn

VKontakte
Permalink
Add to

Other Talks in the Series: Business Intelligence, Big Data, and Applications in Industry

52 min

Dr. Michael McDonald
Fairfield University and Morning Investments, USA

25 min

Dr. Gary Templeton
West Virginia University, USA

25 min

Dr. Brian Blank
Mississippi State University, USA

23 min

Dr. Ramon P. DeGennaro
University of Tennessee, USA

34 min

Dr. Michael McDonald
Fairfield University and Morning Investments, USA

25 min

Dr. Ying Zhang
Fairfield University, USA

27 min

Dr. Michael Puleo
Fairfield University, USA

16 min

Dr. Matthew N. Murray
University of Tennessee, Knoxville, USA

15 min

Mr. Scott Baldwin
eBay, USA

Transcript

Please wait while the transcript is being prepared...

0:00

Hello. This presentation is about data preparation, also sometimes referred to as data cleaning or scrubbing or less euphemistically, as data janitor work. I'm Matt Wong, CEO of Liquidaty, a New York City based startup company.

0:17

Often the biggest challenge when working with data is not its availability, quality, volume, or accessibility, but rather is the arduous task of preparing the data before it can be used for the particular purpose at hand. To address this challenge, corporations may resort to hiring full-time employees whose sole job is to prepare data that resides in one place so that it can be moved to another place for processing. This activity in general is what I will refer to as data preparation. So what exactly does that mean? Well, you've probably heard of new fields in technology commonly referred to as big data or machine learning. And one thing that these new technologies have in common is that generally they consume vast quantities of data. If there is one thing that you take away from this presentation, remember this: Just as we humans prefer much of our food to be cooked before we eat it, data consuming technologies typically need their data to be prepared often in a very specific manner before it can be consumed.

1:25

Let's start with a simple example. Imagine you have created a holiday card on a website, and to provide a list of recipients you'd like to import your address book from a spreadsheet. Most likely, the import process will require the data to be formatted in a very specific manner. In particular, it may require that your data columns are named in accordance with a predefined specification. It might be zip code or postal code or it could just be zip. Last name and first name, sometimes might be combined, sometimes it's required to be separate. Should there be street address or just address? Will it be okay if there are spaces between those column header words? Or if, for example, there is a dash between E and Mail. In addition to the application being particular about how the data is formatted, it might also be particular about the content of the data. Is it acceptable to be missing a zip code or to have a zip code that is not a number? Or if the zip is not consistent with the state or if the address does not actually exist? Depending on the consuming application, these kinds of exceptions may or may not need to be identified and handled, perhaps by fixing them or perhaps by removing the entire entry with the offending data. So let's say in order to get your address book data in a form that is consumable by your holiday card website, you change some column names, remove some entries with missing critical data, and save the resulting spreadsheet as a CSV file. Congratulations, you've just completed an exercise in data preparation.

Quiz

Quiz available with full talk access. Request Free Trial or Login.

Show

Hide

Share
Share This Talk
Messaging

Outlook

Gmail

Yahoo!

WhatsApp
Social

Facebook

X

LinkedIn

VKontakte
Permalink
More actions

Big data collection and dataset cleaning

Embed in course/own notes

See Options

Login via your organisation

We noted you are experiencing viewing problems

We hope you have enjoyed this limited-length demo

Share This Talk

Messaging

Social

Permalink

Printable Handouts

Navigable Slide Index

This material is restricted to subscribers.

Topics Covered

Links

Series:

Categories:

Talk Citation

Publication History

Big data collection and dataset cleaning

Share This Talk

Messaging

Social

Permalink

Other Talks in the Series: Business Intelligence, Big Data, and Applications in Industry

Transcript

Quiz

Share This Talk

Messaging

Social

Permalink

Big data collection and dataset cleaning