Skip to main content

Regression with big data

Published on April 30, 2017   23 min
Hello, I'm Ramon DeGennaro, the Haslam College Business Professor of Banking and Finance at the University of Tennessee. Today, I'll be telling you about Regression with Big Data.
Analysis using huge datasets including regression analysis is much the same as it is with other datasets. Perhaps the biggest difference is that Big Data offers researchers much more rope, the better to hang themselves. Modern software and computers make it easy to get the research process backwards. Proper analysis begins with a question. The researcher decides which approach might answer it, then can collect the necessary data. Only then does he or she estimate a model. Today, technology and Big Data tempt researchers to turn to empirical analysis too soon. We have a choice. We can choose the difficult task of thinking about the problem with almost no immediate gratification or we can plunge into the empirical analysis immediately because performing analysis using even sophisticated empirical methods requires only a little bit of coding. Immediate gratification with little or no thought is a handy winner in most cases. Presto, we have a result. Unfortunately, the result probably won't provide the answer to our question. Big Data tips the scales even further toward mindless computations. Few people enjoy sorting through hundreds or even thousands of variables to decide which ones will be useful. It's much easier to resort to automated modeling techniques such as the venerable stepwise regression. This is particularly important with hundreds or even thousands of variables. We can and must do better. Another problem with Big Data is that huge numbers of observations mean tiny standard errors, so our coefficient estimates are extremely precise. It's not clear what that means. A precise measurement of the wrong variable or using bad data or a bad model is no help, and in fact, leads to overconfidence and policy errors.