Univariate statistics and metabolomics

Published on October 31, 2017   51 min

Other Talks in the Series: Bioinformatics for Metabolomics

My name is Ron Wehrens. I'm a scientist working at Wageningen University and Research in the Netherlands. Today, we're going to talk about the subject of "Univariate Statistics and Metabolomics". And, I've been doing quite a lot of work in the past few years on analyzing metabolomics data sets. The topic of statistics is of central importance to any field of science where experimental data are being analyzed, and metabolomics is no exception. The difficulty in metabolomics comes from two things. Basically, for every sample that we analyze, we get a lot of information, and that is one of the aspects. And the second thing is that we often don't know what the information means. So we get signals, we get peaks from our experimental data, but we are not exactly sure what these peaks mean and with which metabolites they are associated. So the latter part is an annotation problem that I will not go into today. But what we will try to do is to cover, let's say, the central ideas behind basic statistical methods and see how they can be applied in a metabolomics context.
So the central element in any statistical analysis is that we are looking at variability. And there are several sources of variability. And the variability may be a desired thing, may give us information, or may be unwanted variability, so it will hamper us in drawing conclusions or in finding out what we want to find out. There are several standard sources of unwanted variability. The first of which is the measurement variability. In any experiment that we do, we will see that a repetition of the experiment will not automatically lead to exactly the same result. Complex machines like mass spectrometers will not give you the same result every time that you do an experiment. So this measurement variability is unavoidable in a sense. Then there is the variability that is associated with the system under study. If we have several different samples, for instance from several different people, these samples will be different because the people that they belong to are different. And the biological variability between people can be a major source of unwanted variability, and it may really hamper us in drawing conclusions. That's also one of the reasons why, in general, it's better to have as many samples as possible, and we'll come back to that later. We hope, in that case, that biological variability will even out, and that the average will be close to the true population average that we want to estimate. And finally, there is also a kind of variability that really has to do with human errors. So as humans, we are bound to make an error once in a while. And in general, as soon as we can automate things, we avoid some kinds of these errors. But then again, also, the automation is a human process. And if we make errors in the automation process, then we can expect errors on a much grander scale.