Registration for a live webinar on 'Chronic inflammation, immune cell trafficking and anti-trafficking agents' is now open.See webinar details
An introduction to statistics for statistical genetics: models and techniques common in statistical genetics
Published on July 31, 2017 18 min
Other Talks in the Series: Statistical Genetics
An introduction to statistics for statistical genetics: general statistical concepts
- Dr. Paul O'Reilly
- King's College London, UK
GxE interactions in genome-wide association studies
- Dr. David V. Conti
- University of Southern California, USA
Heritability and its uses
- Dr. Doug Speed
- Aarhus Institute of Advanced Studies, Aarhus University, Denmark and University College London, UK
This lecture, An Introduction to Statistics for Statistical Genetics, is the first talk in the statistical genetics series. I'm Dr. Paul O'Reilly, a senior lecturer in statistical genetics, performing research at King's College London. This is Part Two: Models and Techniques Common in Statistical Genetics. In this section of the talk, I will give a basic overview of several models and techniques popular in statistical genetics, with the aim of providing an introductory level understanding of each. If you plan to use any of these approaches, then you will need to obtain further details from other lectures in this series, in statistical textbooks, or online.
In part two of the talk, I'll introduce a number of statistical models and techniques that are often used in statistical genetics. First, I'll describe Hidden Markov models. Then I'll explain the process of statistical imputation, as well as a method of imputation especially tailored for application to genetic data. Next I'll explain principal component analysis, and then mixed models, and finally, I'll describe shrinkage and regularisation methods. For each of these I'll give the general intuition of the model or technique and explain its relevance to statistical genetics using examples from the field.
A process has the Markov property if the next state, in space or time, is governed only by the present state. A useful way to think about this in a real life context is to consider the weather. If it is raining now, it doesn't matter too much that it was dry and sunny yesterday, it's very likely to be still raining in one minute from now. Strictly speaking weather isn't Markovian since the recent weather or present season is also informative about future weather; But over short periods of time weather is close enough to Markovian to be a useful analogy. If we model a system or process as having the Markov property, then it is called a Markov model. A Markov model is typically made up of some number of possible states, which in the case of weather might be raining, snowing, sunny, and overcast, along with transition probabilities of switching from one state to another. The transition probabilities may be different for different transitions. For example, going from overcast to snowing has higher probability than going from sunny to snowing. Markov models are extremely useful in analysing genetic data because the ancestral contributions to our DNA sequence are highly Markovian. Consider a chromosome that came from either your mother or father. This chromosome will be a mosaic of your grandparents chromosomes as a result of recombination. It can be viewed as being made up of two states, either grandmother or grandfather, and if at a particular locus, the sequence is from your grandmother, then at the very next locus this is most likely from your grandmother as well, but with some probability there will be a transition to sequence that came from your grandfather. Likewise, we can view our chromosomes as being a mosaic of our great-grandparents chromosomes or of our ancestors from any number of generations ago. Regions with high recombination rates will involve many transitions between these ancestral sequence states, whereas those with little recombination may correspond to only a single ancestral state. Because of the relatedness among all individuals, this means that a sample of individual's chromosomes, and genetic variation data in general, can be well captured by Markov models. In practice, Hidden Markov models, or HMMs for short, are usually employed, because the ancestral states are unknown but genotype data can be used to estimate them. Going back to the weather analogy, applying hidden Markov Models is a bit like trying to estimate the state of the weather only from data on what clothes people are wearing or if they're applying sun cream or holding umbrellas. In genetics, our observed data are usually genotypes, and we can use these to estimate different hypothetical ancestral sequence underlying the genotypes at a genomic locus in a sample of individuals. This has the effect of clustering the sample of DNA sequences into groups of similar sequence. The HMMs are also used to estimate when there are transitions between different ancestral sequences along individual chromosomes. By capturing the structure of genetic variation data in samples of individuals in a way that reflects their present similarities and differences, and ancestral histories, Hidden Markov Models are extremely useful in statistical genetics and have been employed in a wide range of applications including estimating haplotypes from genotypes, identifying copy number variants, characterising population admixture, and in performing genetic imputation. The problem of missing data plagues medical and scientific research,