This lecture, An Introduction to Statistics for Statistical Genetics,
is the first talk in the statistical genetics series.
I'm Dr. Paul O'Reilly,
a senior lecturer in statistical genetics,
performing research at King's College London.
This is Part Two: Models and Techniques Common in Statistical Genetics.
In this section of the talk,
I will give a basic overview of several models and techniques
popular in statistical genetics, with the aim of
providing an introductory level understanding of each.
If you plan to use any of these approaches,
then you will need to obtain further details from other lectures in this series,
in statistical textbooks, or online.
In part two of the talk,
I'll introduce a number of statistical models and techniques
that are often used in statistical genetics.
First, I'll describe Hidden Markov models.
Then I'll explain the process of statistical imputation,
as well as a method of imputation especially tailored for application to genetic data.
Next I'll explain principal component analysis, and then mixed models,
and finally, I'll describe shrinkage and regularisation methods.
For each of these I'll give the general intuition of the model or technique
and explain its relevance to statistical genetics using examples from the field.
A process has the Markov property if
the next state, in space or time, is governed only by the present state.
A useful way to think about this in a real life context is to consider the weather.
If it is raining now,
it doesn't matter too much that it was dry and sunny yesterday,
it's very likely to be still raining in one minute from now.
Strictly speaking weather isn't Markovian since
the recent weather or present season is also informative about future weather;
But over short periods of time weather is close
enough to Markovian to be a useful analogy.
If we model a system or process as having the Markov property,
then it is called a Markov model.
A Markov model is typically made up of some number of possible states,
which in the case of weather might be
raining, snowing, sunny, and overcast,
along with transition probabilities of switching from one state to another.
The transition probabilities may be different for different transitions.
For example, going from overcast to snowing
has higher probability than going from sunny to snowing.
Markov models are extremely useful in analysing genetic data
because the ancestral contributions to our DNA sequence are highly Markovian.
Consider a chromosome that came from either your mother or father.
This chromosome will be a mosaic of
your grandparents chromosomes as a result of recombination.
It can be viewed as being made up of two states,
either grandmother or grandfather,
and if at a particular locus,
the sequence is from your grandmother,
then at the very next locus this is most likely from your grandmother as well,
but with some probability there will be a transition
to sequence that came from your grandfather.
Likewise, we can view our chromosomes as being a mosaic of
our great-grandparents chromosomes
or of our ancestors from any number of generations ago.
Regions with high recombination rates will involve
many transitions between these ancestral sequence states,
whereas those with little recombination may correspond to only a single ancestral state.
Because of the relatedness among all individuals,
this means that a sample of individual's chromosomes,
and genetic variation data in general,
can be well captured by Markov models.
In practice, Hidden Markov models,
or HMMs for short,
are usually employed, because
the ancestral states are unknown but genotype data can be used to estimate them.
Going back to the weather analogy,
applying hidden Markov Models is a bit
like trying to estimate the state of the weather only from
data on what clothes people are wearing or
if they're applying sun cream or holding umbrellas.
In genetics, our observed data are usually genotypes,
and we can use these to estimate
different hypothetical ancestral sequence underlying
the genotypes at a genomic locus in a sample of individuals.
This has the effect of clustering the sample of
DNA sequences into groups of similar sequence.
The HMMs are also used to estimate when there are transitions between
different ancestral sequences along individual chromosomes.
By capturing the structure of genetic variation data in samples of individuals
in a way that reflects their present similarities and differences, and ancestral histories,
Hidden Markov Models are extremely useful in statistical genetics and have been
employed in a wide range of applications including estimating haplotypes from genotypes,
identifying copy number variants,
characterising population admixture, and in performing genetic imputation.
The problem of missing data plagues medical and scientific research,