The primary mission of the Division of Statistical Genomics is to provide leadership
and promote excellence in statistical genomics and genetic epidemiology research
at WUMS. The discipline of statistical genetics stands at the center of the BioMed21
vision, providing the statistical analytic techniques which will identify and
characterize the genotype-phenotype connections needed to make personalized
medicine a reality. The DSG intends to become a magnet for all statistical genetics researchconducted throughout the university. We will also continue to develop
novel analytic methods and efficent study designs which leverage the remarkable
dual technological acceleration we are experiencing in both genotyping/sequencing efficiency and massive computing power to propel our understanding of the role of
our fundamental genetic profiles in normal variation, disease processes and
response to treatment.
We are fortunate to live in an exciting era, where scientists are actually beginning to successfully dissect the genetic architecture of complex traits through large-scale
government organized, systematic genome wide scans (e.g. GAIN, GEI, STAMPEED).
Some of the genetic underpinnings of the major chronic diseases of middle age: heart disease, cancer, stroke, diabetes, Alzheimer’s, etc. are tantalizingly close to our grasp.
The days of personalized medicine are surely at hand. We have finally found a
“formula” or strategy, in which much low hanging genetic fruit will be harvested in the
next few years. This new genomic paradigm is based upon separating the dissection
problem into two disjoint tasks: 1) gene variant discovery/validation and 2) gene characterization.
For variant discovery and validation,
we are only interested in identifying
the important variants for a
trait/disease and what is noise in the genome for that trait. At this stage,
we do not care so much what is their
exact mechanism of action, which we
defer to the gene characterization stage. During the gene variant discovery phase, we use simple tools to scan the genome and we largely ignore the complexity that we know must surely exist. We search in a huge number of places in the genome, but each time, we use simple one-SNP-at-a-time main effect models (perhaps sprinkling a few linear covariates into the model which we hope we can sufficiently adjust away their effects), hoping that the underlying gene signals will be strong enough to at least be detectable, if not accurately estimated, by such simple tools applied in this way. Once we have “found” all of the gene players, then we hope we can deal with the complexity issues later, when the number of variants to consider has been radically reduced from many millions to a few hundred. Only then do we consider exactly how and when genetic variants interact with specific environmental triggers to produce trait variation, and how they dynamically interact with one another.
But we are rapidly reaching the point where all such “low hanging fruit” (by definition, these are the gene variants that can be found in such a simplistic scanning way) will be completely discovered. Many researchers suspect that these larger effect variants will not account for all of the important variation in complex traits (or we would have found them long ago with more primitive technologies), and they may in fact explain very little of the relative genetic impact on disease traits. The vast majority of important variation may be too small to be detected by such simplistic methods and models. Then what will we do? We will have to use new strategies to harvest the higher hanging fruit.
Our ability to make progress towards discovery and characterization harvesting the high hanging fruit is hampered by two major challenges. The first challenge is the sheer vastness of the territory in which the important variants are hidden. There are 3.2 billion base pairs in the human genome (http://www.hapmap.org/whatishapmap.html). If one base pair were the size of a standard staple (about ½ inch), and we laid 3.2 billion of them end-to-end to form a human genome, it would circle the earth at the equator. Of course, much of the genome is constant across all humans, because it codes for what makes us human (and primates, and mammals, and animals, etc.). But even if we confine our attention only to easy to measure SNPs (the estimated prevalence of SNPs is about 1/1,200 base pairs; or every 50 feet along the chain of staples along the equator), it is still a daunting task to distinguish what variation is signal and what is noise. Linkage disequilibrium patterns (correlations among nearby SNPs in the genome) are helping us quickly hone in on regions harboring common large effect causative variants for disease (the low haning fruit), but power is low for tag SNPs for rare causative variants, and small effect variants will require unrealistic sample sizes, so much about our current LD mapping paradigm will collapse when we try to go for the high hanging fruit.
The second challenge is very the complexity of the pathways and systems of the diseases and traits we are after. This is especially true for complex traits in which the component genes in a phenotype pathway have dynamic function, exhibit epistatic (gene-gene) interactions, as well as gene by environment interactions with various exposure patterns. We used to believe that our ultimate genetic targets were non-synonymous exonic coding SNPs. But we are increasingly coming to appreciate the importance of alternative splicing junctions, promoter and regulatory variants, transcription binding variants, micro-RNAs, the role of copy number variation and epigenetic phenomenon, as well as the emerging importance of metagenomics and the interplay between our genome and that of other organisms in our ecosystem. Systems biology approaches for dissecting the roles of gene networks in such complex pathways is another major research arc of statistical genetics. It may very well be that we cannot make progress with the old paradigm of using simplistic approaches in a discovery phase and deferring characterization to a later stage. We may have to confront the complexity of our networks and systems head on, from the beginning, or we may fail at both goals of discovery and characterization. One thing is clear. The current paradigm, as successful as it now is, will reach its limit all too quickly. We need to begin thinking about the problems of the future, not the ones of the present.
To meet these challenges will take multidisciplinary research teams dedicated to the genetic dissection of complex traits and committed to understanding how genetic variation produces trait variation. It will require interactions between clinicians and pharmacogeneticists who understand the biology of traits and the dynamic nature of phenotypic development in response to environmental exposures and treatments, bioinformaticians who can leverage the vast databases of accumulated knowledge, experimental geneticists who can bring the power of model systems and comparative genomics, and mathematicians and statistical geneticists (such as those from the DSG) who can develop new models and methods of integrating complexity and the dynamic nature of the genotype-phenotype connection in the development of complex traits. The DSG is committed, eager and excited to play its part in making personalized medicine a reality.