Slide1: SNP Resources: Finding SNPs
Discovery and Databases Mark J. Rieder, PhD
NIEHS Variation Workshop
January 30-31, 2006
Slide2: SNP Resources: SNP discovery and cataloging
SNP discovery/genotyping: Genome-wide approaches
SNP Consortium
HapMap
The current state of SNP resources
Comprehensive SNP discovery
NIEHS SNPs - Environmental Genome Project
SNP Databases - 'How to' Manual for finding SNPs
In class - Tutorial
Slide3: Genetic Markers: Overview RFLPs (SNPs circa 1980)
Microsatellites (SSLP; di-, tri-, tetranucleotide repeats)
1/50,000 bp
Linkage Studies - 300-400 markers (~1 Mbp)
Multi-allelic/High heterozygosity/informative
Complex genotyping assays
Single Nucleotide Polymorphisms (SNPs)
Most frequent genetic variant (base substitutions)
1/1000 bp (comparing randomly selected chromosomes)
Biallelic/less informative
Simplified genotyping platforms (+/- calling)
Slide4: Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (andgt; 1- 5% MAF) - 1/300 bp
How has SNP discovery progressed toward this goal?
Slide5: Finding SNPs: Marker Discovery and Methods SNP discovery has proceeded in two distinct phases:
1 - SNP Identification
Define the alleles
Map this to a unique place in the genome
2 - SNP Characterization
Determination of the genotype in many individuals
Population frequency of SNPs
Slide6: Finding SNPs: Marker Discovery and Methods SNP Discovery has proceeded in two distinct phases:
1 - SNP Discovery**/Characterization
2 - SNP Discovery/Characterization**
Slide7: Finding SNPs: Marker Discovery and Methods $ 45 Million - 2 years (1999, 2001 - 2003) Goals: Identify 300,000 SNPs and map 150,000 (April 1999)
Determine allele frequency of SNPs
If you don’t have a reference genome - how do you find SNPs?
Slide8: Finding SNPs: Sequence-based SNP Mining Sequence Overlap - SNP Discovery GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGCATCCAGGAGATTACC DNA
SEQUENCING RT errors? Sequencing
Quality
Slide9: Finding SNPs: Sequence-based SNP Mining RRS = Reduced Representation Sequencing Genomic DNA (multiple individuals) RE to generate
fragments Clone DNA fragments
into plasmid vectors Sequence and align and cluster From overlap identify mismatches = SNPs GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC Altshuler, et al. Nature (2000)
Slide10: Finding SNPs: Sequence-based SNP Mining BAC = Bacterial Artificial Chromosome
Primary vector for DNA cloning in the HGP Fragment DNA DNA from multiple individuals Clone large fragments into BACs
(unknown sequence) Sequence and Reassemble
(known sequence) Assembly with other overlapping BACs GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC
Slide11: Feb. 2001 - Human Genome Project and TSC TSC and HGP: Highand#x8; Resolution SNP Map
Slide12: Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (andgt; 1 - 5% MAF) - 1/300 bp
Feb 2001 - 1.42 million (1/1900 bp)
Slide13: dbSNP
-NCBI SNP database SNP Discovery: dbSNP database
Slide14: SNP data submitted to dbSNP: Clustering dbSNP processing of SNPs
Slide15: Finding SNPs: Marker Discovery and Methods SNP Discovery has proceeded in two distinct phases:
1 - SNP Identification**/Discovery
2 - SNP Discovery/Characterization**
Slide16: HapMap Project Proposed: Map more SNPs and genotype Increase SNP density over the first 6 - 12 months
Ultimately produce a fine scale genetic map (HapMap) which
would serve as a common resource for all biomedical reseseachers
Genotype 600,000 SNPs genome-wide
Four populations: CEPH (Europe), Yoruban (Africa), Japanese/
Chinese (Asian)
Slide17: Initiation of project planning (July 2001):
2.8 million SNPs (1.4 million validated) - 1/1900 bp HapMap SNP Discovery: Prior to Genotyping TACGCCTATA TCAAGGAGAT Generate more SNPs: Other Sources of SNPs:
Perlegen (Affymetrix chips) SNP data (chr22)
Sequence chromatograms from Celera project
Slide18: HapMap Discovery Increased SNP Density and Validated SNPs 10 million
rs SNPs 5 million
validated
rs SNPs
Slide19: Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (andgt; 1- 5% MAF) - 1/300 bp When will we have them all? Feb 2001 - 1.42 million (1/1900 bp)
Nov 2003 - 2.0 million (1/1500 bp)
Feb 2004 - 3.3 million (1/900 bp)
Mar 2005 - 5.0 million (validated - 1/600 bp)
Slide20: Finding SNPs: Sequence-based SNP Mining RANDOM Sequence Overlap - SNP Discovery GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGCATCCAGGAGATTACC Genomic RRS
Library Shotgun
Overlap DNA
SEQUENCING Random
Shotgun Align to
Reference
Slide21: SNP discovery is dependent on your sample population size GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGCATCCAGGAGATTACC { 2 chromosomes 8
Slide22: SNP Characterization/Genotyping Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (andgt;1- 5% MAF) - 1/300 bp
Mar 2005 - 5.0 million (validated/mapped - 1/600 bp) 5.0/10.0 = 50% of all common SNPs (validated)!
Slide23:
Slide24: Finding SNPs: Genotype Data Adds Value to SNPs
HapMap Genotyping Confirms SNP as 'real' and 'informative'
Minor Allele Frequency (MAF) - common or rare
MAF in different populations
Detection of SNP x SNP correlations
(Linkage Disequilibrium)
Determine haplotypes
Slide25: Few SNPs in dbSNPs had Genotype Data
Slide26: 1.58 millions SNPs genotyped
71 individuals from 3 American populations
European, African and Asian ancestry Perlegen Large-scale Genotyping Capacity
HapMap Completion: HapMap Completion Nature - Oct 27 (2005) HapMap + Perlegen
Slide28: Perlegen
Data dbSNP: Increasing numbers of SNPs now have genotype data HapMap
Phase II
Perlegen
Slide29: Current State of dbSNP Many SNPs left to validate and characterize.
Slide30: Increasing SNP Density: HapMap ENCODE Project ENCODE = ENCyclopedia Of DNA Elements
Catalog all functional elements in 1% of the genome (30 Mb) 10 Regions x 500 kb/region (Pilot Project)
David Altschuler (Broad), Richard Gibbs (Baylor)
16 CEU, 16 YRI, 8 HCB, 8 JPT
Comprehensive PCR based resequencing across these regions
Slide31: Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (andgt;1- 5% MAF) - 1/300 bp
Mar 2005 - 5.0 million (validated - 1/600 bp) ~4.0 million validated SNPs with genotypes!
(HapMap confirmed, allele frequency/population,
SNPxSNP correlations (LD), haplotypes)
Slide32: SNP discovery is dependent on your sample population size GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGCATCCAGGAGATTACC { 2 chromosomes
Slide33: Goal: Comprehensively identify all common sequence variation in candidate genes
Initial biological focus: Candidate environmental response genes involved in DNA repair, cell cycle, apoptosis, metabolism, cell signaling, and oxidative stress.
Approach: Direct resequencing of genes
Samples: PDR = 90 ethnically diverse individuals representative of U.S. population (397 genes)
EGP95 = 95 samples from 4 ethnic groups (23 HapMap Asians, 22 HapMap Europeans, 15 HapMap Yorubans, 12 African Americans, 24 Hispanic ) (170 genes)
Slide34: Targeted SNP Discovery 5’ 3’ Directed analysis: cSNPs 5’ 3’ Complete analysis: cSNP and Haplotype Structure Analysis Arg-Cys Val-Val Arg-Cys Val-Val PCR amplicons PCR amplicons Generate SNP data from complete genomic resequencing
(i.e. 5’ regulatory, exon, intron, 3’ regulatory sequence)
Slide35: Nov 2005 - Zaitlen et al. Genome Research 15:1594-1600 Summary of NIEHS SNP genotypes in dbSNP Current numbers
554 genes sequenced
12.76 Mb scanned
75,580 genotyped SNPs identified
7 million genotypes deposited in dbSNP
Slide36: Development of a genome-wide SNP map: How many SNPs? Nickerson and Kruglyak, Nature Genetics, 2001 ~ 10 million common SNPs (andgt;1- 5% MAF) - 1/300 bp
NIEHS SNPs = 1/180 bp (n = 95, 4 pops)
HapMap ENCODE = 1/160 (n = 48, 3 pops) Comprehensive resequencing can identify the
vast majority of SNPs in a region
Slide37: Rarer and population specific SNPs are found by resequencing SNP Discovery: dbSNP database Minor Allele Freq. (MAF) dbSNP (Perlegen/HapMap) { 15% 25% NIEHS SNPs {
NIEHS SNPs Characterization: NIEHS SNPs Characterization PDR = 90 ethnically diverse individuals representative of U.S. population (397 genes - ~55,000 SNPs )
Selection of informative (high frequency, coding, etc) SNPs to be genotyped in defined populations (~7600 SNPs)
HapMap Populations
European (CEU,n=60)
African (YRI, n =60)
Asian (HCB, n = 45 and JPT, n = 45)
Non-HapMap Populations
Hispanic (n = 60)
African-American (n = 62)
Illumina NIEHS SNPs Genotyping: Illumina NIEHS SNPs Genotyping Each well samples 1536 SNPs in one individual
For each HapMap sample 5 x 1536 (7680 genotyped SNPs)
3,000,000 genotypes generated (total ~400 samples)
Slide40: Population Allele Frequency Correlations
Illumina NIEHS SNP Genotyping
Slide41: NIEHS SNPs Genotype Data PDR (397 genes)
SNPs characterized in six
different major populations.
Slide42: Summary: The Current State of SNP Resources SNPs have been rapidly adopted as the genetic marker of choice.
Approximately 10 million common SNPs exist in the human genome (1/300 bp).
Random SNP discovery processes generate many SNPs (TSC and HapMap).
Random approaches to SNPs discovery have reached limits of discovery
and validation (1/600 bp; 50% SNP validation)
Most validated SNPs (5 million) will be genotyped by the HapMap (3 pops)
Resequencing approaches continue to catalog important variants (rarer)
NIEHS SNPs has generated SNP data on andgt;550 candidate genes and 75 K SNPs