Protein Subcellular Localization : Shan Sundararaj
University of Alberta
Edmonton, AB
ss23@ualberta.ca Protein Subcellular Localization
Why is Localization Important? : Why is Localization Important? Function is dependent on context
Co-localization of proteins of related function
Valuable annotation for new proteins
Design of proteins with specific targets
Drug targeting
Accessibility:
Membrane-bound > cytoplasmic > nuclear
Why is Localization Important? : Why is Localization Important? 1974 Nobel Prize in Physiology/Medicine
George Palade
“for discoveries concerning the structural and functional organization of the cell”
1999 Nobel Prize in Physiology/Medicine
Günter Blobel
“for the discovery that proteins have intrinsic signals that govern their transport and localization in the cell”
Bacteria : Bacteria cytoplasm cytoplasm cytoplasmic membrane cytoplasmic membrane outer membrane periplasm cell wall Extracellular Gram Positive (3-4 states) Gram Negative (5 states) Extracellular
Eukaryotic Cell : Eukaryotic Cell Compartmentalized
Diverse range of specific organelles:
Plants: chloroplasts, chromoplasts, other plastids
Muscle: sarcoplasm
Various endosomes, vesicles (modified from Voet & Voet, Biochemystry; Wiley-VCH 1992)
Yet more categories… : Yet more categories… Chloroplast Mitochondrion Yeast “specific”
Level of Annotation : Level of Annotation As simple as two states:
membrane protein vs. non-membrane protein
secreted protein vs. non-secreted protein
Gross compartments:
cytoplasm, inner membrane, periplasm, cell wall, outer membrane, extracellular
nucleus, mitochondria, peroxisome, vacuole…
Fine compartments:
Mitochondrial matrix, bud neck, spindle pole…
Any of 1425 GO cellular compartments
Localization signaling : Localization signaling Proteins must have intrinsic signals for their localization – a cellular address
E.g. N-terminal signal sequences 321 Nuclear Inner Membrane Lane
Nucleus, Intracellular county
Eukaryotic Cell
CL34V3M3
Localization signaling : Localization signaling Some signals are easily recognizable
Signal peptidase cleavage site, consensus sequence for secretion extracellular
Address printed neatly, postal code
Others are difficult to understand
Outer membrane b-barrel proteins, no consensus sequence, few sequence restraints
Sloppy address, different kind of code that we don’t understand yet
Experimental determination : Experimental determination Since don’t fully understand the language of proteins, our knowledge must often come from inference
Predicting localization is like sorting mail based only on examples of where some mail has gone before
Important to have good data sets of proteins with known localizations
Datasets : Datasets Organelle_DB (http://organelledb.lsi.umich.edu/)
25095 eukaryotic proteins from subcellular proteomics studies
DBSubLoc (http://www.bioinfo.tsinghua.edu.cn/~guotao/download.html)
Combines SwissProt and PIR annotations (64051 proteins)
PSORTDB (http://db.psort.org/)
Bacterial. 1591 Gram –ve proteins, 574 Gram +ve proteins
SignalP (http://www.cbs.dtu.dk/ftp/signalp/)
940 plant and 2738 human proteins
YPL (http://bioinfo.mbb.yale.edu/genome/localize/)
2956 yeast proteins
Experimental Methods : Experimental Methods Electron microscopy
GFP tagging / fluorescence microscopy
Subcellular fractionation + detection
Western blotting
Mass spectrometry
Electron Microscopy : Electron Microscopy Highest resolution, can work at the level of a single protein complex
Immunolabel proteins of interest in conjunction with colloidal gold, and visualize
Combined with electron tomography, can even visualize unlabeled complexes (from Koster and Klumperman, Nat Rev Mol Cell Biol, Sep 2003, S6-10)
Fluorescence Microscopy : Fluorescence Microscopy Tag gene at either 3’ or 5’ end
Using GFP (or RFP, YFP, CFP, etc.)
Using an epitope tag and a fluorescently labeled antibody
Careful of removing signal peptides!
Also use a subcellular-specific marker or stain
Visualize with confocal fluorescence microscopy and analyze images for co-localization
Specific co-labeling (yeast) : Early Golgi:Cop1
Endosome: Snf7
ER to Golgi: Sec13
Golgi apparatus: Anp1
Late Golgi: Chc1
Lipid particle: Erg6
Mitochondrion: MitoTracker
Nucleus: DAPI
Nucleolus: Sik1
Nuclear periphery: Nic96
Peroxisome: Pex3
Vacuole: FM4-64 Specific co-labeling (yeast) Nuclear-specific DAPI staining
Subcellular Fractionation : Subcellular Fractionation tissue
homogenate 1000 g Pellet
unbroken cells
nuclei
chloroplast transfer
supernatant transfer
supernatant transfer
supernatant 10,000 g 100,000 g Pellet
mitochondria Pellet
microsomal
Fraction
(ER, golgi,
lysosomes,
peroxisomes) Super.
Cytosol,
Soluble
enzymes
Detergent Fractionation : Detergent Fractionation Cells Extraction with
Digitonin/EDTA Cytoplasmic
Fraction Extraction with
TritonX100/EDTA supernatant pellet Extraction with
SDS/EDTA Organelle
Membranes Nuclear Cytoskeletal (in SDS)
Fractionation Identification : Fractionation Identification Once fractionated, take compartment of interest and separate proteins
2D gel or chromatography
Identify separated proteins
Mass spectrometry for high-throughput
Western blot for specific proteins
Fractionation in proteomics : Fractionation in proteomics
High-Throughput Experiments : High-Throughput Experiments Kumar et al., Genes Dev 2002, 16:707-719
Epitope-tagged >60% of ORFs, visualized with fluorescently labeled antibody
2744 localizations (44% of S. cerevisiae genes)
Huh et al., Nature 2003, 425:686-691
GFP tagged all ORFs, RFP tagged compartments
4156 localizations (75% of S. cerevisiae genes)
Combined, now nearly 87% of yeast proteins have a localization annotation
High-Throughput Experiments : High-Throughput Experiments Lopez-Campistrous et al, Mol Cell Proteomics, 2005
Subcellular fractionation of E. coli, 2D-gel separation, MS-MS
2,160 localizations to cytoplasm, inner membrane, periplasm, and outer membrane
Predictions from known data : Predictions from known data Enough experimental data exists to build highly accurate computational predictors of localization
Predictions from known data : Predictions from known data Different information used for predictions:
Sequence motifs
N-terminal: secretory signal peptides, mitochondrial targeting peptide, chloroplast transit peptide
C-terminal: peroxisome import signal, ER retention signal
Mid-sequence: nuclear localization signals
Amino acid composition
AA frequency, dipeptide composition.
Homology
- Sequence comparison to proteins of known localization
N-terminal signal peptides : N-terminal signal peptides Common structure of signal peptides:
positively charged n-region, followed by a hydrophobic h-region and a neutral but polar c-region.
N-terminal signal peptides : N-terminal signal peptides
More work to do : More work to do Multiple bacterial secretion pathways
C-terminal signal peptides
Internal mitochondrial transit peptides
Structural aspects of targeting
Gene re-localization
Still a lot to discover in how signaling works!
Computational methods for predicting localization : Computational methods for predicting localization Expert rule based methods
Artificial Neural Nets (ANN)
Hidden Markov Models (HMM)
Naïve Bayes (NB)
Support Vector Machines (SVM)
Combination of above methods
Naïve Bayes : Naïve Bayes Assumption:
Features are conditionally
independent, given class labels
Structure:
1 level tree
Class labels — root
Features — leaf nodes
Prediction:
class(f) = argmax P(C=c)P(F=f | C=c)
c
Artificial Neural Network : Artificial Neural Network Excellent for modeling non-linear input/output relationships
Robust to noise in training data
Widely used in bioinformatics
Support Vector Machines : Support Vector Machines Input vectors are separated into positive vs. negative instance
Map to new feature space
Find hyperplane that best separates the two classes by distance
Evaluating Predictors - Precision : # of proteins correctly labeled as “cyt” divided by the total # of proteins labeled as “cyt”
How often the label is correct
If there are 90 proteins correctly labeled as “cyt”, and 10 proteins incorrectly labeled as “cyt”, then the precision is 90/100 = 0.90. Evaluating Predictors - Precision True Predicted
Evaluating Predictors - Sensitivity : Evaluating Predictors - Sensitivity # of proteins correctly labeled as cytoplasmic divided by the total # of proteins that are cytoplasmic
“How many of the true results were retrieved” (also called “recall” or “accuracy”) True Predicted
Predictions from known data : Predictions from known data Different information used for predictions:
Sequence motifs
N-terminal: secretory signal peptides, mitochondrial targeting peptide, chloroplast transit peptide
C-terminal: peroxisome import signal, ER retention signal
Mid-sequence: nuclear localization signals
Amino acid composition
AA frequency, dipeptide composition, hydrophobicity
Homology
- Sequence comparison to proteins of known localization
TargetP, SignalP, *Phttp://www.cbs.dtu.dk/services/ : Sequence-based methods
TargetP (85-90% recall)
Predicts mitochondria/chloroplast/secreted
Contains SignalP and ChloroP
LipoP
lipoproteins and signal peptides in Gram negative bacteria
SecretomeP
non-classical secretion in eukaryotes TargetP, SignalP, *P http://www.cbs.dtu.dk/services/
SignalP result : SignalP result Prediction: Signal peptide
Signal peptide probability: 0.945
Signal anchor probability: 0.000
Max cleavage site probability: 0.723 between pos. 28 and 29 Cleavage site Common structure of signal peptides:
positively charged n-region, followed by a hydrophobic h-region and a neutral but polar c-region.
Organellar Prediction : Organellar Prediction Predotar (http://www.inra.fr/predotar/) (80% recall)
Mitochondrial and plastid sequences; N-terminal sequences
MitoPred (http://mitopred.sdsc.edu/) (82% recall)
Mitochondrial; PFAM domains, AA composition
MitoProteome (http://www.mitoproteome.org/)
Database of experimentally predicted human mitochondrial
MitoP (http://ihg.gsf.de/mitop2/)
Combines data from multiple experimental and computational sources to give a consensus score for each “mitochondrial” protein in yeast and human
The PSORT Family : The PSORT Family PSORT – plant sequences
Expert rule-based system
PSORT II – eukaryotic sequences
Probabilistic tree
iPSORT – eukaryotic N-term. signal sequences
ANN
PSORT-B – bacterial sequences
WoLF PSORT – eukaryotic
Updated (2005) version of PSORTII
PSORT-Bhttp://www.psort.org/psortb/ : PSORT-B http://www.psort.org/psortb/
PSORT-B - methods : PSORT-B - methods Signal peptides: Non-cytoplasmic
AA composition/patterns
SVM’s trained for each location vs. all other locations
Transmembrane helices: Inner membrane
HMMTOP
PROSITE motifs: all localizations
Outer membrane motifs: Outer membrane
Homology to proteins of known localization
SCL-BLAST Integration with a Bayesian network
PSORT-B results : PSORT-B results SeqID: Unannotated_bacterial2
Analysis Report:
CMSVM- Unknown [No details]
CytoSVM- Cytoplasmic [No details]
ECSVM- Unknown [No details]
HMMTOP- Unknown [No internal helices found]
Motif- Unknown [No motifs found]
OMPMotif- Unknown [No motifs found]
OMSVM- Unknown [No details]
PPSVM- Unknown [No details]
Profile- Unknown [No matches to profiles found]
SCL-BLAST- Cytoplasmic [matched 118438: Cyto. protein]
SCL-BLASTe- Unknown [No matches against database]
Signal- Unknown [No signal peptide detected]
Localization Scores:
Cytoplasmic 9.97
CytoplasmicMembrane 0.01
Periplasmic 0.01
OuterMembrane 0.00
Extracellular 0.00
Final Prediction:
Cytoplasmic 9.97
Proteome Analysthttp://www.cs.ualberta.ca/~bioinfo/PA/Sub/ : Proteome Analyst http://www.cs.ualberta.ca/~bioinfo/PA/Sub/
Proteome Analyst - Method : Proteome Analyst - Method
Proteome Analyst - Feature Extraction : Proteome Analyst - Feature Extraction
Slide44 : TOP 3 Homologs
AFP1_ARATH
AFP1_BRANA
AFP2_ARATH
KW
Plant defense; Fungicide;
Signal; Multigene Family;
Pyrrolidone carboxylic acid
DR: InterPro
IPR002118; IPR003614
CC: Subcellular location
Secreted
Token Set:
{Plant defense; Fungicide; Signal; Multigene Family; Pyrrolidone carboxylic acid; IPR002118; IPR003614; Secreted} Proteome Analyst: Feature Extraction
PASub - Results : PASub - Results Features Log scale Contribution of each token
PASub - Interpretation : PASub - Interpretation Bars represent -log probability, so a little difference is a lot!
Naïve Bayes chosen as classifier because of transparency of method
Each token gives a probability that can be summed and shown graphically
Neural network actually has higher recall
Can change token set, ask to explain with different features
Save Time: Pre-computed Genomes : Save Time: Pre-computed Genomes PSORTDB
http://db.psort.org
Browse, search, BLAST, download
103 Gram –ve bacteria, 45 Gram +ve bacteria
Proteome Analyst (PA-GOSUB)
http://www.cs.ualberta.ca/~bioinfo/PA/GOSUB/
Browse, search, BLAST, download
15 bacterial and 8 eukaryotic