logging in or signing up 07 Carr 1245pm Nellwyn Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 47 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 30, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Guidelines for Publication of Peptide and Protein Identification Data: Guidelines for Publication of Peptide and Protein Identification Data Journal of Molecular and Cellular Proteomics Working Group on Publication Guidelines Steven Carr, Broad Institute of MIT and Harvard (Chair) Ruedi Aebersold, ETH and Institute for Systems Biology Michael Baldwin, University of California, San Francisco Al Burlingame, University of California, San Francisco Karl Clauser, Broad Institute of MIT and Harvard Alexey Nesvizhskii, Institute for Systems BiologyWhy are guidelines needed?: Why are guidelines needed? Dramatic increase in the number of large data set papers being published Inability to determine if results of peptide and protein i.d. are valid Lack of understanding and misuse of algorithms contribute to large false positive error rates Manual validation is not useful or practical gold standard Published studies often do not contain enough information for the reader to assess how the data was processed and what the criteria for identification were Likely that we are publishing many incorrect interpretations Publication Guidelines for Peptide and Protein Identification Data: Publication Guidelines for Peptide and Protein Identification Data Goals: try to insure that high quality, significant data are entering the proteomics literature develop minimal guidelines for publication of peptide and protein identification data in MCP Initial focus on how identifications were made and validated guidelines should not be burdensome nor should they dictate what tools to use Initiate process requiring submission of data as a condition for acceptance of manuscript and logistics involved Proteomics community is data starved Value of data increases if it can be collectively analyzed/mined Why are guidelines needed?: Why are guidelines needed? Finding a peptide match in a DB is easy, but knowing whether it is correct is not MS/MS Database Search: MS/MS Database Search Acquired MS/MS spectrum Sequence Database Algorithms: SEQUEST, Mascot, Sonar, SpectrumMill, … ISLLDAQSAPLR VVEELCPTPEGK DLLLQWCWENGK ECDVVSNTIIAEK GDAVFVIDALNR VPTPNVSVVDLTNR SYLFCMENSAEK PEQSDLRSWTAK 200 400 600 800 1000 1200 m/z 200 400 600 800 1000 1200 m/z correlate similarity score Theoretical spectrum best matching peptide in database may be correct or incorrect best matching database peptide Slide courtesy of Alexey Nesvizhskii, ISB Slide6: sort by search score threshold incorrect “correct” SEQUEST: Xcorr > 2.0 Cn > 0.1 MASCOT: Score > 30 Threshold Model Slide courtesy of Alexey Nesvizhskii, ISB Why are guidelines needed?: Why are guidelines needed? Finding a peptide match in a DB is easy, but knowing whether it is correct is not It is almost always possible to match a MS/MS spectrum to a peptide in the database Incorrect matches often (but not always) result from use of low quality peptide MS/MS data to search the database Even high quality data can produce invalid identifications actual peptide sequence is not in the database searched (under the search conditions used) Why are guidelines needed?: Why are guidelines needed? Unknown and variable false positive error rates are associated with each algorithm Commercial algorithms use thresholds and scoring methods to move most probable hit to top of list Recommended settings are empirically derived and are not universally applicable Use of conservative scoring and filtering thresholds reduces number of misassigned peptides and proteins, but does not eliminate false positives Probability of a false positive assignment is much higher for “one-hit-wonders” statistical methods to validate peptide assignments to MS/MS spectra of peptides have shown promising results, but are not yet widely available or accepted Publication Guidelines for Peptide and Protein Identification Data in MCP: Publication Guidelines for Peptide and Protein Identification Data in MCP Working group assembled January, 2004 Ruedi Aebersold, ETH Zurich and Institute for Systems Biology Michael Baldwin, University of California, San Francisco Al Burlingame, University of California, San Francisco Steven Carr, Broad Institute of MIT and Harvard (Chair) Karl Clauser, Broad Institute of MIT and Harvard Alexey Nesvizhskii, Institute for Systems Biology Additional contributions from: Robert Chalkley, Kirk Hansen, Kati Medzihradszky, UCSF; Andrew Keller, ISB and Ron Beavis, Beavis Informatics, Ltd. Guidelines published Mol. Cell. Proteomics June 2004; 3: 531. Guideline 1: Guideline 1 Describe search engine used and how peptide and protein assignments were made using that software All papers must provide: The method and/or program used to create the “peak list” from raw data note factors that affect the quality of the subsequent database search (e.g., smoothing, de-isotoping) Name and version of DB search program used and parameters used for its operation include precursor-ion mass accuracy; fragment-ion mass accuracy; modifications allowed for; enzyme specified or not; any missed cleavages; etc. Guideline 1, con’t.: Guideline 1, con’t. Name and version of sequence database used Include number of protein entries at time of search Scores used to interpret MS/MS data Thresholds and values specific to judging certainty of identification and description of how applied Describe any statistical analysis that was applied to validate the results and of how it was applied e.g. reverse database search; PeptideProphet Guideline 2: Guideline 2 Provide sequence coverage observed for each protein identified the total number of peptides belonging to each protein must be explicitly stated (not # of MS/MS spectra) different forms of the same peptide are to be counted as only a single peptide Differing charge states of same peptide or common sample handling artifacts (e.g., ox) all count as 1 encourage providing tables that list sequences of all identified peptides/protein Guidelines 3 and 4: Guidelines 3 and 4 Increase the stringency of information required to use single peptide identifications for protein assignment Protein assignments based on single peptide assignments must include: the sequence of the peptide used to make each such assignment, together with the amino acids N- and C-terminal to that peptide’s sequence the precursor mass and charge (not just m/z) observed the scores for this peptide Guidelines 3 and 4, con’t.: Guidelines 3 and 4, con’t. Biological conclusions based on a single peptide id’s or to a posttranslationally modified form of that protein, must be supported by inclusion of the MS/MS spectrum Single peptides from ICAT and similar experiments are covered by this guideline as well For large ICAT datasets we have not yet required that spectra for all single-peptide id’s be provided Slide15: Disallow 1 Hit Wonders that are partial/non-trypticGuideline 6: Guideline 6 How to count the number of unique proteins identified based on the peptides found Slide17: Protein Inference Problem Prot A Peptide Prot B protein A or protein B ?? Or both? Degenerate peptides are more prevalent with databases of higher eukaryotes due to the presence of: related protein family members alternative splice forms partial sequences Degenerate peptides: correspond to more than a single entry in protein database In shotgun proteomics the connectivity between peptides and proteins is lost Slide courtesy of Alexey Nesvizhskii, ISB Guideline 6: Guideline 6 How to count the number of unique proteins identified based on the peptides found Issue: same (or very similar) protein having different names and accession numbers in the database Authors must demonstrate that they are aware of the problem and have taken reasonable measures to eliminate redundancy When a single protein member of a multi-protein family has been singled out, explain how the other members of the group were ruled out, if at all If a protein from a different species than that studied is identified, then this must be mentioned and justified Guideline 8: Quantitation (under development): Guideline 8: Quantitation (under development) How to report on methods used for deriving quantitative results from proteomic data sets Use of MS data directly (e.g., peak intensities, number of repeat MS/MS spectra, etc.) Isotopic labeling methods How normalization accomplished The community is data starved: The community is data starved Inhibits refinement and comparison of new algorithms Integration and collective analysis likely to yield new knowledge MCP strongly encourages submission of all MS/MS spectra mentioned in the paper as supplemental material. We will accept dta, raw, pkl, mgf files MCP is moving toward accepting and serving raw or minimally processed intact LC-MS/MS data sets Authors are encouraged to provide access to raw MS data using group websites etc. Storage on journal websites not a viable, long-term solution; public repositories are essential. Slide21: Capacity Constraints on Repositories Conversion formats (e.g., mzXML and mzDATA) blow file sizes up 5x – 10xSlide22: Recommendations to MMC for how to handle data now Follow the MCP guidelines Use common search algorithm(s) and database to search Employ method(s) for evaluation of the FP rate plan to integrate data for searching to find weak associations not evident in single datasets Employ common/consistent annotation of results Store data in original instrument vendor format in as minimally processed form as possible (exchange formats in flux) Files contain all the interesting info in unprocessed form parent peak intensities for quantitation resolution, peak spacing (charge states) acquisition parameters Slide23: Recommendations (to be added) List of specific recommendations for MMC that follow-on from the guidelines Data analysis (e.g., all data through multiple engines) Statistical estimate of FP rate (e.g., reverse DB search) acceptable approaches to relative quantitation for LC-MS/MS experiments Results presentation Making raw/minimally processed data availableSlide24: Current/Future Utility Constraints on Readers/Reviewers If repository stores XML format, then user needs compatible tools ISB provides converters from most instruments to mzXML and open source non-graphical mzXML reader mzData - similar XML format from HUPO, but no converters or readers available yet Will search engines support XML files? Will instrument vendors formats continue to be compatible with XML converters? Meetings like this need to have representatives from MS manufacturers present who are in decision-making capacity Will open source community provide viable graphical utilities for XML formats? Will they work on decreasing dataset size? You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
07 Carr 1245pm Nellwyn Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 47 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 30, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Guidelines for Publication of Peptide and Protein Identification Data: Guidelines for Publication of Peptide and Protein Identification Data Journal of Molecular and Cellular Proteomics Working Group on Publication Guidelines Steven Carr, Broad Institute of MIT and Harvard (Chair) Ruedi Aebersold, ETH and Institute for Systems Biology Michael Baldwin, University of California, San Francisco Al Burlingame, University of California, San Francisco Karl Clauser, Broad Institute of MIT and Harvard Alexey Nesvizhskii, Institute for Systems BiologyWhy are guidelines needed?: Why are guidelines needed? Dramatic increase in the number of large data set papers being published Inability to determine if results of peptide and protein i.d. are valid Lack of understanding and misuse of algorithms contribute to large false positive error rates Manual validation is not useful or practical gold standard Published studies often do not contain enough information for the reader to assess how the data was processed and what the criteria for identification were Likely that we are publishing many incorrect interpretations Publication Guidelines for Peptide and Protein Identification Data: Publication Guidelines for Peptide and Protein Identification Data Goals: try to insure that high quality, significant data are entering the proteomics literature develop minimal guidelines for publication of peptide and protein identification data in MCP Initial focus on how identifications were made and validated guidelines should not be burdensome nor should they dictate what tools to use Initiate process requiring submission of data as a condition for acceptance of manuscript and logistics involved Proteomics community is data starved Value of data increases if it can be collectively analyzed/mined Why are guidelines needed?: Why are guidelines needed? Finding a peptide match in a DB is easy, but knowing whether it is correct is not MS/MS Database Search: MS/MS Database Search Acquired MS/MS spectrum Sequence Database Algorithms: SEQUEST, Mascot, Sonar, SpectrumMill, … ISLLDAQSAPLR VVEELCPTPEGK DLLLQWCWENGK ECDVVSNTIIAEK GDAVFVIDALNR VPTPNVSVVDLTNR SYLFCMENSAEK PEQSDLRSWTAK 200 400 600 800 1000 1200 m/z 200 400 600 800 1000 1200 m/z correlate similarity score Theoretical spectrum best matching peptide in database may be correct or incorrect best matching database peptide Slide courtesy of Alexey Nesvizhskii, ISB Slide6: sort by search score threshold incorrect “correct” SEQUEST: Xcorr > 2.0 Cn > 0.1 MASCOT: Score > 30 Threshold Model Slide courtesy of Alexey Nesvizhskii, ISB Why are guidelines needed?: Why are guidelines needed? Finding a peptide match in a DB is easy, but knowing whether it is correct is not It is almost always possible to match a MS/MS spectrum to a peptide in the database Incorrect matches often (but not always) result from use of low quality peptide MS/MS data to search the database Even high quality data can produce invalid identifications actual peptide sequence is not in the database searched (under the search conditions used) Why are guidelines needed?: Why are guidelines needed? Unknown and variable false positive error rates are associated with each algorithm Commercial algorithms use thresholds and scoring methods to move most probable hit to top of list Recommended settings are empirically derived and are not universally applicable Use of conservative scoring and filtering thresholds reduces number of misassigned peptides and proteins, but does not eliminate false positives Probability of a false positive assignment is much higher for “one-hit-wonders” statistical methods to validate peptide assignments to MS/MS spectra of peptides have shown promising results, but are not yet widely available or accepted Publication Guidelines for Peptide and Protein Identification Data in MCP: Publication Guidelines for Peptide and Protein Identification Data in MCP Working group assembled January, 2004 Ruedi Aebersold, ETH Zurich and Institute for Systems Biology Michael Baldwin, University of California, San Francisco Al Burlingame, University of California, San Francisco Steven Carr, Broad Institute of MIT and Harvard (Chair) Karl Clauser, Broad Institute of MIT and Harvard Alexey Nesvizhskii, Institute for Systems Biology Additional contributions from: Robert Chalkley, Kirk Hansen, Kati Medzihradszky, UCSF; Andrew Keller, ISB and Ron Beavis, Beavis Informatics, Ltd. Guidelines published Mol. Cell. Proteomics June 2004; 3: 531. Guideline 1: Guideline 1 Describe search engine used and how peptide and protein assignments were made using that software All papers must provide: The method and/or program used to create the “peak list” from raw data note factors that affect the quality of the subsequent database search (e.g., smoothing, de-isotoping) Name and version of DB search program used and parameters used for its operation include precursor-ion mass accuracy; fragment-ion mass accuracy; modifications allowed for; enzyme specified or not; any missed cleavages; etc. Guideline 1, con’t.: Guideline 1, con’t. Name and version of sequence database used Include number of protein entries at time of search Scores used to interpret MS/MS data Thresholds and values specific to judging certainty of identification and description of how applied Describe any statistical analysis that was applied to validate the results and of how it was applied e.g. reverse database search; PeptideProphet Guideline 2: Guideline 2 Provide sequence coverage observed for each protein identified the total number of peptides belonging to each protein must be explicitly stated (not # of MS/MS spectra) different forms of the same peptide are to be counted as only a single peptide Differing charge states of same peptide or common sample handling artifacts (e.g., ox) all count as 1 encourage providing tables that list sequences of all identified peptides/protein Guidelines 3 and 4: Guidelines 3 and 4 Increase the stringency of information required to use single peptide identifications for protein assignment Protein assignments based on single peptide assignments must include: the sequence of the peptide used to make each such assignment, together with the amino acids N- and C-terminal to that peptide’s sequence the precursor mass and charge (not just m/z) observed the scores for this peptide Guidelines 3 and 4, con’t.: Guidelines 3 and 4, con’t. Biological conclusions based on a single peptide id’s or to a posttranslationally modified form of that protein, must be supported by inclusion of the MS/MS spectrum Single peptides from ICAT and similar experiments are covered by this guideline as well For large ICAT datasets we have not yet required that spectra for all single-peptide id’s be provided Slide15: Disallow 1 Hit Wonders that are partial/non-trypticGuideline 6: Guideline 6 How to count the number of unique proteins identified based on the peptides found Slide17: Protein Inference Problem Prot A Peptide Prot B protein A or protein B ?? Or both? Degenerate peptides are more prevalent with databases of higher eukaryotes due to the presence of: related protein family members alternative splice forms partial sequences Degenerate peptides: correspond to more than a single entry in protein database In shotgun proteomics the connectivity between peptides and proteins is lost Slide courtesy of Alexey Nesvizhskii, ISB Guideline 6: Guideline 6 How to count the number of unique proteins identified based on the peptides found Issue: same (or very similar) protein having different names and accession numbers in the database Authors must demonstrate that they are aware of the problem and have taken reasonable measures to eliminate redundancy When a single protein member of a multi-protein family has been singled out, explain how the other members of the group were ruled out, if at all If a protein from a different species than that studied is identified, then this must be mentioned and justified Guideline 8: Quantitation (under development): Guideline 8: Quantitation (under development) How to report on methods used for deriving quantitative results from proteomic data sets Use of MS data directly (e.g., peak intensities, number of repeat MS/MS spectra, etc.) Isotopic labeling methods How normalization accomplished The community is data starved: The community is data starved Inhibits refinement and comparison of new algorithms Integration and collective analysis likely to yield new knowledge MCP strongly encourages submission of all MS/MS spectra mentioned in the paper as supplemental material. We will accept dta, raw, pkl, mgf files MCP is moving toward accepting and serving raw or minimally processed intact LC-MS/MS data sets Authors are encouraged to provide access to raw MS data using group websites etc. Storage on journal websites not a viable, long-term solution; public repositories are essential. Slide21: Capacity Constraints on Repositories Conversion formats (e.g., mzXML and mzDATA) blow file sizes up 5x – 10xSlide22: Recommendations to MMC for how to handle data now Follow the MCP guidelines Use common search algorithm(s) and database to search Employ method(s) for evaluation of the FP rate plan to integrate data for searching to find weak associations not evident in single datasets Employ common/consistent annotation of results Store data in original instrument vendor format in as minimally processed form as possible (exchange formats in flux) Files contain all the interesting info in unprocessed form parent peak intensities for quantitation resolution, peak spacing (charge states) acquisition parameters Slide23: Recommendations (to be added) List of specific recommendations for MMC that follow-on from the guidelines Data analysis (e.g., all data through multiple engines) Statistical estimate of FP rate (e.g., reverse DB search) acceptable approaches to relative quantitation for LC-MS/MS experiments Results presentation Making raw/minimally processed data availableSlide24: Current/Future Utility Constraints on Readers/Reviewers If repository stores XML format, then user needs compatible tools ISB provides converters from most instruments to mzXML and open source non-graphical mzXML reader mzData - similar XML format from HUPO, but no converters or readers available yet Will search engines support XML files? Will instrument vendors formats continue to be compatible with XML converters? Meetings like this need to have representatives from MS manufacturers present who are in decision-making capacity Will open source community provide viable graphical utilities for XML formats? Will they work on decreasing dataset size?