PIR Integrated Resources and Data-Mining Tools for Functional Genomics and Proteomics
| Zhang-Zhi Hu, M.D. | |
| Bioinformatics Scientist, | |
| Protein Information Resource | |
| Georgetown University Medical Center, Washington, DC |
Functional Genomics and Proteomics
| Study of Biological Systems Based on Global Knowledge of Genomes, Transcriptomes, Proteomes, Metabolomes | ||
| Genome: All the Genetic Material in the Chromosomes | ||
| Transcriptome: Entire Set of Gene Transcripts | ||
| Proteome: Entire Set of Proteins | ||
| Metabolome: Entire Set of Metabolites | ||
Protein Information Resource (PIR)
| Goal: An Integrated Public Resource of Protein Informatics to Support Genomic/Proteomic Research & Scientific Discovery | ||
| Components | ||
| Database: Data Organization & Information Retrieval | ||
| Software: Data Analysis & Sequence Annotation | ||
| Challenges | ||
| Voluminous, Complex, Dynamic Data from Heterogeneous Sources | ||
| Integrated, Classification Approach | ||
| Databases: PIR-PSD, PIR-NREF, iProClass | ||
| Integrated Analysis System: Knowledge Base System | ||
| Database Interoperability: Ontology, XML, Relational Schema, iProClass Framework | ||
PIR Web Site (http://pir.georgetown.edu)
| Superfamily, Domain, and Motif Classification | ||
| Superfamily Concept | ||
| End-to-End Similarity & Same Overall Domain Architecture | ||
| Significance | ||
| Improve Sensitivity of Protein Identification | ||
| Provide Complete Clustering for Database Organization | ||
| Detect and Correct Genome Annotation Errors Systematically | ||
| Drive Other Annotations | ||
| Stimulate Evolution, Genomics and Proteomics Research | ||
Genome Era Challenges:
Transitive Catastrophe
| Error Propagation: At least 17 Sequences Incorrectly Named as IMP Dehydrogenase or Related (Propagated to KEGG & WIT) |
| Non-Redundant REFerence Protein Sequence Database | ||
| Comprehensiveness: PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, PDB | ||
| Timeliness: Biweekly Updates (~ 1,000,000 Sequences) | ||
| Non-Redundancy: by Sequence Identity & Taxonomy (Species) | ||
| Source Attribution: Protein IDs and Names from Underlying Databases, Sequence, Taxonomy, Bibliography | ||
| Related Sequences: Identical Sequences from Different Species, Complete Substring, >=95% Sequence Identity | ||
| Applications | ||
| Protein Identification: Full-Scale or Species-Based Sequence Analysis and Text Search | ||
| Detection of Annotation Errors | ||
| Development of Protein Name Ontology | ||
| FTP Distribution: XML and FASTA Formats | ||
| Annotation Discrepancy of Multi-Domain Proteins |
PIR-NREF Database (http://pir.georgetown.edu/pirwww/search/pirnref.shtml)
| Challenges | ||
| Voluminous, Complex & Dynamic Data from Heterogeneous Sources in Distributed Networking Environment | ||
| Data Warehouse | ||
| Local Copy of Databases in a Unified Database Schema | ||
| Allows Local Control of Data; Update Problem | ||
| Hypertext Navigation | ||
| Browsing Model with Hypertext Links | ||
| Allows Direct Interaction; Easily Lost in Cyberspace | ||
| iProClass Approach | ||
| Data Warehouse + Hypertext Navigation | ||
| Rich Links (Links + Executive Summaries) between Database Objects | ||
| An Integrated Platform for Describing Comprehensive Family Relationships and Structural and Functional Features of Proteins | ||
| An Integrated Platform for Describing Comprehensive Family Relationships and Structural and Functional Features of Proteins | ||
| Classification Scheme: Superfamily/Family & Domain/Motif | ||
| Superfamily/Family (Global): Full-Length Similarity with Same Domain Arrangement | ||
| Domain/Motif (Local): Structural/Functional Units & Sites | ||
| Sequence and Family Data | ||
| Non-Redundant, Annotated PIR-PSD, Swiss-Prot, TrEMBL Sequences: ~827,000 | ||
| Superfamilies (~36,000), Families (>145,000), Domains (>3700), Motifs (>1300), Post-Translational Modifications (>280) | ||
| Superfamily and Protein Summary Reports | ||
| Modular Framework: Extensibility, Flexibility, Customization | ||
iProClass - Sequence Report (I)
Bibliography Information Display
| From Curated Databases (e.g., PIR-NREF, SGD) | |
| From User Submission | |
| From Computer-Mapping (e.g. Gene Symbol) |
| Gene Ontology (GO) | ||
| Three Ontologies: Biological Process, Molecular Function, Cellular Component | ||
| Consortium: FlyBase, SGD, MGI, TAIR, WormBase, Pombase | ||
KEGG Metabolic & Regulatory Pathways
| pathway |
DIP Protein-Protein Interactions
iProClass - Sequence Report (II)
Protein Structural Classification
| CATH Classification |
PIR-RESID
Post-Translational Modification Database
iProClass - Superfamily Report
Integrated Protein Knowledge Base System
Protein Informatics for Expression Analysis
Knowledge Base for
Functional Genomics & Proteomics
| Homology Based | ||
| Sequence & Structural Families | ||
| Functionally Linked | ||
| Genetic Association: Gene Clustering on Chromosomes, Multi-Domain Proteins | ||
| Function Association: Pathways, Biological Processes, Networks, Protein-Protein Interactions, Protein Complexes | ||
| Correlated Evolution: Related Phylogenetic Profile | ||
| Correlated Expression: mRNA/Protein Expression | ||
| Sponsors | ||
| NIH: NLM (PIR) | ||
| NSF: BDI (iProClass); ITR (Ontology) | ||
| PIR Team | ||
| Cathy Wu, Winona Barker, Robert Ledley, Hongzhan Huang, Lai-Su Yeh, Bruce Orcutt, CR Vinayaka, Zhang-Zhi Hu, Baris Suzek, Yongxing Chen, Jim Zhang, Peter Kourtesis, Jorge L. Cardenas, Leslie Arminski | ||