|
1
|
- Zhang-Zhi Hu, M.D.
- Bioinformatics Scientist,
- Protein Information Resource
- Georgetown University Medical Center, Washington, DC
|
|
2
|
- Study of Biological Systems Based on Global Knowledge of Genomes,
Transcriptomes, Proteomes, Metabolomes
- Genome: All the Genetic Material in the Chromosomes
- Transcriptome: Entire Set of Gene Transcripts
- Proteome: Entire Set of Proteins
- Metabolome: Entire Set of Metabolites
|
|
3
|
- Goal: An Integrated Public Resource of Protein Informatics to Support
Genomic/Proteomic Research & Scientific Discovery
- Components
- Database: Data Organization & Information Retrieval
- Software: Data Analysis & Sequence Annotation
- Challenges
- Voluminous, Complex, Dynamic Data from Heterogeneous Sources
- Integrated, Classification Approach
- Databases: PIR-PSD, PIR-NREF, iProClass
- Integrated Analysis System: Knowledge Base System
- Database Interoperability: Ontology, XML, Relational Schema, iProClass Framework
|
|
4
|
|
|
5
|
- Superfamily, Domain, and Motif Classification
- Superfamily Concept
- End-to-End Similarity & Same Overall Domain Architecture
- Significance
- Improve Sensitivity of Protein Identification
- Provide Complete Clustering for Database Organization
- Detect and Correct Genome Annotation Errors Systematically
- Drive Other Annotations
- Stimulate Evolution, Genomics and Proteomics Research
|
|
6
|
|
|
7
|
- Error Propagation: At least 17 Sequences Incorrectly Named as IMP
Dehydrogenase or Related (Propagated to KEGG & WIT)
|
|
8
|
- Non-Redundant REFerence Protein Sequence Database
- Comprehensiveness: PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, PDB
- Timeliness: Biweekly Updates (~ 1,000,000 Sequences)
- Non-Redundancy: by Sequence Identity & Taxonomy (Species)
- Source Attribution: Protein IDs and Names from Underlying Databases,
Sequence, Taxonomy, Bibliography
- Related Sequences: Identical Sequences from Different Species, Complete
Substring, >=95% Sequence Identity
- Applications
- Protein Identification: Full-Scale or Species-Based Sequence Analysis
and Text Search
- Detection of Annotation Errors
- Development of Protein Name Ontology
- FTP Distribution: XML and FASTA Formats
|
|
9
|
|
|
10
|
- Annotation Discrepancy of Multi-Domain Proteins
|
|
11
|
|
|
12
|
|
|
13
|
|
|
14
|
- Challenges
- Voluminous, Complex & Dynamic Data from Heterogeneous Sources in
Distributed Networking Environment
- Data Warehouse
- Local Copy of Databases in a Unified Database Schema
- Allows Local Control of Data; Update Problem
- Hypertext Navigation
- Browsing Model with Hypertext Links
- Allows Direct Interaction; Easily Lost in Cyberspace
- iProClass Approach
- Data Warehouse + Hypertext Navigation
- Rich Links (Links + Executive Summaries) between Database Objects
- An Integrated Platform for Describing Comprehensive Family
Relationships and Structural and Functional Features of Proteins
|
|
15
|
- An Integrated Platform for Describing Comprehensive Family Relationships
and Structural and Functional Features of Proteins
- Classification Scheme: Superfamily/Family & Domain/Motif
- Superfamily/Family (Global): Full-Length Similarity with Same Domain
Arrangement
- Domain/Motif (Local): Structural/Functional Units & Sites
- Sequence and Family Data
- Non-Redundant, Annotated PIR-PSD, Swiss-Prot, TrEMBL Sequences: ~827,000
- Superfamilies (~36,000), Families (>145,000), Domains (>3700), Motifs
(>1300), Post-Translational Modifications (>280)
- Superfamily and Protein Summary Reports
- Modular Framework: Extensibility, Flexibility, Customization
|
|
16
|
|
|
17
|
|
|
18
|
- From Curated Databases (e.g., PIR-NREF, SGD)
- From User Submission
- From Computer-Mapping (e.g. Gene Symbol)
|
|
19
|
- Gene Ontology (GO)
- Three Ontologies: Biological Process, Molecular Function, Cellular
Component
- Consortium: FlyBase, SGD, MGI, TAIR, WormBase, Pombase
|
|
20
|
|
|
21
|
|
|
22
|
|
|
23
|
|
|
24
|
|
|
25
|
|
|
26
|
|
|
27
|
|
|
28
|
|
|
29
|
- Homology Based
- Sequence & Structural Families
- Functionally Linked
- Genetic Association: Gene Clustering on Chromosomes, Multi-Domain
Proteins
- Function Association: Pathways, Biological Processes, Networks,
Protein-Protein Interactions, Protein Complexes
- Correlated Evolution: Related Phylogenetic Profile
- Correlated Expression: mRNA/Protein Expression
|
|
30
|
- Sponsors
- NIH: NLM (PIR)
- NSF: BDI (iProClass); ITR (Ontology)
- PIR Team
- Cathy Wu, Winona Barker, Robert Ledley, Hongzhan Huang, Lai-Su Yeh, Bruce Orcutt, CR
Vinayaka, Zhang-Zhi Hu, Baris Suzek, Yongxing Chen, Jim Zhang, Peter
Kourtesis, Jorge L. Cardenas, Leslie Arminski
|