Notes
Slide Show
Outline
1
PIR Integrated Resources and Data-Mining Tools for Functional Genomics and Proteomics
  • Zhang-Zhi Hu, M.D.
  • Bioinformatics Scientist,
  • Protein Information Resource
  • Georgetown University Medical Center, Washington, DC
2
Functional Genomics and Proteomics
  • Study of Biological Systems Based on Global Knowledge of Genomes, Transcriptomes, Proteomes, Metabolomes
    • Genome: All the Genetic Material in the Chromosomes
    • Transcriptome: Entire Set of Gene Transcripts
    • Proteome: Entire Set of Proteins
    • Metabolome: Entire Set of Metabolites
3
Protein Information Resource (PIR)
  • Goal: An Integrated Public Resource of Protein Informatics to Support Genomic/Proteomic Research & Scientific Discovery
  • Components
    • Database: Data Organization & Information Retrieval
    • Software: Data Analysis & Sequence Annotation
  • Challenges
    • Voluminous, Complex, Dynamic Data from Heterogeneous Sources
  • Integrated, Classification Approach
    • Databases: PIR-PSD, PIR-NREF, iProClass
    • Integrated Analysis System: Knowledge Base System
    • Database Interoperability: Ontology, XML, Relational Schema,       iProClass Framework
4
PIR Web Site (http://pir.georgetown.edu)
5
Protein Family Classification
  • Superfamily, Domain, and Motif Classification
  • Superfamily Concept
    • End-to-End Similarity & Same Overall Domain Architecture
  • Significance
    • Improve Sensitivity of Protein Identification
    • Provide Complete Clustering for Database Organization
    • Detect and Correct Genome Annotation Errors Systematically
    • Drive Other Annotations
    • Stimulate Evolution, Genomics and Proteomics Research
6
Genome Sequence Annotation
7
Genome Era Challenges:
Transitive Catastrophe
  • Error Propagation: At least 17 Sequences Incorrectly Named as IMP Dehydrogenase or Related (Propagated to KEGG & WIT)
8
PIR-NREF Database
  • Non-Redundant REFerence Protein Sequence Database
    • Comprehensiveness: PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, PDB
    • Timeliness: Biweekly Updates (~ 1,000,000 Sequences)
    • Non-Redundancy: by Sequence Identity & Taxonomy (Species)
    • Source Attribution: Protein IDs and Names from Underlying Databases, Sequence, Taxonomy, Bibliography
    • Related Sequences: Identical Sequences from Different Species, Complete Substring, >=95% Sequence Identity
  • Applications
    • Protein Identification: Full-Scale or Species-Based Sequence Analysis and Text Search
    • Detection of Annotation Errors
    • Development of Protein Name Ontology
  • FTP Distribution: XML and FASTA Formats


9
PIR-NREF Report (I)
10
PIR-NREF Report (II)
  • Annotation Discrepancy of Multi-Domain Proteins
11
PIR-NREF Database (http://pir.georgetown.edu/pirwww/search/pirnref.shtml)
12
PIR Searches (I)
13
PIR Searches (II)
14
Data Integration
  • Challenges
    • Voluminous, Complex & Dynamic Data from Heterogeneous Sources in Distributed Networking Environment
  • Data Warehouse
    • Local Copy of Databases in a Unified Database Schema
    • Allows Local Control of Data; Update Problem
  • Hypertext Navigation
    • Browsing Model with Hypertext Links
    • Allows Direct Interaction; Easily Lost in Cyberspace
  • iProClass Approach
    • Data Warehouse + Hypertext Navigation
    • Rich Links (Links + Executive Summaries) between Database Objects
    • An Integrated Platform for Describing Comprehensive Family Relationships and Structural and Functional Features of Proteins
15
iProClass Database
  • An Integrated Platform for Describing Comprehensive Family Relationships and Structural and Functional Features of Proteins
  • Classification Scheme: Superfamily/Family & Domain/Motif
    • Superfamily/Family (Global): Full-Length Similarity with Same Domain Arrangement
    • Domain/Motif (Local): Structural/Functional Units & Sites
  • Sequence and Family Data
    • Non-Redundant, Annotated PIR-PSD, Swiss-Prot, TrEMBL Sequences: ~827,000
    • Superfamilies (~36,000), Families (>145,000), Domains (>3700), Motifs (>1300), Post-Translational Modifications (>280)
    • Superfamily and Protein Summary Reports
  • Modular Framework: Extensibility, Flexibility, Customization
16
iProClass
Overview
17
iProClass - Sequence Report (I)
18
Bibliography Information Display
  • From Curated Databases (e.g., PIR-NREF, SGD)
  • From User Submission
  • From Computer-Mapping (e.g. Gene Symbol)
19
Functional Classification
  • Gene Ontology (GO)
    • Three Ontologies: Biological Process, Molecular Function, Cellular Component
    • Consortium: FlyBase, SGD, MGI, TAIR, WormBase, Pombase
20
KEGG Metabolic & Regulatory Pathways
  • pathway
21
DIP Protein-Protein Interactions
22
iProClass - Sequence Report (II)
23
Protein Structural Classification
  • CATH Classification
24
PIR-RESID
Post-Translational Modification Database
25
iProClass - Superfamily Report
26
Integrated Protein Knowledge Base System
27
Integrated Knowledge
Base
28
Protein Informatics for Expression Analysis
29
Knowledge Base for
Functional Genomics & Proteomics
  • Homology Based
    • Sequence & Structural Families
  • Functionally Linked
    • Genetic Association: Gene Clustering on Chromosomes, Multi-Domain Proteins
    • Function Association: Pathways, Biological Processes, Networks, Protein-Protein Interactions, Protein Complexes
    • Correlated Evolution: Related Phylogenetic Profile
    • Correlated Expression: mRNA/Protein Expression
30
Acknowledgments
  • Sponsors
    • NIH: NLM (PIR)
    • NSF: BDI (iProClass); ITR (Ontology)


  • PIR Team
    • Cathy Wu, Winona Barker, Robert Ledley, Hongzhan Huang,   Lai-Su Yeh, Bruce Orcutt, CR Vinayaka, Zhang-Zhi Hu, Baris Suzek, Yongxing Chen, Jim Zhang, Peter Kourtesis, Jorge L. Cardenas, Leslie Arminski