PIR Integrated Resources and Data-Mining Tools for Functional Genomics and Proteomics
Zhang-Zhi Hu, M.D.
Bioinformatics Scientist,
Protein Information Resource
Georgetown University Medical Center, Washington, DC

Functional Genomics and Proteomics
Study of Biological Systems Based on Global Knowledge of Genomes, Transcriptomes, Proteomes, Metabolomes
Genome: All the Genetic Material in the Chromosomes
Transcriptome: Entire Set of Gene Transcripts
Proteome: Entire Set of Proteins
Metabolome: Entire Set of Metabolites

Protein Information Resource (PIR)
Goal: An Integrated Public Resource of Protein Informatics to Support Genomic/Proteomic Research & Scientific Discovery
Components
Database: Data Organization & Information Retrieval
Software: Data Analysis & Sequence Annotation
Challenges
Voluminous, Complex, Dynamic Data from Heterogeneous Sources
Integrated, Classification Approach
Databases: PIR-PSD, PIR-NREF, iProClass
Integrated Analysis System: Knowledge Base System
Database Interoperability: Ontology, XML, Relational Schema,       iProClass Framework

PIR Web Site (http://pir.georgetown.edu)

Protein Family Classification
Superfamily, Domain, and Motif Classification
Superfamily Concept
End-to-End Similarity & Same Overall Domain Architecture
Significance
Improve Sensitivity of Protein Identification
Provide Complete Clustering for Database Organization
Detect and Correct Genome Annotation Errors Systematically
Drive Other Annotations
Stimulate Evolution, Genomics and Proteomics Research

Genome Sequence Annotation

Genome Era Challenges:
Transitive Catastrophe
Error Propagation: At least 17 Sequences Incorrectly Named as IMP Dehydrogenase or Related (Propagated to KEGG & WIT)

PIR-NREF Database
Non-Redundant REFerence Protein Sequence Database
Comprehensiveness: PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, PDB
Timeliness: Biweekly Updates (~ 1,000,000 Sequences)
Non-Redundancy: by Sequence Identity & Taxonomy (Species)
Source Attribution: Protein IDs and Names from Underlying Databases, Sequence, Taxonomy, Bibliography
Related Sequences: Identical Sequences from Different Species, Complete Substring, >=95% Sequence Identity
Applications
Protein Identification: Full-Scale or Species-Based Sequence Analysis and Text Search
Detection of Annotation Errors
Development of Protein Name Ontology
FTP Distribution: XML and FASTA Formats

PIR-NREF Report (I)

PIR-NREF Report (II)
Annotation Discrepancy of Multi-Domain Proteins

PIR-NREF Database (http://pir.georgetown.edu/pirwww/search/pirnref.shtml)

PIR Searches (I)

PIR Searches (II)

Data Integration
Challenges
Voluminous, Complex & Dynamic Data from Heterogeneous Sources in Distributed Networking Environment
Data Warehouse
Local Copy of Databases in a Unified Database Schema
Allows Local Control of Data; Update Problem
Hypertext Navigation
Browsing Model with Hypertext Links
Allows Direct Interaction; Easily Lost in Cyberspace
iProClass Approach
Data Warehouse + Hypertext Navigation
Rich Links (Links + Executive Summaries) between Database Objects
An Integrated Platform for Describing Comprehensive Family Relationships and Structural and Functional Features of Proteins

iProClass Database
An Integrated Platform for Describing Comprehensive Family Relationships and Structural and Functional Features of Proteins
Classification Scheme: Superfamily/Family & Domain/Motif
Superfamily/Family (Global): Full-Length Similarity with Same Domain Arrangement
Domain/Motif (Local): Structural/Functional Units & Sites
Sequence and Family Data
Non-Redundant, Annotated PIR-PSD, Swiss-Prot, TrEMBL Sequences: ~827,000
Superfamilies (~36,000), Families (>145,000), Domains (>3700), Motifs (>1300), Post-Translational Modifications (>280)
Superfamily and Protein Summary Reports
Modular Framework: Extensibility, Flexibility, Customization

iProClass
Overview

iProClass - Sequence Report (I)

Bibliography Information Display
From Curated Databases (e.g., PIR-NREF, SGD)
From User Submission
From Computer-Mapping (e.g. Gene Symbol)

Functional Classification
Gene Ontology (GO)
Three Ontologies: Biological Process, Molecular Function, Cellular Component
Consortium: FlyBase, SGD, MGI, TAIR, WormBase, Pombase

KEGG Metabolic & Regulatory Pathways
pathway

DIP Protein-Protein Interactions

iProClass - Sequence Report (II)

Protein Structural Classification
CATH Classification

PIR-RESID
Post-Translational Modification Database

iProClass - Superfamily Report

Integrated Protein Knowledge Base System

Integrated Knowledge
Base

Protein Informatics for Expression Analysis

Knowledge Base for
Functional Genomics & Proteomics
Homology Based
Sequence & Structural Families
Functionally Linked
Genetic Association: Gene Clustering on Chromosomes, Multi-Domain Proteins
Function Association: Pathways, Biological Processes, Networks, Protein-Protein Interactions, Protein Complexes
Correlated Evolution: Related Phylogenetic Profile
Correlated Expression: mRNA/Protein Expression

Acknowledgments
Sponsors
NIH: NLM (PIR)
NSF: BDI (iProClass); ITR (Ontology)
PIR Team
Cathy Wu, Winona Barker, Robert Ledley, Hongzhan Huang,   Lai-Su Yeh, Bruce Orcutt, CR Vinayaka, Zhang-Zhi Hu, Baris Suzek, Yongxing Chen, Jim Zhang, Peter Kourtesis, Jorge L. Cardenas, Leslie Arminski