Notes
Slide Show
Outline
1
Preserving Scientific Data: New Methods for Scientific Discovery

  • John Rumble
  • National Institute of Standards and Technology
2
Data
  • “When you can measure what you are speaking about, and express it in numbers, you know something about it;







  • Lord Kelvin
3
Data
  • “When you can measure what you are speaking about, and express it in numbers, you know something about it;


  • “But when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you scarcely in your thoughts advanced to the state of science.”


  • Lord Kelvin
4
Types of Data
  • Numbers
  • Simple text
  • Complex text
  • Equations
  • Graphs
  • Diagrams
  • Pictures
  • Software
  • Rules
  • 1, 2, 3…
  • ABCs
  • Greek, scripts, symbol
  • E=mc2


5
Characteristics of Scientific Databases
  • Data
  • From many publications or observations
  • Full range of independent variables
  • Large number of measurements
    • similar and varied
  •  Numbers of substances or systems
  • Large amount of metadata


  • Text
  • One or small number of studies
  • Limited range of variables


  • Small number of measurements


  • Small number of substances or systems
  • Small amount of metadata
6
Data Preservation and Scientific Discovery
  • Data communicate measurement and calculation results
  • Preserved data collections form the foundation of scientific discovery
  • Scientific discovery explains the observable world
7
Data Preservation and Scientific Discovery
  • Historical trends in data preservation and discovery
    • Accuracy
    • Comprehensiveness
    • Explanation of essence
    • Explanation of the complex
    • Automated discovery - The future


8
Accuracy
  • Newgrange – Ireland
  • 6000 years old
  • Aligned to the rising sun in the winter solstice
  • Depended on careful observational data on the rising sun


9
Accuracy Improving
  • Stonehenge
  • 5000 years old
  • Over 100 stones
  • Complicated stone alignments
  • Marks position of the moon and major stars as well as the sun
  • Reproducibility of several observations
10
Comprehensive
  • Galen
  • Greek physician
  • Experimental physiologist
  • Arabic copy from 800 AD
  • Pictorial, descriptive, function describing
  • Representative of botanical and animal catalogs
11
Comprehensive Improving
  • Pliny the Elder
  • Roman scholar
  • Natural History (77 AD)
  • One of earliest known encyclopedias of the natural world
  • Systemization of data
12
Extraction of Essence
  • Tycho Brahae
  • Late 16th Century
  • Danish Astronomer
  • Made precise measurements that led to Kepler’s theories
  • Led to discovery of simple relationships
13
Explanation of the Complex
  • Charles Darwin
  • Combined with others in geology, zoology and botany
  • A wide variety of facts and phenomena recorded
  • Theory of Evolution had to explain all these observations and measurements
14
Prediction from Data
15
Prediction from Data
  • Notes on the Spectral Lines of Hydrogen: Johann Jacob Balmer Annalen der Physik und Chemie 25 80-5 (1885)
  • I gradually arrived at a formula which, at least for these four lines, expresses a law by which their wavelengths can be represented by striking precision…From the formula, we obtained for a fifth hydrogen line 3936.65x10-7 mm.
  • The development of quantum mechanics
  • Bohr and Schrödinger
16
Data and Experimentation Today
  • Today we have exciting new capability to observe nature better than ever before
    • Atomic force microscopes
    • Hubble Space Telescope
    • Micro-electronics and lasers
    • High power computers to analyze data
    • Biomacromolecule sequencing instruments

  • Generates large amounts of quality data
17
Data and Computation Today
  • We now also have the ability to create a Virtual World
        • Models and simulations of complex systems
        • Techniques to do advanced mathematics
        • Computers to execute immense calculations
        • Visualization tools to examine our virtual world

  • Requires and generates large amounts of quality data


18
Data and The Information Revolution
  • Computer at every desk
  • The Internet/WWW explosion
  • Database tools on every computer
  • Electronic publications
  • Model and simulation-based R&D
  • Virtual libraries
  • Comprehensive databases


  • Data at the very heart of the revolution


19
21st Century Science

  • From the fundamental to the complex
    • Determining the laws of nature for a few particles to understanding real systems - cells, the atmosphere, the Earth, ecology
  • From reductionism to constructionism
    • Using our basic knowledge to make models and predict behavior of real systems
20
The Face of 21st Century Science
  • Complex
  • Multi-disciplinary
  • Real systems
  • Virtual as well as physical


  • Access to quality data becomes critical
  • Long term preservation of and access to data becomes more important than ever!
21
Major Point of Today’s Talk
  • Scientific databases in the future will be even more important source for scientific discovery
  • Preservation of data  needed for
    • New insights
    • Scientific principles
    • New knowledge
    • Understanding complex systems
    • And the discovery will be computer-aided, if not done by computers alone
22
Data Preservation in the Future
  • Yesterday
  • Collections managed by a small number of people
  • Collections readable by one scientist
  • Collections interpretable by one person


  • Discoveries made by thinking, with analysis by one person


  • Future
  • Collections managed by groups
  • Collections not readable by any individual
  • Collections interpretable only with aid of software


  • Discoveries made by computers, with verification by people


23
Discovering Science in Preserved Data
  • Real systems are very complex
  • Large number of objects
  • Large number of independent variables
  • Collective behavior difficult to find
  • Abstraction of important features
  • Existence of unifying theory or concept
  • Multiple views
24
Discovering Science in Preserved Data
  • Too much data for any one person to understand


  • How long does it take to look over a terabyte of data?
25
Discovering Science in Preserved Data
  • Real systems are very complex
  • Large number of objects  - mole, species, stars, geographic points


  • How much data is needed to come up with an idea?
  • Does quality count?
26
Discovering Science in Preserved Data
  • Real systems are very complex
  • Large number of independent variables


  • How do we use metadata to describe what we preserve?
  • How do they change over time and context?
  • If we must aggregate different data sets (e.g., over the Web) to do discovery, how do we know data are comparable?
27
Discovering Science in Preserved Data
  • Real systems are very complex
  • Collective behavior difficult to find


  • How do we recognize real phenomena from artifacts?
  • What kind of data visualization and exploitation (discovery) tools will exist in 20 years?
  • Weather prediction for the next day!
28
Discovering Science in Preserved Data
  • Real systems are very complex
  • Abstraction of important features


  • How can we find what is important when we have too much data?


  • Cholesterol linkage to heart disease was found by computer-aided correlation.
29
Discovering Science in Preserved Data
  • Real systems are very complex
  • Existence of unifying theory or concept


  • Could we derive quantum mechanics from a complete database of atomic and molecular spectra?


  • What features does quantum mechanics have beyond these data?
30
Discovering Science in Preserved Data
  • Real systems are very complex
  • Multiple views


  • Quantum theory, matrix mechanics, Maxwell’s theory; quantum electrodynamics


  • Are all views of nature equally discoverable?
31
Important New Data Collections
  • International Virtual Observatory
  • Structural Genomics
  • Proteomics
  • Climate change


  • Historic geologic
  • Chemistry on demand


  • Biodiversity
  • Brain scans
  • All observation for every point in the sky
  • For living things!
  • For all living things
  • Water, earth, atmosphere and all they contain


  • Lots of years, lots of rocks
  • 60 elements, 5 at a time, different ratios, ???
  • 5M species? or 10M? or 50M
  • Just think, forever
32
 Challenges of the Data Era
  • The technology to handle the overwhelming volume of data from new measurement techniques
  • What to capture when sensors generate too much too fast?
  • How to store, represent, manipulate and display too voluminous data?
  • How to find out which data are important?
33
Challenges of the Data Era
  • Making accurate virtual measurements on virtual systems
  • What is uncertainty in a calculation?
  • How do you establish traceability for a calculation?
  • What computational results should be stored, and how can those data be handled?


34
Challenges of the Data Era
  • Evaluating data quality
  • How can large amounts of data be evaluated? In real time? As new data are published?
  • How can large data sets be integrated together correctly?
  • How do you determine the quality of a calculation?
  • What does quality mean in a terabyte of data?
35
Challenges of the Data Era
  • Making exploitation of large data sets possible
  • What standards are needed for making data sets work together?
  • How can you verify discovery from data sets?
  • How can you make control decisions when you have too much data?
36
Challenges of the Data Era
  • How do we maintain full and open access to the large number of databases required for making new scientific discoveries
  • What policies are needed for full and open access?
  • How can discoverers profit  from their automated discoveries?
  • How do you get the information industry to understand the new paradigm for discovery?
37
Some Final Thoughts
  • Scientific databases in the future will be even more important source for scientific discovery
  • Preservation of data  needed for
    • New insights
    • Scientific principles
    • New knowledge
    • Understanding complex systems
    • Will computers discover and people just verify?
38
Some Final Thoughts
  • Let’s take advantage of CODATA’s expertise, neutrality and openness to support scientific and technological advances in the future