Data Management Practices and Challenges in Geosciences Today :Data Management Practices and Challenges in Geosciences Today Chaitan Baru
San Diego Supercomputer Center (SDSC)
California Institute for Telecommunications and Information Technology (Calit2)
Data Management :Data Management DATA COLLECTION DATA PUBLICATION DATA ACCESS
Geosciences Data Management :Geosciences Data Management Atmospheric Sciences
Meteorological data provides community focus – real time as well as archived data
Common field of interest (e.g. weather); continental scale
Ocean Sciences
Ship cruises and real time data from moorings. Increasingly, integration with more diverse data (including biological)
Field of interest is in regions (e.g. extent of cruises)
Earth Sciences
Broad range of data types: sensor data (e.g. seismic, GPS), field data collections (e.g. geologic data), remote sensing (e.g. LIDAR), analytical data (e.g. geochem, geochronology)
Broad coverage: study area within a small region (e.g. watershed), to continental and tectonics settings
Also, managing model outputs
need to manipulate and visualize very large outputs from models
GEON: A Platform for Data Integration Example: GEONsearch :GEON: A Platform for Data Integration Example: GEONsearch www.geongrid.org
GEONsearch and GEONworkbench :GEONsearch and GEONworkbench GEONsearch Search Condition(s)
spatial temporal concept Log GEON
Catalog GEON Datasets extracted information/indexes Web services Gazetteer Geologic
Age
GEON Registration :GEON Registration Ontology Registration Dataset Registration
(hosted) Data Item (Schema) Registration
(hosted / non-hosted) Data Item Detail Registration
(values) Service Registration Resource Registration
CUAHSI Hydrologic Information Systemcuahsi.sdsc.edu :CUAHSI Hydrologic Information Systemcuahsi.sdsc.edu Integrated access to federal data sources
Web services for accessing each source
Need to map to common metadata (ontology)
Private workspace
Ability to store data and derived products in “personal digital library”
Integrated search
Ability to search federal data sources as well as digital library, with a single search command
Scientific workflows
Access to modeling and analysis tools via scientific workflow software, e.g. Kepler, ModelBuilder, D2K…
Data Integration in CUAHSI HIS :Data Integration in CUAHSI HIS From: “Chapter 4: System Architecture,” by Chaitan Baru, Ilya Zaslavsky, Reza Wahadj,
Hydrologic Information Systems: A Status Report, edited by David Maidment,
http://www.ce.utexas.edu/prof/maidment/CUAHSI/HISStatusSept15.pdf
Slide 9:ROADNet: Real-time Observatories, Applications & Data management Networks
(courtesy: John Orcutt, Frank Vernon, SIO)
Slide 10:SDSCStorage Resource Broker
USArray: Background :USArray: Background Overview
12 year project; part of EarthScope
Continental-scale seismic observatory for lithosphere and deep Earth structure
Record local, regional and teleseismic earthquakes
Major Components
A transportable array of 400 portable, unmanned three-component broadband seismometers deployed on a uniform grid that will systematically cover the US
A flexible component of 400 portable, three-component, short-period and broadband seismographs and 2000 single-channel high frequency recorders A permanent array of high-quality, three-component seismic stations, coordinated as part of the US Geological Survey's Advanced National Seismic System (ANSS), to provide a reference array spanning the contiguous United States and Alaska.
URLs
http://www.earthscope.org/usarray/
http://anf.ucsd.edu/index.html
http://www.earthscope.org/usarray/usarray_assets/USArray6.mov Courtesy: Frank Vernon, SIO, Tony Fountain, SDSC
USArray Existing Infrastructure :USArray Existing Infrastructure Infrastructure / Data Flow:
Seismic sensors connected to dataloggers
Dataloggers stream data to central collection facility at SIO via IP (and other)-based networking
New sites initially stream data into “Prelim” ORB (object ring buffer) for QA/QC
Operational sites stream data into Production ORB
Production data streams sent from SIO to IRIS for archiving and dissemination (www.iris.edu)
Uses BRTT’s Antelope sensornet middleware throughout
Scale:
Up to 400 sites deployed at any given time
Thousands of channels of real-time streaming data
Status:
Currently in first wave deployment (~ 100 sites)
Between 5-20 new sites (physically) installed per week
Transportable array sites will *move* every 18 months Courtesy: Frank Vernon, SIO, Tony Fountain, SDSC
SOA Architecture Instantiated for USArray :SOA Architecture Instantiated for USArray Courtesy: Tony Fountain, Neil Cotofana
Cyberinfrastructure Lab for Environmental Observing Systems (CLEOS), SDSC
KEPLER & ROADNet Real-Time Scientific Workflows :KEPLER & ROADNet Real-Time Scientific Workflows Laser Strainmeter Channels in;
Scientific Workflow;
Earth-tide signal out Straightforward Example: Architecture: Seismic Waveforms Images other
types of data ORBserver Real-time
Packet Buffer Near-real-time
database Scientific Workflow Courtesy: John Orcutt, SIO
Slide 16:LOOKING Laboratory for
Ocean
Observatory
Knowledge &
INtegration
Grid NSF ITR Grant Cyberinfrastructure for
Ocean
Observatories
Initiative Courtesy: John Orcutt, SIO
CHRONOS Federated Databases :CHRONOS Federated Databases Create a dynamic, interactive and time-calibrated framework for Earth history
Develop a network of chronostratigraphy databases
“Federated” Database Design
The following databases are part of the CHRONOS Federated Database at SDSC, based on IBM’s DB2 Information Integrator
Neptune
PaleoStrat
PaleoBiology
Janus
TimeScale
FAUNMAP
MIOMAP Courtesy: Doug Greer, SDSC
Top-Level View of a Federated Database :Top-Level View of a Federated Database Applications Federated
Database Data Source A Data Source D Data Source C Data Source B
Federated Data Sources :Federated Data Sources Geographically Distributed
Heterogeneous
Relational Databases – most common
Spreadsheets
Non-relational Sources
Web Pages / Web Services
Flat Files
“Global Views”
Views may be “virtual”, or contain data (materialized views)
Views define data in a uniform way across the data sources
Applications can access data through these global views, using SQL
Example: Chronos Hole_Desc View :Example: Chronos Hole_Desc View Uniform global-view for hole/taxa description for Age Depth Plots application
CHRONOS Hole_Desc
Database Name
Hole_ID
Elevation
Meters_of_Section
Taxa_Count Courtesy: Doug Greer, SDSC
Challenges :Challenges Efficient access to remote data
Service interfaces to allow subsetting of data at remote end
Efficient access to very large data
Parallel I/O, manipulation of 10’sTB of viz output, long term storage of 100’sTB of model output
Versioning of data and metadata, and providing provenance
Managing access to “regular” users vs “power” users (or, privileged users)
More Challenges… :More Challenges… Distributed versus centralized storage
Warehousing vs federation
Or should it really be…
Distributed Curation and Centralized storage?
Long-term preservation of digital data
Opportunities :Opportunities Standardize on Web service interfaces for tools, applications, and data
E.g. Web Mapping Services for map image services, services for accessing geologic maps, gravity data, sensor data, …
Develop community standards for knowledge representation
Schemas, controlled vocabularies, ontologies
Choose common representation system, e.g. OWL
“Meta-workflow” frameworks
Support inter-operation among different scientific workflow systems
There may be an opportunity to work through new GSA Division on Geoinformatics and AGU working group on IT
Thank You! :Thank You! Chaitan Baru
baru@sdsc.edu