FGIT DATA CHALLENGES Baru

Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Data Management Practices and Challenges in Geosciences Today : 

Data Management Practices and Challenges in Geosciences Today Chaitan Baru San Diego Supercomputer Center (SDSC) California Institute for Telecommunications and Information Technology (Calit2)

Data Management : 

Data Management DATA COLLECTION DATA PUBLICATION DATA ACCESS

Geosciences Data Management : 

Geosciences Data Management Atmospheric Sciences Meteorological data provides community focus – real time as well as archived data Common field of interest (e.g. weather); continental scale Ocean Sciences Ship cruises and real time data from moorings. Increasingly, integration with more diverse data (including biological) Field of interest is in regions (e.g. extent of cruises) Earth Sciences Broad range of data types: sensor data (e.g. seismic, GPS), field data collections (e.g. geologic data), remote sensing (e.g. LIDAR), analytical data (e.g. geochem, geochronology) Broad coverage: study area within a small region (e.g. watershed), to continental and tectonics settings Also, managing model outputs need to manipulate and visualize very large outputs from models

GEON: A Platform for Data Integration Example: GEONsearch : 

GEON: A Platform for Data Integration Example: GEONsearch www.geongrid.org

GEONsearch and GEONworkbench : 

GEONsearch and GEONworkbench GEONsearch Search Condition(s) spatial temporal concept Log GEON Catalog GEON Datasets extracted information/indexes Web services Gazetteer Geologic Age

GEON Registration : 

GEON Registration Ontology Registration Dataset Registration (hosted) Data Item (Schema) Registration (hosted / non-hosted) Data Item Detail Registration (values) Service Registration Resource Registration

CUAHSI Hydrologic Information Systemcuahsi.sdsc.edu : 

CUAHSI Hydrologic Information Systemcuahsi.sdsc.edu Integrated access to federal data sources Web services for accessing each source Need to map to common metadata (ontology) Private workspace Ability to store data and derived products in “personal digital library” Integrated search Ability to search federal data sources as well as digital library, with a single search command Scientific workflows Access to modeling and analysis tools via scientific workflow software, e.g. Kepler, ModelBuilder, D2K…

Data Integration in CUAHSI HIS : 

Data Integration in CUAHSI HIS From: “Chapter 4: System Architecture,” by Chaitan Baru, Ilya Zaslavsky, Reza Wahadj, Hydrologic Information Systems: A Status Report, edited by David Maidment, http://www.ce.utexas.edu/prof/maidment/CUAHSI/HISStatusSept15.pdf

Slide 9: 

ROADNet: Real-time Observatories, Applications & Data management Networks (courtesy: John Orcutt, Frank Vernon, SIO)

Slide 10: 

SDSCStorage Resource Broker

USArray: Background : 

USArray: Background Overview 12 year project; part of EarthScope Continental-scale seismic observatory for lithosphere and deep Earth structure Record local, regional and teleseismic earthquakes Major Components A transportable array of 400 portable, unmanned three-component broadband seismometers deployed on a uniform grid that will systematically cover the US A flexible component of 400 portable, three-component, short-period and broadband seismographs and 2000 single-channel high frequency recorders A permanent array of high-quality, three-component seismic stations, coordinated as part of the US Geological Survey's Advanced National Seismic System (ANSS), to provide a reference array spanning the contiguous United States and Alaska. URLs http://www.earthscope.org/usarray/ http://anf.ucsd.edu/index.html http://www.earthscope.org/usarray/usarray_assets/USArray6.mov Courtesy: Frank Vernon, SIO, Tony Fountain, SDSC

USArray Existing Infrastructure : 

USArray Existing Infrastructure Infrastructure / Data Flow: Seismic sensors connected to dataloggers Dataloggers stream data to central collection facility at SIO via IP (and other)-based networking New sites initially stream data into “Prelim” ORB (object ring buffer) for QA/QC Operational sites stream data into Production ORB Production data streams sent from SIO to IRIS for archiving and dissemination (www.iris.edu) Uses BRTT’s Antelope sensornet middleware throughout Scale: Up to 400 sites deployed at any given time Thousands of channels of real-time streaming data Status: Currently in first wave deployment (~ 100 sites) Between 5-20 new sites (physically) installed per week Transportable array sites will *move* every 18 months Courtesy: Frank Vernon, SIO, Tony Fountain, SDSC

SOA Architecture Instantiated for USArray : 

SOA Architecture Instantiated for USArray Courtesy: Tony Fountain, Neil Cotofana Cyberinfrastructure Lab for Environmental Observing Systems (CLEOS), SDSC

KEPLER & ROADNet Real-Time Scientific Workflows : 

KEPLER & ROADNet Real-Time Scientific Workflows Laser Strainmeter Channels in; Scientific Workflow; Earth-tide signal out Straightforward Example: Architecture: Seismic Waveforms Images other types of data ORBserver Real-time Packet Buffer Near-real-time database Scientific Workflow Courtesy: John Orcutt, SIO

Slide 16: 

LOOKING Laboratory for Ocean Observatory Knowledge & INtegration Grid NSF ITR Grant Cyberinfrastructure for Ocean Observatories Initiative Courtesy: John Orcutt, SIO

CHRONOS Federated Databases : 

CHRONOS Federated Databases Create a dynamic, interactive and time-calibrated framework for Earth history Develop a network of chronostratigraphy databases “Federated” Database Design The following databases are part of the CHRONOS Federated Database at SDSC, based on IBM’s DB2 Information Integrator Neptune PaleoStrat PaleoBiology Janus TimeScale FAUNMAP MIOMAP Courtesy: Doug Greer, SDSC

Top-Level View of a Federated Database : 

Top-Level View of a Federated Database Applications Federated Database Data Source A Data Source D Data Source C Data Source B

Federated Data Sources : 

Federated Data Sources Geographically Distributed Heterogeneous Relational Databases – most common Spreadsheets Non-relational Sources Web Pages / Web Services Flat Files “Global Views” Views may be “virtual”, or contain data (materialized views) Views define data in a uniform way across the data sources Applications can access data through these global views, using SQL

Example: Chronos Hole_Desc View : 

Example: Chronos Hole_Desc View Uniform global-view for hole/taxa description for Age Depth Plots application CHRONOS Hole_Desc Database Name Hole_ID Elevation Meters_of_Section Taxa_Count Courtesy: Doug Greer, SDSC

Challenges : 

Challenges Efficient access to remote data Service interfaces to allow subsetting of data at remote end Efficient access to very large data Parallel I/O, manipulation of 10’sTB of viz output, long term storage of 100’sTB of model output Versioning of data and metadata, and providing provenance Managing access to “regular” users vs “power” users (or, privileged users)

More Challenges… : 

More Challenges… Distributed versus centralized storage Warehousing vs federation Or should it really be… Distributed Curation and Centralized storage? Long-term preservation of digital data

Opportunities : 

Opportunities Standardize on Web service interfaces for tools, applications, and data E.g. Web Mapping Services for map image services, services for accessing geologic maps, gravity data, sensor data, … Develop community standards for knowledge representation Schemas, controlled vocabularies, ontologies Choose common representation system, e.g. OWL “Meta-workflow” frameworks Support inter-operation among different scientific workflow systems There may be an opportunity to work through new GSA Division on Geoinformatics and AGU working group on IT

Thank You! : 

Thank You! Chaitan Baru baru@sdsc.edu