Slide 1: Persistent Management of Distributed Data
Reagan W. Moore
University of California, San Diego
San Diego Supercomputer Center
moore@sdsc.edu
http://www.npaci.edu/DICE/
Data and Knowledge Systems Group : Data and Knowledge Systems Group Staff
Reagan Moore
Ilkai Altintas
Chaitan Baru
Sheau Yen Chen
Charles Cowart
Amarnath Gupta
George Kremenek
Bertram Ludäscher
Richard Marciano
XuFei Qian
Roman Olshanowsky
Arcot Rajasekar
Abe Singer
Michael Wan
Ilya Zaslavsky
Bing Zhu Graduate Students
A. Bagchi
S. Bansal
A. Behere
R. Bharath
S. Bharath
M. Kulrul
L. Sui
Undergraduate Interns
N. Cotofana
M. Shumaker
J. Trang
L. Yin
+/- NN
Information Management Projects : Information Management Projects Digital Libraries
California Digital Library - Art Museum Image COnsortium
DARPA/USPTO Patent digital library - SAIC, NCSA, U Virginia
NLM Visible Embryo digital library - GMU, OHSI, UC, LSU, JHU
NSF Digital Library Initiative, Phase II - UCSB, Stanford
NSF NPACI Digital Sky - Caltech
NSF National Science Education Digital Library - UCAR, Cornell, UCSB, U Mass, Columbia
Data Grid Environments
DOE Data Visualization Corridor - LLNL, OSU
DOE Particle Physics Data Grid - Stanford, Caltech
NASA Information Power Grid - NASA Ames, USC/ISI, U Texas
NIH Biomedical Informatics Research Network - Duke, Harvard
NSF Grid Physics Network - U Florida, Caltech, U Wisc, USC/ISI, U Chicago
NSF National Virtual Observatory - JHU, Caltech
NSF Southern California Earthquake Center - USC/ISI
Persistent Archives
NARA Persistent Archive - UCB, U Maryland
NHPRC Archivist workbench - U Minnesota
NSF NSDL Persistent archive for curricula modules - Cornell
Topics : Topics Data management systems
Data collections, digital libraries
Distributed data management
Data grids
Persistent data management
Persistent archives
Common infrastructure for data management
Data Collections : Data Collections Define the context for describing a collection of digital entities
Context specified by metadata attributes
Provenance, origin of the digital entities
Administrative, location of the digital entities
Technical, purpose of the digital entities
Support organization of attributes as hierarchy of sub-collections
Digital Libraries : Digital Libraries Provide services on the data collection
Ingestion, loading of attribute values
Extensibility, definition of new attributes
Discovery, queries on attributes
Browsing, hierarchical listing
Presentation, formatting specified data models
Data Grids : Data Grids Manage data in a distributed environment
Logical name space, provide global identifier
Data access, storage system abstraction
Replication, disaster back up
Uniform access, common API across file systems, archives, and databases
Single sign-on, authenticate across administration domains
Persistent Archives : Persistent Archives Manage technology evolution
Storage system abstraction, support data migration across storage systems
Information repository abstraction, support catalog migration to new databases
Logical name space, support global persistent identifier
Storage Resource Broker : Storage Resource Broker Integration of collection-based management of digital entities, with
Remote data access through storage system abstraction
Catalog access through information repository abstraction
Automation through collection-owned data
Capabilities : Capabilities Support legacy systems
Integrate archives with file systems
Share distributed data
Maintain persistent collection
Control data access
Uniform API : Uniform API Provide common access semantics
Map from the interface preferred by your application to the interfaces required by legacy storage systems
Slide 12: Java, NT
Browsers Web
WSDL GridFTP SDSC Storage Resource Broker & Meta-data Catalog
Common APIs Application HRM Access
APIs Servers Storage Abstraction Catalog Abstraction Databases
DB2, Oracle, Sybase Logical Name
Space Latency
Management Data
Transport Metadata
Transport Consistency Management / Authorization-Authentication Prime
Server Linux
I/O DLL /
Python
Discovery Transparencies : Discovery Transparencies Naming transparency - find a data set without knowing its name
Map from attributes to a global file name
Location transparency - access a data set without knowing where it is
Map from global file name to local file name
Access transparency - access a data set without knowing the type of storage system
Federated client-server architecture
Slide 14: Java, NT
Browsers Web
WSDL GridFTP SDSC Storage Resource Broker & Meta-data Catalog
Transparencies Application HRM Access
APIs Servers Storage Abstraction Catalog Abstraction Databases
DB2, Oracle, Sybase Logical Name
Space Latency
Management Data
Transport Metadata
Transport Consistency Management / Authorization-Authentication Prime
Server Linux
I/O DLL /
Python
Persistent Collection : Persistent Collection Maintain authenticity
Authenticate all accesses
Assign roles for access control lists (curation, write, annotate, read)
Manage audit trails of all operations
Collection-owned data
All accesses through the data management system
Slide 16: Java, NT
Browsers Web
WSDL GridFTP SDSC Storage Resource Broker & Meta-data Catalog
Persistency Application HRM Access
APIs Servers Storage Abstraction Catalog Abstraction Databases
DB2, Oracle, Sybase Logical Name
Space Latency
Management Data
Transport Metadata
Transport Consistency Management / Authorization-Authentication Prime
Server Linux
I/O DLL /
Python
Preservation(Similar requirements to a data grid) : Preservation(Similar requirements to a data grid) Name transparency
Find a file by attributes (map from attributes to global name)
Location transparency
Access a file by a global identifier (map from global to local file name)
Access transparency
Use same API to access data in archive or file cache
Authenticity
Disaster recovery, replicate data across storage systems
Audit and process management
Slide 18: Java, NT
Browsers Web
WSDL GridFTP SDSC Storage Resource Broker & Meta-data Catalog
Preservation Application HRM Access
APIs Servers Storage Abstraction Catalog Abstraction Databases
DB2, Oracle, Sybase Logical Name
Space Latency
Management Data
Transport Metadata
Transport Consistency Management / Authorization-Authentication Prime
Server Linux
I/O DLL /
Python
Convergence of Technologies : Convergence of Technologies Data grids as basis for distributed data management
Federation of distributed resources
Creation of logical name space to automate discovery
Distributed data collections
Discovery based on attributes
Distributed data storage systems
Digital libraries
Development of services for manipulating, viewing data
Persistent archives
Management of technology evolution
Digital Entities : Digital Entities Digital entities are “images of reality”, made of
Data, the bits (zeros and ones) put on a storage system
Information, the attributes used to assign semantic meaning to the data
Knowledge, the structural relationships described by a data model
Every digital entity requires information and knowledge to correctly interpret and display
Differentiating between Data, Information, and Knowledge : Differentiating between Data, Information, and Knowledge Data
Digital object
Objects are streams of bits
Information
Any tagged data, which is treated as an attribute.
Attributes may be tagged data within the digital object, or tagged data that is associated with the digital object
Knowledge
Relationships between attributes
Relationships can be procedural/temporal, structural/spatial, logical/semantic, functional
Knowledge Creation Roadmap : Knowledge Creation Roadmap Knowledge syntax (consensus)
RDF, XMI, Topic Map
Knowledge management (recursive operations)
Oracle parallel database
Knowledge manipulation (spatial/procedural rules)
Generation of inference rules and mapping to data models
Knowledge generation (scalable inference engine)
Application of inference rules in inference engine
Slide 23: Knowledge Based Data Grid Roadmap Attributes
Semantics Knowledge Information Data Ingest
Services Management Access
Services (Model-based Access) (Data Handling System - SRB) MCAT/HDF Grids XML DTD SDLIP XTM DTD Rules - KQL Information
Repository Attribute- based
Query Feature-based
Query Knowledge or
Topic-Based
Query / Browse Knowledge
Repository for
Rules Relationships
Between
Concepts Fields
Containers
Folders Storage
(Replicas,
Persistent IDs)
Information Management Projects : Information Management Projects Digital Libraries
California Digital Library - Art Museum Image COnsortium
DARPA/USPTO Patent digital library - SAIC, NCSA, U Virginia
NLM Visible Embryo digital library - George Mason University
NSF Digital Library Initiative, Phase II - UCSB, Stanford
NSF NPACI Digital Sky - Caltech 2MASS sky survey
Data Grid Environments
DOE Data Visualization Corridor - LLNL
DOE Particle Physics Data Grid - Stanford, Caltech
NASA Information Power Grid - NASA Ames
NIH Biomedical Informatics Research Network - Duke, Harvard
NSF Grid Physics Network - U Florida
NSF National Virtual Observatory - Johns Hopkins University, Caltech
NSF Southern California Earthquake Center - USC/ISI
Persistent Archives
NARA Persistent Archive - UCB, U Maryland
NHPRC Archivist workbench - U Minnesota
NSF NSDL Persistent archive for curricula modules - Cornell University
Additional Projects : Additional Projects ACS Alliance for Cell Signaling
NSF Digital Government - Fed Web
DOE - SciDAC Grid Portal
DOE - SciDAC Scientific Data Management Project
Hayden Planetarium
NASA Information Power Grid : NASA Information Power Grid Develop digital library interface to the archives at NASA Ames - SRB
Demonstrate high performance data access across both NASA and NSF resources
Demonstrate telescience through NASA resources (link electron microscope at UCSD, with image collection at NASA Ames, and processing of data at NASA Ames)
Hayden Planetarium : Hayden Planetarium Provide a data collaboration environment
Share data between NCSA (simulations of the solar system evolution), SDSC (3D visualizations), and Hayden (review)
Manage 3-6 TBs of data
Provide seamless access across administration domains and storage resources
Slide 28: The SRB is great. It has been utterly essential in this project - we
could not have done this work without the SRB. We have used the SRB
as a central shared repository for raw data, derived data, and
visualization results. Data has been submitted by, and retrieved by
each of the partners in this project on a daily basis. Email back
and forth frequently includes SRB directory names into which data or
results have been stored for broad review at different sites. The
visualization animations we've produced are immediately placed into
the SRB where they are downloaded, simultaneously, by people at NCSA
and the museum in New York. The animations are reviewed, new data
generated, email exchanged, and new images rendered and put back into
the SRB to start the next review cycle. The whole thing has worked
flawlessly. I am delighted and will gladly promote its virtues at
any opportunity.
As you've seen, I do have some suggestions for future functionality
here and there. The "migration awkwardness" is one. But these
suggestions are for added features or minor interface smoothing.
They should in no way diminish the fact that it all works wonderfully!
Thanks!
-- Dave
Visible Embryo Project : Visible Embryo Project Build a digital library of images, reports for use by educators and physicians
Manage transfer of images from Armed Forces Institute of Pathology to an archive at SDSC
Provide access to the material
Use the SRB/MCAT system to assemble a digital library
Slide 30: SDSC Los
Angeles Oakland OHSU UIC
Startap Eolas GMU ASX200 AFIP:
Collab WS DC POP MSWS MSWS NT WS NT WS NIC OC-3 Abilene
OC-3 Abilene
OC-3 JHU VBNS
OC-12 DS3 Vegas OC-3 GST 100
Gbit BEN ATD Net WRL HSCC Visible Embryo Project Disk
Cache Disk
Cache Disk
Cache Image
Generation Archive Disk
Cache
National Virtual Observatory : National Virtual Observatory Federate existing sky survey image collections and catalogs
Support statistical analyses across multiple surveys
Implement services support environment
Use the SRB/MCAT to support bulk data access, replicate the major sky surveys, support large scale database record analyses
Slide 32: Compute Resources Catalogs Data Archives Information
Discovery Metadata
delivery Data
Discovery Data
Delivery Catalog Mediator Data mediator 1. Portals and Workbenches Bulk Data
Analysis Catalog
Analysis Metadata
View Data
View 4.Grid
Security
Caching
Replication
Backup
Scheduling 2.Knowledge & Resource
Management Standard Metadata format, Data model, Wire format Catalog/Image Specific Access Standard APIs and Protocols Concept space 3. 5. 6. 7. Derived Collections National Virtual Observatory
Data Grid
Particle Physics Data Grid : Particle Physics Data Grid Support replication of data sets for the high energy physics grid
Federate data collections for the BaBar experiment at SLAC using the SRB
Federate BaBar collections between SLAC and Lyons, France
Support web service interface, support derived data product catalog
Particle Physics Data Grid - Replication System : Particle Physics Data Grid - Replication System
National STEM Education Digital Library - NSDL : National STEM Education Digital Library - NSDL Provide persistent archive for educational material indexed in the NSDL repository
Develop knowledge spaces to characterize collection holdings
Map knowledge spaces to AAAS2061 concept space, and to state mandated grade level concepts
Slide 36: Usage Enhancement Collection Building User Interfaces Metadata & data
access-based
services Core NSDL Bus Meta-data delivery
Data delivery
Query
Global Ids
Security
Network Virtual
Collections &
Mediators Information
about collections Delivery
Presentation
Aggregation - Channels NSDL
National Archives Records Administration : National Archives Records Administration Develop prototype persistent archive for NARA digital holdings
Identify pertinent research areas for long term preservation of data, information, and knowledge
Apply data grid technology for the implementation of persistent archives.
Grid Physics Network : Grid Physics Network Develop infrastructure to support virtual data products
Create repository for derived data products
Automate the extraction of metadata from Virtual Data Language files
Automate extraction of administrative data through grid portals
GridPort + SRB Architecture : GridPort + SRB Architecture With SRB capabilities, file access is direct, uniform
Uses same authentication as portal and other Grid services
Single SRB account access allows for more flexible data management
DOE Scientific Data Management : DOE Scientific Data Management Develop knowledge management tools to mediate between biological information resources
Integrate the tools into the DOE scientific data management environment
Further Information : Further Information http://www.npaci.edu/DICE