DataCentral DatatoDiscovery

Category: Entertainment

Presentation Description

No description available.


Presentation Transcript


Roman Olschanowsky Data Applications and Services

Why SDSC Data Central? : 

Why SDSC Data Central? Today’s scientists and engineers are increasingly dependent on valued community data collections and comprehensive data CI Need for large data many users with large data needs extend above and beyond what their home environments increasingly dependent on valued data collections used community-wide Comprehensive data environment that incorporates access to the full spectrum of data enabling resources

Why SDSC Data Central? : 

Why SDSC Data Central? SDSC has experienced increasing demand by the domain communities for collaborations on data driven discovery including hosting, managing and publishing data in digital libraries sharing data through the Web and data grids creating, optimizing, porting large scale databases data intensive computing with high bandwidth data movement analyzing, visualizing, rendering and data mining large scale data preservation of data in persistent archives building collections, portals, ontologies, etc. providing resource, services and expertise

Data-Driven Discovery: 

Data-Driven Discovery Astronomy Geosciences Life Sciences Arts, and Humanities NVO – 93 TB SCEC – 270 TB Japanese Art Images – 600 GB Engineering TeraBridge 4 TB JCSG/SLAC – 15.7 TB Ocean Science SCOOP 7TB

Community Data Collections and Databases: 

Community Data Collections and Databases Researchers are in need of a trusted repository for the management, publishing and sharing of their data with project and community members Many increasingly dependent on valued community data In response to the large number of requests, SDSC has formed DataCentral

What is Data Central?: 

What is Data Central? Comprehensive data environment that incorporates access to the full spectrum of data enabling resources Started as the first program of its kind to support research and community data collections and databases Umbrella for SDSC Production Data Efforts enabling “Data to Discovery”


Data to Discovery

What does SDSC Data Central offer? : 

SDSC has been actively working with and collaborating with many researchers and national scale projects in their integrated data efforts We offer Expertise and Resources for: Public Data Collections and Database Hosting Long-term storage and preservation (tape and disk) Remote data management and access (SRB, portals) Data Analysis, Visualization and Data Mining Professional, qualified 24/7 support What does SDSC Data Central offer?

Data Resources Available through DataCentral: 

Data Resources Available through DataCentral Expertise in High performance large data management, hosting and publishing Data migration, upload and sharing through the grid Database application tuning, porting and optimization SQL query tuning and schema design Data analysis, visualization and data mining Portal creation and collection publication Preservation of data in persistent archives, etc.

DataCentral Resources: 

DataCentral Resources DataCentral infrastructure: HW and SW resources SAMQFS, HPSS, Disk, DB servers Web farm Accounting system Data management tools and data analysis SW Appropriate space, power, cooling, UPS systems Human resources System administrators, collection specialists supporting users and applications 24/7 operators Staff Expertise 6 PB HPSS silo SDSC Servers Web-based portal Storage resources Security Networking UPS systems Software tools Web services 24/7 Operations, etc. SDSC machine room

Community Self-Selection: SDSC Data Central: 

Community Self-Selection: SDSC Data Central First program of its kind to support research and community data collections and databases Comprehensive resources Disk: 400 TB accessible via HPC systems, Web, SRB, GridFTP Databases: DB2, Oracle, MySQL SRB: Collection management Tape: 25 PB, accessible via file system, HPSS, Web, SRB, GridFTP Data collection and database hosting Batch oriented access Collection management services Collaboration opportunities: Long-term preservation Data technologies and tools Examples of Allocated Data Collections include Bee Behavior (Behavioral Science) C5 Landscape DB (Art) Molecular Recognition Database (Pharmaceutical Sciences) LIDAR (Geoscience) LUSciD (Astronomy) NEXRAD-IOWA (Earth Science) AMANDA (Physics) SIO_Explorer (Oceanography) Tsunami and Landsat Data (Earthquake Engineering) UC Merced Library Japanese Art Collection (Art) Terabridge (Structural Engineering)

Services, Tools, and Technologies for Data Management and Synthesis: 

Data Systems SAM/QFS HPSS GPFS SRB Services, Tools, and Technologies for Data Management and Synthesis Data Services Data migration/upload, usage and support (SRB) Database selection and Schema design (Oracle, DB2, MySQL) Database application tuning and optimization Portal creation and collection publication Data analysis (e.g. Matlab, SAS) and mining (e.g. WEKA) DataCentral Data-oriented Toolkits and Tools Biology Workbench Montage (astronomy mosaicking) Kepler (Workflow management) Vista Volume renderer (visualization), etc.

Public Data Collections Hosted in SDSC’s DataCentral: 

Public Data Collections Hosted in SDSC’s DataCentral

Cyberinfrastructure and Data – Integration infrastructure: 

Cyberinfrastructure and Data – Integration infrastructure Storage hardware Networked Storage (SAN) High speed networking sensornets instruments Coordination, Interoperability I n t e g r a t I o n

Cyberinfrastructure and Data – Data Integration: 

Cyberinfrastructure and Data – Data Integration Life Sciences Geosciences

Cyberinfrastructure-enabled Research Examples: 

Cyberinfrastructure-enabled Research Examples

Tracking the Heavens: 

Hubble Telescope Palomar Telescope Sloan Telescope Tracking the Heavens

The Virtual Observatory: 

The Virtual Observatory Premise: most observatory data is (or could be) online So, the Internet is the world’s best telescope: It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. It’s as deep as the best instruments It is up when you are up The “seeing” is always great It’s a smart telescope: links objects and data to literature on them Software has became a major expense Share, standardize, reuse.. Slide modified from Alex Szalay, NVO

Downloading the Night Sky: 

Hubble Telescope Palomar Telescope Sloan Telescope The National Virtual Observatory Astronomy community came together to set standards for services and data Interoperable, multi-terabyte online databases Technology-enabled, science-driven. NVO combines over 100 TB of data from 50 ground and space-based telescopes and instruments to create a comprehensive picture of the heavens Sloan Digital Sky Survey, Hubble Space Telescope, Two Micron All Sky Survey, National Radio Astronomy Observatory, etc. Downloading the Night Sky

Using Technology to Evolve Astronomy: 

Using Technology to Evolve Astronomy Slide modified from Alex Szalay, NVO Looking for Needles in haystacks – the Higgs particle Haystacks -- Dark matter, Dark energy Statistical analysis often deals with Creating uniform samples Data filtering Assembling relevant subsets Censoring bad data “Likelihood” calculations Hypothesis testing, etc. Traditionally these are performed on files, most of these tasks are much better done inside a database

How NVO Works: 

How NVO Works Raw data comes from large-scale telescopes Telescopes provide daily sweep of the sky, scientists “clean data” which is then converted from temporal to spatial data, allowing indexing over both dimensions. All NVO data on website available to the public without restriction (by community agreement, all data public after 1 year) NVO databases distributed and mirrored at multiple sites Crab Nebula Palomar Telescope

Making Discoveries Using the NVO: 

Making Discoveries Using the NVO Scientists at Johns Hopkins, Caltech and other institutions confirmed the discovery of a new brown dwarf. Search time on 5,000,000 files went from months to minutes using NVO database tools and technologies. Brown dwarfs are often called the “missing link” in the study of star formations. They are considered small, cool “failed stars”.

Cyberinfrastructure and NVO: 

Cyberinfrastructure and NVO Sky surveys from major telescopes indexed and catalogued in NVO databases by time and spatial location using Storage Resource Broker and other tools NVO collections archived at multiple sites, accessed by Grid technologies Software tools and web portals create an environment for ingestion of new information, mining, discovery and dissemination

Moving the Earth : 

Moving the Earth The earth is constantly evolving through the movement of “plates” Using plate tectonics, the Earth's outer shell (lithosphere) is posited to consist of seven large and many smaller moving plates. As the plates move, their boundaries collide, spread apart or slide past one another, resulting in geological processes such as earthquakes, volcanoes and the development of mountains, typically at plate boundaries.

Major Earthquakes on the San Andreas Fault, 1680-present: 

Major Earthquakes on the San Andreas Fault, 1680-present 1906 M 7.8 1857 M 7.8 1680 M 7.7 How dangerous is the San Andreas Fault? Geoscience researchers can now use massive amount of geological, historical, and environmental data to simulate natural disasters such as earthquakes. Focus is on understanding big earthquakes and their impact. Simulations combine large-scale data collections, high-resolution models, supercomputer runs Simulation results provide new scientific information enabling better Estimation of seismic risk Emergency preparation, response and planning Design of next generation of earthquake-resistant structures Results provide immense societal benefits which can help in saving many lives and billions in economic losses ? Earthquake Simulations


TeraShake simulates a 7.7 earthquake along the southern San Andreas fault close to LA using seismic, geophysical, and other data from the Southern California Earthquake Center


How TeraShake simulates earthquakes: Divide up Southern California into “blocks” For each block, get all the data on ground surface composition, geological structures, fault information, etc. HowTerashake Works

SCEC Data Requirements: 

Resources must support a complicated orchestration of computation and data movement The next generation simulation will require even more resources: Researchers plan to double the temporal/spatial resolution of TeraShake SCEC Data Requirements Parallel file system Data parking “I have desired to see a large earthquake simulation for over a decade. This dream has been accomplished.”  Bernard Minster, Scripps Institute of Oceanography

DataCentral-enabled Examples: 

DataCentral-enabled Examples


SURA Coastal Ocean Observing and Prediction ‘SCOOP’ Simulations predicting the approach of Katrina to New Orleans. The wind fields of Katrina are shown as white/grey ribbons, clearly showing the hurricane vortex. The yellow to red coloring beneath the eye of the hurricane shows the storm surge moving cross the gulf, pushed by the hurricane’s wind. W. Benger and S. Venkataraman, CCT/LSU. “With SDSC’s Storage Resource Broker software we can access these data sets in DataCentral through the Grid from anywhere in the world.” This will help SCOOP researchers make their data available to the wider coastal modeling community. The data for the 2005 hurricane season are particularly valuable, and the SCOOP collection in DataCentral covers Katrina, Rita, and Wilma—three of the strongest category five hurricanes on record.


SIOExplorer: Web Exploration of Seagoing Archives 3000 cruises online at SIO “Bridging the Gap between Libraries and Data Archives” Data 50 years of digital data Growing 200 GB per year Images 99 years of SIO Archives Documents Reports, publications, books UCSD Libraries Scripps Institution of Oceanography San Diego Supercomputer Center


AMANDA (Antarctic Muon And Neutrino Detector Array) AMANDA -II 1500 m 2000 m MAPO Airport South Pole Amundsen-Scott South Pole Station The dream of constructing a radically different telescope has been realized by the innovative AMANDA-II project. Instead of sensing light, like all telescopes since the time of Galileo, AMANDA responds to a fundamental particle called a neutrino. Neutrino messengers provide a startlingly new view of the Universe. The 20TB/yr produced requires manipulation, processing, filtering, and Monte Carlo data analyses for the search of high energy neutrinos. A full data analysis requires a total space of 40TB/yr.


The future of genomics-enabled medicine depends on the creation of tools that allow scientists to explore the relationships between specific genetic characters and specific disease outcomes. Each of these tools has, at its core, the use of high end computation and high end data storage and integration. Landscape of tools: LIMS - collecting and organizing the lab data. Microarray analysis - quantifying cellular response to stimulii Virtual screening of lead compounds. Together, SDSC and City of Hope push forward the limits of genomics enabled medicine. City of Hope’s Informatics Core Lab


High resolution global mosaic of the earth Greyscale geotiffs with geolocation tags for GIS integration Produced from 8200 individual Landsat7 scenes over 500MB each These data sets provide global imagery, elevation data, and formats for NASA World Wind and GEON’s GeoFusion browser WMS Global Mosaic

authorStream Live Help