Session 25 Data Management

Uploaded from authorPOINT
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Slide1: 

Session 25 Monday 17th July Malcolm Atkinson

Slide2: 


Introduction to Structured Data in Grids: 

Introduction to Structured Data in Grids Reminders: Distributed Systems andamp; Data scale Significance of Structure Strategies for Data Integration Metadata Challenges A view of OGSA-DAI

Slide4: 


Foundations of Collaboration: 

Foundations of Collaboration Strong commitment by individuals To work together To take on communication challenges Mutual respect andamp; mutual trust Distributed technology To support information interchange To support resource sharing To support data integration To support trust building Sufficient time Common goals Complementary knowledge, skills andamp; data Can we predict when it will work? Can we find remedies when it doesn’t?

A strategy that works well: 

A strategy that works well Collaboratively constructed Shared access Data Resources Sequence databases Protein structure and Crystallography databases Sky Surveys Census data Zoo DB Mouse Atlas … Works better when linked to Funding andamp; Publication. But funding the maintenance?

Works better with an organising nucleus: 

Works better with an organising nucleus EBI BIRN GEON SEEK / Species 2000 IVOA CaBIG … Helping to Organise Giving user support Establishing standards Sharing methods

Principles of Distributed Computing: 

Principles of Distributed Computing Issues you can’t avoid Lack of Complete Knowledge (LOCK) Latency Heterogeneity Autonomy Unreliability Change A Challenging goal balance technical feasibility against virtual homogeneity, stability and reliability Appropriate balance between usability and productivity while remaining affordable, manageable and maintainable This is NOT easy

Compound Causes of Data Growth: 

Compound Causes of Data Growth Faster devices Cheaper devices Higher-resolution all ~ Moore’s law Increased processor throughput  more derived data Cheaper andamp; higher-volume storage Remote data more accessible Public policy to make research data available Bandwidth increases Latency doesn’t get less though

Slide10: 


Interpretational Opportunities & Challenges: 

Interpretational Opportunities andamp; Challenges Finding andamp; Accessing data Variety of mechanisms andamp; policies Interpreting data Variety of forms, value systems andamp; ontologies Independent provision andamp; ownership Autonomous changes in availability, form, policy, … Processing data Understanding how it may be related Devising models that expose the relationships Presenting results Humans need either Derived small volumes of statistics Visualisations

Interpretational Opportunities & Challenges: 

Interpretational Opportunities andamp; Challenges Finding andamp; Accessing data Variety of mechanisms andamp; policies Interpreting data Variety of forms, value systems andamp; ontologies Independent provision andamp; ownership Autonomous changes in availability, form, policy, … Processing data Understanding how it may be related Devising models that expose the relationships Presenting results Humans need either Derived small volumes of statistics Visualisations

Interpretational Opportunities & Challenges: 

Interpretational Opportunities andamp; Challenges Finding andamp; Accessing data Variety of mechanisms andamp; policies Interpreting data Variety of forms, value systems andamp; ontologies Independent provision andamp; ownership Autonomous changes in availability, form, policy, … Processing data Understanding how it may be related Devising models that expose the relationships Presenting results Humans need either Derived small volumes of statistics Visualisations

Data Access and Integration: motives: 

Data Access and Integration: motives Key to Integration of Scientific Methods Publication and sharing of results Primary data from observation, simulation andamp; experiment Encourages novel uses Allows validation of methods and derivatives Enables discovery by combining data independently collected and Decisions!

Data Access and Integration: motives: 

Data Access and Integration: motives Key to Large-scale Collaboration Economies: data production, publication andamp; management Sharing cost of storage, management and curation Many researchers contributing increments of data Pooling annotation  rapid incremental publication And criticism Accommodates global distribution Data andamp; code travel faster and more cheaply Accommodates temporal distribution Researchers assemble data Later (other) researchers access data

Data Access and Integration: challenges: 

Data Access and Integration: challenges Scale Many sites, large collections, many uses Longevity Research requirements outlive technical decisions Diversity No 'one size fits all' solutions will work Primary Data, Data Products, Meta Data, Administrative data, … Many Data Resources Independently owned andamp; managed No common goals No common design Work hard for agreements on foundation types and ontologies Autonomous decisions change data, structure, policy, … Geographically distributed Petabyte of Digital Data / Hospital / Year

Data Integration: Scientific discovery: 

Data Integration: Scientific discovery Choosing data sources How do you find them? How do they describe and advertise them? Is the equivalent of Google possible? Obtaining access to that data Overcoming administrative barriers Overcoming technical barriers Understanding that data The parts you care about for your research Extracting nuggets from multiple sources Pieces of your jigsaw puzzle Combing them using sophisticated models The picture of reality in your head Analysis on scales required by statistics Coupling data access with computation Repeated Processes Examining variations, covering a set of candidates Monitoring the emerging details Coupling with scientific workflows You’re an innovator Your model  their model  Negotiation andamp; patience needed from both sides

Scientific Data: Opportunities & Challenges: 

Scientific Data: Opportunities andamp; Challenges Opportunities Global Production of Published Data Volume Diversity Combination  Analysis  Discovery Challenges Data Huggers Meagre metadata Ease of Use Optimised integration Dependability Opportunities Specialised Indexing New Data Organisation New Algorithms Varied Replication Shared Annotation Intensive Data andamp; Computation Challenges Fundamental Principles Approximate Matching Multi-scale optimisation Autonomous Change Legacy structures Scale and Longevity Privacy and Mobility Sustained Support / Funding

Requirements: User’s viewpoint: 

Requirements: User’s viewpoint Find Data Registries andamp; Human communication Understand data Metadata description, Standard / familiar formats andamp; representations, Standard value systems andamp; ontologies Data Access Find how to interact with data resource Obtain permission (authority) Make connection Make selection Move Data In bulk or streamed (in increments)

Requirements: User’s viewpoint 2: 

Requirements: User’s viewpoint 2 Transform Data To format, organisation andamp; representation required for computation or integration Combine data Standard DB operations + operations relevant to the application model Present results To humans: data movement + transform for viewing To application code: data movement + transform to the required format To standard analysis tools, e.g. R To standard visualisation tools, e.g. Spotfire

Requirements: Owner’s viewpoint: 

Requirements: Owner’s viewpoint Create Data Automated generation, Accession Policies, Metadata generation Storage Resources: SRM, SRB, … Preserve Data Archiving Replication Metadata Protection Provide Services with available resources Definition andamp; implementation: costs andamp; stability Resources: storage, compute andamp; bandwidth

Requirements: Owner’s viewpoint 2: 

Requirements: Owner’s viewpoint 2 Protect Services Authentication, Authorisation, Accounting, Audit Reputation Protect data Comply with owner requirements – encryption for privacy, … Monitor and Control use Detect and handle failures, attacks, misbehaving users Plan for future loads and services Establish case for Continuation Usage statistics Discoveries enabled

Slide23: 


Why structure data: 

Why structure data It always is structured Without structure it is just a bag of bits Is the next 32 bits An integer Two integers Part of a double 4 characters 2 characters in Ucode … Is this a 1D, 2D or 3D array? How big is it? Where is the UUID? Of course the Author of the Application Knows this

More interesting questions: 

More interesting questions How do you discover the structure? If the application developer isn’t available They are virtually never available There were lots of them who made changes Perhaps a community has defined the structure Then communicated it among themselves How do you find that community

More interesting questions 2: 

More interesting questions 2 Perhaps structure description written with the data Binary data at start of file(s) Binary data in another file How do you know the relationship between the files Binary data among the other data How do you find it How do you find these binary structure descriptions? How do you interpret them?

Structure Described textually: 

Structure Described textually Binary data is efficient TRY Separate textual description E.g. MIME types Bespoke structural description language Product specific Computing language specific Application community specific Attempt a standard data structure description language E.g. GGF DFDL Still have to discover which description applies to which data A binding problem Still have to understand the names andamp; interpretation E.g. a field described 'Distance IEE64bitFloat' Which distance? What units? When measured?

Textual data is easy to use: 

Textual data is easy to use Humans can read andamp; write it Though there is a limit as to how much! Humans can edit it Though they make errors andamp; break structure It allows structural flexibility andamp; extension The structure may be implicit E.g. a standard natural language text A popular format maintained by user discipline A format maintained by tools E.g. mail message headers That then make the structure explicit andamp; maintained

Structured textual data: 

Structured textual data Semistructured data May use layout and tags to code structure E.g. field-name text newline E.g. column names, newline, comma-separated values, newline, … E.g. XML tag pairs Structure may be more or less consistently This may be improved with a schema AND schema checking E.g. XML schema, e.g. XSD Another binding problem – which schema controls which document? May be some implicit rules E.g. XML tag pairing Structure may be partially inferred E.g. recognise integers With textual exceptions, e.g. 'not yet known'

Databases provide some structure: 

Databases provide some structure Manage data Manage description of structure Schema (logical and physical metadata) Constraints Authorisation rules Manage storage Often efficient layout andamp; binary / compressed Manage Privacy E.g. guarantee encryption Provide operations Queries, updates, bulk loads, rule checks, stored procedures Interpretation challenges remain

Exploit structure: 

Exploit structure Go directly to parts of data Extract relevant parts Transform during this process Generate descriptions of data structure Store bindings between Structure description and data Transfer smaller volumes of data Compress exploiting structure Aids to interpretation Require a structural foundation

Slide32: 


Basic Strategies for Users: 

Basic Strategies for Users Use a Service provided by a Data Owner Use a self-administered workflow Use a scripted workflow Use data virtualisation services

Basic Strategies for Users: 

Basic Strategies for Users Use a Service provided by a Data Owner Easiest as pre-packaged Web-based form interfaces E.g. for BLAST jobs at EBI Now may be provided as Web Services Accessed by client portal E.g. Initiating BLAST runs in BRIDGES project No multi-source data integration Unless provided by Data Owner Opportunity for discovery restricted to that data Use a self-administered workflow Use a scripted workflow Use data virtualisation services

Basic Strategies for Users: 

Basic Strategies for Users Use a Service provided by a Data Owner Use a self-administered workflow Use a sequence of Services Plus own data Organise each step Collect and manage intermediate results Organise integration processes manually Common strategy Very laborious Error prone Tedious repetition Hard to provide to other researchers Use a scripted workflow Use data virtualisation services

Basic Strategies for Users: 

Basic Strategies for Users Use a Service provided by a Data Owner Use a self-administered workflow Use a scripted workflow Describe the steps in a Scripting Language Steps performed by Workflow Enactment Engine Many languages in use Trade off: familiarity andamp; availability Trade off: detailed control versus abstraction Incrementally develop correct process Sharable andamp; Editable Basis for scientific communication andamp; validation Valuable IPR asset Repetition is now easy Parameterised explicitly andamp; implicitly Use data virtualisation services

Workflow Systems: 

Workflow Systems

Example Grid3 Application:NVO Mosaic Construction: 

Example Grid3 Application: NVO Mosaic Construction NVO/NASA Montage: A small (1200 node) workflow Construct custom mosaics on demand from multiple data sources User specifies projection, coordinates, size, rotation, spatial sampling Work by Ewa Deelman et al., USC/ISI and Caltech

Basic Strategies for Users: 

Basic Strategies for Users Use a Service provided by a Data Owner Use a self-administered workflow Use a scripted workflow Use data virtualisation services Form a federation Set of data resources – incremental addition Registration andamp; description of collected resources Warehouse data or access dynamically to obtain updated data Virtual data warehouses – automating division between collection and dynamic access Describe relevant relationships between data sources Incremental description + refinement / correction Run jobs, queries andamp; workflows against combined set of data resources Automated distribution andamp; transformation Example systems IBM’s Information Integrator GEON, BIRN andamp; SEEK OGSA-DAI is an extensible framework for building such systems

Basic Strategies for Users: 

Basic Strategies for Users Use a Service provided by a Data Owner Use a self-administered workflow Use a scripted workflow Use data virtualisation services Arrange that multiple data services have common properties Arrange federations of these Arrange access presenting the common properties Expose the important differences Support integration accommodating those differences

Virtualisation variations: 

Virtualisation variations Extent to which homogeneity obtained Regular representation choices – e.g. units Consistent ontologies Consistent data model Consistent schema – integrated super-schema DB operations supported across federation Ease of adding federation elements Ease of accommodating change as federation members change their schema and policies Drill through to primary forms supported

Slide42: 


Metadata Definition: 

Metadata Definition Metadata is data that describes other data Any property of the other data Structure Physical organisation Usage and storage policies Destruction policies Privacy and legal constraints Provenance Aids to interpretation Known uses and users … One person’s metadata can be another person’s data

Challenges for metadata: 

Challenges for metadata All the challenges of Data E.g. authorisation, privacy, dependable storage, … Managing changes, quality, … The binding between Data andamp; Metadata What metadata describes this data? What data does this metadata describe? Specific data All the data about a particular topic All the data that will be produced in a particular way Good abstractions for using data andamp; metadata together Good mechanisms for generating metadata Automation andamp; incentives

Metadata modes of use: creation: 

Metadata modes of use: creation Generate Metadata Then generate and store data that complies Generate Metadata andamp; Data At the same time 'Atomic' operation Have already a collection of data And some metadata, e.g. structural Mine or generate further information about the data Store that as additional metadata Note constructing bindings in each case Must maintain stable and accurate bindings

Modes of using metadata: 

Modes of using metadata Query or search metadata Use this to find specific parts Browse metadata (after query) To understand data To consider exploitation strategies Create indexes Use these to accelerate algorithms This should be done more often! Applications andamp; tools read metadata Use it to drive selections, mappings, presentations E.g. use it to generate detailed workflows from abstract workflows E.g. construct wrappers and data transformers

Slide47: 


Simple Intermediary Pattern: 

Simple Intermediary Pattern

Persistent Intermediary Pattern: 

Persistent Intermediary Pattern

Redirector Pattern: 

Redirector Pattern

Redirector: OGSA-DAI as the consumer: 

Redirector: OGSA-DAI as the consumer

Coordinator Pattern: 

Coordinator Pattern

Data Assembly Pattern: 

Data Assembly Pattern

Slide54: 


Integrated service for Data & Metadata: 

Integrated service for Data andamp; Metadata

Slide56: 

? Picture composition by Luke Humphry based on prior art