Category: Entertainment

Presentation Description

No description available.


Presentation Transcript

DIAL: Distributed Interactive Analysis of Large Datasets: 

DIAL: Distributed Interactive Analysis of Large Datasets David Adams – BNL September 16, 2005 DOSAR meeting


Contents DIAL project Implementation Components Dataset Transformation Job Scheduler Catalogs User interfaces Status Interactivity Current results Conclusions More information Contibutors

DIAL project: 

DIAL project DIAL = Distributed Interactive Analysis of Large datasets Goal is to demonstrate the feasibility of doing interactive analysis of large data samples Analysis means production of analysis objects (e.g. histograms or ntuples) from HEP event data Interactive means that user request is processed in a few minutes Large is whatever is available and useful for physics studies How large can we go? Approach is to distribute processing over many nodes using the inherent parallelism of HEP data Assume each event can be independently processed Each node runs an independent job to processes a subset of the events Results from each job are merged into the overall result


Implementation DIAL software provides a generic end-to-end framework for distributed analysis Generic means few assumptions about the data to be processed or application that carries out the application Experiment and user must provide extensions to the system Able to support a wide range of processing systems Local processing using batch systems such as LSF, Condor and PBS Grid processing using different flavors of CE or WMS Provides friendly user interfaces Integration with root Python binding (from GANGA) GUI for job submission and monitoring (from GANGA) Mostly written in C++ Releases packaged 3-4 times/year


Components AJDL Abstract job definition language Dataset and transformation (application + task) Job C++ interface and XML representation for each Scheduler Handles job processing Catalogs Hold job definition objects and their metadata Web services User interfaces Monitoring and accounting


Dataset Generic data description includes Description of the content (type of data) E.g. reconstructed objects, electrons, jets with cone size 0.7,… Number of events and their ID’s Data location Typically a list of logical file names List of constituent datasets Model is inherently hierarchical DIAL provides the following classes Dataset – defines the common interface GenericDataset – provides data and XML representation TextDataset – embedded collection of text files SingleFileDataset – Data location is a single file EventMergeDataset – Collection of single file event datasets

Dataset (cont): 

Dataset (cont) Current approach for experiment-specific event datasets Add a GenericDataset subclass that is able to read experiment files and fill the appropriate data Use EventMergeDataset to describe large collections DIAL can carry out processing using the GenericDataset interface and implementation Allowed for experiment to provide its own implementation of the Dataset interface If GenericDataset is not sufficient Processing components then require this implementation


Transformation A transformation acts on a dataset to produce another dataset Fundamental user interaction with the system Transformation has two components Application describes the action to take Task provides data to configure the application Task is a collection of named text files Application holds two scripts run – Carries out transformation Input is dataset.xml and output is result.xml Has access to the transformation build_task – Creates transformation data from the files in the task E.g. compiles to build library Build is platform specific


Job The Job class hold the following data: Job definition: application, task, dataset and job preferences Status of the corresponding job Initialized, running, done, failed or killed Other history information Compute node, native ID, start and stop times, return code, … Job provides interface to Start or submit job Update the status and history information Kill job Fetch the result (output dataset)

Job (cont): 

Job (cont) Subclasses Provide connection to underlying processing system Implement these methods Currently support fork, LSF and Condor ScriptedJob Subclass that implements these methods by calling a script Scripts may be then written to support different processing systems Alternative to providing subclass Added in release 1.20 Job copies Jobs may be copied locally or remotely Copy includes only the base class: All common data available but job cannot be started, updated or killed Use scheduler for these interactions


Scheduler Scheduler carries out processing The following interface is defined add_task builds the task Input: application, task Output: job ID submit creates and starts a job Input: application, task, dataset and job preferences Output: job ID job(JobId) returns a reference to the job Or to a copy if the scheduler is remote kill(JobId) to kill a job

Scheduler (cont): 

Scheduler (cont) Subclasses implement this interface LocalScheduler Use a single job type and queue to handle processing, e.g. LSF MasterScheduler Splits input dataset Uses an instance of LocalScheduler to process subjobs Merges results from subjobs Analysis service Web service wrapper for Scheduler This service may run any scheduler with any job type WsClientScheduler subclass acts as a client to the service Same interface for interacting with local fork, local batch or remote analysis service Typical mode of operation is to use a remote analysis service

Scheduler (cont): 

Scheduler (cont) Typical structure of a job running on a MasterScheduler

Scheduler (cont): 

Scheduler (cont) Scheduler hierarchy Plan to add a subclass to forward requests to a selected scheduler based on job characteristics This will make it possible to create a tree (or web) of analysis services Should allow us to scale to arbitrarily large data samples (limited only by available computing resources)

Scheduler (cont): 

Scheduler (cont) Example of a possible scheduler tree for ATLAS


Catalogs DIAL provides support for different types of catalogs Repositories to hold instances of application, task, dataset and job Selection catalogs associate names and metadata with instances of these objects And others MySql implementations Populated with ATLAS transformations and Rome AOD datasets

User interfaces: 

User interfaces Root Almost all DIAL classes are available at the root command line User can access catalogs, create job definitions, and submit and monitor jobs Python interface available in PyDIAL Most DIAL classes are available as wrapped python classes Work done by GANGA using LCG tools GUI There is a GUI that supports job specification, submission and monitoring Provided by GANGA using the PyDIAL


Status Releases Most of the previous is available in the current release (1.20) Release 1.30 expected next month Release with service hierarchy in January Use in ATLAS Ambition is to use DIAL to define a common interface for distributed analysis with connections to different systems See figure ATLAS Distributed analysis is under review Also want to continue with the original goal of providing a system with interactive response for appropriate jobs Integration with the new USATLAS PANDA project

Possible analysis service model for ATLAS: 

Possible analysis service model for ATLAS

Status (cont): 

Status (cont) Use in other experiments Some interest but no serious use that I know of Current releases depend on ATLAS software but I am willing to build a version without that dependency Not difficult A generic system like this might be of interest to smaller experiments that would like to have the capabilities but do not have the resources to build a dedicated system


Interactivity What does it mean for a system to be interactive? Literal meaning suggests user is able to directly communicate with jobs and subjobs Most users will be satisfied if their jobs complete quickly and do not need to interact with subjobs or care if some other agent is doing so Important exception is the capability to a job and all its subjobs Interactive (responsive) impose requirements on the processing system (batch, WMS, …) High job rates (> 1 Hz) Low submit-to-result job latencies (< 1 minute) High data input rate from SE to farm (> 10 MB/job)

Current results: 

Current results ATLAS generated data samples for the Rome physics meeting in June Data includes AOD (analysis oriented data) that is a summary of the reconstructed event data AOD size is 100 kB/event All Rome AOD datasets are available in DIAL An interactive analysis service is deployed at BNL using a special LSF queue for job submission The following plot shows processing times for datasets of various sizes LSF SUSY (green triangles) is a “real AOD analysis” producing histograms LSF big produces ntuples and extra time is for ntuple merging (not parallelized) Largest jobs take 3 hours to run serially and are processed in about 10 minutes with the DIAL interactive service


Conclusions DIAL analysis service running at at BNL Serves clients from anywhere Processing done locally Rome AOD datasets available Robust and responsive since June Next steps Similar service running at UTA (using PBS) Service to select between these BNL service to connect to OSG CE (in place of local LSF) Integration with new ATLAS data management system Integration with PANDA

More information: 

More information DIAL home page: ADA (ATLAS Distributed Analysis) home page: Current DIAL release:


Contributors GANGA Karl Harrison, Alvin Tan DIAL David Adams, Wensheng Deng, Tadashi Maeno, Vinay Sambamurthy, Nagesh Chetan, Chitra Kannan ARDA Dietrich Liko ATPROD Frederic Brochu, Alessandro De Salvo AMI Solveig Albrand, Jerome Fulachier ADA (plus those above), Farida Fassi, Christian Haeberli, Hong Ma, Grigori Rybkine, Hyunwoo Kim

authorStream Live Help