John Hicks Grid3 SC2003 and beyond

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Grid3 SC2003 and beyond: 

Grid3 SC2003 and beyond John Hicks TransPAC HPCC Engineer Indiana University PRAGMA 6 – Beijing 17-May-2004

Introduction to Grid3: 

Introduction to Grid3 Grid3 is a coordinated project between US LHC experiments (US ATLAS, US CMS), grid projects (iVDGL, GriPhyN, PPDG), and computing projects (LIGO, SDSS, BTeV) Purpose of Grid3 is to build a multi-experiment multi-VO grid environment Test the infrastructure and services for production and analysis of scientific experiments Provide a platform for technology demonstrators Grid3 is supported by the National Science Foundation and the Department of Energy

The Grid3 Project: 

The Grid3 Project Grid3 at SC03 had 28 sites peak processor count ~2800 CPUs 6 virtual organizations (VO) SDSS ATLAS iVDGL USCMS LIGO (now LIGO Scientific Collaboration, LSC) BTeV 11 application Resources dynamically roll-in/out Applications dynamically installed Grid3 provides a base for a persistent grid

Science Applications: 

Science Applications Each VO provides and maintains its applications Applications do not require privileged access to be installed or to operate Reserved areas for applications, data stage-in/out, temporary files, are available Installation location information is published in Metacomputing Directory Service (MDS) Multiple versions of an application may exist HEP, CS demonstrators, Astrophysics, Biology applications

Grid3 Experiments: USATLAS: 

Grid3 Experiments: USATLAS The US ATLAS group consists of 31 universities and 3 national laboratories. It is participating in the building and operation of the ATLAS (A Toroidal LHC Apparatus) experiment to be installed in one of the interaction regions at the Large Hadron Collider (LHC) at CERN, Geneva Switzerland. http://www.usatlas.bnl.gov

Grid3 Experiments: USCMS: 

Grid3 Experiments: USCMS USCMS is a collaboration of US scientists participating in the Compact Muon Solenoid (CMS) experiment at the Lepton Hadron Collider (LHC) at CERN in Geneva, Switzerland. http://www.uscms.org

Grid3 Experiments: LIGO: 

Grid3 Experiments: LIGO The Laser Interferometer Gravitational-Wave Observatory (LIGO) is a facility dedicated to the detection of cosmic gravitational waves and the harnessing of these waves for scientific research. It consists of two widely separated installations within the United States — one in Hanford Washington (left) and the other in Livingston, Louisiana (right) — operated in unison as a single observatory. http://www.ligo.caltech.edu

Grid3 Experiments: SDSS: 

Grid3 Experiments: SDSS The Sloan Digital Sky Survey (SDSS) is a collaboration of scientists and engineers to map one-quarter of the entire sky, determining the positions and absolute brightnesses of more than 100 million celestial objects. http://www.sdss.org

Grid3 Experiments: BTeV: 

Grid3 Experiments: BTeV The BTeV experiment is designed to challenge the Standard Model explanation of CP violation, mixing and rare decays of beauty and charm quark states. http://www-btev.fnal.gov

Grid3 Software: 

Grid3 Software Pacman Packing and installation software Main deployment tool for Grid3 All software pacmanized VDT The Virtual Data Toolkit (VDT) is a set of grid software that can be easily installed and configured. The goal of the VDT is to make it as easy as possible for users to install grid software It includes fundamental grid software, Virtual Data software, and utilities

Slide11: 

Job submission and data transfer Globus Toolkit The Globus Toolkit is an open source toolkit used for building grids. The toolkit components can be used independently or together to develop applications. These components help support and manage elements like: Security, Fault Detection, Information infrastructure, Portability, Resource management, Data management, Communication MDS - a directory service used to publish configuration information RLS - The replica location service (RLS) maintains and provides access to mapping information from logical names for data items to target names. These target names may represent physical locations of data items, or an entry in the RLS may map to another level of logical naming for the data item. Condor Condor is an open source work management system for compute-intensive jobs which provides : A job queuing mechanism, Scheduling policy, Priority scheme, Resource monitoring, Resource management

Slide12: 

Job submission and data transfer (cont.) VDS The Virtual Data System (VDS Chimera/Pegasus/Sphinx/DAGMan) is open-source software which provides a method for storing the representation of computational procedures used to generate data, those procedures themselves and the datasets produced by them. This allows the auditing and lineage of derived data to be recorded and the automatic on-demand re-derivation of said data. This is important in large collaborations where it may be more difficult to determine how particular data was generated.

Slide13: 

User Management Virtual Organization Membership Service VOMS (Virtual Organization Membership Service) is open-source software which provides information on a user's membership within a virtual organization (VO). A virtual organization is an abstract entity grouping Users, Institutions and Resources into the same administrative domain. A User's membership in a VO indicates that he/she may have permissions to utilize resources at individual institutions. Grid User Management System Develop Model for Distributed User Registration Work With Existing VO Management Tools including EDG VOMS servers used in Grid2003 Help Define Requirements for New & Improved VO Tools Focus on Site Tools for User Management

Slide14: 

Information Services MDS Based Schemas/Information needed MDS core, GLUE (Grid Laboratory Universal Environment) Grid3 Site specific information on Grid3 ($GRID3, $APP, $DATA, $TMP, $TMP_WIN) VO specific information on Grid3 (run time environments needed to run VO specific applications) Vo and application Specific

iVDGL Grid Operations Center (iGOC): 

iVDGL Grid Operations Center (iGOC) The iGOC is currently located at Indiana University The iGOC provides 24x7x365 operational support backed by Services Level Agreements (SLA) Support includes: Problem alert, tracking, and trouble ticket support Support for systems which host the Globus Index Information Service (GIIS), VOMS Database Service, Replica Location Service (RLS), and Monitoring Tools Grid3 monitoring is coordinated through the iVDGL operations group and the iGOC

Monitoring/Interactive Analysis services: 

Monitoring/Interactive Analysis services Ganglia Open source tool to collect cluster monitoring information such as CPU and network load, memory and disk usage MonALISA Monitoring tool to support resource discovery, access to information and gateway to other information gathering systems ACDC (Advanced Computational Data Center) Job Monitoring System Application using grid submitted jobs to query the job managers and collect information about jobs. This information is stored in a DB and available for aggregated queries and browsing. Metrics Data Viewer (MDViewer) analyzes and plots information collected by the different monitoring tools, such as the DBs at iGOC. Distributed Interactive Analysis of Large datasets (DIAL) provides connection between interactive analysis tools (like JAS, ROOT) and data processing applications (like ATHENA).

Monitoring services: 

Monitoring services Web Ganglia MDS GRIS Job sched agents Information providers MonALISA ML repository Server DB Report Web Outputs Information consumers ACDC Job DB VO GIIS GIIS MDViewer Web Web Report

Ganglia snapshots: 

Ganglia snapshots

MonALISA framework: 

MonALISA framework

Interactive analysis services: 

Interactive analysis services Metrics Data Viewer (MDViewer) analyzes and plots information collected by the different monitoring tools, such as the DBs at iGOC Distributed Interactive Analysis of Large datasets (DIAL) provides connection between interactive analysis tools (like JAS, ROOT) and data processing applications (like ATHENA) Differentiate the possible information sources for MDViewer (other DBs, log files, …) and provide different GUIs (e.g. servlet) Make DIAL Grid enabled and to add a dataset catalog to it

Grid3 status tool: 

Grid3 status tool Choose the sites from the catalog Site list, available resources Availability test Site specific information Current map based on Abilene weather map soon to be SVG

Site Status tool: 

Site Status tool

Monitor Job execution: 

Monitor Job execution Check the submitted/running/held jobs Verify the increased load Control the traffic Look the expected completion time

Grid3 at SC2003: 

Grid3 at SC2003 Users point of view Ease to become a Site (well defined instruction, responsive mailing list for support) Ease to package an application for the Grid (well defined example to follow, will provide automatic installation, submission - biology group at ANL prepared for grid execution in less than 1 week – using Chimera-Pegasus) ATLAS validated the full chain event generation, simulation, reconstruction, analysis (Higgs event observed during SC03) CMS currently using grid3 for effective production – more than 40000 CPU*day used by VOs in the last 2 months (real jobs, no tests)

Submissions during SC2003 week: 

Submissions during SC2003 week Total number of jobs submitted during SC2003 week: ~ 3400 successful (data produced, transferred to SE, registered in RLS): ~2300. Row statistics, can be improved resubmitting the jobs which failed due to different reasons. 1. Simulation jobs: SUB OK "Higgs" sample (200evts): ~1500 ~1020 "Top" sample (200evts): ~1200 ~600 2. Reconstruction jobs: "Higgs" sample (200evts): ~710 ~675 These data has been analyzed by David Adams using DIAL. The production chain resulted validated by the reconstruction of a Higgs trace Different errors, sometime with unknown cause, others due to changes in resource availability, failed transfer or registration, competition in shared resources (RAM), certificate issues (DOEgrid/DOEsciencegrid)

Statistics per VO: 

Statistics per VO Met Targets Data transferred per day>1 TB Number of concurrent jobs >1100 (11/20/03) Number of users>100 Number of different applications >11 Number of sites running multiple applications >10 Rate of Faults/Crashes < 1/hour Operational Support Load of full demonstrator < 2 FTEs More than 45000 CPU*days used

Grid3+ beyond SC03: 

Grid3+ beyond SC03 Main production grid has 24 Sites, ~2613 CPUs There are 7 VO all running RedHat Linux 7.3, VDT 1.1.11, and using DOE grid certs The additional VO, grid3dev, is a persistent development test-bed Grid3dev currently has 8 sites with 154 CPUs Grid3dev is a platform to develop code for multiple UNIX OSs and test preproduction code Grid3dev has recently gone from VDT 1.1.13 to VDT 1.1.14

For more info: 

For more info GGF http://www.ggf.org/ Globus http://www.globus.org/ Grid2003 http://www.ivdgl.org/grid2003/ Monitoring http://grid.uchicago.edu/metrics/

Thank you: 

Thank you John Hicks Indiana University jhicks@iu.edu