Presentation Transcript
ATLAS and Grid Computing : ATLAS and Grid Computing RWL Jones
GridPP 13 5th July 2005
ATLAS Computing Timeline : ATLAS Computing Timeline Commissioning
takes priority!
Computing TDR structure : Computing TDR structure The TDR describes the whole Software & Computing Project as defined within the ATLAS organization:
Massive productions on 3 Grids : Massive productions on 3 Grids
Massive productions on 3 Grids (3) : Massive productions on 3 Grids (3) July-September 2004: DC2 Geant-4 simulation (long jobs)
40% on LCG/EGEE Grid, 30% on Grid3 and 30% on NorduGrid
February-May 2005: Rome production
70% on LCG/EGEE Grid, 25% on Grid3, 5% on NorduGrid
LCG/EGEE Grid resources always difficult to saturate with “traditional” means
New approach (Lexor-CondorG) used Condor-G to submit directly to the sites
in this way the job rate was doubled on the same total available resources
much more efficient usage of the CPU resources
the same approach is now evaluated also for the Grid3/OSG Grid job submission which suffered also from job rate problems
Massive productions on 3 Grids (4) : Massive productions on 3 Grids (4) 73 data sets containing 6.1M events simulated and reconstructed (without pile-up)
Total simulated data: 8.5M events
Pile-up done later (for 1.3M events done up to last week)
Experience with LCG-2 Operations : Experience with LCG-2 Operations Support for our productions was excellent from the CERN-IT-EIS team
Other LCG/EGEE structures were effectively invisible (GOC, ROCs, GGUS etc)
no communication line between experiments and the Grid Operations Centres
operational trouble info always through the EIS group
sites scheduled major upgrades or downtimes during our productions
no concept of “service” for the service providers yet!
many sites consider themselves as part of a test structure set up (and funded) by EGEE
but we consider the LCG Grid as an operational service for us!
many sites do not have the concept of “permanent disk storage” in a Storage Element
if they change something in their filing system, our catalogue has to be updated!
Second ProdSys development cycle : Second ProdSys development cycle The experience with DC2 and the Rome production taught us that we had to re-think at least some of the ProdSys components
The ProdSys review defined the way forward:
Frederic Brochu one of the reviewers
Keep the global ProdSys architecture (system decomposition)
Replace or re-work all individual components to address the identified shortcomings of Grid middleware:
reliability and fault tolerance first of all
Re-design the Distributed Data Management system to avoid single points of failure and scaling problems
Work is now underway
target is end of Summer for integration tests
ready for LCG Service Challenge 3 from October onwards
Distributed Data Management : Distributed Data Management Accessing distributed data on the Grid is not a simple task
Several central DBs are needed to hold dataset information
“Local” catalogues hold information on local data storage
The new DDM system (right) is under test this summer
It will be used for all ATLAS data from October on (LCG Service Challenge 3)
Affects GridPP effort
Computing Operations : Computing Operations The Computing Operations organization likely to change:
Grid Tools
Grid operations:
Tier-0 operations
re-processing of real and simulated data at Tier-1's
data distribution and placement
Software distribution and installation
Site and software installation validation and monitoring
Coordination of Service Challenges in 2005-2006
User Support
Proposal to use Frederic Brochu in front-line triage
Credited contribution
Contingent on Distributed Analysis planning
Software Installation : Software Installation Software installation continues to be a challenge
Rapid roll-out of release to the Grid important for ATLAS UK eScience goals (3.1.4)
Vital for user code in distributed analysis
Grigori Rybkine (50/50 GridPP/ATLAS eScience):
Working towards 3.1.5, kit installation and package management in distributed analysis
Package manager implementation supports tarball and locally-built code
Essential support role
3.1.5 progressing well, 3.1.4 may have some delays because of external effort in nightly deployable packages
Current plans for EGEE/gLite : Current plans for EGEE/gLite Ready to test new components as soon as they are released from the internal certification process
assume the LCG Baseline Services
Only seen the File Transfer Service & LCG File Catalogue
both being actively tested by our DDM group
FTS will be field-tested by Service Challenge 3 starting in July
LFC is in our plan for the new DDM (Summer deployment)
Not really seen the new Workload Management System nor the new Computing Element
some ATLAS informal access to pre-release versions
As soon as the performance is acceptable we will ask to have them deployed
this is NOT a blank check!
Distributed Analysis System : Distributed Analysis System ATLAS and GANGA work now focused on Distributed Analysis
LCG RTAG 11 in 2003 did not produce a common analysis system project as hoped. ATLAS therefore planned to combine the strengths of various existing prototypes:
GANGA provides a Grid front-end for Gaudi/Athena jobs
DIAL provides fast, quasi-interactive, access to large local clusters
The ATLAS Production System to interface to the 3 Grid flavours
Alvin Tan
Work on the job-building GUI and Job Options Editor well received
Wish from LBL to merge JOE with Job Options Tracer project
Monitoring work also well received – prototypes perform well.
Frederic Brochu
Provided beta version of new job submission from GANGA direct to Production System
Distributed Analysis System (2) : Distributed Analysis System (2) Currently reviewing this activity to define a baseline for the development of start-up Distributed Analysis System
All this has to work together with the DDM system described earlier
Decide a baseline “now”, so we can have a testable system by this autumn
The outcome of the review may change GridPP plans
Conclusions : Conclusions ATLAS is (finally) getting effective throughput from LCG
The UK effort is making an important contribution
The Distributed Analysis is continuing to pose a big challenge
ATLAS is taking the right management approach
GridPP effort will have to be responsive
Catch the
buzz on authorSTREAM
Copyright © 2002-2008 authorSTREAM. All rights reserved.