7f tiz legfontosabb publikacio

Hungarian perspectives in CERN LHC GRID

György Vesztergombi <veszter@rmki.kfki.hu>,

Gergely Debreczeni, Csaba Hajdu and Jozsef Kadlecsik

Abstract

Due to its size and urgency of the task the LHC Computing Grid (LCG) plays a special role amongst the different grid projects. Hungary as member of CERN actively participating in the LHC project both in the experimental and the computing sides. In this talk the present state of LCG is summarized.

===========================================================================

Hungarian summary

Magyar lehetőségek a CERN LHC GRID-ben

A világ leghatalmasabb részecskegyorsitója van épülőben Genf közelében a francia svájci határon a CERN-ben. Az LHC-nak (Large Hadron Collidernek) nevezett gyorsitó 2007-re készül el, amelyen 4 nagy nemzetközi együttműködés keretében mintegy 6000 fizikus és mérnök a világ számos egyeteméről és kutatóintézetéből fogja a kisérleteket végezni, közöttük az RMKI kutatói is Budapestről.

Az LHC kisérletek különlegesen nagy számitástechnikai igényekkel lépnek fel: 5-8 PetaByte-nyi mérési adat kerül begyűjtésre évenként, amely feldolgozásához a mai leggyorsabb PC processzorokat figyelembe véve is legalább 200 ezerre lenne szükség 10 PetaByte diszkkapacitással felszerelve. Még ha fel is tételezzük, hogy a következő években is hasonló ütemben folytatódik a tárolási sűrűség és processzorok teljesitményének növekedése ez még akkor is igen hatalmas és komplex rendszert fog jelenteni, amelynek a tervek szerint kétharmada az egész világon „regionális centrumokban” Európában, Amerikában és Ázsiában lesz elosztva.

Ezen adottságok következtében az LHC-hoz szükséges számitástechnikai rendszer mint globális grid kerül kialakitásra azzal a céllal, hogy a földrajzilag teljesen szétszórt számitógép elemeket egyetlen koherens virtuális egészként integrálja. A feladat megoldása igen sok területen jelent komoly kihivást, mint például tudományos feladatok elosztott hálózaton való megoldása, grid middleware kidolgozása, automatizált számitógép rendszer menedzselése, nagyteljesitményű hálózat működtetése, objekt orientált adatbázisok kezelése, biztonsági kérdések megoldása, globális grid működtetése.

A kutatás és fejlesztés egy a CERN által koordinált projekt keretében történik tudományos intézetek és ipari partnerek bevonásával. Az LHC Computing Grid-ből röviditve az LCG nevű projekt szervesen integrálva lesz az európiai nemzeti grid kezdeményezésekkel és szorosan együtt fog működni olyan egyéb projektekkel, amelyek a grid technologia és nagyteljesitményű nagy kiterjedésű hálózatfejlesztés élvonalában vannak, mint

· GEANT, Datagrid és DataTAG, részben szponzorálva az Európai Unió által,

· GriPhyN, Globus, iVDGL és PPDG, amelyeket az National Science Foundation és Deparment of Energy támogat Amerikában.

2003 első felében létrehozzák a bevezető jellegű első Globális GRID Szolgálatot, az LCG-1-t, azzal a világos céllal, hogy megbizható „termelői” szolgáltatást nyújtson az LHC kisérletekben dolgozók részére. A szolgálat kezdetben csak kisebb számú nagyobb Regionális Centrumból fog állni három kontinensen elosztva.Létrehozták a GRID Deployment Board-ot (GDB), amely az LCG telepitését hangolja össze. Ahhoz, hogy egy ország a telepitésben való részvételre kvalifikálja magát a következő feltételeket kell teljesitenie. Bizonyitani kell, hogy 2003 áprilisáig képes hozzájárulni az LCG közös infrastrukturájához legalább egy olyan centrummal, amelyben legalább 50 CPU számitógép teljesitmény 5 TeraByte diszkkapacitással rendelkezésre áll és a folyamatos működtetés személyi feltételei legalább 2 emberév ekvivalens mennyiségben biztositottak. Ezzel a feltétellel összhangban megkezdődött az RMKI-ban egy LCG klaszter installálása, amely várhatólag áprilisban bekapcsolódik a rendszerbe. Ez a klaszter persze csak egy első lépés, mintegy kiinduló pontja a később szükséges nagy LHC rendszernek.

Az előadásban összefoglaljuk a CERN EU Datagrid projektből levonható általános tanulságokat, amelynek az EDG testbed-je tekinthető az LCG-1 prototipusának. Bemutatásra kerülnek azok az elképzelések, hogy a Magyarországon részecskefizikai kutatásra elérhető viszonylag szerény erőforrások felhasználásával hogyan lehet 2007-re a kivánt teljesitményű LCG bázis létrehozni. Valamint szeretnénk rámutatni, hogy milyen kedvező hatásai lehetnek egy ilyen projektnek az általános magyar GRID közösségre. Ebben a vonatkozásban hangsúlyozni kell a lehetséges előnyök mindkét oldalát: egyrészt, itt egy valóban „termelő” működő rendszer telepitéséről van szó, nem „demo”-ról, másrészt, az itt nyert tapasztalat kiindulási alapot szolgáltat új alkalmazások kifejlesztésére és magának a grid rendszereknek az újabb fejlettebb verziójának kidolgozására.

===========================================================================

1. Introduction

The world's most powerful particle accelerator is being constructed at CERN, the European Organization for Nuclear Research, near Geneva on the border between France and Switzerland. The accelerator, called the Large Hadron Collider (LHC), will start operation in 2007 and be used as a research tool by four large collaborations of physics researchers, including some 6,000 people from universities and laboratories around the world including RMKI in Budapest, Hungary.

The computational requirements of the experiments that will use the LHC are enormous: 5-8 PetaBytes of data will be generated each year, the analysis of which will require some 10 PetaBytes of disk storage and the equivalent of 200,000 of today's fastest PC processors. Even allowing for the continuing increase in storage densities and processor performance this will be a very large and complex computing system, and about two thirds of the computing capacity will be installed in "regional computing centres" spread across Europe, America and Asia.

The computing facility for LHC will thus be implemented as a global computational grid, with the goal of integrating large geographically distributed computing fabrics into a virtual computing environment. There are challenging problems to be tackled in many areas, including: distributed scientific applications; computational grid middleware, automated computer system management; high performance networking; object database management; security; global grid operations.

The development and prototyping work is being organised as a project that includes many scientific institutes and industrial partners, coordinated by CERN. The project, nicknamed LCG (after LHC Computing Grid), will be integrated with several European national computational grid activities, and it

will collaborate closely with other projects involved in advanced grid technology and high performance wide area networking, such as:

* GEANT, Datagrid and DataTAG, partially funded by the European Union,

* GriPhyN, Globus, iVDGL and PPDG, funded in the US by the National

Science Foundation and Department of Energy.

During the first half of 2003 an initial LCG Global GRID Service (LCG-1) will be set up, with the clear goal of providing a reliable, productive service for LHC collaborations. The service will begin with a small number of the larger Regional Centres, including sites in the three continents.

A GRID Deployment Board (GDB) is created to manage the deployment of LCG. In order to qualify a country for participation in the deployment it implies having an approved plan to contribute to the LCG common infrastracture for April 2003 with at least one centre having a minimum capacity of 50 CPU's and 5 TeraBytes of disk space together with 2FTE's available for operation and support.

In accordance with these requirements in the KFKI-RMKI the installation of the LCG cluster is in progress and by April it will be incorporated to the system. This cluster however serves only as a seed for more wider activities.

The talk will summarize the experience gained in the CERN EU Datagrid project, whose EDG testbed can be regarded as prototype for LCG Phase I.

The planned gradual build-up of LCG till 2007 will be highlighted and the possibilities will be discussed how can Hungary participate in this scientific venture effectively with relatively modest resources and what benefits are expected from this project for the general GRID community. In this respect we should like to emphasize both side of the story: on one hand the deployment of a "working" system, and on the other hand a starting base for research and development base for new applications, upgrade and work out new versions of the system itself.

2. GRID concept

In the general public the notion of computing grid has a rather diffuse meaning. In this section we should like to use this world better defined restricted sense. One can start from I. Foster checklist for a GRID to be a GRID.

Ø a GRID coordinates resources that are not subject to centralized control and live within different control domains and addresses the issues of security, policy, payment, membership… that arise in these settings. (i.e. it is no local management system.)

Ø uses standard, open, general-purpose protocols and interfaces (i.e.it is not an application-specific system).

Ø a GRID allows its constituent resources to be used in a coordinated fashion to deliver various qualities of service ( response time, throughput, availability, and security, and/or co-allocation of multiple resource types ).

Ø the utility of the combined system is significantly greater than that of the sum of its parts to meet complex user demands.

This can be summarized as

Ø “When the network is as fast as the computer's
internal links, the machine disintegrates across
the net into a set of special purpose appliances”
(George Gilder)

What is GRID in the CERN sense?

The GRID initiative in CERN is arising from the need to solve a specific problem, therefore it requires a specific solution, which can have some basic properties which implies a notion that it can be regarded as a realization of the GRID concept.

By the nature of the problem, that the mass of experimental data will be generated in 4 very localized places at LHC accelerator in experiments ATLAS, CMS, ALICE and LHCb calls for a special Tier-0 center, where this primordial raw data are stored and primarily processed. The challenge however is so vast, that the Tier-0 center can solve only the smaller part of the particular tasks. The simulation and higher level data analysis calls for centers of similar size. This Tier-1 so called regional centers are scattered around the world in strategic positions providing the service for the local communities through the intermediate Tier-2 local centers. In this hierarchical system the Tier-3 centers are the group serves and the users with their PCs or laptops are representing the Tier-4 level.

This architecture is very similar to the one of electrical power grids.:Tier-0: Experiments (=power plants) are supplying Tier-1: large volume data services (=high tension power grid) which are connected to Tier-2:local centers (=middle-voltage transformator stations) which are feeding the Tier-3: actual working places (=houses,offices) providing the service to individual Tier-4 devices ( low-voltage appliances).

This system is a realization of grid in Foster’s sense, but there can be other very different variations on this theme.

The other very characteristic property of the LCG is that it allows certified users belonging to multi-domain Virtual Organizations to access a large amount of resources via single log in (sign on). In a crude way one can specify this as the possibility to organize on top of the physical grid ad hoc virtually standalone

SUBGRIDs where the resource are assigned to a given experiment or subexperiment.

CERN model means that the GRID= networked data processing centres and ”middleware” software act as the “glue” of resources between two basic components:

Researchers perform their activities regardless geographical location, interact with colleagues, share and access data

Scientific instruments and experiments provide huge amount of data

The main challenge for High Throughput Computing is how to maximize the amount of resources accessible to its customers. Distributed ownership of computing resources is the major obstacle such an environment has to overcome in order to expand the pool of resources it can draw from.

The secure access to resources is required (security framework to allow resources access only to certified, identified users (e.g. X.509 Public Key Infrastructure )). The security issues are discussed in separate talk on this conference.

3. The European DataGrid (EDG)

The LCG will be based on the work made by the European DataGrid (EDG) collaboration. EDG is a project funded by the European Union to exploit and build the next generation computing infrastructure providing intensive computation and analysis of shared large-scale databases:

Ø Enable data intensive sciences by providing world wide Grid test beds to large distributed scientific organisations.

Ø Start : Jan 1, 2001 End : Dec 31, 2003

Ø Applications/End User Communities : HEP, Earth Observation, Biology.

Ø Specific Project Objectives:

n Middleware for Jobs (Workload) and Data Management,

Information Systems, Fabric & GRID management, Network Monitoring

n Large scale testbed

n Production quality demonstrations

n Contribute to Open Standards and international bodies ( GGF, Industry & Research forum)

EDG structure

Ø The EDG collaboration is structured in 12 Work Packages:

n WP1: Work Load Management System

n WP2: Data Management

n WP3: Grid Monitoring / Grid Information Systems

n WP4: Fabric Management

n WP5: Storage Element

n WP6: Testbed and demonstrators

n WP7: Network Monitoring

n WP8: High Energy Physics Applications

n WP9: Earth Observation

n WP10: Biology

n WP11: Dissemination

n WP12: Management

EDG current status

Ø EDG currently provides set of middleware services

Ø Job & Data Management

Ø GRID & Network monitoring

Ø Security, Authentication & Authorization tools

Ø Fabric Management

Ø Runs on Linux Red Hat 6.2 platform

Ø Site install & config tools and set of common services available

Ø ( Resource Brokers, VO-LDAP servers for Authentication, VO-based Replica Catalogs, VO-management services )

Ø 5 principle EDG 1.2.0 sites currently belonging to the EDG-Testbed

Ø CERN(CH), RAL(UK), NIKHEF(NL), CNAF(I), CC-Lyon(F),

Ø being deployed on other EDG testbed sites (~10)

Ø Intense middleware development continuously going on, concerning:

n New features for job partitioning and check-pointing, billing and accounting

n New tools for Data Management and Information Systems.

Integration of network monitoring information inside the brokering polices

4. Deployment

Many regional centres are eager to join LCG-1 as early as possible. However, due to the complexity of the grid middleware and its configuration, the actual deployment and installation must be done in a very well controlled manner, especially for the first few sites while the deployment and installation process is debugged. A reasonable model that is acceptable to most centres is the following. Initial deployment of the middleware will be to one site at a time, focusing on the larger Tier 1 like sites that have sufficient support staff to be dedicated to this deployment process and then to act as a resource to their regional Tier 2 centres. Once LCG-1 has been deployed to a handful of Tier 1 centres, then those centres can provide support to the Tier 2 centres. In this way Tier 2’s in many regions can be brought on line in parallel. This process provides both a level of control in the early stages, but also addresses the needs of the centres to join LCG-1 in a reasonable time. However, it is essential that all centres provide an adequate level of support during this process.

Interoperating with existing grids and on mature clusters with non-grid users will be essential for LCG-1 in the medium term.

As far as possible sites that are members of LCG-1 at the early production stage (July 2003) should provide at least 16 hour/day on-call support.

The aim is to attempt to ensure that all 4 LHC experiments have adequate resources provided. The desire is to add centres incrementally during the first half of 2003, to reach a situation in May where the initial LCG-1 service is distributed over 6-7 regional centres in 3 continents. The schedule for deployment during the second half of 2003 should be updated in June-July in the light of experience and changes of resource and site availability.

5. Services and Architecture

LCG-1, more than a collection of resources, represents a set of services that the LHC experiments will use to get their work done. At this point in the development of a deployment plan for LCG-1 there are many unknowns. Attempts to specify the middleware components to be used, to specify exactly which services will be run, did not succeeded at this time; which components will be selected from among several alternatives is still to be evaluated. Some current services are only expected to work well, or at all, if they are singleton services LCG-wide. However, other candidate components may remove that restriction. The issue of whether there is then one grid or several experiment-specific ones is therefore not so simple to answer without a better understanding of the limitations of the available middleware. It is clear that the ultimate goal is to have one LCG, with all jobs able to run at all sites, restricted only by policy decisions or lack of resources based on allocation priorities.

It is recommend that a System Architecture Team or Taskforce be formed (of people from both the Grid Technology and Applications areas of the LCG project) and an overall system architecture evolved into which the various services provided by the middleware components may be plugged in in a well-defined way. Through this we may better understand issues related to scalability (how many services of each type may run and where), interoperability (EDG, VDT and other components), robustness and error recovery (dependencies of services on each other and on underlying fabric services will be more clearly spelled out). This will also go a long way towards the desirable longer term goals of defining interfaces rather than components. Simple pass-through or stub services could be put in place while middleware providers work on providing additional functional components.

Issues:

There are several attempts at architecture from PPDG/GriPhyN, from EDG, from the Experiments. Some of these are closer to application architecture than system architecture. They may be somewhat incompatible in their conceptual designs. Much work will need to be done to align everyone.
Error handling and fault tolerance can only be successfully addressed once an overall system architecture is agreed upon.

6. Identified Middleware Services

Virtual Organization Service (assumes underlying CA, registration, authentication services exist)

Several possibilities being investigated

Replica Service

Relies on Replica Catalog server
Relies on (or should rely on) other services such as Storage Services, or simple file transfer services

Storage and Data caching/handling services

No standard components agreed on yet

Resource Discovery and Job Scheduling and Execution services (Resource Broker ++)

Limited initial functionality

Publication of static and dynamic information about the site or the Grid

Some functionality in recommended Middleware (from EDG) but not enough. Some publication may have to be done “by hand”
MDS with the GLUE schema as the underlying service

Monitoring Services

Is it enough to run these at one (or more) operations centers?

Experiment-specific Data Provenance and “Virtual Data” services
Experiment-specific Job/workflow scheduling and execution services

In the early stages of deploying LCG-1, since the overall architecture is unspecified and the optimal use and deployment of services unclear, it is recommended that the minimum numbers of servers for each service be run initially: in some cases one at CERN only, or one at CERN, one in the US and one in Asia, or one for each experiment. However, since operational experience will guide the deployment one may need to move rapidly to wider deployment of many services in multiple locations.

A yet-to-be-formed group of technical representatives from each site or Grid operating as part of LCG-1 should make day-by-day decisions about which services should be deployed where in order to meet the goals of the particular LCG-1 environment (development, integration or production). This needs “operations specialists” to be identified at sites to work with core team “operations” people in LCG at CERN.

At minimum the following databases will have to be available in one or more places. They must be high availability, backed up, preferably with fail-over since important services depend on them.

MDS catalog (or equivalent)
Replica catalog
VO databases

7. Services needed for scheduling and allocating resources

In order to proceed further with any dynamic scheme for scheduling resources, or even with a manual process for high level allocation, one must assume that certain minimum services to provide information in a standard form are in place and furthermore are actually used by each site or Grid that is part of LCG-1.

Despite limitations of the middleware, each site or group of sites that belongs to LCG-1 must agree to somehow publish a minimum set of information about either the static or dynamic attributes of their resources. It is important that this is made a requirement of entry into LCG-1, so that increasing functionality for resource scheduling and allocation may be brought on incrementally in a coherent way. Such information will be needed for each of the development, integration and production environments of LCG-1.

8. Allocation processes

Resources may be “allocated” to experiments on three separate levels.

At the first level each experiment plans for data challenges that involve specific numbers of events to be processed through the entire data processing and analysis chain. The experiments make estimates of the amount of CPU and storage resources that this processing will take. The decision to “allocate” an adequate number of resources and to support a particular experiment’s requirements for its data challenge (and eventually all its physics data) lies with the LHCC, the Computing RRB, and other CERN oversight mechanisms that balance physics goals against overall resource requests.

Since, in the next few years, there may not be adequate resources for all experiments to simultaneously engage in a data challenge it is envisaged that during certain time periods (on the scale of weeks/months) a particular experiment may be given priority at a particular center or even may be given access to resources at a center that it normally does not use, on the basis of expectations of reciprocal favorable treatment for other experiments at a later time. At the moment we see no “Grid” mechanisms in place, other than purely administrative ones, based on planning and negotiation by individual experiments with individual centers, for reaching these types of agreements. The results of such agreements will need to somehow find their way into the published data of a site or set of sites, for use by grid middleware components involved in resource discovery and scheduling of jobs. We do not understand how that will be done at the present time

The third and most fine-grained level of resource allocation is that which will be performed on a minute by minute or hour by hour basis, using Grid middleware components, as part of the process of dispatching jobs/workflows to appropriate sites for execution.

The aim for LCG, in the long term, is to permit any experiment to use any grid resource that the resource owner agrees is a valid use of that resource, subject to local allocation decisions.

9. Metrics

Metrics are needed for several different processes: -

· Description and measure of resources being contributed to LCG-1.Need some agreed units.

· Allocation of resources.

· Measurement of actual use of resources.

· Criteria of the experiment for success of LCG-1

Work on developing common metrics and on “translations” from local units of resource measurement to common units may be needed. This might merit a working group of its own. However, it is too soon to address this with any urgency for LCG-1 and presumably much will be learned during the deployment of LCG-1.

We probably need to consider a simplified “work done” metric to measure the overall value of the services provided in aggregate, never mind how many specints, how much disk cache, and how many tapes were available. This may mean that the experiment jobs themselves need to be instrumented in a uniform way to collect this information.

Experiments need to clearly state, in some measurable form, their criteria for success for LCG-1. Otherwise success must be guaranteed (by definition) else it may not be perceived, as there are bound to be plenty of problems.

10. LCG in Hungary

In Hungary in the same way as in other developed countries the informatics society is well aware of the importance of GRID computing and a number of R&D projects are in progress. But there is no real compelling demand at the present level of services and applications which could not be live without computing GRID. Despite the loud propaganda firework the situation is almost the same in the US and Europe. The only project in the world which is unaccomplishable without GRID is the CERN LHC. One can make GRID research and developments in million directions, but to get a working LHC Computing GRID by 2007 when the accelerator will start to dump out data at PetaBytes/year rate, one need to start the deployment of at least a working prototype in year 2003.

As Hungary being a member state of CERN has a scientific interest that the Hungarian physicists could also participate in the analysis of these challenging data, therefore it is the interest of the scientific community to invest in GRID prototype deployment in Hungary because in this respect it is nothing else then a special part of the detector setup. Besides the frontline research instrument, the deployment of part of a world-wide GRID in Hungary represent direct technology transfer into our region. Hungary can be part of a second revolution similar to the internet web, which was developed and first deployed as a research tool for CERN LEP accelerator. The idea was worked out and tested by the physicists and when it matured enough there was no limits for its proliferation.

The building up LCG as a special purpose system for a demanding application which has most of the characteristics of an universal GRID can provide a test-ground were - within well defined boundary conditions and time-scale on a really large volume process - one can work out a robust (may be not optimal) system which is adequately modelling those general grid properties which can’t be studied on reduced scale systems. In parallel, the LCG can serve as education and training centre for the building up of GRIDS for other applications.

An additional aspect of the situation is that all the countries in our region from Czech Republic till Greece are also participating in the LHC experiments but neither of them is in the position to think of about a Tier-1 centre, though the number of interested physicists according to a recent survey in the contacted countries Austria, Slovakia, Czech Rep., Hungary, Serbia, Croatia, Bulgaria, Greece and Turkey would present a community of 400-500 LHC physicists. Due to the fact that Austria declined, but willing to support other’s initiative, Hungary arrived in a special position being regionally centered and could serve as bridge-head toward south, it would be worth to develop a “mini” Tier-1 center in the long run in Budapest.

The creation of LCG is an evolutive process. Hungary is intending to join along two lines. At first create a Tier-2 dedicated LHC centre for LHC physics purposes in the KFKI-RMKI, then extend it by creating secondary centres in ELTE, Atomki, and Debrecen University which are also connected with CERN research, but those centres already will be open for other scientific branches too. Of course, SZTAKI is also a main actor in the scene concentrating on most of the non-LHC topics.

11. KFKI-RMKI cluster

The cluster consists of 50 AMD Athlon MP 2000+ CPUs and a 4 TeraByte disc server, which is half filled at present. This cluster provides 33 kSI2k computer power (“SI2k” is the unit which is used in CERN to compare the computing power of CPUs). Though this cluster size is at the low end of the world-wide proposed systems, still we can be amongst the first 10 clusters which will start 24 hours/ 7days GRID service in July 2003. Of course, this will be an introductory “primitive” batch service, but it will be a world premier where computers from 3 continents (Asia, Europe and America) will work together in such regime.

The implementation of the the first ever LCG middleware is not a trivial issue as it was discussed in general in section 4. Due to the fact that there is no existing Tier-1 centre in between, we should rely on the CERN Tier-0/Tier-1 centre. At the time of the writing of this talk the middelware is available only in a test version which consists of RDMs: RedHat 6.1 + LCFG + EDG 1.4.4 packets. This are good only for training on a few CPU system. The final version is expected by end of May which shall contain RedHat 7.3 + LCFDng + EDG 2.0.

Depending on the available resources we hope to increment this centre in each coming year, thus by 2007 it could provide a reasonable contribution to the final LHC Grid.

Acknowledgment

This work was supported by OTKA grant No. T 029264.