Hungarian perspectives in CERN LHC GRID
György Vesztergombi <veszter@rmki.kfki.hu>,
Gergely Debreczeni, Csaba Hajdu and Jozsef
Kadlecsik
Abstract
Due to its size and urgency of the task
the LHC Computing Grid (LCG) plays a special role amongst the
different grid projects. Hungary as member of CERN actively participating in
the LHC project both in the experimental and the computing sides. In this talk
the present state of LCG is
summarized.
===========================================================================
Hungarian summary
Magyar lehetőségek a
CERN LHC GRID-ben
A világ leghatalmasabb részecskegyorsitója van
épülőben Genf közelében a francia svájci határon a CERN-ben. Az LHC-nak (Large Hadron Collidernek)
nevezett gyorsitó 2007-re készül el, amelyen 4 nagy nemzetközi
együttműködés keretében mintegy 6000
fizikus és mérnök a világ számos egyeteméről és
kutatóintézetéből fogja a kisérleteket végezni, közöttük az RMKI kutatói
is Budapestről.
Az LHC kisérletek különlegesen nagy
számitástechnikai igényekkel lépnek fel: 5-8 PetaByte-nyi mérési adat kerül
begyűjtésre évenként, amely feldolgozásához a mai leggyorsabb PC
processzorokat figyelembe véve is legalább 200 ezerre lenne szükség 10 PetaByte
diszkkapacitással felszerelve. Még ha fel is tételezzük, hogy a következő
években is hasonló ütemben folytatódik a tárolási sűrűség és
processzorok teljesitményének növekedése ez még akkor is igen hatalmas és
komplex rendszert fog jelenteni, amelynek a tervek szerint kétharmada az egész
világon „regionális centrumokban” Európában, Amerikában és Ázsiában lesz
elosztva.
Ezen adottságok következtében az LHC-hoz
szükséges számitástechnikai rendszer mint globális grid kerül kialakitásra
azzal a céllal, hogy a földrajzilag teljesen szétszórt számitógép elemeket
egyetlen koherens virtuális egészként integrálja. A feladat megoldása igen sok
területen jelent komoly kihivást, mint például tudományos feladatok elosztott
hálózaton való megoldása, grid middleware kidolgozása, automatizált számitógép
rendszer menedzselése, nagyteljesitményű hálózat működtetése, objekt
orientált adatbázisok kezelése, biztonsági kérdések megoldása, globális grid
működtetése.
A kutatás és fejlesztés egy a CERN által
koordinált projekt keretében történik
tudományos intézetek és ipari partnerek bevonásával. Az LHC Computing Grid-ből röviditve az LCG
nevű projekt szervesen integrálva lesz az európiai nemzeti grid
kezdeményezésekkel és szorosan együtt fog működni olyan egyéb
projektekkel, amelyek a grid technologia és nagyteljesitményű nagy
kiterjedésű hálózatfejlesztés élvonalában vannak, mint
·
GEANT,
Datagrid és DataTAG, részben szponzorálva az Európai Unió által,
·
GriPhyN,
Globus, iVDGL és PPDG, amelyeket az National Science Foundation és Deparment of
Energy támogat Amerikában.
2003 első felében létrehozzák a
bevezető jellegű első Globális GRID Szolgálatot, az LCG-1-t,
azzal a világos céllal, hogy megbizható „termelői” szolgáltatást nyújtson
az LHC kisérletekben dolgozók részére. A szolgálat kezdetben csak kisebb számú
nagyobb Regionális Centrumból fog állni három kontinensen elosztva.Létrehozták
a GRID Deployment Board-ot (GDB), amely az LCG telepitését hangolja össze.
Ahhoz, hogy egy ország a telepitésben való részvételre kvalifikálja magát a
következő feltételeket kell teljesitenie. Bizonyitani kell, hogy 2003
áprilisáig képes hozzájárulni az LCG közös infrastrukturájához legalább egy
olyan centrummal, amelyben legalább 50 CPU számitógép teljesitmény 5 TeraByte
diszkkapacitással rendelkezésre áll és a folyamatos működtetés személyi
feltételei legalább 2 emberév ekvivalens mennyiségben biztositottak. Ezzel a
feltétellel összhangban megkezdődött az RMKI-ban egy LCG klaszter
installálása, amely várhatólag áprilisban bekapcsolódik a rendszerbe. Ez a
klaszter persze csak egy első
lépés, mintegy kiinduló pontja a később szükséges nagy LHC rendszernek.
Az előadásban összefoglaljuk a CERN
EU Datagrid projektből levonható általános tanulságokat, amelynek az EDG
testbed-je tekinthető az LCG-1 prototipusának. Bemutatásra kerülnek azok
az elképzelések, hogy a Magyarországon részecskefizikai kutatásra elérhető
viszonylag szerény erőforrások felhasználásával hogyan lehet 2007-re a
kivánt teljesitményű LCG bázis létrehozni. Valamint szeretnénk rámutatni,
hogy milyen kedvező hatásai lehetnek egy ilyen projektnek az általános
magyar GRID közösségre. Ebben a vonatkozásban hangsúlyozni kell a lehetséges
előnyök mindkét oldalát: egyrészt, itt egy valóban „termelő”
működő rendszer telepitéséről van szó, nem „demo”-ról, másrészt,
az itt nyert tapasztalat kiindulási alapot szolgáltat új alkalmazások
kifejlesztésére és magának a grid rendszereknek az újabb fejlettebb verziójának
kidolgozására.
===========================================================================
1. Introduction
The world's most powerful particle
accelerator is being constructed at CERN, the European Organization for Nuclear
Research, near Geneva on the border between France and Switzerland.
The accelerator, called the Large Hadron Collider (LHC), will start operation
in 2007 and be used as a research tool by four large collaborations of physics
researchers, including some 6,000 people from universities and laboratories
around the world including RMKI in Budapest, Hungary.
The computational requirements of the
experiments that will use the LHC are enormous: 5-8 PetaBytes of data will be
generated each year, the analysis of which will require some 10 PetaBytes of
disk storage and the equivalent of 200,000 of today's fastest PC processors.
Even allowing for the continuing increase in storage densities and processor
performance this will be a very large and complex computing system, and about
two thirds of the computing capacity will be installed in "regional
computing centres" spread across Europe, America and Asia.
The computing facility for LHC will thus
be implemented as a global computational grid, with the goal of integrating
large geographically distributed computing fabrics into a virtual computing
environment. There are challenging problems to be tackled in many areas,
including: distributed scientific applications; computational grid middleware,
automated computer system management; high performance networking; object
database management; security; global grid operations.
The development and prototyping work is being
organised as a project that includes many scientific institutes and industrial
partners, coordinated by CERN. The project, nicknamed LCG (after LHC Computing
Grid), will be integrated with several European national computational grid
activities, and it
will collaborate closely with other
projects involved in advanced grid technology and high performance wide area
networking, such as:
* GEANT, Datagrid and DataTAG, partially funded by the European Union,
* GriPhyN, Globus, iVDGL and PPDG, funded in the US by the National
Science Foundation
and Department of Energy.
During the first half of 2003 an initial
LCG Global GRID Service (LCG-1) will be set up, with the clear goal of
providing a reliable, productive service for LHC collaborations. The service
will begin with a small number of the larger Regional Centres, including sites
in the three continents.
A GRID Deployment Board (GDB) is created
to manage the deployment of LCG. In order to qualify a country for
participation in the deployment it implies having an approved plan to
contribute to the LCG common infrastracture for April 2003 with at least one
centre having a minimum capacity of 50 CPU's and 5 TeraBytes of disk space
together with 2FTE's available for operation and support.
In accordance with these requirements in
the KFKI-RMKI the installation of the LCG cluster is in progress and by April
it will be incorporated to the system. This cluster however serves only as a
seed for more wider activities.
The talk will summarize the experience
gained in the CERN EU Datagrid project, whose EDG testbed can be regarded as
prototype for LCG Phase I.
The planned gradual build-up of LCG till
2007 will be highlighted and the possibilities will be discussed how can
Hungary participate in this scientific venture effectively with relatively
modest resources and what benefits are expected from this project for the
general GRID community. In this respect we should like to emphasize both side
of the story: on one hand the deployment of a "working" system, and
on the other hand a starting base for research and development base for new
applications, upgrade and work out new versions of the system itself.
2. GRID concept
In the general
public the notion of computing grid has a rather diffuse meaning. In this
section we should like to use this world better defined restricted sense. One
can start from I. Foster checklist for a GRID to be a GRID.
Ř
a GRID coordinates resources that are not subject to
centralized control and live within different control domains and addresses the
issues of security, policy, payment, membership… that arise in
these settings. (i.e. it is no local
management system.)
Ř
uses
standard, open, general-purpose protocols and interfaces (i.e.it is not an
application-specific system).
Ř
a GRID allows its constituent resources to be used in a
coordinated fashion to deliver various qualities of service ( response time, throughput, availability, and
security, and/or co-allocation of multiple resource types ).
Ř the
utility of the combined system is significantly greater than that of the sum of
its parts to meet complex user demands.
This can be summarized as
Ř “When
the network is as fast as the computer's
internal links, the machine disintegrates
across
the net into a set of special purpose
appliances”
(George Gilder)
What is GRID in the CERN sense?
The GRID
initiative in CERN is arising from the need to solve a specific problem,
therefore it requires a specific solution, which can have some basic properties
which implies a notion that it can be regarded as a realization of the GRID concept.
By the nature
of the problem, that the mass of experimental data will be generated in 4 very
localized places at LHC accelerator in experiments ATLAS, CMS, ALICE and LHCb
calls for a special Tier-0 center, where this primordial raw data are stored
and primarily processed. The challenge however is so vast, that the Tier-0
center can solve only the smaller part of the particular tasks. The simulation
and higher level data analysis calls for centers of similar size. This Tier-1
so called regional centers are scattered around the world in strategic
positions providing the service for the local communities through the
intermediate Tier-2 local centers. In this hierarchical system the Tier-3
centers are the group serves and the users with their PCs or laptops are
representing the Tier-4 level.
This
architecture is very similar to the one of electrical power grids.:Tier-0: Experiments (=power plants) are supplying
Tier-1: large volume data services (=high tension power grid) which are
connected to Tier-2:local centers (=middle-voltage transformator stations)
which are feeding the Tier-3: actual working places (=houses,offices) providing
the service to individual Tier-4 devices ( low-voltage appliances).
This system
is a realization of grid in Foster’s
sense, but there can be other very different variations on this theme.
The other
very characteristic property of the LCG is that it allows certified users
belonging to multi-domain Virtual
Organizations to access a large amount of resources via single log in (sign
on). In a crude way one can specify this as the possibility to organize on top
of the physical grid ad hoc virtually standalone
SUBGRIDs
where the resource are assigned to a given experiment or subexperiment.
CERN model means
that the GRID= networked data processing centres and
”middleware” software act as the “glue”
of resources between two basic components:
Researchers
perform their activities regardless geographical location, interact with
colleagues, share and access data
Scientific
instruments and experiments provide huge amount of data
The main challenge for High Throughput Computing is how to maximize the amount of resources accessible to its customers.
Distributed ownership of computing resources is the major obstacle such an
environment has to overcome in order to expand the pool of resources it can
draw from.
The secure
access to resources is required
(security framework to allow resources access only to certified,
identified users (e.g. X.509 Public Key Infrastructure )). The security issues
are discussed in separate talk on this conference.
3. The European DataGrid (EDG)
The LCG will be based on the work made by the European DataGrid (EDG)
collaboration. EDG is a project funded by the European Union to exploit and build the
next generation computing infrastructure providing intensive computation and
analysis of shared large-scale databases:
Ř Enable data intensive
sciences by providing world wide Grid test beds to large distributed scientific
organisations.
Ř Start : Jan 1, 2001 End : Dec 31, 2003
Ř Applications/End
User Communities : HEP, Earth
Observation, Biology.
Ř Specific
Project Objectives:
n
Middleware
for Jobs (Workload) and Data Management,
Information Systems, Fabric
& GRID management, Network Monitoring
n
Large scale testbed
n
Production quality demonstrations
n
Contribute to Open Standards and international bodies
( GGF, Industry & Research forum)
EDG structure
Ř
The EDG collaboration is structured in 12 Work Packages:
n
WP1: Work Load Management System
n
WP2: Data Management
n
WP3: Grid Monitoring / Grid
Information Systems
n
WP4: Fabric Management
n
WP5: Storage Element
n
WP6: Testbed and demonstrators
n
WP7: Network Monitoring
n
WP8: High Energy Physics Applications
n
WP9: Earth Observation
n
WP10: Biology
n
WP11: Dissemination
n
WP12: Management
EDG current status
Ř
EDG currently provides set of middleware services
Ř
Job & Data Management
Ř
GRID & Network monitoring
Ř
Security, Authentication & Authorization tools
Ř
Fabric Management
Ř
Runs on Linux Red Hat 6.2 platform
Ř
Site install & config tools and set of common services available
Ř ( Resource Brokers, VO-LDAP
servers for Authentication, VO-based Replica Catalogs, VO-management services )
Ř
5 principle EDG 1.2.0 sites currently belonging to the EDG-Testbed
Ř
CERN(CH),
RAL(UK), NIKHEF(NL), CNAF(I), CC-Lyon(F),
Ř being deployed on other
EDG testbed sites (~10)
Ř
Intense middleware development continuously going on, concerning:
n
New features for job partitioning and check-pointing, billing and
accounting
n
New tools for Data Management and Information Systems.
Integration of network
monitoring information inside the brokering polices
Many regional centres are eager to join LCG-1 as early as possible. However, due to the complexity of the grid middleware and its configuration, the actual deployment and installation must be done in a very well controlled manner, especially for the first few sites while the deployment and installation process is debugged. A reasonable model that is acceptable to most centres is the following. Initial deployment of the middleware will be to one site at a time, focusing on the larger Tier 1 like sites that have sufficient support staff to be dedicated to this deployment process and then to act as a resource to their regional Tier 2 centres. Once LCG-1 has been deployed to a handful of Tier 1 centres, then those centres can provide support to the Tier 2 centres. In this way Tier 2’s in many regions can be brought on line in parallel. This process provides both a level of control in the early stages, but also addresses the needs of the centres to join LCG-1 in a reasonable time. However, it is essential that all centres provide an adequate level of support during this process.
Interoperating with existing grids and on
mature clusters with non-grid users will be essential for LCG-1 in the medium
term.
As far as possible sites that are members of LCG-1
at the early production stage (July 2003) should provide at least 16 hour/day
on-call support.
The aim is to attempt to ensure that all 4 LHC
experiments have adequate resources provided.
The desire is to add centres incrementally during the first half of
2003, to reach a situation in May where the initial LCG-1 service is
distributed over 6-7 regional centres in 3 continents. The schedule for
deployment during the second half of 2003 should be updated in June-July in the
light of experience and changes of resource and site availability.
LCG-1, more than a collection of resources, represents a set of services that the LHC experiments will use to get their work done. At this point in the development of a deployment plan for LCG-1 there are many unknowns. Attempts to specify the middleware components to be used, to specify exactly which services will be run, did not succeeded at this time; which components will be selected from among several alternatives is still to be evaluated. Some current services are only expected to work well, or at all, if they are singleton services LCG-wide. However, other candidate components may remove that restriction. The issue of whether there is then one grid or several experiment-specific ones is therefore not so simple to answer without a better understanding of the limitations of the available middleware. It is clear that the ultimate goal is to have one LCG, with all jobs able to run at all sites, restricted only by policy decisions or lack of resources based on allocation priorities.
It is recommend that a System
Architecture Team or Taskforce be formed (of people from both the Grid
Technology and Applications areas of the LCG project) and an overall system
architecture evolved into which the various services provided by the middleware
components may be plugged in in a well-defined way. Through this we may better understand issues related to
scalability (how many services of each type may run and where), interoperability
(EDG, VDT and other components), robustness and error recovery (dependencies of
services on each other and on underlying fabric services will be more clearly
spelled out). This will also go a long way towards the desirable longer term
goals of defining interfaces rather than components. Simple pass-through or stub services could be put in place while
middleware providers work on providing additional functional components.
Issues:
In the early stages of deploying LCG-1, since the overall architecture is unspecified and the optimal use and deployment of services unclear, it is recommended that the minimum numbers of servers for each service be run initially: in some cases one at CERN only, or one at CERN, one in the US and one in Asia, or one for each experiment. However, since operational experience will guide the deployment one may need to move rapidly to wider deployment of many services in multiple locations.
A yet-to-be-formed group of technical representatives from each site or Grid operating as part of LCG-1 should make day-by-day decisions about which services should be deployed where in order to meet the goals of the particular LCG-1 environment (development, integration or production). This needs “operations specialists” to be identified at sites to work with core team “operations” people in LCG at CERN.
At minimum the following databases will have to be available in one or more places. They must be high availability, backed up, preferably with fail-over since important services depend on them.
In order to proceed further with any dynamic scheme for scheduling resources, or even with a manual process for high level allocation, one must assume that certain minimum services to provide information in a standard form are in place and furthermore are actually used by each site or Grid that is part of LCG-1.
Despite limitations of
the middleware, each site or group of sites that belongs to LCG-1 must agree to
somehow publish a minimum set of information about either the static or dynamic
attributes of their resources. It is
important that this is made a requirement of entry into LCG-1, so that
increasing functionality for resource scheduling and allocation may be brought
on incrementally in a coherent way.
Such information will be needed for each of the development, integration
and production environments of LCG-1.
Resources may be “allocated” to experiments on three separate levels.
At the first level each experiment plans for data challenges that involve specific numbers of events to be processed through the entire data processing and analysis chain. The experiments make estimates of the amount of CPU and storage resources that this processing will take. The decision to “allocate” an adequate number of resources and to support a particular experiment’s requirements for its data challenge (and eventually all its physics data) lies with the LHCC, the Computing RRB, and other CERN oversight mechanisms that balance physics goals against overall resource requests.
Since, in the next few years, there may not be adequate resources for all experiments to simultaneously engage in a data challenge it is envisaged that during certain time periods (on the scale of weeks/months) a particular experiment may be given priority at a particular center or even may be given access to resources at a center that it normally does not use, on the basis of expectations of reciprocal favorable treatment for other experiments at a later time. At the moment we see no “Grid” mechanisms in place, other than purely administrative ones, based on planning and negotiation by individual experiments with individual centers, for reaching these types of agreements. The results of such agreements will need to somehow find their way into the published data of a site or set of sites, for use by grid middleware components involved in resource discovery and scheduling of jobs. We do not understand how that will be done at the present time
.
The third and most fine-grained level of resource allocation is that which will be performed on a minute by minute or hour by hour basis, using Grid middleware components, as part of the process of dispatching jobs/workflows to appropriate sites for execution.
The aim for LCG, in the long term, is to permit any experiment to use any grid resource that the resource owner agrees is a valid use of that resource, subject to local allocation decisions.
Metrics are needed for
several different processes: -
·
Description
and measure of resources being contributed to LCG-1.Need some agreed units.
·
Allocation
of resources.
·
Measurement
of actual use of resources.
·
Criteria
of the experiment for success of LCG-1
Work on developing
common metrics and on “translations”
from local units of resource measurement to common units may be needed. This might merit a working group of its
own. However, it is too soon to address
this with any urgency for LCG-1 and presumably much will be learned during the
deployment of LCG-1.
We probably need to
consider a simplified “work done” metric to measure the overall value of the
services provided in aggregate, never mind how many specints, how much disk
cache, and how many tapes were available.
This may mean that the experiment jobs themselves need to be instrumented
in a uniform way to collect this information.
Experiments need to
clearly state, in some measurable form, their criteria for success for LCG-1.
Otherwise success must be guaranteed (by definition) else it may not be
perceived, as there are bound to be plenty of problems.
10. LCG in Hungary
In Hungary in the same
way as in other developed countries the informatics society is well aware of
the importance of GRID computing and a number of R&D projects are in
progress. But there is no real compelling demand at the present level of
services and applications which could not be live without computing GRID.
Despite the loud propaganda firework the situation is almost the same in the US
and Europe. The only project in the world which is unaccomplishable without
GRID is the CERN LHC. One can make GRID research and developments in million
directions, but to get a working LHC Computing GRID by 2007 when the
accelerator will start to dump out data at PetaBytes/year rate, one need to
start the deployment of at least a working prototype in year 2003.
As Hungary being a
member state of CERN has a scientific interest that the Hungarian physicists
could also participate in the analysis of these challenging data, therefore it
is the interest of the scientific community to invest in GRID prototype deployment
in Hungary because in this respect it is nothing else then a special part of
the detector setup. Besides the frontline research instrument, the deployment
of part of a world-wide GRID in Hungary represent direct technology transfer into our region. Hungary can be part of a second revolution similar to the
internet web, which was developed and first deployed as a research tool for
CERN LEP accelerator. The idea was worked out and tested by the physicists and
when it matured enough there was no limits for its proliferation.
The building up LCG as a
special purpose system for a demanding application which has most of the
characteristics of an universal GRID can provide a test-ground were - within well defined boundary conditions
and time-scale on a really large volume process - one can work out a robust (may be not optimal) system which is
adequately modelling those general grid properties which can’t be studied on
reduced scale systems. In parallel, the LCG can serve as education and training
centre for the building up of GRIDS for other applications.
An additional aspect of
the situation is that all the countries in our region from Czech Republic till Greece are also participating in the
LHC experiments but neither of them is in the position to think of about a
Tier-1 centre, though the number of interested physicists according to a recent
survey in the contacted countries Austria, Slovakia, Czech Rep., Hungary,
Serbia, Croatia, Bulgaria, Greece and Turkey would present a community of
400-500 LHC physicists. Due to the fact that Austria declined, but willing to
support other’s initiative, Hungary arrived in a special position being
regionally centered and could serve as bridge-head toward south, it would be
worth to develop a “mini” Tier-1 center in the long run in Budapest.
The creation of LCG is
an evolutive process. Hungary is intending to join along two lines. At first
create a Tier-2 dedicated LHC centre for LHC physics purposes in the KFKI-RMKI,
then extend it by creating secondary centres in ELTE, Atomki, and Debrecen
University which are also connected with CERN research, but those centres
already will be open for other scientific branches too. Of course, SZTAKI is
also a main actor in the scene concentrating on most of the non-LHC topics.
11. KFKI-RMKI
cluster
The cluster consists of
50 AMD Athlon MP 2000+ CPUs and a 4
TeraByte disc server, which is half filled at present. This cluster provides 33
kSI2k computer power (“SI2k” is the unit which is used in CERN to compare the
computing power of CPUs). Though this cluster size is at the low end of the
world-wide proposed systems, still we can be amongst the first 10 clusters
which will start 24 hours/ 7days GRID service in July 2003. Of course, this
will be an introductory “primitive” batch service, but it will be a world
premier where computers from 3 continents (Asia, Europe and America) will work
together in such regime.
The implementation of
the the first ever LCG middleware is not a trivial issue as it was discussed in
general in section 4. Due to the fact that there is no existing Tier-1 centre
in between, we should rely on the CERN Tier-0/Tier-1 centre. At the time of the writing of this talk the
middelware is available only in a test version which consists of RDMs: RedHat 6.1 + LCFG + EDG 1.4.4 packets. This are good only for training on a few CPU
system. The final version is expected by end of May which shall contain RedHat
7.3 + LCFDng + EDG 2.0.
Depending on the
available resources we hope to increment this centre in each coming year, thus
by 2007 it could provide a reasonable contribution to the final LHC Grid.
Acknowledgment
This work was supported by OTKA grant No. T 029264.