P-GRADE:
Developing and Running Parallel Programs on Supercomputers, Cluster and Grid
systems
Peter
Kacsuk
MTA
SZTAKI
kacsuk@sztaki.hu
The P-GRADE system was
originally designed for supporting the development of parallel programs by the
Laboratory of Parallel and Distributed Systems of MTA SZTAKI. Developing
parallel programs is essentially more difficult than creating sequential ones.
That was the reason to construct a graphical programming environment by which
even non-IT specialist end-users like meteorologists, biologists, etc. are able
to develop supercomputer and cluster programs. P-GRADE provided in its original
form a graphical language, a graphical editor, pre-compiler, PVM library
support, distributed graphical debugger, monitoring system, performance and
execution visualization tool. In the framework of the IKTA-3 project “Cluster
Programming Technology and its Usage in Meteorology”, we extended P-GRADE with
new and novel tools that significantly increase the efficiency and reliability
of P-GRADE programs executed on clusters.
As Grid systems became a
reality, a parallel program can be executed simultaneously on several clusters
and/or supercomputers. In order to exploit these new possibilities, P-GRADE was
further developed towards the Grid. The aim is the same either using P-GRADE
for supercomputers, clusters or Grid: to hide the details of the
parallel/distributed execution environment from the user in order to allow him
concentrate on the problem to be solved. Another aim was to enable the usage of
the same P-GRADE system no matter whether the developed parallel program will
run on a supercomputer, cluster or in a Grid.
This extension of P-GRADE is
solved in the framework of the IKTA-4 project “Hungarian Supercomputing Grid”
in several stages. In the first step a new execution mode, namely the job
execution mode, was introduced into P-GRADE. The job execution mode is
indispensable in the Grid but it also very useful on supercomputers and
clusters. In order to introduce the job execution mode, P-GRADE was integrated
with the Condor job management system. The advantage of this marriage is that
the parallel programs developed in P-GRADE are automatically transferred to
Condor job and then Condor takes care of running them either in a single cluster
or on several "friendly clusters" using the "Condor
flocking" technique. We demonstrated this integrated system at the Grid
Demo workshop of the CCGrid'2002 conference in Berlin. This P-GRADE/Condor
system is a good candidate to use in the Hungarian Cluster Grid, which is also
based on Condor.
The drawback of the
P-GRADE/Condor system is that due to the restrictions of Condor it cannot
support file staging, application monitoring, on-line visualization and
parallel program check-pointing. The lack of parallel program check-pointing
prevents the temporary suspension of Grid programs and their resumption at a
later time. Such feature would be very important in the Hungarian Cluster Grid
where the clusters serve as Grid resources only at night and at the weekends.
In order to solve the problems above we developed a new Grid layer called
PERL-GRID. The new system, called TotalGrid, that combines P-GRADE and
PERL-GRID enables the execution of parallel programs on arbitrary Grid
resources, and supports file staging, application monitoring, on-line
visualization and parallel program check-pointing. The TotalGrid system can be
applied not only for scientific Grids but also to form company Grids. The
TotalGrid system was demonstrated at the 5th EU DataGrid conference in
September 2002 at Piliscsaba by the MEANDER ultra-short weather forecast
program package of the Hungarian Meteorology Service. A new workflow execution
mode of P-GRADE programs is under development in the framework of the SuperGrid
project. The workflow execution mode will enable the Grid execution of very
complex problems consisting of several jobs whose dependency is described by
the workflow graph of P-GRADE.
The talk will explain,
compare and evaluate the various execution modes mentioned above of P-GRADE
both for clusters and Grid systems.