Large-scale distributed computing systems Lecture 3:

Large-scale distributed
computing systems
Lecture 3:
Infrastructures and deployment
January 2015
Johan Montagnat
CNRS, I3S, MODALIS
http://www.i3s.unice.fr/~johan/
Course overview
►
►
►
►
►
►
►
►
1. Distributed computing and models
2. Remote services and cloud computing
3. Infrastructures and deployment
4. Workload and performance modeling
5. Workflows
6. Authentication, authorization, security
7. Data management
8. Evaluation
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
2
Course content
►
3. Infrastructures and deployment
▸
▸
▸
▸
Research and production infrastructures
Middleware development
Deployment
Operations
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
3
Recipe to make a distributed sytem
►
Deploying, operating and using a large-scale distributed
system is a long process
1. System developers design, develop and package middleware
services
2. System administrators select the collection of interoperable
services to run
3. System administrators deploy and operate the services
∙
∙
Decide on services distribution
Decide on operation policies
4. Application developers build applications against services
available
5. End user exploit infrastructure through applications
►
Fortunately, infrastructures are being operated, e.g.
▸
▸
EGI European Grid (running the gLite middleware)
French Aladdin/Grid5000 infrastructure (flexible middleware)
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
4
Basic components for large-scale,
intensive computing
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
5
Resources management strategies
Cluster computing
Grid computing
Cloud infrastructures
Cloud
Resources
Manager
Resources
Broker
Computing Cluster
Application
Tasks Manager
Site X
Site X
Master
(Batch manager)
Site Y
Site Y
Workers
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
6
Main middleware components
Data
Management
DataSets info
User requests
reso
u
Information
Service
rces
e
rc
ou
sa
llo
ca
n
qu
er
ies
tio
q ue r i e s
Author.
&Authen.
Site X
I n de x i ng
s
Re
Workload
Management
Pu b l i c a t i on
info
Sites Resources
Computing Resources
Storage Resources
Dynamic evolution
Logging, real time monitoring
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
7
Functionality overview
►
►
►
►
►
Authentication and authorization
▸
Identification, security, access control
Information service
▸
Large scale distribution of resources, volatility
Resource broker
▸
Load distribution, reliability, heterogeneity of computing
resources
Data Management Service
▸
Storage distribution, reliability, heterogeneity of storage
resources
Logging, monitoring, accounting
▸
Traces collection, system heartbeat, security enforcement,
billing
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
8
User Interface
►
System entry point
▸
▸
►
Machine available to the user with middleware service
clients installed
▸
►
Client-side: no service, no need for inbound connection
The system interface may be implemented at multiple
levels
▸
▸
▸
▸
►
May be the user workstation
May be an intermediate host with heavy client / specific OS
installed
Command line interface
API
Service interface
Portal
Control client credentials
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
9
Authentication and authorization
►
Authentication: user identities may be determined
through different credentials
▸
▸
▸
►
Authorization
▸
▸
►
▸
►
Access control to any resource (hardware, software, data...)
Requires authentication
User identifiers are belonging to VOs
▸
►
Login / passwords
Personal certificates (usually X509) issued by a global
certification authority
Physical devices (fingerprint...)
registered in VOMS (VO Management Services)
The VOMS maps users to groups and roles
Enable single sign-on to all grid services
Identity delegation is often needed (for a service to act on
behalf of the user)
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
10
Information System
►
►
Collects / provides information on available resources
(minimally computing and storage resources)
Uses a standardized information schema
▸
►
Information Indexes are built and periodically updated
▸
▸
►
GLUE (Grid Laboratory Uniform Environment) is standardized
across different grids
Trade-off between freshness of information and status update
frequency
Information may (and will often be) outdated
Scalability is critical
▸
▸
Large amount of resources is federated
Hierarchical distribution, caching...
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
11
Workload Management System
►
User computational tasks (jobs) are described
▸
▸
►
The Resource Brokering matches jobs requirements with
available resources and make resources assignment
decisions
▸
►
Through a Job Description Language (task-based)
Through a standard invocation interface (service-based)
Use information on available resources, data location, dynamic
load, etc.
Follow on the job life cycle: input data transfer, execution
and monitoring, transfer output data, etc.
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
12
Data Management System
►
►
Infrastructure-wide file management system, based on
heterogeneous local storages
Different storage performance and use
▸
▸
▸
►
►
►
Disks
Tapes
Mass storage devices...
File catalogs provide a transparent view of distributed
data files
Additional services may provide data distribution, data
transfer load balancing, improved reliability and
performance through replication, etc.
Data access patterns matter
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
13
Metadata management
►
The file catalogs use metadata associated to files
▸
►
Other databases services are commonly needed for
storing user-defined metadata using more general
relational schema. They provide:
▸
▸
►
►
File size, checksum...
A standard interface to different DBMS
A certificate-based authentication and authorization policy
Centralized DBMS adopted whenever possible
Distributed relational database systems are challenging
▸
▸
▸
Data distribution policies
Distributed queries optimization
Distribution challenges
∙
∙
∙
Coherency of data
Concurrent updates
Transactions management
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
14
Middleware examples
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
15
Global computing example: Condor
►
Workload management
Heterogeneous resources
Deliver High Throughput Computing
▸
For many experimental sciences, the computing throughput
matters. Focus is not instantaneous computing power but the
amount of computing that can be harnessed over a long period.
▸
HTC is a 24-7-365 activity: fault tolerance is critical
Batch-oriented system
▸
Batch extended with Job Control Languages to face grid
heterogeneity
Distributed computing IS difficult
▸
team of ~35 faculty, full time staff and students (U. Winsconsin)
▸
established in 1985
▸
Faces software/middleware engineering challenges in a
UNIX/Linux/Windows/OS X environment
▸
►
►
►
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
16
Matchmaking heterogeneous
resources
►
Run jobs in a variety of environments
▸
▸
▸
►
Local dedicated clusters (machine rooms)
Local opportunistic (desktop) computers
Grid environments; Interface to other systems
Desktop Computers
Matchmaking process
I need a mac for this code
to run
I need a linux box with 2Gb
RAM
Master Ubinet: Large-Scale Distributed Computing (3)
Matchmaker
Dedicated Clusters
Johan Montagnat
17
Matchmaking
►
Condor conceptually divides people into three groups
▸
▸
▸
►
I need Linux, and I prefer faster machines
Machine owner preferences
▸
▸
▸
▸
►
}
May or may not
be the same people
Job submitter preferences
▸
►
Job submitters
Machine owners
Pool (cluster) administrator
I prefer jobs from the physics group
I will only run jobs between 8pm and 4am
I will only run certain types of jobs
Jobs can be preempted if something better comes along
System administrator preferences
▸
▸
▸
When can jobs preempt other jobs?
Which users have higher priority?
Do some groups of users have allocations of computers?
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
18
Matchmaking
►
Matchmaking is two-way
▸
Job describes what it requires:
∙
▸
Machine describes what it provides:
∙
►
I will only run jobs from the Physics department
Matchmaking allows preferences
▸
►
I need Linux && 8 GB of RAM
I need Linux, and I prefer machines with more memory but will
run on any machine you provide me
ClassAds Job Description Language (JDL)
▸
Stating facts
∙
∙
▸
Job’s executable is analysis.exe
Machine’s load average is 5.6
Stating preferences
∙
I require a computer with Linux
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
19
ClassAds JDL
►
ClassAds are:
Semi-structured
▸
Attribute = Expression
▸
Schema-free, userextensible
Extensible declaration
▸
HasJava_1_4 = TRUE
▸
ShoeLength = 7
▸
►
►
Extensible matchmaking
▸
Requirements =
OpSys == "LINUX" &
HasJava_1_4 == TRUE
Master Ubinet: Large-Scale Distributed Computing (3)
►
Example
MyType
= "Job"
TargetType
= "Machine"
ClusterId
= 1377
Owner
= "roy“
Cmd
= “analysis.exe“
Requirements =
(Arch == "INTEL")
&& (OpSys == "LINUX")
&& (Disk >= DiskUsage)
&& ((Memory * 1024)>=ImageSize)
…
Johan Montagnat
20
Matchmaking diagram
Matchmaker
Matchmaking
Service
Negotiator
Collector
2
Information
service
1
3
condor_schedd
Job queue service
Queue
ClassAd
Type = “Machine”
Requirements = “…”
Including dynamic information (load...)
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
21
Job submission
►
►
Users submit jobs to scheduler
▸
Jobs described as ClassAds
▸
Each scheduler has a queue
▸
Scheduler / queues are not centralized
Negotiator
▸
collects list of computers
▸
contacts each schedd (What jobs do you have to run?)
▸
compares each job to each computer to find a match
∙
∙
►
Evaluate requirements of job & machine in context of both ClassAds
If both evaluate to true, there is a match
Fault tolerance scheduler
▸
Resubmission
▸
Fail-over scheduler
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
22
Job submission
Matchmaker
condor_negotiator
condor_submit
performs
matchmaking
condor_schedd
Manage queue
Queue
condor_shadow
manage job
(submit side)
condor_collector
store ClassAds
condor_startd
manage computer
condor_starter
manage job
(execution side)
Job
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
23
Deployment example
►
►
Always a single
matchmaker
Possibly several
schedulers
Matchmaker
Fail-Over
Matchmakers
Dedicated Cluster
Pool schedds &
Desktop Schedds
(Many job queues)
Fail-Over
Schedd
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
24
Job policies
►
Implemented through ClassAd expressions
▸
►
Examples of user policies
▸
▸
▸
►
periodic_remove, periodic_hold, periodic_release and
on_exit_remove attributes
periodic_remove = JobStatus == 2 &&
((CurrentTime - EnteredCurrentStatus) > 43200))
If my job runs too long, kill it
If my job runs too long, run it somewhere else
If my job finished with exit code 12, run it again
Examples of system policies
▸
▸
▸
▸
If a job is has run more than 12 hours and is using more than
2GB of RAM, then kill it
How many batch slots per computer?
Will only run jobs during day
Authorized users (can control fair share)
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
25
Large-scale computing techniques
►
Flocking
▸
►
Condor-G
▸
▸
►
Metascheduling: connect scheduler to several condor pools
Grid interface
Including Condor-C condor interface(!)
Pilot jobs
▸
Job-based reservation of resources and application level
scheduling
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
26
Flocking
►
Submit a scheduler to several pools
▸
▸
Share condor pools between institutions
Try to run on local pool first, then try to run on remote pool
Scheduler
►
Execute nodes
Matchmakers
Networking issues with private networks
▸
A communication broker may be needed if the scheduler is not
able to communicate directly with every execute node
broker
1
2
schedd
3
Master Ubinet: Large-Scale Distributed Computing (3)
startd
Johan Montagnat
27
Condor-G
►
Submit jobs to other grid systems
▸
►
Minimal changes to job description
Grids
▸
▸
▸
▸
▸
▸
▸
▸
▸
Globus 2
Globus 4
Amazon EC2
Nordugrid
Unicore
PBS
LSF
Condor
...
Matchmaker
submit
negotiator
collector
Custom
Advertiser
schedd
Grid
container
Queue condor_gridmanager
Job
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
28
Condor-C
►
►
Connect two condor pools under different administrative
domains
Easier connectivity requirements
▸
▸
Scheduler A only communicates to scheduler B and delegates
running of jobs (instead of contacting all startd)
Schedd A still receives notification of delegated jobs progresses
matchmaker
condor_submit
Schedd A
(Job caretaker)
gridmanager
Schedd B
condor-gahp
startd
gahp: isolates interaction
with grid system
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
29
Pilot jobs
►
Pilot jobs are application-level scheduler jobs that once
executing on a grid resource schedule other jobs on this
resource
▸
▸
►
Example: overlaying Condor on another system
▸
▸
►
Resources allocation through a batch system (bypasses grid
workload manager)
Enables application-level scheduling
Submit startd as a grid job to start a new pilot
Grow the condor pool with the new startd daemon
Limitations
▸
▸
▸
In push mode (e.g. Condor), the scheduler has to be able to
open communication with the pilot jobs (or use a com. broker)
Security is tricky (whose job is ran by the pilot?)
System administrators do not like pilots so much (although
users just love them)
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
30
Condor pilot jobs
►
Startd can be rub as a grid pilot job (Condor Glide-in)
1. Pilot P (Job is Condor!)
2. Submit jobs J1, J2...
condor_submit
P
P
schedd
Grid
J1 J2
P
J2
J1
Master Ubinet: Large-Scale Distributed Computing (3)
Condor
Central
Manager
Johan Montagnat
31
From Pilots to Clouds
►
►
Clouds:
resources
allocation
Pilots:
resource
dedication
through job
submission
Pilot submission over grid
Cloud resource
reservation
Application
Tasks manager
Cloud
Resources
Manager
Resources
Broker
Application
Tasks Manager
Site X
Site Y
Site Y
Master Ubinet: Large-Scale Distributed Computing (3)
Site X
Johan Montagnat
32
Wrap-up
►
Condor develops extensible job management interfaces
▸
▸
►
Reliability, reliability, reliability
▸
▸
▸
►
Loosely coupled components
Modularized code
Many internal safe-guards
Mostly focus on jobs management
▸
▸
▸
►
Focus on scalability
Tackles challenges of large-scale clusters / grids
Hardly any concern on processed data
No other grid services
Not even workload balancing
Not a complete middleware
▸
Condor is one middleware service
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
33
Global computing example: gLite
►
gLite: “Lightweight” middleware for grid computing
▸
▸
►
gLite and the EGI grid
▸
▸
▸
►
http://www.glite.org/, doc, download, etc
Apache 2-like license
The EGI infrastructure is a grid of clusters
Interface to multiple batch systems, data managers, etc
Global computing model
gLite middleware stack
▸
▸
▸
Reusing existing technologies
Higher level services
Compatibility with (some)
standards
gLite components
Virtual Data Toolkit (VDT)
CONDOR
GT2 components
(GSI, GridFTP)
Local systems (PBS, LSF...)
OS (including SSL, Apache...)
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
34
gLite Grid Middleware Services
CLI
API
Access
Authorization
Information &
Monitoring
Auditing
Authentication
Security Services
Metadata
Catalog
File & Replica
Catalog
Storage
Element
Data
Movement
Application
Monitoring
Information &
Monitoring Services
Accounting
Job
Provenance
Package
Manager
Connectivity
Computing
Element
Workload
Management
Data Management
Master Ubinet: Large-Scale Distributed Computing (3)
Workload Mgmt Services
Johan Montagnat
35
gLite Information system
►
Lightweight Directory Access Protocol (LDAP)
▸
▸
▸
▸
▸
►
Berkeley Database Information Index (BDII)
▸
▸
►
Directory: specialized database optimized for reading and
searching. Hold descriptive, attribute-based information.
No complex DB transactions
High volume lookup
c=UK
c=FR
Distribution, hierarchical
Replication and caching, accepting temporary
o=CNRS
inconsistencies
ou=I3S
ou=DR20
LDAP database updated externally
Merge number of sources using the LDAP Data Interchange
Format
Grid Laboratory Uniform Environment (GLUE) schema
▸
▸
Common data model for all resources
Contain information on Sites, Services, Computing and Storage
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
36
Workload Management System
►
From jobs submission to jobs execution
Pre-scheduling
Meta-scheduling
(WMS)
Scheduling
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
37
Workload Management System
►
WMS = collection of services
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
38
Workload Management System
►
►
►
►
►
Helps the user accessing computing resources
▸
resource brokering
▸
management of input and output
Different job types
▸
parametric jobs
▸
management of simple workflows (Direct Acyclic Graphs)
▸
support for MPI jobs
Reliability
▸
Support for complete and shallow resubmission in case of
failure
Using information system
▸
Collection of information from many sources (BDII, R-GMA...)
▸
Support for Data management interfaces
Support for file peeking during job execution (Job File Perusal)
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
39
Batch-oriented process
►
User describe tasks using the CONDOR ClassAds
language
Executable = “application.exe”;
▸
▸
▸
▸
▸
►
▸
Manages batch systems heterogeneity
Performs global (cross-sites) monitoring
Jobs execution environment
▸
▸
►
Arguments = “-s -in config.data”;
InputSandbox = {“application.exe”};
StdOutput = “out”;
StdError = “err”;
OutputSandbox = {“out”, “err”};
InputData =
{“lfn:/biomed/RadioImg08.dicom”}
DataAccessProtocol = {“gsiftp”};
Requirments =
other.GlueCEUniqueID == “egee.fr”;
Computing Element (CE) front-end for each cluster
▸
►
Executable program
Command line parameters
Input and Output sandboxes
Input and Output data files
Node selection constraints
UNIX job running under normal user account on a UNIX node
Executable may be shipped with the job (input sandbox) or preinstalled (per site)
User environment/files cleaned-up for every job
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
40
Job information and workload mgt
►
►
►
►
Logging and Bookkeeping service
Tracks jobs during their lifetime (in terms of events)
L&B Proxy provides faster, synchronous and more
efficient access to L&B services to Workload
Management Services
Support for “sites reputability ranking”
▸
▸
Maintains recent statistics of job failures at sites
Feeds back to WMS to facilitate planning
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
41
Local users mapping: glexec
►
►
►
Users have no dedicated account on worker nodes
▸
Pool accounts are used
glexec changes the local UID as function of the user identity
Enable VO-agent to control different user jobs
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
42
Data Management
►
►
►
►
Client
Disk or tape storage resource
Global file catalog (LFC)
Data Management System
▸
Critical centralized database, with
LCG
File access
possible replicas (RO)
service
LFC
Common file access interface: SRM
File Catalog
(Storage Resource Manager)
▸
negotiable transfer protocols
(GridFTP, gsidcap, RFIO, …)
▸
SRM
Various implementation from
SRM
Local
Local
common
common
catalog
external projects
catalog
interface
interface
LCG File access API
▸
Abstractions for Storage Element,
Storage
Storage
File Catalog, Information System
System
System
1
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
2
43
Metacomputing example: DIET
►
DIET (Distributed Interactive Engineering
Toolbox)
▸
▸
▸
▸
►
http://graal.ens-lyon.fr/~diet
Focus on distributed computing, scalability
Some integration of data (data transfers for
computation, no real data storage)
Exploits a Remote Procedure Call (RPC)
interface
Architecture
▸
▸
Each computing resource is installing a
Server Daemon (SeD)
Hierarchical set of agents for load balancing:
Leader Agents (LA), Master Agents (MA).
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
44
DIET components
►
Client
▸
►
Server Daemon (SeD)
▸
▸
▸
▸
▸
▸
►
GridRPC client able to interface to DIET
Front-end to a computational server: GridRPC enabled
Manages a processor (single server) or a cluster (batch
interface)
Application software is pre-installed on SeDs
Update information on the node (CPU availability, memory...)
Record a list of locally available softwares
Stores the list of the data available on a server (eventually with
their distribution and the way to access them)
Remote Procedure Calls
▸
▸
Synchronous and asynchronous calls to remote procedures
Wait / probe / cancel on asynchronous calls
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
45
DIET components
►
Master Agent (MA)
▸
▸
▸
►
Leader Agent (LA)
▸
▸
▸
▸
►
Distributed servers receiving client requests
Make scheduling decisions: pick the best computing resource
Enable direct connection between the client and the computer
Transmit requests and information between MAs and SeDs
Hold a list of requests
Know the servers that can solve a given problem and
information
Specific LA hierarchy deployed depending on the network
topology
Flexible and scalable scheme
▸
▸
Various SeD, LA and MA can be implemented
Hierarchical load distribution
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
46
Scalability
►
►
Most trivial deployment: one MA, directly connected to all
SeDs in the platform
Multiple MAs
▸
▸
▸
▸
▸
►
Deploy multiple MAs with known “neighbors”
Manage several sub-systems
Accept more client connections
An MA attempts to answer the request of its clients. If no
resources were found by the MA, the request is forwarded to
other MAs.
No client load distribution
Hierarchy of LAs
▸
▸
Hierarchical load distribution to handle very large systems
Experimented up to 10 000's SeDs
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
47
Scheduling
►
Customizable scheduler
▸
▸
►
Scheduling information collected by CoRI
▸
▸
►
Round-robin, Random, FAST...
Possible scheduler plugins (application specific)
Static: number of processors, memory, cache and disk
performance...
Dynamic: execution time prediction, time since last execution,
CPU load, disk space...
Performance estimation function
▸
▸
▸
Needed for smart scheduling
Require application codes performance analysis, not always
possible
Attempts to collect statistics from runs history (NWS, FAST...)
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
48
CoRI/FAST
►
Information collection: CoRI
▸
▸
▸
►
Modular/extensible collectors: CoRI-Easy, NWS, Ganglia, FAST
CoRI manager aggregates information for scheduling
Extensible application-dependent metrics
Probes: FAST
▸
▸
▸
▸
Dynamic performance forecasting
CPU monitoring
Extended NWS network monitoring and
forecasting
Model time and space needed for
applicaitons:
∙
∙
Collects (procedure, machine, parameters)
Estimate time for a given run
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
49
Data Management
►
GridRPC model
▸
▸
►
Data transmitted as procedures
parameters
No data resilient, no data management
Data Tree Manager
▸
▸
▸
▸
Enable persistent data on computing
nodes
Logical data manager hierarchy
mapped on the DIET agents hierarchy
Physical data manager on each SeD,
interfaced to a data mover service
No data access control
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
50
Fault Tolerance
►
Failure detection
▸
▸
►
MA/LA failure recovery
▸
►
Heartbeat + timeout method
WAN: unpredictable round trip-time, trade-off between
long failure detection time and accuracy
Agent topology adaptation on failure detection
SeD failure recovery
▸
▸
▸
▸
Potentially large computing time wasted
Redundant computations: efficient but costly
Checkpointing / restarting implemented at service level
Trade-off between checkpointing overhead and benefit
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
51
GoDIET
►
DIET deployment tool
▸
▸
▸
▸
▸
▸
XML description of the resources
Desired agents and server heirarchy
Additional services
Write configurations file
Stage files to remote resources
Timely launch components
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
52
Deployment and operations
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
53
The EGI Grid example
►
►
European Grid
Infrastructure
Production infrastructure
▸
▸
Operate a large-scale,
production quality grid
infrastructure for e-Science
This is 24-7-365 activity
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
54
54
In 2015:
> 50 countries
> 315 sites
> 450,000 CPU cores
> 280 PB disks
> 120 PB tapes
> 200 VOs
> 20,000 users
> 45 Mjobs / month
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
55
Capacity
►
Regular increase of infrastructure growth since 2004
CIC Portal: http://cic.gridops.org/
Accounting Portal: http://www3.egee.cesga.es/
Average 1.6 Mjobs / day in 2012
4.4 million normalized CPU years since 2004
1.6 billion jobs processed since 2004
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
56
User communities
►
Regular increase of infrastructure growth since 2004
▸
> 20,000 registered users
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
57
gLite Software Process
Software Development
Technical
input
Error Fixing
Software
Serious
problem
Integration
Testing &
Certification
Pre-Production
Deployment
Packages
Problem
Fail
Production
Infrastructure
Integration
Tests
Testbed
Deployment
Fail
Pass
Functional Tests
Release
(RPM-based distribution)
Pre-Production
Deployment
Installation Guide, Release
Notes, etc
Master Ubinet: Large-Scale Distributed Computing (3)
Pass
Fail
Scalability Tests
Pass
Johan Montagnat
58
EGI Operation Services
Support Structures & Processes
Training infrastructure
Training activities
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
59
Grid management: structure
•
Operations Coordination
Centre (OCC)
▸
•
Regional Operations Centres
(ROC)
▸
Master Ubinet: Large-Scale Distributed Computing (3)
management, oversight of all
operational and support
activities
▸
providing the core of the
support infrastructure, each
supporting a number of
resource centres within its
region
Grid Operator on Duty
•
Resource centres
•
Grid User Support (GGUS)
▸
▸
providing resources
(computing, storage, network,
etc.);
At FZK, coordination and
management of user support,
single point of contact for users
Johan Montagnat
60
Grid monitoring tools
►
►
Tools used by the Grid Operator on
Duty team to detect problems
CIC portal http://cic.gridops.org/
▸
▸
►
►
►
►
►
single entry point
Integrated view of monitoring tools
Site Functional Tests (SFT)
Service Availability Monitoring (SAM)
Grid Operations Centre Core
Database (GOCDB)
GIIS monitor (Gstat)
...
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
61
Example report
►
►
Availability = the fraction of time the same was up
Reliability = Availability / scheduled availability
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
62
User support and feedback
►
GGUS ticket submission and
tracking portal
▸
▸
▸
Problem reporting, logging and
traceability
Collect, group, and prioritise
requirements from the VOs and
from within operations
Dispatch tickets to entities
concerned (operations, middleware
development, VO management...)
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
63
The Aladdin / Grid5000 example
►
Research-oriented infrastructure
▸
▸
▸
►
Understand behavior of grid systems
Between production-scale and simulation
7700 CPU cores deployed in 9 sites over France
In 2014
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
64
Aladdin / Grid5000 architecture
►
Multi-clusters environment
▸
▸
►
Multi-sites user account management
▸
▸
►
User account replicated on all sites
NFS file sharing in each site
Reserve resource, deploy anything
▸
▸
▸
▸
►
One gateway per site
Multiple clusters per gateway
Advanced reservation of resource pools
System image deployment on reserved nodes
Nodes hot rebooting (with specified system image)
Get root access to all nodes
Security
▸
Confined environment (only ssh traffic from outside to
gateways, no outbound connection from worker nodes)
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
65
Aladdin / Grid5000 network
►
►
►
Dedicated 10Gbits links
“Dark fiber” dedicated lambdas
Monitoring tools
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
66
Tooling
►
OAR2
▸
▸
▸
▸
▸
▸
▸
▸
▸
▸
▸
►
First-Fit Scheduler with matching resource (job/node properties)
Advance Reservation
Batch and Interactive jobs
Walltime
Hold and resume jobs
Multi-queues with priority
Best-effort queues (for exploiting idle resources)
Epilogue/Prologue scripts
No Daemon on compute nodes; rsh and ssh as remote
execution protocols
Dynamic insertion/deletion of compute node
Logging
kdeploy
▸
System images deployment on reserved nodes
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
67
Tooling
►
Monitoring tools
►
Schedules
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
68
References
►
►
►
►
EGEE/EGI
▸
L. Erwin, H., Stockinger and K. Stockinger, “Performance Engineering
in Data Grids”, journal of Concurrency and Computation: Practice &
Experience, vol 17(2-4), Feb.-April, 2005.
Grid'5000
▸
E. Caron and F. Desprez. “DIET: A Scalable Toolbox to Build Network
Enabled Servers on the Grid”. International Journal of High
Performance Computing Applications, 20(3):335-352, 2006.
Condor
▸
D. Thain, T. Tannenbaum and M. Livny, “Distributed Computing in
Practice: The Condor Experience”. Concurrency and Computation:
Practice and Experience. vol. 17(2-4):323-356, Feb.-April, 2005.
OAR
▸
N. Capit, G. Da Costa, Y. Georgiou, G. Huard, C. Martin, G. Mounié,
P. Neyron and O. Richard. “A batch scheduler with high level
components.” In Cluster Computing and Grid, 2005.
Master Ubinet: Large-Scale Distributed Computing (3)
Johan Montagnat
69