download

2014 SCENT HPC Summer School@GIST
HTCondor 소개
및
HTC@PLSI 구축 사례
국가슈퍼컴퓨팅연구소
슈퍼컴퓨팅서비스센터
슈퍼컴퓨팅서비스통합실
박주원
Korea Institute of Science and Technology Information
[email protected]
1
2
3
Korea Institute of Science and Technology Information
2
1
2
3
Korea Institute of Science and Technology Information
3
HTC 서비스 개요
What
Why
HTC (High Throughput Computing)
 서로 독립적인 다수의 하위 작업
처리를 위한 컴퓨팅자원 사용 방식
HPC
Computation
Need
HTC
Large computing power
Large computing power
Duration
Short(hours, days)
Long(months, years)
Interest
How fast an individual job can
complete
How many jobs can complete over
a long period of time
Communication Tightly coupled
Loosely coupled
Must execute within a particular site
Resource Reqs.
with low-latency interconnects
Independent, sequential jobs can be
individually scheduled on many
different computing resources
across multiple sites
슈퍼컴퓨터 지원 분야 확대
 천문우주, 의료 등 다양한
분야에서 1만개 이상의 하위
작업으로 구성된 작업 처리 요청
슈퍼컴퓨터 이용 효율성 증대
 고성능 슈퍼 컴퓨팅 자원이
필요한 분야에 효율적으로 제공
< HPC vs HTC >
Korea Institute of Science and Technology Information
4
HTC 서비스 개요
① HTC Service in XSEDE (미국)
XSEDE에서의 HTC 서비스는 Open Science
Grid 자원을 활용함.
OSG에는 CHTC에서 제공하는 Condor 풀
(GLOW)이 있음.
Condor-G 기반의 meta-scheduling 지원
서로 다른 종류의 Local Scheduler를
사용하는 클러스터를 통해 HTC 서비스를
제공하기 위해 BOSCO라는 프로젝트 진행중.
Korea Institute of Science and Technology Information
5
HTC 서비스 개요
② HTC Service in EGI (유럽)
유럽에서는 HTC 서비스 제공을 위해
EGI (European Grid Infra.)를 이용함.
기존의 Grid 뿐만 아니라 클라우드
자원을 통합하여 a grid of clouds에
대한 연구도 진행 중임.
EGI에서는 meta-scheduler로
GridWay를 이용함.
Korea Institute of Science and Technology Information
6
HTC 서비스 개요
② HTC Service @ KISTI
Meta-Job(OGF JSDL Standard)에 기반한
HTCaaS
대규모 계산 작업 제출 및 자동 분할 기능
Agent(Pilot Job)에 기반한 Multi-level
Scheduling
계산 자원의 Local Storage 활용을 위한 Data
Management Framework 제공
웹 인터페이스, JAVA API, 클라이언트
프로그램을 비롯한 다양한 클라이언트
인터페이스 제공
Korea Institute of Science and Technology Information
7
1
2
•
•
•
•
•
Introduction
Architecture
Getting Started
Administration
Use Cases
3
Korea Institute of Science and Technology Information
8
Korea Institute of Science and Technology Information
Korea Institute of Science and Technology Information
9
Job Scheduler
Computing
resource
User
작업 실행
결과 확인
Korea Institute of Science and Technology Information
10
Job Scheduler
Korea Institute of Science and Technology Information
11
Job Scheduler
HTCondor
Korea Institute of Science and Technology Information
12
Introduction to using HTCondor
Open source project out of the University of Wisconsin-Madison
http://www.cs.wisc.edu/condor
Established in 1985 to do research and development of distributed highthroughput computing
John (TJ) Knoeller
Zach Miller
Todd Tannenbaum
Miron Livny
Korea Institute of Science and Technology Information
13
Introduction to using HTCondor
HTCondor 특징
Open Source (Free!!)
Interoperates with many types of computing grids
Manages both dedicated CPUs (clusters) and non-dedicated resources
Very configurable, adaptable
Flexible multi-clustering (flocking)
Korea Institute of Science and Technology Information
14
Korea Institute of Science and Technology Information
Korea Institute of Science and Technology Information
15
HTCondor Architecture
Definition
ClassAd
 HTCondor’s internal data representation
Matchmaking
associating a job with a machine resource
Central Manager
 central repository for the whole pool
Submit Host
 the computer from which jobs are submitted to HTCondor
Execute Host
Korea Institute of Science and Technology Information
16
HTCondor Architecture
Korea Institute of Science and Technology Information
17
HTCondor Architecture
master
Central Manager
J
S negotiator
masterSubmit Machine
Q
J
J
collector
S
masterExecute Machine
schedd
submit
Q
shadow
Korea Institute of Science and Technology Information
startd
starter
J
S
Job
18
HTCondor Architecture
ClassAd
 the language that HTCondor uses to represent information about: jobs (job ClassAd),
machines (machine ClassAd), and programs that implement HTCondor's functionality
(called Deamon)
Job ClassAD
…
 Out = "out/out.2699“
 Cmd = "monte_int“
 TransferInput = "data.2699“
 UserLog = "log.2699“
 Owner = "p377han“
 Requirements = ( TARGET.Arch ==
"X86_64" ) && ( TARGET.OpSys ==
"LINUX" ) && ( TARGET.Disk >=
RequestDisk ) &&
( TARGET.Memory >= RequestMemory )
&& ( TARGET.HasFileTransfer
…
Machine ClassAD
…
 Machine = "glory254.plsi.or.kr"
 OpSysAndVer = "CentOS5"
 JavaVersion = "1.4.2"
 CondorVersion = "$CondorVersion:
7.8.8 Mar 20 2013 BuildID: 110288
$"
 HardwareAddress =
"00:1b:24:78:37:65"
 COLLECTOR_HOST_STRING = "glorymg01.plsi.or.kr"
 SubnetMask = "255.255.255.0“
…
Korea Institute of Science and Technology Information
19
HTCondor Architecture
Matchmaking
 The matchmaker matches job ClassAds with machine ClassAds, taking into account:
•
Requirements of both the machine and the job
•
Rank of both the job and the machine
•
Priorities, such as those of users and group
Matchmaking
Job ClassAD
Machine ClassAD
Matchmaking
Korea Institute of Science and Technology Information
20
Korea Institute of Science and Technology Information
Korea Institute of Science and Technology Information
21
Getting Started
① Choose a universe for the job
② Make the job batch-ready
③ Create a job description file
④ Run condor_submit to put the job in the queue
Korea Institute of Science and Technology Information
22
Getting Started
① Choose a universe for the job
Controls how HTCondor handles jobs
the many universes include:
 Standard
 Vanilla
 Grid
 Allows running almost any
“serial” job
 Provides automatic file transfer
for input and output files
 Parallel
 VM
Korea Institute of Science and Technology Information
23
Getting Started
② Make the job batch-ready
Must be able to run in the background
No interactive input
No GUI/window clicks
 Job can still use stdin (keyboard), stdout (screen) , and stderr , but
files are used instead of the actual devices
Korea Institute of Science and Technology Information
24
Getting Started
③ Create a job description file
A plain text file
File name extensions are irrelevant,
although many use .sub or .submit as
suffixes
File handling
Input ( ex. $a < a.in)
Output (ex. $a > a.out)
test.jds
Universe = vanilla
Executable = a.out
Input = a.in
Output = a.out
Error = a.err
Log = a.log
queue
Error (ex. $a 2> a.err)
Korea Institute of Science and Technology Information
25
Getting Started
④ Run condor_submit to put the job in the queue
Run condor_submit, providing the name of the submit description file:
$ condor_submit test.jds
condor_submit will
Parse the submit description files, checking for errors
Create a ClassAd that describes the job
Place the job in the queue
Korea Institute of Science and Technology Information
26
Example
count.sh
multiple.jds
executable = count.sh
universe = vanilla
output = out/out.txt
error = out/err.txt
log = out/log.txt
Queue 50
Korea Institute of Science and Technology Information
27
Example
$ condor_status
Korea Institute of Science and Technology Information
28
Example
$ condor_submit multiple.jds
Job ID
Korea Institute of Science and Technology Information
29
Example
$ condor_status
Korea Institute of Science and Technology Information
30
Example
$ condor_q
Job ID
(ClusterId.ProcId)
작업 제출자
Korea Institute of Science and Technology Information
작업 상태
31
Example
$ condor_hold
$ condor_release
Korea Institute of Science and Technology Information
32
More that you do with HTCondor
Requirements
A boolean expression
Evaluated with respect to attributes from machine ClassAd(s)
Korea Institute of Science and Technology Information
33
More that you do with HTCondor
Rank
All matches which meet the requirements can be sorted by preference with a
Rank expression
Like Requirements, is evaluated against attributes from machine ClassAd(s)
test.jds
Universe = vanilla
Executable = a.out
Input = a.in
Output = a.out
Error = a.err
Log = a.log
Requirements = machine == “test.plsi.or.kr”
Rank = KFLOPS
Queue 500
Korea Institute of Science and Technology Information
34
Korea Institute of Science and Technology Information
Korea Institute of Science and Technology Information
35
Administrating HTCondor
① Determine the machine role
② Obtain HTCondor & Install
③ Configure HTCondor
Korea Institute of Science and Technology Information
36
Administrating HTCondor
① Determine the machine role (Central manager)
하나의 pool에 하나의 Central manager (Negotiator, Collector)
다수의 Submit host 가능
다수의 Execute host
Central Manager
[jwpark@apu1 condor_example]$ ps -ef | grep condor_
condor
2796
1 0 Feb07 ?
02:14:20 condor_master -f
jwpark
3190 2936 0 09:56 pts/0
00:00:00 grep condor_
condor
30896 2796 0 Jun20 ?
00:04:09 condor_collector -f
condor
30897 2796 0 Jun20 ?
00:03:42 condor_negotiator -f
condor
30898 2796 0 Jun20 ?
00:02:15 condor_schedd -f
root
30900 30898 0 Jun20 ?
00:01:37 condor_procd -A
/tmp/condor-lock.apu10.901845565605075/procd_pipe.SCHEDD -L
/opt/condor/local/log/ProcLog.SCHEDD -R 10000000 -S 60 -C 780
Execute host
[jwpark@apu3 ~]$ ps -ef | grep condor_
condor
2132
1 0 Feb06 ?
00:43:49 condor_master -f
condor
2386 2132 0 Feb18 ?
00:19:26 condor_startd -f
root
7701 2386 0 Feb19 ?
00:39:39 condor_procd -A
/tmp/condor-lock.apu30.70573943301952/procd_pipe.STARTD -L
/opt/condor/local/log/ProcLog.STARTD -R 10000000 -S 60 -C 780
jwpark
24247 24223 0 09:51 pts/0
00:00:00 grep condor_
Korea Institute of Science and Technology Information
37
Administrating HTCondor
② Obtain HTCondor & Install
Obtain HTCondor (http://research.cs.wisc.edu/htcondor/downloads/)
Install on central manager
Install on execute hosts
Central Manager
Execute host
$ groupadd condor --gid 780
$ useradd -d /home/condor -g condor --uid
780 --shell /bin/bash condor
$ su
$ ./condor_install --prefix=/opt/condor -local-dir=/opt/condor/local --
$ groupadd condor --gid 780
$ useradd -d /home/condor -g condor --uid
780 --shell /bin/bash condor
$ su
$ ./condor_install --prefix=/opt/condor -local-dir=/opt/condor/local --
type=manager,submit
type=execute --owner=condor -central-manager=apu1-ib
--owner=condor
Korea Institute of Science and Technology Information
38
Administrating HTCondor
③ Configure HTCondor
/opt/condor/etc/condor_config
Local 설정 파일 수정 (/opt/condor/local/condor_config.local)
 Central Manager 확인
 policy 설정
 security 설정
cat /opt/condor/local/condor_config.local
## What machine is your central manager?
CONDOR_HOST = apu1-ib
…
 flocking 설정
Korea Institute of Science and Technology Information
39
Administrating HTCondor
③ Configure HTCondor
Policy expression
 Start
 Rank
 Suspend
 Continue
 Preempt
 Kill
Korea Institute of Science and Technology Information
40
Administrating HTCondor
③ Configure HTCondor
Always run jobs
START = True
RANK =
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
Korea Institute of Science and Technology Information
41
Administrating HTCondor
③ Configure HTCondor
Prefer “chemistry” job
START = True
RANK = Department == "Chemistry"
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
Korea Institute of Science and Technology Information
42
Administrating HTCondor
③ Configure HTCondor
Security configuration
ALLOW_WRITE = \*.plsi.or.kr
ALLOW_READ = \*.plsi.or.kr
Korea Institute of Science and Technology Information
43
Administrating HTCondor
③ Configure HTCondor
Flocking 설정
FLOCK_FROM = titan, venus
FLOCK_TO = titan, venus, kobic
Korea Institute of Science and Technology Information
44
Korea Institute of Science and Technology Information
Korea Institute of Science and Technology Information
45
Use Cases
미국 지질조사국 (Monte Carlo Simulation)
지하수 분포도 모델링
Korea Institute of Science and Technology Information
46
Use Cases
Duke University
Protein Structure 모델링
Korea Institute of Science and Technology Information
47
Use Cases
University of Wisconsin – Madison
IceCube 중성미자 분석
Korea Institute of Science and Technology Information
48
1
2
3
Korea Institute of Science and Technology Information
49
HTC@PLSI 구축 사례
HPC
Job
Job
Job
Job
Job
Job
Job
4호기
Job
Tachyon & GAIA
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
Job
4호기
Job
주요 미션
HTC
PLSI
소규모 자원
장비의 노후화
분산된 환경
Korea Institute of Science and Technology Information
50
HTC@PLSI 구축 사례
기관명
시스템명
KU
Venus
KIAS
GIST
KOBIC
KISTI
KISTI
UOS
HTCondor의
CPU
Intel Xeon
@2.5GHz
gene
AMD Opteron
(x86 클러스터)
@2GHz
titan
Intel Xeon
(x86 클러스터)
@2.5GHz
KOBIC
AMD Opteron
(SUN x4100)
@2.1GHz
GLORY
AMD Opteron
(SUN x2100)
@1.8GHz
kairos
Intel Xeon
(VM cluster)
@2.5GHz
t2c
Intel(R) Xeon
(x86 클러스터)
@ 2.13GHz
core
Mem
112
16GB
128
8GB
128
16GB
184
4GB
514
2GB
200
16GB
80 ~ 120
8GB
비고
(2014년도 편
입 완료)
2014 하반기
편입 예정
편입 협의중
기능을 활용하여 HTC 서비스 인프라 구축 (@PLSI)
지리적으로 분산된 6개 사이트 1000+ core 규모
Korea Institute of Science and Technology Information
51
HTC@PLSI 구축 사례
서비스 정책 (default)
Korea Institute of Science and Technology Information
52
HTC@PLSI 구축 사례
서비스 정책 (전용자원 요청시)
Korea Institute of Science and Technology Information
53
HTC@PLSI 구축 사례
HTC@PLSI 활용 사례
연세대학교
 주 소행성대 소행성 종족 연구 분야 (Monte Carlo Simulation)
서울시립대학교
 의료/고에너지물리 분야 지원 (GEANT4 지원)
한국천문연구원
 남천황도대 태양계소천체 집중탐사연구 분야 협력 (예정)
Korea Institute of Science and Technology Information
54
Korea Institute of Science and Technology Information
Korea Institute of Science and Technology Information
HTC@PLSI 구축 사례
Korea Institute of Science and Technology Information
56
HTC@PLSI 구축 사례
Demo!!
Korea Institute of Science and Technology Information
57