2014 SCENT HPC Summer School@GIST HTCondor 소개 및 HTC@PLSI 구축 사례 국가슈퍼컴퓨팅연구소 슈퍼컴퓨팅서비스센터 슈퍼컴퓨팅서비스통합실 박주원 Korea Institute of Science and Technology Information [email protected] 1 2 3 Korea Institute of Science and Technology Information 2 1 2 3 Korea Institute of Science and Technology Information 3 HTC 서비스 개요 What Why HTC (High Throughput Computing) 서로 독립적인 다수의 하위 작업 처리를 위한 컴퓨팅자원 사용 방식 HPC Computation Need HTC Large computing power Large computing power Duration Short(hours, days) Long(months, years) Interest How fast an individual job can complete How many jobs can complete over a long period of time Communication Tightly coupled Loosely coupled Must execute within a particular site Resource Reqs. with low-latency interconnects Independent, sequential jobs can be individually scheduled on many different computing resources across multiple sites 슈퍼컴퓨터 지원 분야 확대 천문우주, 의료 등 다양한 분야에서 1만개 이상의 하위 작업으로 구성된 작업 처리 요청 슈퍼컴퓨터 이용 효율성 증대 고성능 슈퍼 컴퓨팅 자원이 필요한 분야에 효율적으로 제공 < HPC vs HTC > Korea Institute of Science and Technology Information 4 HTC 서비스 개요 ① HTC Service in XSEDE (미국) XSEDE에서의 HTC 서비스는 Open Science Grid 자원을 활용함. OSG에는 CHTC에서 제공하는 Condor 풀 (GLOW)이 있음. Condor-G 기반의 meta-scheduling 지원 서로 다른 종류의 Local Scheduler를 사용하는 클러스터를 통해 HTC 서비스를 제공하기 위해 BOSCO라는 프로젝트 진행중. Korea Institute of Science and Technology Information 5 HTC 서비스 개요 ② HTC Service in EGI (유럽) 유럽에서는 HTC 서비스 제공을 위해 EGI (European Grid Infra.)를 이용함. 기존의 Grid 뿐만 아니라 클라우드 자원을 통합하여 a grid of clouds에 대한 연구도 진행 중임. EGI에서는 meta-scheduler로 GridWay를 이용함. Korea Institute of Science and Technology Information 6 HTC 서비스 개요 ② HTC Service @ KISTI Meta-Job(OGF JSDL Standard)에 기반한 HTCaaS 대규모 계산 작업 제출 및 자동 분할 기능 Agent(Pilot Job)에 기반한 Multi-level Scheduling 계산 자원의 Local Storage 활용을 위한 Data Management Framework 제공 웹 인터페이스, JAVA API, 클라이언트 프로그램을 비롯한 다양한 클라이언트 인터페이스 제공 Korea Institute of Science and Technology Information 7 1 2 • • • • • Introduction Architecture Getting Started Administration Use Cases 3 Korea Institute of Science and Technology Information 8 Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 9 Job Scheduler Computing resource User 작업 실행 결과 확인 Korea Institute of Science and Technology Information 10 Job Scheduler Korea Institute of Science and Technology Information 11 Job Scheduler HTCondor Korea Institute of Science and Technology Information 12 Introduction to using HTCondor Open source project out of the University of Wisconsin-Madison http://www.cs.wisc.edu/condor Established in 1985 to do research and development of distributed highthroughput computing John (TJ) Knoeller Zach Miller Todd Tannenbaum Miron Livny Korea Institute of Science and Technology Information 13 Introduction to using HTCondor HTCondor 특징 Open Source (Free!!) Interoperates with many types of computing grids Manages both dedicated CPUs (clusters) and non-dedicated resources Very configurable, adaptable Flexible multi-clustering (flocking) Korea Institute of Science and Technology Information 14 Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 15 HTCondor Architecture Definition ClassAd HTCondor’s internal data representation Matchmaking associating a job with a machine resource Central Manager central repository for the whole pool Submit Host the computer from which jobs are submitted to HTCondor Execute Host Korea Institute of Science and Technology Information 16 HTCondor Architecture Korea Institute of Science and Technology Information 17 HTCondor Architecture master Central Manager J S negotiator masterSubmit Machine Q J J collector S masterExecute Machine schedd submit Q shadow Korea Institute of Science and Technology Information startd starter J S Job 18 HTCondor Architecture ClassAd the language that HTCondor uses to represent information about: jobs (job ClassAd), machines (machine ClassAd), and programs that implement HTCondor's functionality (called Deamon) Job ClassAD … Out = "out/out.2699“ Cmd = "monte_int“ TransferInput = "data.2699“ UserLog = "log.2699“ Owner = "p377han“ Requirements = ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer … Machine ClassAD … Machine = "glory254.plsi.or.kr" OpSysAndVer = "CentOS5" JavaVersion = "1.4.2" CondorVersion = "$CondorVersion: 7.8.8 Mar 20 2013 BuildID: 110288 $" HardwareAddress = "00:1b:24:78:37:65" COLLECTOR_HOST_STRING = "glorymg01.plsi.or.kr" SubnetMask = "255.255.255.0“ … Korea Institute of Science and Technology Information 19 HTCondor Architecture Matchmaking The matchmaker matches job ClassAds with machine ClassAds, taking into account: • Requirements of both the machine and the job • Rank of both the job and the machine • Priorities, such as those of users and group Matchmaking Job ClassAD Machine ClassAD Matchmaking Korea Institute of Science and Technology Information 20 Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 21 Getting Started ① Choose a universe for the job ② Make the job batch-ready ③ Create a job description file ④ Run condor_submit to put the job in the queue Korea Institute of Science and Technology Information 22 Getting Started ① Choose a universe for the job Controls how HTCondor handles jobs the many universes include: Standard Vanilla Grid Allows running almost any “serial” job Provides automatic file transfer for input and output files Parallel VM Korea Institute of Science and Technology Information 23 Getting Started ② Make the job batch-ready Must be able to run in the background No interactive input No GUI/window clicks Job can still use stdin (keyboard), stdout (screen) , and stderr , but files are used instead of the actual devices Korea Institute of Science and Technology Information 24 Getting Started ③ Create a job description file A plain text file File name extensions are irrelevant, although many use .sub or .submit as suffixes File handling Input ( ex. $a < a.in) Output (ex. $a > a.out) test.jds Universe = vanilla Executable = a.out Input = a.in Output = a.out Error = a.err Log = a.log queue Error (ex. $a 2> a.err) Korea Institute of Science and Technology Information 25 Getting Started ④ Run condor_submit to put the job in the queue Run condor_submit, providing the name of the submit description file: $ condor_submit test.jds condor_submit will Parse the submit description files, checking for errors Create a ClassAd that describes the job Place the job in the queue Korea Institute of Science and Technology Information 26 Example count.sh multiple.jds executable = count.sh universe = vanilla output = out/out.txt error = out/err.txt log = out/log.txt Queue 50 Korea Institute of Science and Technology Information 27 Example $ condor_status Korea Institute of Science and Technology Information 28 Example $ condor_submit multiple.jds Job ID Korea Institute of Science and Technology Information 29 Example $ condor_status Korea Institute of Science and Technology Information 30 Example $ condor_q Job ID (ClusterId.ProcId) 작업 제출자 Korea Institute of Science and Technology Information 작업 상태 31 Example $ condor_hold $ condor_release Korea Institute of Science and Technology Information 32 More that you do with HTCondor Requirements A boolean expression Evaluated with respect to attributes from machine ClassAd(s) Korea Institute of Science and Technology Information 33 More that you do with HTCondor Rank All matches which meet the requirements can be sorted by preference with a Rank expression Like Requirements, is evaluated against attributes from machine ClassAd(s) test.jds Universe = vanilla Executable = a.out Input = a.in Output = a.out Error = a.err Log = a.log Requirements = machine == “test.plsi.or.kr” Rank = KFLOPS Queue 500 Korea Institute of Science and Technology Information 34 Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 35 Administrating HTCondor ① Determine the machine role ② Obtain HTCondor & Install ③ Configure HTCondor Korea Institute of Science and Technology Information 36 Administrating HTCondor ① Determine the machine role (Central manager) 하나의 pool에 하나의 Central manager (Negotiator, Collector) 다수의 Submit host 가능 다수의 Execute host Central Manager [jwpark@apu1 condor_example]$ ps -ef | grep condor_ condor 2796 1 0 Feb07 ? 02:14:20 condor_master -f jwpark 3190 2936 0 09:56 pts/0 00:00:00 grep condor_ condor 30896 2796 0 Jun20 ? 00:04:09 condor_collector -f condor 30897 2796 0 Jun20 ? 00:03:42 condor_negotiator -f condor 30898 2796 0 Jun20 ? 00:02:15 condor_schedd -f root 30900 30898 0 Jun20 ? 00:01:37 condor_procd -A /tmp/condor-lock.apu10.901845565605075/procd_pipe.SCHEDD -L /opt/condor/local/log/ProcLog.SCHEDD -R 10000000 -S 60 -C 780 Execute host [jwpark@apu3 ~]$ ps -ef | grep condor_ condor 2132 1 0 Feb06 ? 00:43:49 condor_master -f condor 2386 2132 0 Feb18 ? 00:19:26 condor_startd -f root 7701 2386 0 Feb19 ? 00:39:39 condor_procd -A /tmp/condor-lock.apu30.70573943301952/procd_pipe.STARTD -L /opt/condor/local/log/ProcLog.STARTD -R 10000000 -S 60 -C 780 jwpark 24247 24223 0 09:51 pts/0 00:00:00 grep condor_ Korea Institute of Science and Technology Information 37 Administrating HTCondor ② Obtain HTCondor & Install Obtain HTCondor (http://research.cs.wisc.edu/htcondor/downloads/) Install on central manager Install on execute hosts Central Manager Execute host $ groupadd condor --gid 780 $ useradd -d /home/condor -g condor --uid 780 --shell /bin/bash condor $ su $ ./condor_install --prefix=/opt/condor -local-dir=/opt/condor/local -- $ groupadd condor --gid 780 $ useradd -d /home/condor -g condor --uid 780 --shell /bin/bash condor $ su $ ./condor_install --prefix=/opt/condor -local-dir=/opt/condor/local -- type=manager,submit type=execute --owner=condor -central-manager=apu1-ib --owner=condor Korea Institute of Science and Technology Information 38 Administrating HTCondor ③ Configure HTCondor /opt/condor/etc/condor_config Local 설정 파일 수정 (/opt/condor/local/condor_config.local) Central Manager 확인 policy 설정 security 설정 cat /opt/condor/local/condor_config.local ## What machine is your central manager? CONDOR_HOST = apu1-ib … flocking 설정 Korea Institute of Science and Technology Information 39 Administrating HTCondor ③ Configure HTCondor Policy expression Start Rank Suspend Continue Preempt Kill Korea Institute of Science and Technology Information 40 Administrating HTCondor ③ Configure HTCondor Always run jobs START = True RANK = SUSPEND = False CONTINUE = True PREEMPT = False KILL = False Korea Institute of Science and Technology Information 41 Administrating HTCondor ③ Configure HTCondor Prefer “chemistry” job START = True RANK = Department == "Chemistry" SUSPEND = False CONTINUE = True PREEMPT = False KILL = False Korea Institute of Science and Technology Information 42 Administrating HTCondor ③ Configure HTCondor Security configuration ALLOW_WRITE = \*.plsi.or.kr ALLOW_READ = \*.plsi.or.kr Korea Institute of Science and Technology Information 43 Administrating HTCondor ③ Configure HTCondor Flocking 설정 FLOCK_FROM = titan, venus FLOCK_TO = titan, venus, kobic Korea Institute of Science and Technology Information 44 Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information 45 Use Cases 미국 지질조사국 (Monte Carlo Simulation) 지하수 분포도 모델링 Korea Institute of Science and Technology Information 46 Use Cases Duke University Protein Structure 모델링 Korea Institute of Science and Technology Information 47 Use Cases University of Wisconsin – Madison IceCube 중성미자 분석 Korea Institute of Science and Technology Information 48 1 2 3 Korea Institute of Science and Technology Information 49 HTC@PLSI 구축 사례 HPC Job Job Job Job Job Job Job 4호기 Job Tachyon & GAIA Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job 4호기 Job 주요 미션 HTC PLSI 소규모 자원 장비의 노후화 분산된 환경 Korea Institute of Science and Technology Information 50 HTC@PLSI 구축 사례 기관명 시스템명 KU Venus KIAS GIST KOBIC KISTI KISTI UOS HTCondor의 CPU Intel Xeon @2.5GHz gene AMD Opteron (x86 클러스터) @2GHz titan Intel Xeon (x86 클러스터) @2.5GHz KOBIC AMD Opteron (SUN x4100) @2.1GHz GLORY AMD Opteron (SUN x2100) @1.8GHz kairos Intel Xeon (VM cluster) @2.5GHz t2c Intel(R) Xeon (x86 클러스터) @ 2.13GHz core Mem 112 16GB 128 8GB 128 16GB 184 4GB 514 2GB 200 16GB 80 ~ 120 8GB 비고 (2014년도 편 입 완료) 2014 하반기 편입 예정 편입 협의중 기능을 활용하여 HTC 서비스 인프라 구축 (@PLSI) 지리적으로 분산된 6개 사이트 1000+ core 규모 Korea Institute of Science and Technology Information 51 HTC@PLSI 구축 사례 서비스 정책 (default) Korea Institute of Science and Technology Information 52 HTC@PLSI 구축 사례 서비스 정책 (전용자원 요청시) Korea Institute of Science and Technology Information 53 HTC@PLSI 구축 사례 HTC@PLSI 활용 사례 연세대학교 주 소행성대 소행성 종족 연구 분야 (Monte Carlo Simulation) 서울시립대학교 의료/고에너지물리 분야 지원 (GEANT4 지원) 한국천문연구원 남천황도대 태양계소천체 집중탐사연구 분야 협력 (예정) Korea Institute of Science and Technology Information 54 Korea Institute of Science and Technology Information Korea Institute of Science and Technology Information HTC@PLSI 구축 사례 Korea Institute of Science and Technology Information 56 HTC@PLSI 구축 사례 Demo!! Korea Institute of Science and Technology Information 57
© Copyright 2024 ExpyDoc