NEP-101 HEP Data-Intensive Distributed Cloud Computing

NEP-101 HEP Data-Intensive Distributed
Cloud Computing
Technical Review
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 1
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
●
Agenda:
1.Percent Completion
2.Project Progress Discussion
3.Collaboration Opportunities
4.Summary
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 2
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
1. Percent Completion:
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 3
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
2. Project Progress Discussion:
–
Batch Services (CloudScheduler)
–
Software Distribution (CVMFS, Shoal, Squid)
–
Storage Federation (UGR)
–
VM Image Distribution (Glint)
–
VM Image Optimization (CernVM3)
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 4
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
2.1. Project Progress Discussion, Batch Services:
- Monitoring & Diagnostics
* Work continues
* Nagios friendly
- Other Activities:
* Belle-II production
* OpenStack/Nova Fairshare Scheduler
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 5
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
2.1.1 Belle-II production:
•
2-15 October.
•
Utilizing up to 7 clouds concurrently (including 3 commercial clouds).
•
Averaging more than 1170 concurrent jobs (peaking at over 1950).
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 6
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
2.1.2 Belle-II production:
•
3rd highest producer in the world.
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 7
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
2.1.2 Fairshare Schedule:
•
Replacement for the standard OpenStack filter scheduler.
•
Improves resource usage in a batch environment.
•
When there are insufficient resources, queues requests rather than rejecting them.
•
Allows the definition of project and user shares.
•
•
•
project_shares={'ATLAS':30, 'Belle-II':30, 'HEP':5, 'staticVMs':10, 'testing':5}
–
user_shares={'p1':{'p1_u1':11, 'p1_u3':13}, 'p2':{'p2_u1':21, 'p1_u3':13}}
Implements the SLURM
utilization:
(https://computing.llnl.gov/linux/slurm/priority_multifactor.html)
fair share algorithm which tracks resource
–
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
–
USER | PROJECT | USER SHARE | PROJECT SHARE | FAIR­SHARE (Vcpus) | FAIR­SHARE (Memory) | actual vcpus usage | effec. vcpus usage | priority | VMs –
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
–
igable | staticVMs| 1% | 10.0% | 0.142883896655 | 0.144670972976 | 2.6% | 46.8% | 2441 | 1 –
crlb | staticVMs| 1% | 10.0% | 0.0240432745384 | 0.0265300227472 | 88.3% | 89.6% | 426 | 8 –
crlb | testing | 10.0% | 10.0% | 0.975288048768 | 0.982024799983 | 1.2% | 1.2% | 16627 | 0 –
batch­account| HEP | 1% | 10.0% | 0.844486880279 | 0.807007869005 | 0.1% | 2.7% | 14093 | 0 –
frank | HEP | 1% | 10.0% | 0.625423100715 | 0.551355623614 | 7.4% | 7.5% | 10113 | 2 –
crlb | HEP | 1% | 10.0% | 0.837189731471 | 0.798172023564 | 0.3% | 2.8% | 13959 | 0 –
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
Collaborating with developers from Istituto Nazionale di Fisica Nucleare.
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 8
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
2.2. Project Progress Discussion, Software Distribution:
- In production for 9 months
- Multiple geographically distributed
ATLAS squid caches
- Adopted by CERN, Oxford, & others.
- Included in CernVM3
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 9
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
2.3. Project Progress Discussion, Storage Federation:
- Supports WebDAV servers and
ATLAS storage elements (SE)
- ATLAS SE authentication via
VOMS proxy – Tested interactively
- Canadian sites configured:
- Production testing waiting for modification to
Rucio/Aria2C, the glue between ATLAS DB, user,
and UGR.
- Simulation tests being formulated.
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 10
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
2.4. Project Progress Discussion, Image Distribution:
- Integrated with OpenStack
Dashboard
* Adding HTTPs & Branding
* Moving to production
* Packaging
* Preparing for OpenStack Summit in November
- Uses OpenStack
development architecture,
and keystone authentication
- Supports Glance, EC2, &
GCE
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 11
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
2.5. Project Progress Discussion, Image Optimization:
- Work continues on contextualization
of CernVM-3:
* Cloud type discovery.
* Contextualization switching from a
combination of puppet/cloud-init to pure
cloud-init.
* Collaborating with CERN and pushing code
changes directly to the CERN repository.
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 12
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
3. Collaboration Opportunities:
–
Latest CloudScheduler, Glint, and UGR will be available on
DAIR
–
All source code developed by the project is on Github
–
Seeking to have Glint installed at CERN, on Westgrid, and
other third party sites
–
Seeking to have Glint included as an OpenStack project
–
Using code provided by CERN, OpenStack, and the open
source community
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 13
NEP-101 HEP Data-Intensive Distributed
Cloud Computing
4. Summary:
–
Project is on track and making good progress.
–
Many of the pieces are already in production by NEP-101
project group, CERN/ATLAS, Belle II, and CANFAR.
15 October 2014
Colin Leavett-Brown, University of Victoria
Technical Review - 14