CPU efficiency of experiment jobs at CERN - Meyrin and

CPU efficiency of experiment jobs at
CERN - Meyrin and Wigner
Nicolò Magini, Alessandro Di Girolamo, Domenico
Giordano, Edward Karavakis, Maarten Litmaath,
Valentina Mancinelli, Stefan Roiser
IT/SDC
WLCG pre-GDB on Data Access
May 13th 2014
Outline
 Introduction
 Analysis
 Outlook
CPU efficiency of experiment jobs at CERN
2014-05-13
2
Introduction
 Since 2013, the Wigner Data Centre in
Budapest provides an extension of the
CERN Computing Centre
 Connected with Meyrin with 2x100 Gbps
 ~30 ms latency
 Gradually ramping up. On March 11th:
 ~8k public slots (20%)
 ~58k HEPSPEC06 (17%)
 Transparent for users
CPU efficiency of experiment jobs at CERN
2014-05-13
3
Impact for VOs?
 Experiments and IT are working to assess the
impact of the distributed computing centre on
workflows running at CERN
 Is the network latency affecting I/O for jobs
running on WNs in Wigner?
 Storage was only in Meyrin
 Any effect should decrease with EOS also in Wigner
  Analysis of individual workflows
 More precise, but might be less representative
  Analysis of historical records to assess the
overall impact
 Wealth of data to mine
CPU efficiency of experiment jobs at CERN
2014-05-13
4
Other factors at play
 Multiple factors to disentangle
 Virtualization?
 Migration from physical machines to virtual nodes in AI
 Wigner is ~fully on AI, Meyrin is still a mix
 Operating system?
 Migration from SLC5 to SLC6
 SLC5 is mostly on physical machines, SLC6 is mostly on AI
 Hardware types?
 Intel vs. AMD, etc…
 Not only infrastructure
 Experiment software?
 Different kinds of workflows?
CPU efficiency of experiment jobs at CERN
2014-05-13
5
Metrics
 CPU efficiency = CPU time/Wallclock time
 Chosen because it is expected to be sensitive to
network latency, not because it is the best target for
optimization
 All other things equal, it will be higher on slower CPUs for the
same job
 Absolute number is not useful, it changes every month
 different workflows and conditions
 For each activity, the difference between Meyrin and
Wigner can be interesting
 Since submission frameworks are submitting blindly all job
types to both sites.
CPU efficiency of experiment jobs at CERN
2014-05-13
6
LHCb analysis
 Monte Carlo workflow
(no input) running at
CERN: 29.5k
‘homogeneous’ jobs
 Selected only
virtualized WNs
  CPU efficiency of
jobs in the same range
@ Meyrin & WIGNER
 Mean Values
 Meyrin: 97.56%
 Wigner: 94.27%
CPU efficiency of experiment jobs at CERN
2014-05-13
7
ALICE job monitoring
http://alimonitor.cern.ch
 ~15% difference
SLC5 vs SLC6
 Effect of OS or
virtualization?
 ~5% difference
Meyrin vs Wigner


CPU efficiency of experiment jobs at CERN
SLC5: CERN-CREAM and CERN-L
SLC6: CERN(Wigner) and CERN(Meyrin)
2014-05-13
8
ATLAS and CMS: Analysis of dashboard
records
 Query Dashboard for all ATLAS and CMS jobs run at
CERN
 For “production” and “analysis” generic activities
 Selecting jobs that at least started to run (> 10 s)
 Correlate job execution host with WN location and
architecture from LSF
 Select sub-sample of jobs running as much as
possible in same conditions in Wigner vs Meyrin
 SLC6 on virtual machines
 Reminder that hardware mix may not be the same
 Weighted average CPU efficiency
 Sum(CPU time)/Sum(Wallclock time)
 Integrated over 1 month
CPU efficiency of experiment jobs at CERN
2014-05-13
9
ATLAS production – successful jobs
ATLAS production - successful jobs
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Meyrin SLC6 virtual
0.1
Wigner SLC6 virtual
0
 Difference in CPU efficiencies ~3% - but what happened in Feb-Mar?
CPU efficiency of experiment jobs at CERN
2014-05-13
10
ATLAS analysis – successful jobs
ATLAS analysis - successful jobs
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Meyrin SLC6 virtual
0.2
Wigner SLC6 virtual
0.1
0
 Difference in CPU efficiencies ~20%
CPU efficiency of experiment jobs at CERN
2014-05-13
11
CMS production – successful jobs
CMS production - CPU efficiency
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Meyrin SLC6 virtual
0.1
Wigner SLC6 virtual
0
 Difference in CPU efficiencies ~3%
CPU efficiency of experiment jobs at CERN
2014-05-13
12
CMS analysis – successful jobs
CMS analysis - CPU efficiency
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Meyrin SLC6 virtual
0.2
Wigner SLC6 virtual
0.1
0
 Difference in CPU efficiencies ~6%
CPU efficiency of experiment jobs at CERN
2014-05-13
13
Next steps
1. Correlating metrics
 E.g. CPU efficiency vs. CPU time

To reveal details which are lost in averages
2. Adding new metrics
 “Events per second”

What we want to optimize
3. Vary selection criteria to disentangle other
factors
 E.g. Intel vs. AMD, SLC5 vs. SLC6, etc.
CPU efficiency of experiment jobs at CERN
2014-05-13
14
Outlook
! Overall impact of remote access is small
 Comparable with variability in our
measurements
 Comparable with other effects e.g. OS
Procedure set up to measure the evolution
 E.g. effect is expected to be reduced now that
EOS is also in Wigner
 Data mining can allow us to spot
problematic workflows
CPU efficiency of experiment jobs at CERN
2014-05-13
15