CPU efficiency of experiment jobs at CERN - Meyrin and Wigner Nicolò Magini, Alessandro Di Girolamo, Domenico Giordano, Edward Karavakis, Maarten Litmaath, Valentina Mancinelli, Stefan Roiser IT/SDC WLCG pre-GDB on Data Access May 13th 2014 Outline Introduction Analysis Outlook CPU efficiency of experiment jobs at CERN 2014-05-13 2 Introduction Since 2013, the Wigner Data Centre in Budapest provides an extension of the CERN Computing Centre Connected with Meyrin with 2x100 Gbps ~30 ms latency Gradually ramping up. On March 11th: ~8k public slots (20%) ~58k HEPSPEC06 (17%) Transparent for users CPU efficiency of experiment jobs at CERN 2014-05-13 3 Impact for VOs? Experiments and IT are working to assess the impact of the distributed computing centre on workflows running at CERN Is the network latency affecting I/O for jobs running on WNs in Wigner? Storage was only in Meyrin Any effect should decrease with EOS also in Wigner Analysis of individual workflows More precise, but might be less representative Analysis of historical records to assess the overall impact Wealth of data to mine CPU efficiency of experiment jobs at CERN 2014-05-13 4 Other factors at play Multiple factors to disentangle Virtualization? Migration from physical machines to virtual nodes in AI Wigner is ~fully on AI, Meyrin is still a mix Operating system? Migration from SLC5 to SLC6 SLC5 is mostly on physical machines, SLC6 is mostly on AI Hardware types? Intel vs. AMD, etc… Not only infrastructure Experiment software? Different kinds of workflows? CPU efficiency of experiment jobs at CERN 2014-05-13 5 Metrics CPU efficiency = CPU time/Wallclock time Chosen because it is expected to be sensitive to network latency, not because it is the best target for optimization All other things equal, it will be higher on slower CPUs for the same job Absolute number is not useful, it changes every month different workflows and conditions For each activity, the difference between Meyrin and Wigner can be interesting Since submission frameworks are submitting blindly all job types to both sites. CPU efficiency of experiment jobs at CERN 2014-05-13 6 LHCb analysis Monte Carlo workflow (no input) running at CERN: 29.5k ‘homogeneous’ jobs Selected only virtualized WNs CPU efficiency of jobs in the same range @ Meyrin & WIGNER Mean Values Meyrin: 97.56% Wigner: 94.27% CPU efficiency of experiment jobs at CERN 2014-05-13 7 ALICE job monitoring http://alimonitor.cern.ch ~15% difference SLC5 vs SLC6 Effect of OS or virtualization? ~5% difference Meyrin vs Wigner CPU efficiency of experiment jobs at CERN SLC5: CERN-CREAM and CERN-L SLC6: CERN(Wigner) and CERN(Meyrin) 2014-05-13 8 ATLAS and CMS: Analysis of dashboard records Query Dashboard for all ATLAS and CMS jobs run at CERN For “production” and “analysis” generic activities Selecting jobs that at least started to run (> 10 s) Correlate job execution host with WN location and architecture from LSF Select sub-sample of jobs running as much as possible in same conditions in Wigner vs Meyrin SLC6 on virtual machines Reminder that hardware mix may not be the same Weighted average CPU efficiency Sum(CPU time)/Sum(Wallclock time) Integrated over 1 month CPU efficiency of experiment jobs at CERN 2014-05-13 9 ATLAS production – successful jobs ATLAS production - successful jobs 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Meyrin SLC6 virtual 0.1 Wigner SLC6 virtual 0 Difference in CPU efficiencies ~3% - but what happened in Feb-Mar? CPU efficiency of experiment jobs at CERN 2014-05-13 10 ATLAS analysis – successful jobs ATLAS analysis - successful jobs 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Meyrin SLC6 virtual 0.2 Wigner SLC6 virtual 0.1 0 Difference in CPU efficiencies ~20% CPU efficiency of experiment jobs at CERN 2014-05-13 11 CMS production – successful jobs CMS production - CPU efficiency 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Meyrin SLC6 virtual 0.1 Wigner SLC6 virtual 0 Difference in CPU efficiencies ~3% CPU efficiency of experiment jobs at CERN 2014-05-13 12 CMS analysis – successful jobs CMS analysis - CPU efficiency 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Meyrin SLC6 virtual 0.2 Wigner SLC6 virtual 0.1 0 Difference in CPU efficiencies ~6% CPU efficiency of experiment jobs at CERN 2014-05-13 13 Next steps 1. Correlating metrics E.g. CPU efficiency vs. CPU time To reveal details which are lost in averages 2. Adding new metrics “Events per second” What we want to optimize 3. Vary selection criteria to disentangle other factors E.g. Intel vs. AMD, SLC5 vs. SLC6, etc. CPU efficiency of experiment jobs at CERN 2014-05-13 14 Outlook ! Overall impact of remote access is small Comparable with variability in our measurements Comparable with other effects e.g. OS Procedure set up to measure the evolution E.g. effect is expected to be reduced now that EOS is also in Wigner Data mining can allow us to spot problematic workflows CPU efficiency of experiment jobs at CERN 2014-05-13 15
© Copyright 2024 ExpyDoc