presentation slides - McGill HPC

Guillimin HPC Users Meeting
September 18, 2014
Bryan Caron
[email protected]
[email protected]
McGill University / Calcul Québec / Compute Canada
Montréal, QC Canada
Outline
•
•
•
•
•
•
Compute Canada News
Service Interruption - October 17
Storage System News
Scheduler Updates
Software and User Environment Updates
Training News
Guillimin HPC Users Meeting
2
Compute Canada News
• Resource Allocation Opportunities Competition 2015
– Announced September 15
• Three categories:
– Fast Track
– Resource Allocations Competition (RAC)
– Research Platforms and Portals (RPP)
Guillimin HPC Users Meeting
3
Compute Canada News
• Fast Track
– By invitation only
– Target community: existing 2014 RAC users with
minimal changes expected for 2015
– simplified application process compared to full RAC
request
– Deadline: October 2, 2014
Guillimin HPC Users Meeting
4
Compute Canada News
• Resource Allocations Competition (RAC)
– For requests larger than a default allocation
• default allocation sizes are variable between
systems and sites
– Allocation duration: 1 year starting Jan 2015
– Application Deadline: October 20, 2014
Guillimin HPC Users Meeting
5
Compute Canada News
• Research Platforms and Portals (RPP) ** New! **
– Application category examples:
• Resources for larger communities of researchers
• Applications that provide a public platform using CC
computing or storage
• Groups with international agreements for multi-year
computing or storage commitments
• Groups providing shared datasets accessible using
non-Compute Canada interfaces / portals
– Timelines
• Letter of Intent due September 25
• Selected projects invited for full application Oct 3
• Full proposals due October 20
Guillimin HPC Users Meeting
6
Compute Canada News
• All applicants are advised to contact CC staff prior to
submitting an application and no later than Oct 1st
– All new applicants MUST contact CC staff
– Please contact us at [email protected] to
discuss your proposals
• Further information:
– https://www.computecanada.ca
– General Inquiries about the resource opportunities:
[email protected]
Guillimin HPC Users Meeting
7
Service Interruption
• Guillimin Service Interruption: October 17
– Scheduled outage due to a full ETS campus-wide power
interruption for electrical maintenance
• Date: Friday October 17 (overnight period)
– All Guillimin services will be unavailable
• Specifics of the Guillimin service interruption start
time and duration to be announced soon
• Will take into consideration both the ETS power
interruption and other priority Guillimin
maintenance actions to be done
Guillimin HPC Users Meeting
8
Storage System News
• Upcoming Activities
– Apply patch fix to GPFS to fix bug in version 3.5.0.19
• either live update with no service interruption or update
during future maintenance
• expected to return storage to optimal tunings that are
currently modified to ensure stability with the bug of the
current GPFS release
– Tape Archive (Backup) and Hierarchical Storage
Management (HSM) Integration - ongoing
Guillimin HPC Users Meeting
9
Scheduler Update
• In general improved overall stability and performance
– Updated to Torque 4.2.8 during the August maintenance period
– Using “cpusets”: each job can only access (is pinned to) as many
CPU cores as were requested in the submission
– A few outstanding issues under review with Adaptive
Computing
• Recall: April 10 - qsub for job submission enabled
– Default PATH settings updated to include Torque commands
(qsub, qstat, …)
– Much faster response for submissions, queries compared to
Moab commands (msub, canceljob, …)
– qsub submission filter: qsub –A <RAPid> now only required if
you can access multiple allocations
– New! qsub can now also be used directly from worker nodes
Guillimin HPC Users Meeting
10
Scheduler Update
• Other scheduler updates
– Jobs specifying gpus=x or mics=x are now automatically routed
to the correct queue (k20 or phi)
– The ScaleMP system is online but has only 120 cores instead of
132 due to hardware issues; please use the scalemp queue for
access
• Future work and updates
– Moab configurations to favour assignment of nodes from within
the same IB switch for MPI jobs (fewer hops)
– Additional qsub filter improvements and features
Guillimin HPC Users Meeting
11
Software Update
• New Installations
–
–
–
–
–
–
pigz v.2.2.5 (parallel gzip) (not a module)
pxz/4.999.9 (parallel xz) compiled with xz v.5.1.4
FFTW/3.3-serial-intel
NAMD/20140822-phi
ifort_icc/14.0.4 (new default, from 14.0.1!) and ifort_icc/15.0
intel_mpi/5.0.1 (new default!); renamed intel_mpi/14.0 to
intel_mpi/4.1.1 and intel_mpi/14.0.1 to intel_mpi/4.1.3.
• Future updates
- MPSS 3.3: software stack update for Intel Phi nodes to improve
functionality and performance for MPI jobs using multiple Phi
nodes and cards.
- PGI license server migration + installation of version 14.7
Guillimin HPC Users Meeting
12
Software Update
• Reminder: Guillimin Hadoop Cluster
– 10 nodes available for MapReduce / Hadoop workloads
– please contact [email protected] for access
– Hadoop Talk @ Le Forum Decideo de Montréal - September
23
• by Dan Mazur of McGill HPC / Calcul Québec
– http://www.forumdecideo.com
Guillimin HPC Users Meeting
13
Training News
• See ‘Training’ at www.hpc.mcgill.ca for our full calendar of
training and workshops for 2014 and to register
– all materials from previous workshops are available online
• Upcoming:
– September 23 - Xeon Phi Developer Training Event (with Intel)
– September 25 - Introduction to Linux
– October 9 - Introduction to MPI
• Recently Completed:
–
–
–
–
–
September 11 - Introduction to HPC
August 19 - Scientific Visualization Tools
July 10 - MapReduce and Hadoop for Big Data
June 5 - Advanced OpenMP
May 22 - Introduction to the Xeon Phi
Guillimin HPC Users Meeting
14
Training News
•
•
•
• See ‘Training’ calendar
to register
• co-hosted with Intel and
the McGill HPC Centre
of Calcul Québec /
Compute Canada
14
User Feedback and Discussion
• Questions? Comments?
• We value your feedback.
• Guillimin Operational News for Users
– Status Pages
• http://www.hpc.mcgill.ca/index.php/guillimin-status
• http://serveurscq.computecanada.ca (all CQ systems)
– Follow us on Twitter
• http://twitter.com/McGillHPC
Guillimin HPC Users Meeting
16