Guillimin HPC Users Meeting September 18, 2014 Bryan Caron [email protected] [email protected] McGill University / Calcul Québec / Compute Canada Montréal, QC Canada Outline • • • • • • Compute Canada News Service Interruption - October 17 Storage System News Scheduler Updates Software and User Environment Updates Training News Guillimin HPC Users Meeting 2 Compute Canada News • Resource Allocation Opportunities Competition 2015 – Announced September 15 • Three categories: – Fast Track – Resource Allocations Competition (RAC) – Research Platforms and Portals (RPP) Guillimin HPC Users Meeting 3 Compute Canada News • Fast Track – By invitation only – Target community: existing 2014 RAC users with minimal changes expected for 2015 – simplified application process compared to full RAC request – Deadline: October 2, 2014 Guillimin HPC Users Meeting 4 Compute Canada News • Resource Allocations Competition (RAC) – For requests larger than a default allocation • default allocation sizes are variable between systems and sites – Allocation duration: 1 year starting Jan 2015 – Application Deadline: October 20, 2014 Guillimin HPC Users Meeting 5 Compute Canada News • Research Platforms and Portals (RPP) ** New! ** – Application category examples: • Resources for larger communities of researchers • Applications that provide a public platform using CC computing or storage • Groups with international agreements for multi-year computing or storage commitments • Groups providing shared datasets accessible using non-Compute Canada interfaces / portals – Timelines • Letter of Intent due September 25 • Selected projects invited for full application Oct 3 • Full proposals due October 20 Guillimin HPC Users Meeting 6 Compute Canada News • All applicants are advised to contact CC staff prior to submitting an application and no later than Oct 1st – All new applicants MUST contact CC staff – Please contact us at [email protected] to discuss your proposals • Further information: – https://www.computecanada.ca – General Inquiries about the resource opportunities: [email protected] Guillimin HPC Users Meeting 7 Service Interruption • Guillimin Service Interruption: October 17 – Scheduled outage due to a full ETS campus-wide power interruption for electrical maintenance • Date: Friday October 17 (overnight period) – All Guillimin services will be unavailable • Specifics of the Guillimin service interruption start time and duration to be announced soon • Will take into consideration both the ETS power interruption and other priority Guillimin maintenance actions to be done Guillimin HPC Users Meeting 8 Storage System News • Upcoming Activities – Apply patch fix to GPFS to fix bug in version 3.5.0.19 • either live update with no service interruption or update during future maintenance • expected to return storage to optimal tunings that are currently modified to ensure stability with the bug of the current GPFS release – Tape Archive (Backup) and Hierarchical Storage Management (HSM) Integration - ongoing Guillimin HPC Users Meeting 9 Scheduler Update • In general improved overall stability and performance – Updated to Torque 4.2.8 during the August maintenance period – Using “cpusets”: each job can only access (is pinned to) as many CPU cores as were requested in the submission – A few outstanding issues under review with Adaptive Computing • Recall: April 10 - qsub for job submission enabled – Default PATH settings updated to include Torque commands (qsub, qstat, …) – Much faster response for submissions, queries compared to Moab commands (msub, canceljob, …) – qsub submission filter: qsub –A <RAPid> now only required if you can access multiple allocations – New! qsub can now also be used directly from worker nodes Guillimin HPC Users Meeting 10 Scheduler Update • Other scheduler updates – Jobs specifying gpus=x or mics=x are now automatically routed to the correct queue (k20 or phi) – The ScaleMP system is online but has only 120 cores instead of 132 due to hardware issues; please use the scalemp queue for access • Future work and updates – Moab configurations to favour assignment of nodes from within the same IB switch for MPI jobs (fewer hops) – Additional qsub filter improvements and features Guillimin HPC Users Meeting 11 Software Update • New Installations – – – – – – pigz v.2.2.5 (parallel gzip) (not a module) pxz/4.999.9 (parallel xz) compiled with xz v.5.1.4 FFTW/3.3-serial-intel NAMD/20140822-phi ifort_icc/14.0.4 (new default, from 14.0.1!) and ifort_icc/15.0 intel_mpi/5.0.1 (new default!); renamed intel_mpi/14.0 to intel_mpi/4.1.1 and intel_mpi/14.0.1 to intel_mpi/4.1.3. • Future updates - MPSS 3.3: software stack update for Intel Phi nodes to improve functionality and performance for MPI jobs using multiple Phi nodes and cards. - PGI license server migration + installation of version 14.7 Guillimin HPC Users Meeting 12 Software Update • Reminder: Guillimin Hadoop Cluster – 10 nodes available for MapReduce / Hadoop workloads – please contact [email protected] for access – Hadoop Talk @ Le Forum Decideo de Montréal - September 23 • by Dan Mazur of McGill HPC / Calcul Québec – http://www.forumdecideo.com Guillimin HPC Users Meeting 13 Training News • See ‘Training’ at www.hpc.mcgill.ca for our full calendar of training and workshops for 2014 and to register – all materials from previous workshops are available online • Upcoming: – September 23 - Xeon Phi Developer Training Event (with Intel) – September 25 - Introduction to Linux – October 9 - Introduction to MPI • Recently Completed: – – – – – September 11 - Introduction to HPC August 19 - Scientific Visualization Tools July 10 - MapReduce and Hadoop for Big Data June 5 - Advanced OpenMP May 22 - Introduction to the Xeon Phi Guillimin HPC Users Meeting 14 Training News • • • • See ‘Training’ calendar to register • co-hosted with Intel and the McGill HPC Centre of Calcul Québec / Compute Canada 14 User Feedback and Discussion • Questions? Comments? • We value your feedback. • Guillimin Operational News for Users – Status Pages • http://www.hpc.mcgill.ca/index.php/guillimin-status • http://serveurscq.computecanada.ca (all CQ systems) – Follow us on Twitter • http://twitter.com/McGillHPC Guillimin HPC Users Meeting 16
© Copyright 2024 ExpyDoc