Programming the Intel® Xeon Phi™ Coprocessor Lab Preparation Log into the CSC machines and make sure to have the Intel Compiler module loaded before starting this exercise with the command module load intel. Copy the file /share/adurango/pm-handson.tar.gz to a folder in your home directory and unpack it. Enter the PM-handson directory Native Programming Enter the Native directory. “Native” Intel® Xeon Phi™ coprocessor applications treat the coprocessor as a standalone multicore computer. Once the binary is built on your host system, it is copied to the “filesystem” on the coprocessor along with any other binaries and data it requires. The program is then run from the ssh console. In the CSC system, we do not need to copy the binaries manually because the /home filesystem and the tools partitions are shared by NFS. Important: Run all the commands of this section from the master node where you log in Edit the native.cpp file and note that is not different than any other program that would run on a Intel® Xeon system. Compile the program using the –mmic flag to generate a MIC binary. Remember to add the openmp flag. Now log in into the coprocessor in your node using: ssh mic0 Go to the same directory And now, try to execute the binary by issuing the command: ./<binary name> 1024 Did this work? The reason because this failed is because our environment on the host system is not transferred by ssh to the coprocessor. We need to set the LD_LIBRARY_PATH environment variable so it points to the directory that contains the OpenMP libraries for MIC (i.e., /share/apps/intel/composerxe/lib/mic ). Export LD_LIBRARY_PATH with the correct value Re-run the previous command. Does it work now? We can alternatively use ssh to run the binary from the host directly by using the following command: ssh mic0 “cd $(pwd); LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH ./<binary name> 1024” Try to run the command line and see if it works. Note two things from this command line. First, we set the LD_LIBRARY_PATH environment variable using MIC_LD_LIBRARY_PATH. MIC_LD_LIBRARY_PATH is defined by the Intel Compiler setup scripts to point to the location of the MIC version of the compiler libraries. Second, we use $(pwd), that is expanded by the shell to our current working directory, to go to the same directory and execute the binary in a single command line. Sometimes it is not so simple to prepare the execution environment of our application and it is better to encapsulate all the commands in a shell script that we can then invoke from ssh. Now we are going to use CSC queuing system and environment. Run the command srun ./<binary name> 1024 Did you need to set the LD_LIBRARY_PATH variable? Why do you think that is? In the CSC computing nodes the Intel Compiler libraries are already installed on a standard directory and that is why is not necessary to set any extra environment. On Intel Xeon Phi coprocessors, as in any other system, the exact configuration of your environment will depend on what software environment the system administrators provide you. Be sure to check beforehand. Vectorization Enter the Vectorization directory One important skill to master for Intel Xeon Phi coprocessors is the ability to vectorize correctly your application. Inspect the serial.cpp file Now compile the program using the –mmic flag and the –vec-report3 flag Did any of the loops vectorize? Why? Now compile the program with the option –guide-vec. With this option the compiler will give suggestion of how to change the code so it vectorizes. What is the compiling suggesting? We have a number of ways to remove pointer aliasing from a source code: Add the –fno-alias flag to the compilation line. Does any loop vectorize now? Now, take a look to the restrict.cpp file. What is the difference with respect to serial.cpp? Compile restrict.cpp using the –restrict flag (do not use –fno-alias). Does any loop vectorize? We can also affect the compiler decisions to vectorize the code by using a compiler directive. Look at the ivdep.cpp and psimd.cpp files. What differences do you see with respect to serial.cpp? And between them? What is the difference between the pragma directives? Now compile both files and check whether any loop is vectorized (you’ll need the –openmpsimd flag for psimd.cpp). Leveraging Intel MKL Automatic Offload Enter the MKL directory. If your application uses (or can use) the Intel MKL library a simple way to take advantage of the Intel Xeon Phi coprocessor is by using Automatic Offload (AO). Take a look at the mm.cpp file. It does a number of matrix multiplications using the DGEMM BLAS call. Note, that there is no OpenMP parallelization or offloading in the code. Compile the application using –mkl flag (do not use –mmic as we want this to run on the host). Run the application: srun --gres=mic:1 ./<binary name> 8192. This ran the application on the host. Set the MKL_MIC_ENABLE environment variable to 1 and re-run the application. Do you observe any changes? In the last run, the application was using simultaneously both the host and the coprocessor to compute the matrix multiplication. Note that the number of threads reported corresponds to the ones available to the host so printing this information can be misleading in an offload scenario. Set the OFFLOAD_REPORT environment variable to 1 and run the application. You will get a report of how the work was divided for each MKL call. Now run the application using 1024 for the first parameter. Do you see a report? What does this mean? Set the MKL_HOST_WORKDIVISION environment variable to 0 and 1 and run the application again with 8192. See how the previous report (and execution time) changes. Use to the MKL_MIC_WORKDIVISION environment variable to experiment with different percentages of work sent to the coprocessor. Note: MKL will use all available Intel Xeon Phi coprocessors by default once AO has been enabled. If you want to limit the number coprocessors you can use the OFFLOAD_DEVICES environment variable to specify which ones should be available to the application. Offloading with OpenMP To use the coprocessor from a host native application you need to offload part of the computation (and the associated data) to the coprocessor. There are different ways to do that but today we are going to use OpenMP 4.0 target directives. Let’s start with a simple MonteCarlo example. Start taking a look into mCarlo_offload.cpp and identify the section of the program that we want to offload to the coprocessor. Now add OpenMP directives to offload the region: o Use the OpenMP target directive to offload the region. o Use the OpenMP declare target directive to annotate those functions that will be used inside of the target region. Note: You will need the –mkl and –openmp flags to compile the application. Try running the application and verify that the check function runs on the coprocessor. Now, use the OMP_NUM_THREADS environment variable to complete the following table: Number of threads 1 2 16 32 64 120 180 240 Runtime (seconds) Now, let’s continue with our previous matrix multiply example of which you will find a copy in the current directory called mm_offload.cpp. Develop a version using the target directive where it effectively shows the number of threads that will be used in the coprocessor. Add a target directive for the mmul call. You will need to use the map clause to transfer the matrices in and out of the coprocessor on each call. Right now the application is transferring the data too many times we can improve that by using a target data directive that encompasses all the target regions that use the same data. Define a target data region around the loop doing the different matrix multiplications. You still need to keep the target directive on the mmul call so it is offloaded to the device. o Note: Because a compiler bug leave also the map clause on the target directive although that shouldn’t be necessary. How the two versions compare? OpenMP Affinity Thread affinity is very important to get correct performance on Intel Xeon Phi coprocessors. Let’s use our just developed matrix multiply to experiment with it. Using the OMP_NUM_THREADS, OMP_PROC_BIND and OMP_PLACES environment variables fill the following table: Number of threads 60 120 240 60 120 240 60 120 240 60 120 240 60 120 240 Place definition threads threads threads cores cores cores threads threads threads cores cores cores Affinity Policy false false false close close close close close close spread spread spread master master master Iteration average time (s.) What is the difference between the different settings? You can set the KMP_AFFINITY environment variable to “verbose” to the exact mapping of OpenMP threads. Note that these results are specific for this application. Different applications will require different affinity policies for optimal performance. Simultaneous computation between the host and the coprocessor Enter the SimultaneousComputing directory For optimal use of resources you might want to use both the host and the coprocessor at the same time. Here we have a mock-up application that provides a pattern to do so. Edit the simultcompute.cpp file and study how the pattern was implemented. In particular note the OpenMP task directives surrounding the target directives. Compile the application (use the –openmp flag) and run it. Does it behave as you expected? Set OMP_NUM_THREADS to 1 and run it again. How do explain the output?
© Copyright 2024 ExpyDoc