Programming Exercise - Prace Training Portal

Programming the Intel® Xeon Phi™
Coprocessor Lab
Preparation



Log into the CSC machines and make sure to have the Intel Compiler module loaded before
starting this exercise with the command module load intel.
Copy the file /share/adurango/pm-handson.tar.gz to a folder in your home directory and
unpack it.
Enter the PM-handson directory
Native Programming

Enter the Native directory.
“Native” Intel® Xeon Phi™ coprocessor applications treat the coprocessor as a standalone multicore
computer. Once the binary is built on your host system, it is copied to the “filesystem” on the
coprocessor along with any other binaries and data it requires. The program is then run from the ssh
console. In the CSC system, we do not need to copy the binaries manually because the /home
filesystem and the tools partitions are shared by NFS.
Important: Run all the commands of this section from the master node where you log in






Edit the native.cpp file and note that is not different than any other program that would run on
a Intel® Xeon system.
Compile the program using the –mmic flag to generate a MIC binary. Remember to add the openmp flag.
Now log in into the coprocessor in your node using: ssh mic0
Go to the same directory
And now, try to execute the binary by issuing the command: ./<binary name> 1024
Did this work?
The reason because this failed is because our environment on the host system is not transferred by ssh
to the coprocessor. We need to set the LD_LIBRARY_PATH environment variable so it points to the
directory that contains the OpenMP libraries for MIC (i.e., /share/apps/intel/composerxe/lib/mic ).
 Export LD_LIBRARY_PATH with the correct value
 Re-run the previous command. Does it work now?
We can alternatively use ssh to run the binary from the host directly by using the following command:
ssh mic0 “cd $(pwd); LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH ./<binary name> 1024”

Try to run the command line and see if it works.
Note two things from this command line. First, we set the LD_LIBRARY_PATH environment variable
using MIC_LD_LIBRARY_PATH. MIC_LD_LIBRARY_PATH is defined by the Intel Compiler setup scripts to
point to the location of the MIC version of the compiler libraries. Second, we use $(pwd), that is
expanded by the shell to our current working directory, to go to the same directory and execute the
binary in a single command line.
Sometimes it is not so simple to prepare the execution environment of our application and it is better to
encapsulate all the commands in a shell script that we can then invoke from ssh.
Now we are going to use CSC queuing system and environment.


Run the command srun ./<binary name> 1024
Did you need to set the LD_LIBRARY_PATH variable? Why do you think that is?
In the CSC computing nodes the Intel Compiler libraries are already installed on a standard directory and
that is why is not necessary to set any extra environment. On Intel Xeon Phi coprocessors, as in any
other system, the exact configuration of your environment will depend on what software environment
the system administrators provide you. Be sure to check beforehand.
Vectorization

Enter the Vectorization directory
One important skill to master for Intel Xeon Phi coprocessors is the ability to vectorize correctly your
application.



Inspect the serial.cpp file
Now compile the program using the –mmic flag and the –vec-report3 flag
Did any of the loops vectorize? Why?

Now compile the program with the option –guide-vec. With this option the compiler will give
suggestion of how to change the code so it vectorizes.
What is the compiling suggesting?

We have a number of ways to remove pointer aliasing from a source code:



Add the –fno-alias flag to the compilation line. Does any loop vectorize now?
Now, take a look to the restrict.cpp file. What is the difference with respect to serial.cpp?
Compile restrict.cpp using the –restrict flag (do not use –fno-alias). Does any loop vectorize?
We can also affect the compiler decisions to vectorize the code by using a compiler directive.


Look at the ivdep.cpp and psimd.cpp files. What differences do you see with respect to
serial.cpp? And between them? What is the difference between the pragma directives?
Now compile both files and check whether any loop is vectorized (you’ll need the –openmpsimd flag for psimd.cpp).
Leveraging Intel MKL Automatic Offload

Enter the MKL directory.
If your application uses (or can use) the Intel MKL library a simple way to take advantage of the Intel
Xeon Phi coprocessor is by using Automatic Offload (AO).




Take a look at the mm.cpp file. It does a number of matrix multiplications using the DGEMM
BLAS call. Note, that there is no OpenMP parallelization or offloading in the code.
Compile the application using –mkl flag (do not use –mmic as we want this to run on the host).
Run the application: srun --gres=mic:1 ./<binary name> 8192. This ran the application on the
host.
Set the MKL_MIC_ENABLE environment variable to 1 and re-run the application. Do you observe
any changes?
In the last run, the application was using simultaneously both the host and the coprocessor to compute
the matrix multiplication. Note that the number of threads reported corresponds to the ones available
to the host so printing this information can be misleading in an offload scenario.




Set the OFFLOAD_REPORT environment variable to 1 and run the application. You will get a
report of how the work was divided for each MKL call.
Now run the application using 1024 for the first parameter. Do you see a report? What does this
mean?
Set the MKL_HOST_WORKDIVISION environment variable to 0 and 1 and run the application
again with 8192. See how the previous report (and execution time) changes.
Use to the MKL_MIC_WORKDIVISION environment variable to experiment with different
percentages of work sent to the coprocessor.
Note: MKL will use all available Intel Xeon Phi coprocessors by default once AO has been enabled. If you
want to limit the number coprocessors you can use the OFFLOAD_DEVICES environment variable to
specify which ones should be available to the application.
Offloading with OpenMP
To use the coprocessor from a host native application you need to offload part of the computation (and
the associated data) to the coprocessor. There are different ways to do that but today we are going to
use OpenMP 4.0 target directives.
Let’s start with a simple MonteCarlo example.


Start taking a look into mCarlo_offload.cpp and identify the section of the program that we want
to offload to the coprocessor.
Now add OpenMP directives to offload the region:
o Use the OpenMP target directive to offload the region.
o Use the OpenMP declare target directive to annotate those functions that will be used
inside of the target region.
Note: You will need the –mkl and –openmp flags to compile the application.


Try running the application and verify that the check function runs on the coprocessor.
Now, use the OMP_NUM_THREADS environment variable to complete the following table:
Number of threads
1
2
16
32
64
120
180
240
Runtime (seconds)
Now, let’s continue with our previous matrix multiply example of which you will find a copy in the
current directory called mm_offload.cpp.


Develop a version using the target directive where it effectively shows the number of threads
that will be used in the coprocessor.
Add a target directive for the mmul call. You will need to use the map clause to transfer the
matrices in and out of the coprocessor on each call.
Right now the application is transferring the data too many times we can improve that by using a target
data directive that encompasses all the target regions that use the same data.

Define a target data region around the loop doing the different matrix multiplications. You still
need to keep the target directive on the mmul call so it is offloaded to the device.
o

Note: Because a compiler bug leave also the map clause on the target directive
although that shouldn’t be necessary.
How the two versions compare?
OpenMP Affinity
Thread affinity is very important to get correct performance on Intel Xeon Phi coprocessors. Let’s use
our just developed matrix multiply to experiment with it.

Using the OMP_NUM_THREADS, OMP_PROC_BIND and OMP_PLACES environment variables fill
the following table:
Number of threads
60
120
240
60
120
240
60
120
240
60
120
240
60
120
240

Place definition
threads
threads
threads
cores
cores
cores
threads
threads
threads
cores
cores
cores
Affinity Policy
false
false
false
close
close
close
close
close
close
spread
spread
spread
master
master
master
Iteration average time (s.)
What is the difference between the different settings? You can set the KMP_AFFINITY
environment variable to “verbose” to the exact mapping of OpenMP threads.
Note that these results are specific for this application. Different applications will require different
affinity policies for optimal performance.
Simultaneous computation between the host and the coprocessor

Enter the SimultaneousComputing directory
For optimal use of resources you might want to use both the host and the coprocessor at the same time.
Here we have a mock-up application that provides a pattern to do so.


Edit the simultcompute.cpp file and study how the pattern was implemented. In particular note
the OpenMP task directives surrounding the target directives.
Compile the application (use the –openmp flag) and run it. Does it behave as you expected? Set
OMP_NUM_THREADS to 1 and run it again. How do explain the output?