OpenPOWER Application Optimization

2
SCOPE OF THE PRESENTATION
• Outline Tuning strategies to improve performance of programs on POWER9 processors
• Performance bottlenecks can arise in the processor front end and back end
• Lets discuss some of the bottlenecks and how we can work around them using compiler flags,
source code pragmas/attributes
• This talk refers to compiler options supported by open source compilers such as GCC. Latest
version available publicly is 9.2.0 which is what we will use for the handson. Most of it carries
over to LLVM as it is. A slight variation works with IBM proprietary compilers such as XL

POWER9 PROCESSOR
3
• Optimized for Stronger Thread Performance and Efficiency
• Increased Execution Bandwidth efficiency for a range of workloads including commercial,
cognitive and analytics
• Sophisticated instruction scheduling and branch prediction for unoptimized applications and
interpretive languages
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation

4
• Shorter Pipelines with reduced disruption
• Improved Application Performance for Modern
Codes
• Higher Performance and Pipeline Utilization
• Removed instruction grouping
• Enhanced instruction fusion
• Pipeline can complete upto 128 (64-SMT4)
instructions /cycle
• Reduced Latency and Improved Scalability
• Improved pipe control of load/store
instructions
• Improved hazard avoidance

FORMAT OF TODAYS DISCUSSION
5
Brief presentation on optimization strategies
Followed by handson exercises
Initial steps -
>ssh –l student<n> orthus.nic.uoregon.edu
>ssh gorgon
Once you have a home directory make a directory with your name within the home/student<n>
>mkdir /home/student<n>/<yourname>
copy the following files into them
> cp -rf /home/users/gansys/archana/Handson .
You will see the following directories within Handson/
Task1/
Task2/
Task3/
Task4/
During the course of the presentation we will discuss the exercises inline and you can try them on the machine

6
PERFORMANCE TUNING IN THE FRONT-END
• Front end fetches and decodes the successive instructions and passes them to the backend for
processing
• POWER9 is a superscalar processor and is pipeline based so works with an advanced branch
predictor to predict the sequence and fetch instructions in advance
• We have call branches, loop branches
• Typically we use the following strategies to work around bottlenecks seen around branches –
• Unrolling, inlining using pragmas/attributes/manually in source (if compiler does not
automatically)
• Converting control to data dependence using ?: and compiling with –misel for difficult to
predict branches
• Drop hints using __builtin_expect(var, value) to simplify compiler’s scheduling
• Indirect call promotion to promote more inlining

7
PERFORMANCE TUNING IN THE BACK-END
• Backend is concerned with executing of the instructions that were fetched and
dispatched to the appropriate units
• Compiler takes care of making sure dependent instructions are far from each other
in its scheduling pass automatically
• Tuning backend performance involves optimal usage of Processor
Resources. We can tune the performance using following.
• Registers- using instructions that reduce reg usage, Vectorization /
reducing pressure on GPRs/ ensuring more throughput, Making loops
free of pointers and branches as much as possible to enable more
vectorization
• Caches – data layout optimizations that reduce footprint, using –fshort-
enums, Prefetching – hardware and software
• System Tuning- parallelization, binding, largepages, optimized libraries

8
STRUCTURE OF HANDSON EXERCISE
• All the handson exercises work on the Jacobi application
• The application has two versions – poisson2d_reference (referred to as
poisson2d_serial in Task4) and poisson2d
• Inorder to showcase an optimization impact, poisson2d is optimized and
poisson2d_reference is minimally optimized to a baseline level and the performance
of the two routines are compared
• The application internally measures the time and prints the speedup
• Higher the speedup higher is the impact of the optimization in focus
• For the handson we work with gcc (9.2.0) and pgi compilers (19.10)
• Solutions are indicated in the Solutions/ folder within each of the Task directories

9
TASK1: BASIC COMPILER FLAGS
• Here the poisson2d_reference.c is optimized at O3 level
• The user needs to optimize poisson2d.c with Ofast level
• Build and run the application poisson2d
• What is the speedup you observe and why ?
• You can generate a perf profile using perf record –e cycles ./poisson2d
• Running perf report will show you the top routines and you can compare
performance of poisson2d_reference and poisson2d to get an idea

10
TASK2: SW PREFETCHING
• Now that we saw that Ofast improved performance beyond O3 lets optimize
poisson2d_reference at Ofast and see if we can further improve it
• The user needs to optimize the poisson2d with sw prefetching flag
• Build and run the application
• What is the speedup you observe?
• Verify whether sw prefetching instructions have been added
• Grep for dcbt in the objdump file

11
TASK3: OPENMP PARALLELIZATION
• The jacobi application is highly parallel
• We can using openMP pragmas parallelize it and measure the speedup
• The source file has openMP pragmas in comments
• Uncomment them and build with openMP options –fopenmp and link with –lgomp
• Run with multiple threads and note the speedup
• OMP_NUM_THREADS=4 ./poisson2d

12
TASK3.1: OPENMP PARALLELIZATION
• Running lscpu you will see Thread(s) per core: 4
• You will see the setting as SMT=4 on the system; You can verify by running
ppc64_cpu –smt on the command line
• Run cat /proc/cpuinfo to determine the total number of threads, cores in the system
• Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3
• Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Set n1, .. n4 to threads in different cores and run for example-
• Compare Speedups; Which one is higher?

13
TASK3.2: IMPACT OF BINDING
• Running lscpu you will see Thread(s) per core: 4
• You will see the setting as SMT=4 on the system; You can verify by running
ppc64_cpu –smt on the command line
• Run cat /proc/cpuinfo to determine the total number of threads, cores in the system
• Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3
• Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example-
• Set n1, .. n4 to threads in different cores and run for example-
• Compare Speedups; Which one is higher?

14
TASK4: ACCELERATE USING GPUS
• You can attempt this after the lecture on GPUs
• Jacobi application contains a large set of parallelizable loops
• Poisson2d.c contains commented openACC pragmas which should be
uncommented, built with appropriate flags and run on an accelerated platform
• #pragma acc parallel loop
• In case you want to refer to Solution - poisson2d.solution.c
• You can compare the speedup by running poisson2d without the pragmas and
running the poisson2d.solution
• For more information you can refer to the Makefile

15
TASK1: BASIC COMPILER FLAGS- SOLUTION
– This hands-on exercise illustrates the impact of the Ofast flag
– Ofast enables –ffast-math option that implements the same math function in a way
that does not require guarantees of IEEE / ISO rules or specification and avoids the
overhead of calling a function from the math library
– If you look at the perf profile, you will observe poisson2d_reference makes a call to
fmax
– Whereas poisson2d.c::main() of poisson2d generates native instructions such as
xvmax as it is optimized at Ofast

16
TASK2: SW PREFETCHING- SOLUTION
– Compiling with a prefetch flag enables the compiler to analyze the code and insert __dcbt and __dcbtst
instructions into the code if it is beneficial
– __dcbt and __dcbtst instructions prefetch memory values into L3 ; __dcbt is for load and __dcbtst is for store
– POWER9 has prefetching enabled both at HW and SW levels
– At HW level, prefetching is “ON” by default
– At the SW level, you can request the compiler to insert prefetch
instructions ; However the compiler can choose to ignore the
request if it determines that it is not beneficial to do so.
– You will find that the compiler generates prefetch instructions when the application is compiled at the Ofast level
but not when
It is compiled at the O3 level
– That is because in the O3 binary the time is dominated by __fmax call which causes the compiler to come to the
conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of fmax
– GCC may add further loop optimizations such as unrolling upon invocation of –fprefetch-loop-arrays

17
TASK3.1: OPENMP PARALLELIZATION
• Running the openMP parallel version you will see speedups with increasing number of OMP_NUM_THREADS
• [student02@gorgon Task3]$ OMP_NUM_THREADS=1 ./poisson2d
• 1000x1000: Ref: 2.3467 s, This: 2.5508 s, speedup: 0.92
• Likewise if you bind threads across different cores you will see greater speedup
• [student02@gorgon Task3]$ OMP_PLACES="{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d
• [student02@gorgon Task3]$ OMP_PLACES="{0},{5},{10},{15}" OMP_NUM_THREADS=4 ./poisson2d

18
TASK4: ACCELERATE USING GPUS
• Building and running poisson2d as it is, you will see no speedups
• [student02@gorgon Task4]$ make poisson2d
• /opt/pgi/linuxpower/19.10/bin/pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d_serial.c -o
poisson2d_serial.o
• /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.c poisson2d_serial.o -
o poisson2d
• [student02@gorgon Task4]$ ./poisson2d
• ….
• 2048x2048: 1 CPU: 5.0743 s, 1 GPU: 4.9631 s, speedup: 1.02
• If you build poisson2d.solution which is the same as poisson2d.c with the OpenACC pragmas and run them on the platform which will
accelerate by pushing the parallel portions to the GPU you will see a massive speedup
• [student02@gorgon Task4]$ make poisson2d.solution
• /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.solution.c
poisson2d_serial.o -o poisson2d.solution
• [student02@gorgon Task4]$ ./poisson2d.solution
• 2048x2048: 1 CPU: 5.0941 s, 1 GPU: 0.1811 s, speedup: 28.13

19
•SUMMARY
• Today we talked about
• Tuning strategies pertaining to the various units in the POWER9 HW –
• Front-end, Back-end
• Some of these strategies were compiler flags, source code pragmas that
one can apply to see improved performance of their programs
• We also saw additional ways of improving performance such as parallelization,
binding etc
• Hopefully the associated handson exercises gave you a more practical experience
in applying these concepts in optimizing an application
Disclaimer: This presentation is intended to represent the views of the author rather than IBM and the recommended solutions are not guaranteed
on sub optimal conditions

21
•
•
•
•
•
•
•
•
•
•
•

22
•
•
•
•
•
•
•

23
•
•
•
•
•
•

24
•
•
• 4 32 BIT WORDS 8 HALF-WORDS 16 BYTES
•
•
•
•
•
•
•

25
Flag Kind XL GCC/LLVM
Can be simulated
in source
Benefit Drawbacks
Unrolling -qunroll -funroll-loops
#pragma
unroll(N)
Unrolls loops ; increases
opportunities pertaining to
scheduling for compiler Increases register pressure
Inlining -qinline=auto:level=N -finline-functions
Inline always
attribute or
manual inlining
increases opportunities for
scheduling; Reduces
branches and loads/stores
Increases register
pressure; increases code
size
Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint
Can cause issues in
alignment
isel
instructions -misel Using ?: operator
generates isel instruction
instead of branch;
reduces pressure on branch
predictor unit
latency of isel is a bit
higher; Use if branches
are not predictable easily
General
tuning
-qarch=pwr9,
-qtune=pwr9
-mcpu=power8,
-mtune=power9
Turns on platform specific
tuning
64bit
compilation-q64 -m64
Prefetching
-
qprefetch[=aggressiv
e] -fprefetch-loop-arrays
__dcbt/__dcbtst,
_builtin_prefetch reduces cache misses
Can increase memory
traffic particularly if
prefetched values are not
used
Link time
optimizatio
n -qipo -flto , -flto=thin
Enables Interprocedural
optimizations
Can increase overall
compilation time
Profile
directed
-fprofile-generate and
–fprofile-use LLVM has
an intermediate step

OpenPOWER Application Optimization

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to OpenPOWER Application Optimization

Similar to OpenPOWER Application Optimization (20)

More from Ganesan Narayanasamy

More from Ganesan Narayanasamy (20)

Recently uploaded

Recently uploaded (20)

OpenPOWER Application Optimization