SlideShare a Scribd company logo
OpenPOWER Application Optimization
2
SCOPE OF THE PRESENTATION
• Outline Tuning strategies to improve performance of programs on POWER9 processors
• Performance bottlenecks can arise in the processor front end and back end
• Lets discuss some of the bottlenecks and how we can work around them using compiler flags,
source code pragmas/attributes
• This talk refers to compiler options supported by open source compilers such as GCC. Latest
version available publicly is 9.2.0 which is what we will use for the handson. Most of it carries
over to LLVM as it is. A slight variation works with IBM proprietary compilers such as XL
POWER9 PROCESSOR
3
• Optimized for Stronger Thread Performance and Efficiency
• Increased Execution Bandwidth efficiency for a range of workloads including commercial,
cognitive and analytics
• Sophisticated instruction scheduling and branch prediction for unoptimized applications and
interpretive languages
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
4
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
• Shorter Pipelines with reduced disruption
• Improved Application Performance for Modern
Codes
• Higher Performance and Pipeline Utilization
• Removed instruction grouping
• Enhanced instruction fusion
• Pipeline can complete upto 128 (64-SMT4)
instructions /cycle
• Reduced Latency and Improved Scalability
• Improved pipe control of load/store
instructions
• Improved hazard avoidance

Recommended for you

Messaging With Erlang And Jabber
Messaging With  Erlang And  JabberMessaging With  Erlang And  Jabber
Messaging With Erlang And Jabber

Erlang and XMPP can be used together in several ways: 1. Erlang is well-suited for implementing XMPP servers due to its high concurrency and reliability. ejabberd is an example of a popular Erlang XMPP server. 2. The XMPP protocol can be used to connect Erlang applications and allow them to communicate over the XMPP network. Libraries like Jabberlang facilitate writing Erlang XMPP clients. 3. XMPP provides a flexible messaging backbone that can be extended using Erlang modules. This allows Erlang code to integrate with and enhance standard XMPP server functionality.

 
by l xf
Meetup 2009
Meetup 2009Meetup 2009
Meetup 2009

This document provides an overview of eBPF/BPF and instructions for creating an eBPF program from scratch. It begins with explaining what eBPF/BPF is, its history and main ideas. It then covers how to build an eBPF program from the Linux kernel source code, including prerequisites, compilation steps, and modifying the makefile. The document also discusses how to program an eBPF program manually, analyzing the eBPF program and loader. It concludes with a promise of a quick demo.

ebpfxdp
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey

TensorRT is an NVIDIA tool that optimizes and accelerates deep learning models for production deployment. It performs optimizations like layer fusion, reduced precision from FP32 to FP16 and INT8, kernel auto-tuning, and multi-stream execution. These optimizations reduce latency and increase throughput. TensorRT automatically optimizes models by taking in a graph, performing optimizations, and outputting an optimized runtime engine.

FORMAT OF TODAYS DISCUSSION
5
Brief presentation on optimization strategies
Followed by handson exercises
Initial steps -
>ssh –l student<n> orthus.nic.uoregon.edu
>ssh gorgon
Once you have a home directory make a directory with your name within the home/student<n>
>mkdir /home/student<n>/<yourname>
copy the following files into them
> cp -rf /home/users/gansys/archana/Handson .
You will see the following directories within Handson/
Task1/
Task2/
Task3/
Task4/
During the course of the presentation we will discuss the exercises inline and you can try them on the machine
6
PERFORMANCE TUNING IN THE FRONT-END
• Front end fetches and decodes the successive instructions and passes them to the backend for
processing
• POWER9 is a superscalar processor and is pipeline based so works with an advanced branch
predictor to predict the sequence and fetch instructions in advance
• We have call branches, loop branches
• Typically we use the following strategies to work around bottlenecks seen around branches –
• Unrolling, inlining using pragmas/attributes/manually in source (if compiler does not
automatically)
• Converting control to data dependence using ?: and compiling with –misel for difficult to
predict branches
• Drop hints using __builtin_expect(var, value) to simplify compiler’s scheduling
• Indirect call promotion to promote more inlining
7
PERFORMANCE TUNING IN THE BACK-END
• Backend is concerned with executing of the instructions that were fetched and
dispatched to the appropriate units
• Compiler takes care of making sure dependent instructions are far from each other
in its scheduling pass automatically
• Tuning backend performance involves optimal usage of Processor
Resources. We can tune the performance using following.
• Registers- using instructions that reduce reg usage, Vectorization /
reducing pressure on GPRs/ ensuring more throughput, Making loops
free of pointers and branches as much as possible to enable more
vectorization
• Caches – data layout optimizations that reduce footprint, using –fshort-
enums, Prefetching – hardware and software
• System Tuning- parallelization, binding, largepages, optimized libraries
8
STRUCTURE OF HANDSON EXERCISE
• All the handson exercises work on the Jacobi application
• The application has two versions – poisson2d_reference (referred to as
poisson2d_serial in Task4) and poisson2d
• Inorder to showcase an optimization impact, poisson2d is optimized and
poisson2d_reference is minimally optimized to a baseline level and the performance
of the two routines are compared
• The application internally measures the time and prints the speedup
• Higher the speedup higher is the impact of the optimization in focus
• For the handson we work with gcc (9.2.0) and pgi compilers (19.10)
• Solutions are indicated in the Solutions/ folder within each of the Task directories

Recommended for you

13 superscalar
13 superscalar13 superscalar
13 superscalar

The document discusses superscalar processors and provides details about the Pentium 4 architecture as an example of a superscalar CISC machine. It covers topics such as instruction issue policies, register renaming, branch prediction, and the 20 stage pipeline of the Pentium 4. The Pentium 4 decodes x86 instructions into micro-ops, allocates registers and resources out of order, and can dispatch up to 6 micro-ops per cycle to execution units.

superscalarlimitation of superscalarsuperscalar processors
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel

Linux kernel provides executable and formalized memory model. These slides describe the nature of parallel programming in the Linux kernel and what memory model is and why it is necessary and important for kernel programmers. The slides were used at KOSSCON 2018 (https://kosscon.kr/).

linuxkernelparallel programming
Esctp snir
Esctp snirEsctp snir
Esctp snir

This document introduces programming models for high-performance computing (HPC). It establishes a taxonomy to classify programming models and systems. The main goals are to introduce the current prominent programming models, including message-passing, shared memory, and bulk synchronous models. The document also discusses that there is no single best solution and that there are trade-offs between different approaches. Implementation stacks and hardware architectures are reviewed to provide context on how programming models map to low-level execution.

parallel programminghpcprogramming model
9
TASK1: BASIC COMPILER FLAGS
• Here the poisson2d_reference.c is optimized at O3 level
• The user needs to optimize poisson2d.c with Ofast level
• Build and run the application poisson2d
• What is the speedup you observe and why ?
• You can generate a perf profile using perf record –e cycles ./poisson2d
• Running perf report will show you the top routines and you can compare
performance of poisson2d_reference and poisson2d to get an idea
10
TASK2: SW PREFETCHING
• Now that we saw that Ofast improved performance beyond O3 lets optimize
poisson2d_reference at Ofast and see if we can further improve it
• The user needs to optimize the poisson2d with sw prefetching flag
• Build and run the application
• What is the speedup you observe?
• Verify whether sw prefetching instructions have been added
• Grep for dcbt in the objdump file
11
TASK3: OPENMP PARALLELIZATION
• The jacobi application is highly parallel
• We can using openMP pragmas parallelize it and measure the speedup
• The source file has openMP pragmas in comments
• Uncomment them and build with openMP options –fopenmp and link with –lgomp
• Run with multiple threads and note the speedup
• OMP_NUM_THREADS=4 ./poisson2d
• OMP_NUM_THREADS=16 ./poisson2d
• OMP_NUM_THREADS=32 ./poisson2d
• OMP_NUM_THREADS=64 ./poisson2d
12
TASK3.1: OPENMP PARALLELIZATION
• Running lscpu you will see Thread(s) per core: 4
• You will see the setting as SMT=4 on the system; You can verify by running
ppc64_cpu –smt on the command line
• Run cat /proc/cpuinfo to determine the total number of threads, cores in the system
• Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3
• Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Set n1, .. n4 to threads in different cores and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Compare Speedups; Which one is higher?

Recommended for you

gcma: guaranteed contiguous memory allocator
gcma:  guaranteed contiguous memory allocatorgcma:  guaranteed contiguous memory allocator
gcma: guaranteed contiguous memory allocator

This document presents GCMA, a Guaranteed Contiguous Memory Allocator that improves upon the current Contiguous Memory Allocator (CMA) solution in Linux. CMA can have unpredictable latency and even fail when allocating contiguous memory, especially under memory pressure or with background workloads. GCMA guarantees fast latency for contiguous memory allocation, success of allocation, and reasonable memory utilization by using discardable memory as its secondary client instead of movable pages. Experimental results on a Raspberry Pi 2 show that GCMA has significantly faster allocation latency than CMA, keeps camera latency fast even with background workloads, and can improve overall system performance compared to CMA.

memory managementlinuxgcma
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingHetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming

Discover, extend, and modernize your current development approach for hetergeneous compute with standards-based OpenFabrics Interfaces* (OFI), message passing interface (MPI), and OpenMP* programming methods on Intel® Xeon Phi™ processors.

libfabricofiintel® xeon phi
Vliw and superscaler
Vliw and superscalerVliw and superscaler
Vliw and superscaler

VLIW (Very Large Instruction Word) is an architecture that aims to achieve high performance through instruction level parallelism (ILP). It allows multiple independent operations to be specified per instruction. Unlike superscalar architectures, all scheduling is done statically by the compiler in VLIW. The compiler analyzes dependencies, extracts parallelism, and encodes parallel instructions into a single very long instruction word to be executed concurrently by the processor. This reduces hardware complexity compared to dynamic scheduling in superscalar chips.

13
TASK3.2: IMPACT OF BINDING
• Running lscpu you will see Thread(s) per core: 4
• You will see the setting as SMT=4 on the system; You can verify by running
ppc64_cpu –smt on the command line
• Run cat /proc/cpuinfo to determine the total number of threads, cores in the system
• Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3
• Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Set n1, .. n4 to threads in different cores and run for example-
• $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}"
OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000
• Compare Speedups; Which one is higher?
14
TASK4: ACCELERATE USING GPUS
• You can attempt this after the lecture on GPUs
• Jacobi application contains a large set of parallelizable loops
• Poisson2d.c contains commented openACC pragmas which should be
uncommented, built with appropriate flags and run on an accelerated platform
• #pragma acc parallel loop
• In case you want to refer to Solution - poisson2d.solution.c
• You can compare the speedup by running poisson2d without the pragmas and
running the poisson2d.solution
• For more information you can refer to the Makefile
15
TASK1: BASIC COMPILER FLAGS- SOLUTION
– This hands-on exercise illustrates the impact of the Ofast flag
– Ofast enables –ffast-math option that implements the same math function in a way
that does not require guarantees of IEEE / ISO rules or specification and avoids the
overhead of calling a function from the math library
– If you look at the perf profile, you will observe poisson2d_reference makes a call to
fmax
– Whereas poisson2d.c::main() of poisson2d generates native instructions such as
xvmax as it is optimized at Ofast
16
TASK2: SW PREFETCHING- SOLUTION
– Compiling with a prefetch flag enables the compiler to analyze the code and insert __dcbt and __dcbtst
instructions into the code if it is beneficial
– __dcbt and __dcbtst instructions prefetch memory values into L3 ; __dcbt is for load and __dcbtst is for store
– POWER9 has prefetching enabled both at HW and SW levels
– At HW level, prefetching is “ON” by default
– At the SW level, you can request the compiler to insert prefetch
instructions ; However the compiler can choose to ignore the
request if it determines that it is not beneficial to do so.
– You will find that the compiler generates prefetch instructions when the application is compiled at the Ofast level
but not when
It is compiled at the O3 level
– That is because in the O3 binary the time is dominated by __fmax call which causes the compiler to come to the
conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of fmax
– GCC may add further loop optimizations such as unrolling upon invocation of –fprefetch-loop-arrays

Recommended for you

Load Store Execution
Load Store ExecutionLoad Store Execution
Load Store Execution

Load and store instructions first generate an effective address, then perform address translation before accessing the data cache for load or store operations. For loads, the cache is read to return data, while stores write data to the cache. Stores are held in the store buffer until retirement to maintain load-store ordering. Loads can bypass and forward from earlier stores in the store buffer to improve performance. Memory dependencies between loads and stores are difficult to handle due to dynamic addresses and long memory latency. Speculative load disambiguation predicts dependencies to allow out-of-order execution when aliases are rare.

Linux Kernel Memory Model
Linux Kernel Memory ModelLinux Kernel Memory Model
Linux Kernel Memory Model

The document discusses the Linux kernel memory model (LKMM). It provides an overview of LKMM, including that it defines ordering rules for the Linux kernel due to weaknesses in the C language standard and need to support multiple hardware architectures. It describes ordering primitives like atomic operations and memory barriers provided by LKMM and how the LKMM was formalized into an executable model that can prove properties of parallel code against the LKMM.

linuxkernelparallel programming
Training Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten ClusteringTraining Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten Clustering

This document provides an introduction to Continuent Tungsten clustering. It discusses key benefits like high availability, multi-site deployment, and ease of use. It examines the clustering architecture including topologies, automatic and manual failover, and rolling maintenance procedures. Commands for monitoring and managing the cluster are also reviewed, including cctrl and tpm diag. A demo shows using cctrl to perform a manual failover by promoting a slave to master.

trainingcontinuentcontinuent tungsten
17
TASK3.1: OPENMP PARALLELIZATION
• Running the openMP parallel version you will see speedups with increasing number of OMP_NUM_THREADS
• [student02@gorgon Task3]$ OMP_NUM_THREADS=1 ./poisson2d
• 1000x1000: Ref: 2.3467 s, This: 2.5508 s, speedup: 0.92
• [student02@gorgon Task3]$ OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 3.65
• [student02@gorgon Task3]$ OMP_NUM_THREADS=16 ./poisson2d
• 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 4.18
• Likewise if you bind threads across different cores you will see greater speedup
• [student02@gorgon Task3]$ OMP_PLACES="{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3490 s, This: 1.9622 s, speedup: 1.20
• [student02@gorgon Task3]$ OMP_PLACES="{0},{5},{10},{15}" OMP_NUM_THREADS=4 ./poisson2d
• 1000x1000: Ref: 2.3694 s, This: 0.6735 s, speedup: 3.52
18
TASK4: ACCELERATE USING GPUS
• Building and running poisson2d as it is, you will see no speedups
• [student02@gorgon Task4]$ make poisson2d
• /opt/pgi/linuxpower/19.10/bin/pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d_serial.c -o
poisson2d_serial.o
• /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.c poisson2d_serial.o -
o poisson2d
• [student02@gorgon Task4]$ ./poisson2d
• ….
• 2048x2048: 1 CPU: 5.0743 s, 1 GPU: 4.9631 s, speedup: 1.02
• If you build poisson2d.solution which is the same as poisson2d.c with the OpenACC pragmas and run them on the platform which will
accelerate by pushing the parallel portions to the GPU you will see a massive speedup
• [student02@gorgon Task4]$ make poisson2d.solution
• /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.solution.c
poisson2d_serial.o -o poisson2d.solution
• [student02@gorgon Task4]$ ./poisson2d.solution
• 2048x2048: 1 CPU: 5.0941 s, 1 GPU: 0.1811 s, speedup: 28.13
19
•SUMMARY
• Today we talked about
• Tuning strategies pertaining to the various units in the POWER9 HW –
• Front-end, Back-end
• Some of these strategies were compiler flags, source code pragmas that
one can apply to see improved performance of their programs
• We also saw additional ways of improving performance such as parallelization,
binding etc
• Hopefully the associated handson exercises gave you a more practical experience
in applying these concepts in optimizing an application
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
Disclaimer: This presentation is intended to represent the views of the author rather than IBM and the recommended solutions are not guaranteed
on sub optimal conditions
20
ACKUP

Recommended for you

Chap6 procedures &amp; macros
Chap6 procedures &amp; macrosChap6 procedures &amp; macros
Chap6 procedures &amp; macros

A microprocessor is an electronic component that is used by a computer to do its work. It is a central processing unit on a single integrated circuit chip containing millions of very small components including transistors, resistors, and diodes that work together.

microprocessor
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB

Faster microprocessor design presentation in American International University-Bangladesh (AIUB). Presentation was taken under the subject "SELECTED TOPICS IN ELECTRICAL AND ELECTRONIC ENGINEERING (PROCESSOR AND DSP HARDWARE DESIGN WITH SYSTEM VERILOG, VHDL AND FPGAS) [MEEE]", as a final semester student of M.Sc at AIUB.

Vliw
VliwVliw
Vliw

checking dependencies between instructions to determine which instructions can be grouped together for parallel execution; assigning instructions to the functional units on the hardware; determining when instructions are initiated placed together into a single word.

superscalar and vliw architectures
21
•
•
•
•
•
•
•
•
•
•
•
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
22
•
•
•
•
•
•
•
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
23
•
•
•
•
•
•
IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
24
•
•
• 4 32 BIT WORDS 8 HALF-WORDS 16 BYTES
•
•
•
•
•
•
•

Recommended for you

Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architectures

This document discusses superscalar and VLIW architectures. Superscalar processors can execute multiple independent instructions in parallel by checking for dependencies between instructions. VLIW architectures package multiple operations into very long instruction words to execute in parallel on multiple functional units with scheduling done at compile-time rather than run-time. The document compares CISC, RISC, and VLIW instruction sets and outlines advantages and disadvantages of the VLIW approach.

VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)

A VLIW processor implements instruction level parallelism by grouping multiple operations into a single very long instruction word. The compiler statically schedules independent instructions to execute in parallel on functional units. This avoids the need for complex hardware to dynamically schedule instructions at runtime. VLIW moves the complexity to the compiler, allowing for simpler hardware that can be lower cost and lower power while achieving higher performance than RISC and CISC chips.

OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar

The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.

open power power 9 ml/dl workloads compiler
25
Flag Kind XL GCC/LLVM
Can be simulated
in source
Benefit Drawbacks
Unrolling -qunroll -funroll-loops
#pragma
unroll(N)
Unrolls loops ; increases
opportunities pertaining to
scheduling for compiler Increases register pressure
Inlining -qinline=auto:level=N -finline-functions
Inline always
attribute or
manual inlining
increases opportunities for
scheduling; Reduces
branches and loads/stores
Increases register
pressure; increases code
size
Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint
Can cause issues in
alignment
isel
instructions -misel Using ?: operator
generates isel instruction
instead of branch;
reduces pressure on branch
predictor unit
latency of isel is a bit
higher; Use if branches
are not predictable easily
General
tuning
-qarch=pwr9,
-qtune=pwr9
-mcpu=power8,
-mtune=power9
Turns on platform specific
tuning
64bit
compilation-q64 -m64
Prefetching
-
qprefetch[=aggressiv
e] -fprefetch-loop-arrays
__dcbt/__dcbtst,
_builtin_prefetch reduces cache misses
Can increase memory
traffic particularly if
prefetched values are not
used
Link time
optimizatio
n -qipo -flto , -flto=thin
Enables Interprocedural
optimizations
Can increase overall
compilation time
Profile
directed
-fprofile-generate and
–fprofile-use LLVM has
an intermediate step

More Related Content

What's hot

Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
SeongJae Park
 
Superscalar processor
Superscalar processorSuperscalar processor
Superscalar processor
noor ul ain
 
Lec1 final
Lec1 finalLec1 final
Lec1 final
Gichelle Amon
 
Messaging With Erlang And Jabber
Messaging With  Erlang And  JabberMessaging With  Erlang And  Jabber
Messaging With Erlang And Jabber
l xf
 
Meetup 2009
Meetup 2009Meetup 2009
Meetup 2009
HuaiEnTseng
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
Yi-Hsiu Hsu
 
13 superscalar
13 superscalar13 superscalar
13 superscalar
Hammad Farooq
 
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
SeongJae Park
 
Esctp snir
Esctp snirEsctp snir
Esctp snir
Marc Snir
 
gcma: guaranteed contiguous memory allocator
gcma:  guaranteed contiguous memory allocatorgcma:  guaranteed contiguous memory allocator
gcma: guaranteed contiguous memory allocator
SeongJae Park
 
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingHetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Intel® Software
 
Vliw and superscaler
Vliw and superscalerVliw and superscaler
Vliw and superscaler
Rafi Dar
 
Load Store Execution
Load Store ExecutionLoad Store Execution
Load Store Execution
Ramdas Mozhikunnath
 
Linux Kernel Memory Model
Linux Kernel Memory ModelLinux Kernel Memory Model
Linux Kernel Memory Model
SeongJae Park
 
Training Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten ClusteringTraining Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten Clustering
Continuent
 
Chap6 procedures &amp; macros
Chap6 procedures &amp; macrosChap6 procedures &amp; macros
Chap6 procedures &amp; macros
HarshitParkar6677
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
Nusrat Mary
 
Vliw
VliwVliw
Vliw
AJAL A J
 
Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architectures
Amit Kumar Rathi
 
VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)
Pragnya Dash
 

What's hot (20)

Understanding of linux kernel memory model
Understanding of linux kernel memory modelUnderstanding of linux kernel memory model
Understanding of linux kernel memory model
 
Superscalar processor
Superscalar processorSuperscalar processor
Superscalar processor
 
Lec1 final
Lec1 finalLec1 final
Lec1 final
 
Messaging With Erlang And Jabber
Messaging With  Erlang And  JabberMessaging With  Erlang And  Jabber
Messaging With Erlang And Jabber
 
Meetup 2009
Meetup 2009Meetup 2009
Meetup 2009
 
TensorRT survey
TensorRT surveyTensorRT survey
TensorRT survey
 
13 superscalar
13 superscalar13 superscalar
13 superscalar
 
An Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux KernelAn Introduction to the Formalised Memory Model for Linux Kernel
An Introduction to the Formalised Memory Model for Linux Kernel
 
Esctp snir
Esctp snirEsctp snir
Esctp snir
 
gcma: guaranteed contiguous memory allocator
gcma:  guaranteed contiguous memory allocatorgcma:  guaranteed contiguous memory allocator
gcma: guaranteed contiguous memory allocator
 
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP ProgrammingHetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
Hetergeneous Compute with Standards Based OFI/MPI/OpenMP Programming
 
Vliw and superscaler
Vliw and superscalerVliw and superscaler
Vliw and superscaler
 
Load Store Execution
Load Store ExecutionLoad Store Execution
Load Store Execution
 
Linux Kernel Memory Model
Linux Kernel Memory ModelLinux Kernel Memory Model
Linux Kernel Memory Model
 
Training Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten ClusteringTraining Slides: Basics 102: Introduction to Tungsten Clustering
Training Slides: Basics 102: Introduction to Tungsten Clustering
 
Chap6 procedures &amp; macros
Chap6 procedures &amp; macrosChap6 procedures &amp; macros
Chap6 procedures &amp; macros
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
 
Vliw
VliwVliw
Vliw
 
Superscalar and VLIW architectures
Superscalar and VLIW architecturesSuperscalar and VLIW architectures
Superscalar and VLIW architectures
 
VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)VLIW(Very Long Instruction Word)
VLIW(Very Long Instruction Word)
 

Similar to OpenPOWER Application Optimization

OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Qualcomm Developer Network
 
Lecture6
Lecture6Lecture6
Lecture6
tt_aljobory
 
Open Dayligth usando SDN-NFV
Open Dayligth usando SDN-NFVOpen Dayligth usando SDN-NFV
Open Dayligth usando SDN-NFV
Open Networking Perú (Opennetsoft)
 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Pier Luca Lanzi
 
Lecture5
Lecture5Lecture5
Lecture5
tt_aljobory
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Spark Summit
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
Jeff Larkin
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
Tim Ellison
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
Surinder Kaur
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
Anne Nicolas
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
MALARMANNANA1
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
AdamRobertsIBM
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptx
LemonReddy1
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
George Markomanolis
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
Federico Michele Facca
 
The role of the cpu in the operation
The role of the cpu in the operationThe role of the cpu in the operation
The role of the cpu in the operation
mary_ramsay
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 

Similar to OpenPOWER Application Optimization (20)

OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Lecture6
Lecture6Lecture6
Lecture6
 
Open Dayligth usando SDN-NFV
Open Dayligth usando SDN-NFVOpen Dayligth usando SDN-NFV
Open Dayligth usando SDN-NFV
 
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMPAlgoritmi e Calcolo Parallelo 2012/2013 - OpenMP
Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP
 
Lecture5
Lecture5Lecture5
Lecture5
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
 
Kernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at FacebookKernel Recipes 2019 - BPF at Facebook
Kernel Recipes 2019 - BPF at Facebook
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptx
 
Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
Docker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platformsDocker Swarm secrets for creating great FIWARE platforms
Docker Swarm secrets for creating great FIWARE platforms
 
The role of the cpu in the operation
The role of the cpu in the operationThe role of the cpu in the operation
The role of the cpu in the operation
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 

More from Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
Ganesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
Ganesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
Ganesan Narayanasamy
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
Ganesan Narayanasamy
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
Ganesan Narayanasamy
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
Ganesan Narayanasamy
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
Ganesan Narayanasamy
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
Ganesan Narayanasamy
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
Ganesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
Ganesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
Ganesan Narayanasamy
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
Ganesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
Ganesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
Ganesan Narayanasamy
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
Ganesan Narayanasamy
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
Ganesan Narayanasamy
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
Ganesan Narayanasamy
 

More from Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 

Recently uploaded

Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
Larry Smarr
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
Awais Yaseen
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 

Recently uploaded (20)

Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 

OpenPOWER Application Optimization

  • 2. 2 SCOPE OF THE PRESENTATION • Outline Tuning strategies to improve performance of programs on POWER9 processors • Performance bottlenecks can arise in the processor front end and back end • Lets discuss some of the bottlenecks and how we can work around them using compiler flags, source code pragmas/attributes • This talk refers to compiler options supported by open source compilers such as GCC. Latest version available publicly is 9.2.0 which is what we will use for the handson. Most of it carries over to LLVM as it is. A slight variation works with IBM proprietary compilers such as XL
  • 3. POWER9 PROCESSOR 3 • Optimized for Stronger Thread Performance and Efficiency • Increased Execution Bandwidth efficiency for a range of workloads including commercial, cognitive and analytics • Sophisticated instruction scheduling and branch prediction for unoptimized applications and interpretive languages IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 4. 4 IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation • Shorter Pipelines with reduced disruption • Improved Application Performance for Modern Codes • Higher Performance and Pipeline Utilization • Removed instruction grouping • Enhanced instruction fusion • Pipeline can complete upto 128 (64-SMT4) instructions /cycle • Reduced Latency and Improved Scalability • Improved pipe control of load/store instructions • Improved hazard avoidance
  • 5. FORMAT OF TODAYS DISCUSSION 5 Brief presentation on optimization strategies Followed by handson exercises Initial steps - >ssh –l student<n> orthus.nic.uoregon.edu >ssh gorgon Once you have a home directory make a directory with your name within the home/student<n> >mkdir /home/student<n>/<yourname> copy the following files into them > cp -rf /home/users/gansys/archana/Handson . You will see the following directories within Handson/ Task1/ Task2/ Task3/ Task4/ During the course of the presentation we will discuss the exercises inline and you can try them on the machine
  • 6. 6 PERFORMANCE TUNING IN THE FRONT-END • Front end fetches and decodes the successive instructions and passes them to the backend for processing • POWER9 is a superscalar processor and is pipeline based so works with an advanced branch predictor to predict the sequence and fetch instructions in advance • We have call branches, loop branches • Typically we use the following strategies to work around bottlenecks seen around branches – • Unrolling, inlining using pragmas/attributes/manually in source (if compiler does not automatically) • Converting control to data dependence using ?: and compiling with –misel for difficult to predict branches • Drop hints using __builtin_expect(var, value) to simplify compiler’s scheduling • Indirect call promotion to promote more inlining
  • 7. 7 PERFORMANCE TUNING IN THE BACK-END • Backend is concerned with executing of the instructions that were fetched and dispatched to the appropriate units • Compiler takes care of making sure dependent instructions are far from each other in its scheduling pass automatically • Tuning backend performance involves optimal usage of Processor Resources. We can tune the performance using following. • Registers- using instructions that reduce reg usage, Vectorization / reducing pressure on GPRs/ ensuring more throughput, Making loops free of pointers and branches as much as possible to enable more vectorization • Caches – data layout optimizations that reduce footprint, using –fshort- enums, Prefetching – hardware and software • System Tuning- parallelization, binding, largepages, optimized libraries
  • 8. 8 STRUCTURE OF HANDSON EXERCISE • All the handson exercises work on the Jacobi application • The application has two versions – poisson2d_reference (referred to as poisson2d_serial in Task4) and poisson2d • Inorder to showcase an optimization impact, poisson2d is optimized and poisson2d_reference is minimally optimized to a baseline level and the performance of the two routines are compared • The application internally measures the time and prints the speedup • Higher the speedup higher is the impact of the optimization in focus • For the handson we work with gcc (9.2.0) and pgi compilers (19.10) • Solutions are indicated in the Solutions/ folder within each of the Task directories
  • 9. 9 TASK1: BASIC COMPILER FLAGS • Here the poisson2d_reference.c is optimized at O3 level • The user needs to optimize poisson2d.c with Ofast level • Build and run the application poisson2d • What is the speedup you observe and why ? • You can generate a perf profile using perf record –e cycles ./poisson2d • Running perf report will show you the top routines and you can compare performance of poisson2d_reference and poisson2d to get an idea
  • 10. 10 TASK2: SW PREFETCHING • Now that we saw that Ofast improved performance beyond O3 lets optimize poisson2d_reference at Ofast and see if we can further improve it • The user needs to optimize the poisson2d with sw prefetching flag • Build and run the application • What is the speedup you observe? • Verify whether sw prefetching instructions have been added • Grep for dcbt in the objdump file
  • 11. 11 TASK3: OPENMP PARALLELIZATION • The jacobi application is highly parallel • We can using openMP pragmas parallelize it and measure the speedup • The source file has openMP pragmas in comments • Uncomment them and build with openMP options –fopenmp and link with –lgomp • Run with multiple threads and note the speedup • OMP_NUM_THREADS=4 ./poisson2d • OMP_NUM_THREADS=16 ./poisson2d • OMP_NUM_THREADS=32 ./poisson2d • OMP_NUM_THREADS=64 ./poisson2d
  • 12. 12 TASK3.1: OPENMP PARALLELIZATION • Running lscpu you will see Thread(s) per core: 4 • You will see the setting as SMT=4 on the system; You can verify by running ppc64_cpu –smt on the command line • Run cat /proc/cpuinfo to determine the total number of threads, cores in the system • Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3 • Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Set n1, .. n4 to threads in different cores and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Compare Speedups; Which one is higher?
  • 13. 13 TASK3.2: IMPACT OF BINDING • Running lscpu you will see Thread(s) per core: 4 • You will see the setting as SMT=4 on the system; You can verify by running ppc64_cpu –smt on the command line • Run cat /proc/cpuinfo to determine the total number of threads, cores in the system • Obtain the thread sibling list of CPU0, CPU1 etc.. Reading the file /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0-3 • Referring to the sibling list, Set n1, .. n4 to threads in same core and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Set n1, .. n4 to threads in different cores and run for example- • $(SC19_SUBMIT_CMD) time OMP_NUM_PLACES=“{0},{5},{9},{13}" OMP_NUM_THREADS=4 ./poisson2d 1000 1000 1000 • Compare Speedups; Which one is higher?
  • 14. 14 TASK4: ACCELERATE USING GPUS • You can attempt this after the lecture on GPUs • Jacobi application contains a large set of parallelizable loops • Poisson2d.c contains commented openACC pragmas which should be uncommented, built with appropriate flags and run on an accelerated platform • #pragma acc parallel loop • In case you want to refer to Solution - poisson2d.solution.c • You can compare the speedup by running poisson2d without the pragmas and running the poisson2d.solution • For more information you can refer to the Makefile
  • 15. 15 TASK1: BASIC COMPILER FLAGS- SOLUTION – This hands-on exercise illustrates the impact of the Ofast flag – Ofast enables –ffast-math option that implements the same math function in a way that does not require guarantees of IEEE / ISO rules or specification and avoids the overhead of calling a function from the math library – If you look at the perf profile, you will observe poisson2d_reference makes a call to fmax – Whereas poisson2d.c::main() of poisson2d generates native instructions such as xvmax as it is optimized at Ofast
  • 16. 16 TASK2: SW PREFETCHING- SOLUTION – Compiling with a prefetch flag enables the compiler to analyze the code and insert __dcbt and __dcbtst instructions into the code if it is beneficial – __dcbt and __dcbtst instructions prefetch memory values into L3 ; __dcbt is for load and __dcbtst is for store – POWER9 has prefetching enabled both at HW and SW levels – At HW level, prefetching is “ON” by default – At the SW level, you can request the compiler to insert prefetch instructions ; However the compiler can choose to ignore the request if it determines that it is not beneficial to do so. – You will find that the compiler generates prefetch instructions when the application is compiled at the Ofast level but not when It is compiled at the O3 level – That is because in the O3 binary the time is dominated by __fmax call which causes the compiler to come to the conclusion that whatever benefit we obtain by adding SW prefetch will be overshadowed by the penalty of fmax – GCC may add further loop optimizations such as unrolling upon invocation of –fprefetch-loop-arrays
  • 17. 17 TASK3.1: OPENMP PARALLELIZATION • Running the openMP parallel version you will see speedups with increasing number of OMP_NUM_THREADS • [student02@gorgon Task3]$ OMP_NUM_THREADS=1 ./poisson2d • 1000x1000: Ref: 2.3467 s, This: 2.5508 s, speedup: 0.92 • [student02@gorgon Task3]$ OMP_NUM_THREADS=4 ./poisson2d • 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 3.65 • [student02@gorgon Task3]$ OMP_NUM_THREADS=16 ./poisson2d • 1000x1000: Ref: 2.3309 s, This: 0.6394 s, speedup: 4.18 • Likewise if you bind threads across different cores you will see greater speedup • [student02@gorgon Task3]$ OMP_PLACES="{0},{1},{2},{3}" OMP_NUM_THREADS=4 ./poisson2d • 1000x1000: Ref: 2.3490 s, This: 1.9622 s, speedup: 1.20 • [student02@gorgon Task3]$ OMP_PLACES="{0},{5},{10},{15}" OMP_NUM_THREADS=4 ./poisson2d • 1000x1000: Ref: 2.3694 s, This: 0.6735 s, speedup: 3.52
  • 18. 18 TASK4: ACCELERATE USING GPUS • Building and running poisson2d as it is, you will see no speedups • [student02@gorgon Task4]$ make poisson2d • /opt/pgi/linuxpower/19.10/bin/pgcc -c -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d_serial.c -o poisson2d_serial.o • /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.c poisson2d_serial.o - o poisson2d • [student02@gorgon Task4]$ ./poisson2d • …. • 2048x2048: 1 CPU: 5.0743 s, 1 GPU: 4.9631 s, speedup: 1.02 • If you build poisson2d.solution which is the same as poisson2d.c with the OpenACC pragmas and run them on the platform which will accelerate by pushing the parallel portions to the GPU you will see a massive speedup • [student02@gorgon Task4]$ make poisson2d.solution • /opt/pgi/linuxpower/19.10/bin/pgcc -DUSE_DOUBLE -Minfo=accel -fast -acc -ta=tesla:cc70,managed poisson2d.solution.c poisson2d_serial.o -o poisson2d.solution • [student02@gorgon Task4]$ ./poisson2d.solution • 2048x2048: 1 CPU: 5.0941 s, 1 GPU: 0.1811 s, speedup: 28.13
  • 19. 19 •SUMMARY • Today we talked about • Tuning strategies pertaining to the various units in the POWER9 HW – • Front-end, Back-end • Some of these strategies were compiler flags, source code pragmas that one can apply to see improved performance of their programs • We also saw additional ways of improving performance such as parallelization, binding etc • Hopefully the associated handson exercises gave you a more practical experience in applying these concepts in optimizing an application IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation Disclaimer: This presentation is intended to represent the views of the author rather than IBM and the recommended solutions are not guaranteed on sub optimal conditions
  • 21. 21 • • • • • • • • • • • IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 22. 22 • • • • • • • IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 23. 23 • • • • • • IBM Systems / version 1.0 / November, 2019 / © 2018 IBM Corporation
  • 24. 24 • • • 4 32 BIT WORDS 8 HALF-WORDS 16 BYTES • • • • • • •
  • 25. 25 Flag Kind XL GCC/LLVM Can be simulated in source Benefit Drawbacks Unrolling -qunroll -funroll-loops #pragma unroll(N) Unrolls loops ; increases opportunities pertaining to scheduling for compiler Increases register pressure Inlining -qinline=auto:level=N -finline-functions Inline always attribute or manual inlining increases opportunities for scheduling; Reduces branches and loads/stores Increases register pressure; increases code size Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint Can cause issues in alignment isel instructions -misel Using ?: operator generates isel instruction instead of branch; reduces pressure on branch predictor unit latency of isel is a bit higher; Use if branches are not predictable easily General tuning -qarch=pwr9, -qtune=pwr9 -mcpu=power8, -mtune=power9 Turns on platform specific tuning 64bit compilation-q64 -m64 Prefetching - qprefetch[=aggressiv e] -fprefetch-loop-arrays __dcbt/__dcbtst, _builtin_prefetch reduces cache misses Can increase memory traffic particularly if prefetched values are not used Link time optimizatio n -qipo -flto , -flto=thin Enables Interprocedural optimizations Can increase overall compilation time Profile directed -fprofile-generate and –fprofile-use LLVM has an intermediate step