Amd accelerated computing -ufrj

Agenda

X86 PROCESSOR EVOLUTION

THE GPU AS AN ACCELERATOR

ACCELERATED PROCESSING UNITS

INTRODUCTION TO OpenCL

AMD architecture
“Istambul” six-core diagram

1 2 3 4 5 6
Balanced
Native caches
L2 L2 L2 L2 L2 L2
six-core
processor
L3 Cache Lower memory
latency
CROSSBAR

Hyper Memory
Transport Controller

HyperTransport

PCI-e
Fast full-duplex Chipset
bus

4P/24-core system example
very good scalability

One memory controller for every
MEMORY

MEMORY
processor

Full-duplex Hyper Transport links
(up to 5.2GHz)
MEMORY

MEMORY
Bus Optimization: HT Assist (Cache
Probe Filtering)

Still the only available 4P system
with Direct Connect Architecture

Direct Connect Architecture 1.0
Balanced and Scalable Design to Support up to 6 Cores

CHANNELS
2 MEMORY

2 MEMORY
CHANNELS
8 DIMMs 8 DIMMs
per CPU per CPU
CHANNELS

2 MEMORY
2 MEMORY

CHANNELS
8 DIMMs 8 DIMMs
per CPU per CPU

No front side bus HyperTransport™ technology

Integrated memory controller NUMA memory architecture

Direct Connect Architecture 2.0
Balanced and Scalable Design to Support up to 16 Cores* per CPU

CHANNELS
4 MEMORY

4 MEMORY
CHANNELS
12 DIMMs 12 DIMMs
per CPU per CPU
CHANNELS
4 MEMORY

4 MEMORY
CHANNELS
12 DIMMs 12 DIMMs
per CPU per CPU

• 1-hop between processors • Four memory channels

• Up to 50% more DIMMs • Up to 33% increase in CPU to CPU
communication speed±

What is next for x86 CPUs

• More processor cores to come
(12, 16, 16 double cores)

• More memory channels
(improves memory bandwidth per
core)

• Improved IPC
(8 per cycle is a target)

Top500 list - beyond the petaflop

Datacenters in the
USA will spend more
than $3 billion on
energy in 2009

1997:

X

Garry Kasparov IBM Deep Blue

The World’s Most Powerful GPU

=

2011 GPU Architecture
AMD Radeon™ HD 6900 Series
Dual graphics engines
New VLIW4 core architecture
Up to 24 SIMD engines
Up to 96 Texture Units
Upgraded render back-ends
 Improved anti-aliasing performance

Fast 256-bit GDDR5 memory interface
 Up to 5.5 Gbps

New GPU compute features

Designing very efficient GPUs
Full load: 180W; Idle:27W

16

14.47
14 GFLOPS/W

12
GFLOPS/W
GFLOPS/mm2
10
7.50

8
4.50 7.90
GFLOPS/mm2
6
2.01 2.21 4.56
4
1.07 2.24

2 0.42 1.06 0.92

0
Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09
ATI Radeon™ ATI Radeon™ ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD
X1800 XT X1900 XTX 2900 PRO 3870 4870 5870

Old and New in High Performance Computing

Old: Power is free, Transistors are expensive
New: Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)

Old: Multiplies are slow, Memory access is fast
New: Multiplies fast, Memory slow
(up 200 clocks to DRAM memory, 4 clocks for FP multiply)

Old: Increasing Instruction Level Parallelism via compilers innovation
New: Explicit thread and data parallelism must be exploited

GPUs: more than just gaming

Processing power – millions of operations per second
Single Core 12
Dual Core 24
Quad Core 48
Hexa Core 72
12 Cores 144
2700
Radeon HD 5970

Both use GPUs

Wii Sports - Golf Oil exploration platform - 2010

15

DirectX® 11 Multi-Threading

 Application, DirectX runtime, and DirectX driver can each run in separate
threads
 Tasks like loading a texture or compiling a shader can execute in parallel
with main rendering thread

DirectX® 10 DirectX® 11

16

Today’s GPUs focused on

GAMING

ENTERTAINMENT

PRODUCTIVITY

DirectX® 11 Tessellation

DirectX® 10 DirectX® 11

No Tessellation Tessellation

Images courtesy of Unigine Corp.

18

Research companies already using

Oil exploration Wheather forecast Fluid Dynamics Nature simulation

21

AMD Balanced Platform
GPU is ideal for data parallel algorithms
CPU is excellent for running some like image processing, CAE, etc
algorithms
 Great use for ATI Stream
 Ideal place to process if GPU is technology
fully loaded
 Great use for additional GPUs
 Great use for additional CPU
cores

Graphics Workloads

Serial/Task-Parallel Other Highly
Workloads Parallel Workloads

Delivers optimal performance for a wide range of
platform configurations

ATI Stream Technology is…

Heterogeneous: Developers leverage AMD GPUs and x86
CPUs for optimal application performance and user experience

High performance: Massively parallel, programmable GPU
architecture delivers unprecedented performance and power
efficiency

Industry Standards: OpenCL™ and DirectCompute 11 enable
cross-platform development

Sciences Government Engineering Gaming Digital Productivity
Content
Creation

Improvements already reached consumers

80%

70%

60%

50%
ATI
Stream
40%

30%

20%

10%

0%

Processor utilization

Adobe Flash plugin used by Youtube.com
 Better image quality and video smoothness
 Lower processor usage

GPU-accelerated video transcoding

Ipod Video
HD Video

Up to 6x faster when using an AMD graphics card

Video Transcoding Sample
No GPU Acceleration
CPU Usage: 100%

Using four
CPU Cores

GPU Usage: 1%

CPU Usage: 100% Time to finish: 1h 52m Total Power: 0.23kW/h
GPU Usage: 1% Peak power: 145W Energy Price: $0.15 26

Video Transcoding Sample
ATI GPU Acceleration
CPU Usage: 45%

GPU Usage: 35%

Using hundreds of
Stream Processors

CPU Usage: 45% (100%) Time to finish: 26m (1h52m) Total Power: 0.11kW/h (0.23)
GPU Usage: 35% (1%) Peak power: 198W (145W) Energy Price: $0.07 ($0.15) 27

Today

Multi-core CPU TeraFLOPS-class GPU

~800 million transistors Up to 2 billion transistors

Multi-tasking Jogos em multiplos monitores

Video e audio Full HD

A new Era on performance evolution

Heterogeneous
Single-Core Multi-Core
computing
Challenge: Challenge: Pros:
Power consumption Power consumption  Performance
Complexity Software  Power efficient

Cons:
Software availability
Single-thread

Performance

Performance
?
We are here
We are here

We are here

Time Time x Cores Time

A new Era on performance evolution

Single-Core Multi-Core
CPU

Core efficiency

Software
Acceleration

Multimedia

Gaming

GPU

Putting all together – The Future is Fusion
AMD “Istambul” six-core processor RV500 GPU Core (2006)

1 2 3 4 5 6
Ring
L2 L2 L2 L2 L2 L2 Stop

Client Interface Client Interface

Cache L3

Client Interface
Client Interface
CROSSBAR
Ring Memory Ring
Stop Controller Stop

Hyper Memory

Client Interface

Client Interface
Client Interface Client Interface

HyperTransport
Ring
Stop
PCI-e

Chipset

AMD “Istambul” six-core processor RV700 GPU Core (2008-2009)

1 2 3 4 5 6
L2 L2 L2 L2 L2 L2

Cache L3

CROSSBAR

Hyper Memory

HyperTransport
PCI-e

Chipset

AMD “Istambul” six-core processor RV700 GPU Core

CROSSBAR
CROSSBAR

2011: welcome to the APU time!

CPU APU GPU

“Supercomputing power in a notebook platform whose
battery lasts for a full day”

One Design, Fewer Watts, Massive Capability

“Zacate”
Discrete-level AMD
Dual-Core
Northbridge + CPU
+ DirectX® 11
GPU
= Fusion
APU

 66 sq. mm  117 sq. mm  59 sq. mm  75 sq. mm
 13 watts  25 watts  8 watts  18 watts

Graphics and Media Processing Efficiency
Improvements
2010 IGP-based Platform 2011 APU-based Platform

~17 GB/sec ~17 GB/sec

CPU
Cores DDR3 DIMM
CPU Memory

UNB / MC
Cores
CPU Chip DDR3 DIMM
APU Chip
MC

Memory UVD

UNB

GPU
~27 GB/sec
~7 GB/sec
Graphics requires
GPU UVD memory bandwidth ~27 GB/sec PCIe
to bring full
SB Functions capabilities to life  3X bandwidth between GPU and memory
 Even the same sized GPU is substantially
more effective in this configuration
PCIe
 Eliminate latency and power associated
with the extra chip crossing
Bandwidth pinch points and latency  Substantially smaller physical foot print
hold back the GPU capabilities

“Ontario” & “Zacate” Architecture
APU
>2 x86 CPU Cores (40nm “Bobcat” core – 1 MB
L2, 64-bit FPU)
>C6 and power gating
>Array of SIMD Engines
• DX11 graphics performance
• Industry leading 3D and graphics processing
>3rd Generation Unified Video Decoder
>H.264, VC1, DixX/Xvid format
>DDR3 800-1066, 2 DIMMs, 64 bit channel
>BGA package

Display and I/O
>Two dedicated digital display interfaces
• Configurable externally as HDMI, DVI, and/or
Display Port
• Also supports a single link LVDS for internal
panels
>Integrated VGA
>5x8 PCIe®
> “Hudson” Fusion Controller Hub

ATI Stream SDK:
OpenCL™ For Multicore x86 CPUs and GPUs
http://developer.amd.com/

The Power of Fusion: Developers leverage heterogeneous
architecture to deliver superior user experience
• First complete OpenCL™ development platform
• Certified OpenCL 1.0 compliant by the Khronos Group
• Write code that can scale well on multi-core CPUs and GPUs
• AMD delivers on the promise of OpenCL™, with both high-
performance CPU and GPU technologies
• Available for download now as part of ATI Stream SDK beta
program – includes documentation, samples, and developer
support

OpenCL™: Game-Changing Development
Enabling Broad Adoption of GP-GPU Capabilities

 Industry standard API: Open, multiplatform development
platform for heterogeneous architectures
 The power of Fusion: Leverages CPUs and GPUs for
balanced system approach
 Broad industry support: Created by architects from AMD,
Apple, IBM, Intel, Nvidia, Sony, etc.
 Fast track development: Ratified in December; AMD is the
first company to provide a complete OpenCL solution
 Momentum: Enormous interest from mainstream
developers and application ISVs

More stream-enabled applications across
all markets

Open Standards:
Maximize Developer Freedom and Addressable Market

Vendor specific Vendor neutral
Cross-platform limiters
Cross-platform enablers
• Apple Display Connector

• 3dfx Glide Digital Visual
OpenCL™ DirectX®
Interface
• Nvidia CUDA

• Nvidia Cg

• Rambus Certified DP JEDEC OpenGL®

• Unified Display Interface

Comparing OpenCL™ and DirectX® 11 DirectCompute

How will developers choose between OpenCL™ and DirectX® 11
DirectCompute?
 Feature set is similar in both APIs
DirectX® 11 DirectCompute
 Easiest path to add compute capabilities to existing DirectX
applications
 Windows Vista® and Windows® 7 only
OpenCL™
 Ideal path for new applications porting to the GPU for the first
time
 True multiplatform: Windows®, Linux®, MacOS
 Natural programming without dealing with a graphics API

Anatomy of OpenCL™

Language Specification
• C-based cross-platform programming interface
• Subset of ISO C99 with language extensions - familiar to developers
• Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error
• Online or offline compilation and build of compute kernel executables
• Includes a rich set of built-in functions

Platform Layer API

• A hardware abstraction layer over diverse computational resources
• Query, select and initialize compute devices
• Create compute contexts and work-queues

Runtime API
• Execute compute kernels
• Manage scheduling, compute, and memory resources

OpenCL Example

Scalar

void square(int n, const float *a, float *result)
{
int i;
for (i=0; i<n; i++)
result[i] = a[i] * a[i];
}

Data-Parallel

kernel dp_square (const float *a, float *result)
{
int id = get_global_id(0);
result[id] = a[id] * a[id];
}

// dp_square executes oven “n” work-items

Summary

X86 PROCESSOR EVOLUTION

THE GPU AS AN ACCELERATOR

ACCELERATED PROCESSING UNITS

INTRODUCTION TO OpenCL
http://developer.amd.com

46

Obrigado!
roberto.brandao@amd.com

roberto.brandao@amd.com

Obrigado!

Amd accelerated computing -ufrj

More Related Content

Amd accelerated computing -ufrj