SlideShare a Scribd company logo
Amd   accelerated computing -ufrj
Agenda


X86 PROCESSOR EVOLUTION



THE GPU AS AN ACCELERATOR



ACCELERATED PROCESSING UNITS



INTRODUCTION TO OpenCL
Evolving x86 Processors
AMD architecture
“Istambul” six-core diagram


                      1    2        3         4        5       6
                                                                    Balanced
      Native                                                         caches
                      L2   L2       L2       L2       L2       L2
     six-core
    processor
                                    L3 Cache                          Lower memory
                                                                         latency
                                    CROSSBAR




                            Hyper                 Memory
                           Transport              Controller


                                          HyperTransport


                                                  PCI-e
   Fast full-duplex             Chipset
         bus
4P/24-core system example
very good scalability



                                 One memory controller for every
MEMORY




                        MEMORY
                                 processor


                                 Full-duplex Hyper Transport links
                                 (up to 5.2GHz)
MEMORY




                        MEMORY
                                 Bus Optimization: HT Assist (Cache
                                 Probe Filtering)


                                 Still the only available 4P system
                                 with Direct Connect Architecture
Direct Connect Architecture 1.0
Balanced and Scalable Design to Support up to 6 Cores




               CHANNELS
               2 MEMORY




                                                              2 MEMORY
                                                              CHANNELS
     8 DIMMs                                                             8 DIMMs
     per CPU                                                             per CPU
               CHANNELS




                                                              2 MEMORY
               2 MEMORY




                                                              CHANNELS
     8 DIMMs                                                             8 DIMMs
     per CPU                                                             per CPU


    No front side bus                   HyperTransport™ technology

    Integrated memory controller        NUMA memory architecture
Direct Connect Architecture 2.0
Balanced and Scalable Design to Support up to 16 Cores* per CPU




             CHANNELS
             4 MEMORY




                                                          4 MEMORY
                                                          CHANNELS
  12 DIMMs                                                           12 DIMMs
   per CPU                                                            per CPU
             CHANNELS
             4 MEMORY




                                                          4 MEMORY
                                                          CHANNELS
  12 DIMMs                                                           12 DIMMs
   per CPU                                                            per CPU


    • 1-hop between processors      • Four memory channels

    • Up to 50% more DIMMs          • Up to 33% increase in CPU to CPU
                                      communication speed±
What is next for x86 CPUs

• More processor cores to come
(12, 16, 16 double cores)


• More memory channels
(improves memory bandwidth per
core)


• Improved IPC
(8 per cycle is a target)
Top500 list - beyond the petaflop




                             Datacenters in the
                            USA will spend more
                             than $3 billion on
                              energy in 2009
1997:




                  X


 Garry Kasparov       IBM Deep Blue
The World’s Most Powerful GPU




                    =
2011 GPU Architecture
    AMD Radeon™ HD 6900 Series
Dual graphics engines
New VLIW4 core architecture
Up to 24 SIMD engines
Up to 96 Texture Units
Upgraded render back-ends
    Improved anti-aliasing performance

Fast 256-bit GDDR5 memory interface
    Up to 5.5 Gbps

New GPU compute features
Designing very efficient GPUs
Full load: 180W; Idle:27W


  16

                                                                                    14.47
  14                                                                                GFLOPS/W


  12
          GFLOPS/W
          GFLOPS/mm2
  10
                                                                         7.50


   8
                                                        4.50                                   7.90
                                                                                            GFLOPS/mm2
   6
                       2.01             2.21                                    4.56
   4
        1.07                                                   2.24

   2   0.42                   1.06             0.92


   0
         Nov-05        Jan-06            Sep-07           Nov-07           Jun-08              Oct-09
       ATI Radeon™   ATI Radeon™     ATI Radeon™ HD   ATI Radeon™ HD   ATI Radeon™ HD   ATI Radeon™ HD
        X1800 XT      X1900 XTX         2900 PRO           3870             4870             5870
Old and New in High Performance Computing

Old: Power is free, Transistors are expensive
New: Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)


Old: Multiplies are slow, Memory access is fast
New: Multiplies fast, Memory slow
(up 200 clocks to DRAM memory, 4 clocks for FP multiply)


Old: Increasing Instruction Level Parallelism via compilers innovation
New: Explicit thread and data parallelism must be exploited
GPUs: more than just gaming

                  Processing power – millions of operations per second
    Single Core   12
     Dual Core     24
     Quad Core          48
     Hexa Core               72
      12 Cores                    144
                                                                                          2700
Radeon HD 5970




                                        Both use GPUs

         Wii Sports - Golf                              Oil exploration platform - 2010

           15
DirectX® 11 Multi-Threading

 Application, DirectX runtime, and DirectX driver can each run in separate
  threads
 Tasks like loading a texture or compiling a shader can execute in parallel
  with main rendering thread

                   DirectX® 10                   DirectX® 11




     16
Today’s GPUs focused on


GAMING




ENTERTAINMENT




PRODUCTIVITY
DirectX® 11 Tessellation


                     DirectX® 10     DirectX® 11




                   No Tessellation   Tessellation

Images courtesy of Unigine Corp.




           18
5/26/2011
5/26/2011
Research companies already using




Oil exploration   Wheather forecast   Fluid Dynamics   Nature simulation

       21
AMD Balanced Platform
                                                     GPU is ideal for data parallel algorithms
CPU is excellent for running some                    like image processing, CAE, etc
algorithms
                                                             Great use for ATI Stream
       Ideal place to process if GPU is                      technology
        fully loaded
                                                             Great use for additional GPUs
       Great use for additional CPU
        cores




                                                    Graphics Workloads

                        Serial/Task-Parallel        Other Highly
                                 Workloads          Parallel Workloads




           Delivers    optimal performance              for a wide range of
                                   platform configurations
ATI Stream Technology is…

Heterogeneous: Developers leverage AMD GPUs and x86
CPUs for optimal application performance and user experience

High performance: Massively parallel, programmable GPU
architecture delivers unprecedented performance and power
efficiency

Industry Standards: OpenCL™ and DirectCompute 11 enable
cross-platform development




  Sciences   Government   Engineering   Gaming    Digital   Productivity
                                                 Content
                                                 Creation
Improvements already reached consumers



                                               80%


                                               70%


                                               60%


                                               50%
                                                                   ATI
                                                                  Stream
                                               40%


                                               30%


                                               20%


                                               10%


                                               0%

                                                     Processor utilization

 Adobe Flash plugin used by Youtube.com
  Better image quality and video smoothness
  Lower processor usage
GPU-accelerated video transcoding




                                               Ipod Video
       HD Video




           Up to 6x faster when using an AMD graphics card
Video Transcoding Sample
No GPU Acceleration
                          CPU Usage: 100%




                                               Using four
                                               CPU Cores




                                               GPU Usage: 1%




  CPU Usage: 100%     Time to finish: 1h 52m       Total Power: 0.23kW/h
   GPU Usage: 1%       Peak power: 145W              Energy Price: $0.15   26
Video Transcoding Sample
ATI GPU Acceleration
                               CPU Usage: 45%




                                                      GPU Usage: 35%

                                                                 Using hundreds of
                                                                 Stream Processors



CPU Usage: 45% (100%)   Time to finish: 26m (1h52m)   Total Power: 0.11kW/h (0.23)
 GPU Usage: 35% (1%)    Peak power: 198W (145W)        Energy Price: $0.07 ($0.15)   27
FUSION TECHNOLOGY
Today




  Multi-core CPU             TeraFLOPS-class GPU

  ~800 million transistors        Up to 2 billion transistors

  Multi-tasking               Jogos em multiplos monitores

                                     Video e audio Full HD
A new Era on performance evolution

                                                                                       Heterogeneous
                Single-Core                          Multi-Core
                                                                                         computing
       Challenge:                               Challenge:                         Pros:
          Power consumption                        Power consumption                 Performance
          Complexity                               Software                          Power efficient

                                                                                   Cons:
                                                                                      Software availability
Single-thread




                                       Performance




                                                                         Performance
                                   ?
                                                           We are here
                     We are here


                                                                                           We are here


                     Time                              Time x Cores                            Time
A new Era on performance evolution


      Single-Core          Multi-Core
CPU




              Core efficiency




                                         Software
                                        Acceleration



                                        Multimedia



                                          Gaming




                                           GPU
Putting all together – The Future is Fusion
  AMD “Istambul” six-core processor                                     RV500 GPU Core (2006)


   1    2        3         4        5           6
                                                                                                    Ring
   L2   L2       L2       L2       L2           L2                                                  Stop

                                                                               Client Interface                Client Interface




                 Cache L3




                                                                                                                                  Client Interface
                                                            Client Interface
                 CROSSBAR
                                                     Ring                                         Memory                                             Ring
                                                     Stop                                         Controller                                         Stop


         Hyper                 Memory




                                                                                                                                  Client Interface
        Transport              Controller




                                                            Client Interface
                                                                               Client Interface                Client Interface


                       HyperTransport
                                                                                                    Ring
                                                                                                    Stop
                                        PCI-e




             Chipset
Putting all together – The Future is Fusion
  AMD “Istambul” six-core processor                  RV700 GPU Core (2008-2009)


   1    2        3         4        5           6
   L2   L2       L2       L2       L2           L2



                 Cache L3


                 CROSSBAR



         Hyper                 Memory
        Transport              Controller


                       HyperTransport
                                        PCI-e




             Chipset
Putting all together – The Future is Fusion
  AMD “Istambul” six-core processor   RV700 GPU Core




                                                       CROSSBAR
             CROSSBAR
2011: welcome to the APU time!




CPU                    APU                      GPU

 “Supercomputing power in a notebook platform whose
            battery lasts for a full day”
One Design, Fewer Watts, Massive Capability

                                                         “Zacate”
                                    Discrete-level         AMD
                   Dual-Core
Northbridge    +     CPU
                                  +  DirectX® 11
                                         GPU
                                                     =    Fusion
                                                           APU




  66 sq. mm        117 sq. mm        59 sq. mm          75 sq. mm
  13 watts         25 watts          8 watts            18 watts
Graphics and Media Processing Efficiency
 Improvements
     2010 IGP-based Platform                                      2011 APU-based Platform


              ~17 GB/sec        ~17 GB/sec

                                                                                    CPU
                                                                                   Cores               DDR3 DIMM
                CPU                                                                                    Memory




                                                                                           UNB / MC
               Cores
   CPU Chip                            DDR3 DIMM
                                                                      APU Chip
                           MC




                                       Memory                                      UVD

                UNB

                                                                                   GPU
                                                                                                      ~27 GB/sec
~7 GB/sec
                                        Graphics requires
               GPU     UVD              memory bandwidth                     ~27 GB/sec    PCIe
                                           to bring full
               SB Functions             capabilities to life    3X bandwidth between GPU and memory
                                                                Even the same sized GPU is substantially
                                                                 more effective in this configuration
               PCIe
                                                                Eliminate latency and power associated
                                                                 with the extra chip crossing
    Bandwidth pinch points and latency                          Substantially smaller physical foot print
      hold back the GPU capabilities
“Ontario” & “Zacate” Architecture
 APU
 >2 x86 CPU Cores (40nm “Bobcat” core – 1 MB
  L2, 64-bit FPU)
 >C6 and power gating
 >Array of SIMD Engines
   • DX11 graphics performance
   • Industry leading 3D and graphics processing
 >3rd Generation Unified Video Decoder
       >H.264, VC1, DixX/Xvid format
 >DDR3 800-1066, 2 DIMMs, 64 bit channel
 >BGA package




 Display and I/O
 >Two dedicated digital display interfaces
   • Configurable externally as HDMI, DVI, and/or
     Display Port
   • Also supports a single link LVDS for internal
     panels
 >Integrated VGA
 >5x8 PCIe®
 > “Hudson” Fusion Controller Hub
OpenCL
Working together
ATI Stream SDK:
OpenCL™ For Multicore x86 CPUs and GPUs
http://developer.amd.com/

 The Power of Fusion: Developers leverage heterogeneous
    architecture to deliver superior user experience
 • First complete OpenCL™ development platform
 • Certified OpenCL 1.0 compliant by the Khronos Group
 •   Write code that can scale well on multi-core CPUs and GPUs
 •   AMD delivers on the promise of OpenCL™, with both high-
     performance CPU and GPU technologies
 •   Available for download now as part of ATI Stream SDK beta
     program – includes documentation, samples, and developer
     support
OpenCL™: Game-Changing Development
Enabling Broad Adoption of GP-GPU Capabilities



    Industry standard API: Open, multiplatform development
     platform for heterogeneous architectures
    The power of Fusion: Leverages CPUs and GPUs for
     balanced system approach
    Broad industry support: Created by architects from AMD,
     Apple, IBM, Intel, Nvidia, Sony, etc.
    Fast track development: Ratified in December; AMD is the
     first company to provide a complete OpenCL solution
    Momentum: Enormous interest from mainstream
     developers and application ISVs


              More stream-enabled applications across
                all markets
Open Standards:
Maximize Developer Freedom and Addressable Market

      Vendor specific                    Vendor neutral
  Cross-platform limiters
                                     Cross-platform enablers
  • Apple Display Connector

  • 3dfx Glide                  Digital Visual
                                                 OpenCL™   DirectX®
                                  Interface
  • Nvidia CUDA

  • Nvidia Cg

  • Rambus                      Certified DP      JEDEC    OpenGL®

  • Unified Display Interface
Comparing OpenCL™ and DirectX® 11 DirectCompute


How will developers choose between OpenCL™ and DirectX® 11
DirectCompute?
 Feature set is similar in both APIs
DirectX® 11 DirectCompute
 Easiest path to add compute capabilities to existing DirectX
  applications
 Windows Vista® and Windows® 7 only
OpenCL™
 Ideal path for new applications porting to the GPU for the first
  time
 True multiplatform: Windows®, Linux®, MacOS
 Natural programming without dealing with a graphics API
Anatomy of OpenCL™


                             Language Specification
  • C-based cross-platform programming interface
  • Subset of ISO C99 with language extensions - familiar to developers
  • Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error
  • Online or offline compilation and build of compute kernel executables
  • Includes a rich set of built-in functions



                                 Platform Layer API

  • A hardware abstraction layer over diverse computational resources
  • Query, select and initialize compute devices
  • Create compute contexts and work-queues



                                     Runtime API
  • Execute compute kernels
  • Manage scheduling, compute, and memory resources
OpenCL Example

                                       Scalar

   void square(int n, const float *a, float *result)
   {
      int i;
      for (i=0; i<n; i++)
         result[i] = a[i] * a[i];
   }



                                  Data-Parallel

   kernel dp_square (const float *a, float *result)
   {
     int id = get_global_id(0);
     result[id] = a[id] * a[id];
   }

   // dp_square executes oven “n” work-items
Summary


X86 PROCESSOR EVOLUTION



THE GPU AS AN ACCELERATOR



ACCELERATED PROCESSING UNITS


INTRODUCTION TO OpenCL
http://developer.amd.com

   46
Obrigado!
roberto.brandao@amd.com
roberto.brandao@amd.com




    Obrigado!

More Related Content

Amd accelerated computing -ufrj

  • 2. Agenda X86 PROCESSOR EVOLUTION THE GPU AS AN ACCELERATOR ACCELERATED PROCESSING UNITS INTRODUCTION TO OpenCL
  • 4. AMD architecture “Istambul” six-core diagram 1 2 3 4 5 6 Balanced Native caches L2 L2 L2 L2 L2 L2 six-core processor L3 Cache Lower memory latency CROSSBAR Hyper Memory Transport Controller HyperTransport PCI-e Fast full-duplex Chipset bus
  • 5. 4P/24-core system example very good scalability One memory controller for every MEMORY MEMORY processor Full-duplex Hyper Transport links (up to 5.2GHz) MEMORY MEMORY Bus Optimization: HT Assist (Cache Probe Filtering) Still the only available 4P system with Direct Connect Architecture
  • 6. Direct Connect Architecture 1.0 Balanced and Scalable Design to Support up to 6 Cores CHANNELS 2 MEMORY 2 MEMORY CHANNELS 8 DIMMs 8 DIMMs per CPU per CPU CHANNELS 2 MEMORY 2 MEMORY CHANNELS 8 DIMMs 8 DIMMs per CPU per CPU No front side bus HyperTransport™ technology Integrated memory controller NUMA memory architecture
  • 7. Direct Connect Architecture 2.0 Balanced and Scalable Design to Support up to 16 Cores* per CPU CHANNELS 4 MEMORY 4 MEMORY CHANNELS 12 DIMMs 12 DIMMs per CPU per CPU CHANNELS 4 MEMORY 4 MEMORY CHANNELS 12 DIMMs 12 DIMMs per CPU per CPU • 1-hop between processors • Four memory channels • Up to 50% more DIMMs • Up to 33% increase in CPU to CPU communication speed±
  • 8. What is next for x86 CPUs • More processor cores to come (12, 16, 16 double cores) • More memory channels (improves memory bandwidth per core) • Improved IPC (8 per cycle is a target)
  • 9. Top500 list - beyond the petaflop Datacenters in the USA will spend more than $3 billion on energy in 2009
  • 10. 1997: X Garry Kasparov IBM Deep Blue
  • 11. The World’s Most Powerful GPU =
  • 12. 2011 GPU Architecture AMD Radeon™ HD 6900 Series Dual graphics engines New VLIW4 core architecture Up to 24 SIMD engines Up to 96 Texture Units Upgraded render back-ends  Improved anti-aliasing performance Fast 256-bit GDDR5 memory interface  Up to 5.5 Gbps New GPU compute features
  • 13. Designing very efficient GPUs Full load: 180W; Idle:27W 16 14.47 14 GFLOPS/W 12 GFLOPS/W GFLOPS/mm2 10 7.50 8 4.50 7.90 GFLOPS/mm2 6 2.01 2.21 4.56 4 1.07 2.24 2 0.42 1.06 0.92 0 Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09 ATI Radeon™ ATI Radeon™ ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD X1800 XT X1900 XTX 2900 PRO 3870 4870 5870
  • 14. Old and New in High Performance Computing Old: Power is free, Transistors are expensive New: Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: Multiplies fast, Memory slow (up 200 clocks to DRAM memory, 4 clocks for FP multiply) Old: Increasing Instruction Level Parallelism via compilers innovation New: Explicit thread and data parallelism must be exploited
  • 15. GPUs: more than just gaming Processing power – millions of operations per second Single Core 12 Dual Core 24 Quad Core 48 Hexa Core 72 12 Cores 144 2700 Radeon HD 5970 Both use GPUs Wii Sports - Golf Oil exploration platform - 2010 15
  • 16. DirectX® 11 Multi-Threading  Application, DirectX runtime, and DirectX driver can each run in separate threads  Tasks like loading a texture or compiling a shader can execute in parallel with main rendering thread DirectX® 10 DirectX® 11 16
  • 17. Today’s GPUs focused on GAMING ENTERTAINMENT PRODUCTIVITY
  • 18. DirectX® 11 Tessellation DirectX® 10 DirectX® 11 No Tessellation Tessellation Images courtesy of Unigine Corp. 18
  • 21. Research companies already using Oil exploration Wheather forecast Fluid Dynamics Nature simulation 21
  • 22. AMD Balanced Platform GPU is ideal for data parallel algorithms CPU is excellent for running some like image processing, CAE, etc algorithms  Great use for ATI Stream  Ideal place to process if GPU is technology fully loaded  Great use for additional GPUs  Great use for additional CPU cores Graphics Workloads Serial/Task-Parallel Other Highly Workloads Parallel Workloads Delivers optimal performance for a wide range of platform configurations
  • 23. ATI Stream Technology is… Heterogeneous: Developers leverage AMD GPUs and x86 CPUs for optimal application performance and user experience High performance: Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency Industry Standards: OpenCL™ and DirectCompute 11 enable cross-platform development Sciences Government Engineering Gaming Digital Productivity Content Creation
  • 24. Improvements already reached consumers 80% 70% 60% 50% ATI Stream 40% 30% 20% 10% 0% Processor utilization Adobe Flash plugin used by Youtube.com  Better image quality and video smoothness  Lower processor usage
  • 25. GPU-accelerated video transcoding Ipod Video HD Video Up to 6x faster when using an AMD graphics card
  • 26. Video Transcoding Sample No GPU Acceleration CPU Usage: 100% Using four CPU Cores GPU Usage: 1% CPU Usage: 100% Time to finish: 1h 52m Total Power: 0.23kW/h GPU Usage: 1% Peak power: 145W Energy Price: $0.15 26
  • 27. Video Transcoding Sample ATI GPU Acceleration CPU Usage: 45% GPU Usage: 35% Using hundreds of Stream Processors CPU Usage: 45% (100%) Time to finish: 26m (1h52m) Total Power: 0.11kW/h (0.23) GPU Usage: 35% (1%) Peak power: 198W (145W) Energy Price: $0.07 ($0.15) 27
  • 29. Today Multi-core CPU TeraFLOPS-class GPU ~800 million transistors Up to 2 billion transistors Multi-tasking Jogos em multiplos monitores Video e audio Full HD
  • 30. A new Era on performance evolution Heterogeneous Single-Core Multi-Core computing Challenge: Challenge: Pros: Power consumption Power consumption  Performance Complexity Software  Power efficient Cons: Software availability Single-thread Performance Performance ? We are here We are here We are here Time Time x Cores Time
  • 31. A new Era on performance evolution Single-Core Multi-Core CPU Core efficiency Software Acceleration Multimedia Gaming GPU
  • 32. Putting all together – The Future is Fusion AMD “Istambul” six-core processor RV500 GPU Core (2006) 1 2 3 4 5 6 Ring L2 L2 L2 L2 L2 L2 Stop Client Interface Client Interface Cache L3 Client Interface Client Interface CROSSBAR Ring Memory Ring Stop Controller Stop Hyper Memory Client Interface Transport Controller Client Interface Client Interface Client Interface HyperTransport Ring Stop PCI-e Chipset
  • 33. Putting all together – The Future is Fusion AMD “Istambul” six-core processor RV700 GPU Core (2008-2009) 1 2 3 4 5 6 L2 L2 L2 L2 L2 L2 Cache L3 CROSSBAR Hyper Memory Transport Controller HyperTransport PCI-e Chipset
  • 34. Putting all together – The Future is Fusion AMD “Istambul” six-core processor RV700 GPU Core CROSSBAR CROSSBAR
  • 35. 2011: welcome to the APU time! CPU APU GPU “Supercomputing power in a notebook platform whose battery lasts for a full day”
  • 36. One Design, Fewer Watts, Massive Capability “Zacate” Discrete-level AMD Dual-Core Northbridge + CPU + DirectX® 11 GPU = Fusion APU  66 sq. mm  117 sq. mm  59 sq. mm  75 sq. mm  13 watts  25 watts  8 watts  18 watts
  • 37. Graphics and Media Processing Efficiency Improvements 2010 IGP-based Platform 2011 APU-based Platform ~17 GB/sec ~17 GB/sec CPU Cores DDR3 DIMM CPU Memory UNB / MC Cores CPU Chip DDR3 DIMM APU Chip MC Memory UVD UNB GPU ~27 GB/sec ~7 GB/sec Graphics requires GPU UVD memory bandwidth ~27 GB/sec PCIe to bring full SB Functions capabilities to life  3X bandwidth between GPU and memory  Even the same sized GPU is substantially more effective in this configuration PCIe  Eliminate latency and power associated with the extra chip crossing Bandwidth pinch points and latency  Substantially smaller physical foot print hold back the GPU capabilities
  • 38. “Ontario” & “Zacate” Architecture APU >2 x86 CPU Cores (40nm “Bobcat” core – 1 MB L2, 64-bit FPU) >C6 and power gating >Array of SIMD Engines • DX11 graphics performance • Industry leading 3D and graphics processing >3rd Generation Unified Video Decoder >H.264, VC1, DixX/Xvid format >DDR3 800-1066, 2 DIMMs, 64 bit channel >BGA package Display and I/O >Two dedicated digital display interfaces • Configurable externally as HDMI, DVI, and/or Display Port • Also supports a single link LVDS for internal panels >Integrated VGA >5x8 PCIe® > “Hudson” Fusion Controller Hub
  • 40. ATI Stream SDK: OpenCL™ For Multicore x86 CPUs and GPUs http://developer.amd.com/ The Power of Fusion: Developers leverage heterogeneous architecture to deliver superior user experience • First complete OpenCL™ development platform • Certified OpenCL 1.0 compliant by the Khronos Group • Write code that can scale well on multi-core CPUs and GPUs • AMD delivers on the promise of OpenCL™, with both high- performance CPU and GPU technologies • Available for download now as part of ATI Stream SDK beta program – includes documentation, samples, and developer support
  • 41. OpenCL™: Game-Changing Development Enabling Broad Adoption of GP-GPU Capabilities  Industry standard API: Open, multiplatform development platform for heterogeneous architectures  The power of Fusion: Leverages CPUs and GPUs for balanced system approach  Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc.  Fast track development: Ratified in December; AMD is the first company to provide a complete OpenCL solution  Momentum: Enormous interest from mainstream developers and application ISVs More stream-enabled applications across all markets
  • 42. Open Standards: Maximize Developer Freedom and Addressable Market Vendor specific Vendor neutral Cross-platform limiters Cross-platform enablers • Apple Display Connector • 3dfx Glide Digital Visual OpenCL™ DirectX® Interface • Nvidia CUDA • Nvidia Cg • Rambus Certified DP JEDEC OpenGL® • Unified Display Interface
  • 43. Comparing OpenCL™ and DirectX® 11 DirectCompute How will developers choose between OpenCL™ and DirectX® 11 DirectCompute?  Feature set is similar in both APIs DirectX® 11 DirectCompute  Easiest path to add compute capabilities to existing DirectX applications  Windows Vista® and Windows® 7 only OpenCL™  Ideal path for new applications porting to the GPU for the first time  True multiplatform: Windows®, Linux®, MacOS  Natural programming without dealing with a graphics API
  • 44. Anatomy of OpenCL™ Language Specification • C-based cross-platform programming interface • Subset of ISO C99 with language extensions - familiar to developers • Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error • Online or offline compilation and build of compute kernel executables • Includes a rich set of built-in functions Platform Layer API • A hardware abstraction layer over diverse computational resources • Query, select and initialize compute devices • Create compute contexts and work-queues Runtime API • Execute compute kernels • Manage scheduling, compute, and memory resources
  • 45. OpenCL Example Scalar void square(int n, const float *a, float *result) { int i; for (i=0; i<n; i++) result[i] = a[i] * a[i]; } Data-Parallel kernel dp_square (const float *a, float *result) { int id = get_global_id(0); result[id] = a[id] * a[id]; } // dp_square executes oven “n” work-items
  • 46. Summary X86 PROCESSOR EVOLUTION THE GPU AS AN ACCELERATOR ACCELERATED PROCESSING UNITS INTRODUCTION TO OpenCL http://developer.amd.com 46