-
Compressing Structured Tensor Algebra
Authors:
Mahdi Ghorbani,
Emilien Bauer,
Tobias Grosser,
Amir Shaikhha
Abstract:
Tensor algebra is a crucial component for data-intensive workloads such as machine learning and scientific computing. As the complexity of data grows, scientists often encounter a dilemma between the highly specialized dense tensor algebra and efficient structure-aware algorithms provided by sparse tensor algebra. In this paper, we introduce DASTAC, a framework to propagate the tensors's captured…
▽ More
Tensor algebra is a crucial component for data-intensive workloads such as machine learning and scientific computing. As the complexity of data grows, scientists often encounter a dilemma between the highly specialized dense tensor algebra and efficient structure-aware algorithms provided by sparse tensor algebra. In this paper, we introduce DASTAC, a framework to propagate the tensors's captured high-level structure down to low-level code generation by incorporating techniques such as automatic data layout compression, polyhedral analysis, and affine code generation. Our methodology reduces memory footprint by automatically detecting the best data layout, heavily benefits from polyhedral optimizations, leverages further optimizations, and enables parallelization through MLIR. Through extensive experimentation, we show that DASTAC achieves 1 to 2 orders of magnitude speedup over TACO, a state-of-the-art sparse tensor compiler, and StructTensor, a state-of-the-art structured tensor algebra compiler, with a significantly lower memory footprint.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Verifying Peephole Rewriting In SSA Compiler IRs
Authors:
Siddharth Bhat,
Alex Keizer,
Chris Hughes,
Andrés Goens,
Tobias Grosser
Abstract:
There is an increasing need for domain-specific reasoning in modern compilers. This has fueled the use of tailored intermediate representations (IRs) based on static single assignment (SSA), like in the MLIR compiler framework. Interactive theorem provers (ITPs) provide strong guarantees for the end-to-end verification of compilers (e.g., CompCert). However, modern compilers and their IRs evolve a…
▽ More
There is an increasing need for domain-specific reasoning in modern compilers. This has fueled the use of tailored intermediate representations (IRs) based on static single assignment (SSA), like in the MLIR compiler framework. Interactive theorem provers (ITPs) provide strong guarantees for the end-to-end verification of compilers (e.g., CompCert). However, modern compilers and their IRs evolve at a rate that makes proof engineering alongside them prohibitively expensive. Nevertheless, well-scoped push-button automated verification tools such as the Alive peephole verifier for LLVM-IR gained recognition in domains where SMT solvers offer efficient (semi) decision procedures. In this paper, we aim to combine the convenience of automation with the versatility of ITPs for verifying peephole rewrites across domain-specific IRs. We formalize a core calculus for SSA-based IRs that is generic over the IR and covers so-called regions (nested scoping used by many domain-specific IRs in the MLIR ecosystem). Our mechanization in the Lean proof assistant provides a user-friendly frontend for translating MLIR syntax into our calculus. We provide scaffolding for defining and verifying peephole rewrites, offering tactics to eliminate the abstraction overhead of our SSA calculus. We prove correctness theorems about peephole rewriting, as well as two classical program transformations. To evaluate our framework, we consider three use cases from the MLIR ecosystem that cover different levels of abstractions: (1) bitvector rewrites from LLVM, (2) structured control flow, and (3) fully homomorphic encryption. We envision that our mechanization provides a foundation for formally verified rewrites on new domain-specific IRs.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Strided Difference Bound Matrices
Authors:
Arjun Pitchanathan,
Albert Cohen,
Oleksandr Zinenko,
Tobias Grosser
Abstract:
A wide range of symbolic analysis and optimization problems can be formalized using polyhedra. Sub-classes of polyhedra, also known as sub-polyhedral domains, are sought for their lower space and time complexity. We introduce the Strided Difference Bound Matrix (SDBM) domain, which represents a sweet spot in the context of optimizing compilers. Its expressiveness and efficient algorithms are parti…
▽ More
A wide range of symbolic analysis and optimization problems can be formalized using polyhedra. Sub-classes of polyhedra, also known as sub-polyhedral domains, are sought for their lower space and time complexity. We introduce the Strided Difference Bound Matrix (SDBM) domain, which represents a sweet spot in the context of optimizing compilers. Its expressiveness and efficient algorithms are particularly well suited to the construction of machine learning compilers. We present decision algorithms, abstract domain operators and computational complexity proofs for SDBM. We also conduct an empirical study with the MLIR compiler framework to validate the domain's practical applicability. We characterize a sub-class of SDBMs that frequently occurs in practice, and demonstrate even faster algorithms on this sub-class.
△ Less
Submitted 4 July, 2024; v1 submitted 18 May, 2024;
originally announced May 2024.
-
A shared compilation stack for distributed-memory parallelism in stencil DSLs
Authors:
George Bisbas,
Anton Lydike,
Emilien Bauer,
Nick Brown,
Mathieu Fehr,
Lawrence Mitchell,
Gabriel Rodriguez-Canal,
Maurice Jamieson,
Paul H. J. Kelly,
Michel Steuwer,
Tobias Grosser
Abstract:
Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloe…
▽ More
Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target current- and next-generation supercomputers. The convenience and performance of DSLs come with significant development and maintenance costs. The siloed design of DSL compilers and the resulting inability to benefit from shared infrastructure cause uncertainties around longevity and the adoption of DSLs at scale. By tailoring the broadly-adopted MLIR compiler framework to HPC, we bring the same synergies that the machine learning community already exploits across their DSLs (e.g. Tensorflow, PyTorch) to the finite-difference stencil HPC community. We introduce new HPC-specific abstractions for message passing targeting distributed stencil computations. We demonstrate the sharing of common components across three distinct HPC stencil-DSL compilers: Devito, PSyclone, and the Open Earth Compiler, showing that our framework generates high-performance executables based upon a shared compiler ecosystem.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Sidekick compilation with xDSL
Authors:
Mathieu Fehr,
Michel Weber,
Christian Ulmann,
Alexandre Lopoukhine,
Martin Lücke,
Théo Degioanni,
Michel Steuwer,
Tobias Grosser
Abstract:
Traditionally, compiler researchers either conduct experiments within an existing production compiler or develop their own prototype compiler; both options come with trade-offs. On one hand, prototyping in a production compiler can be cumbersome, as they are often optimized for program compilation speed at the expense of software simplicity and development speed. On the other hand, the transition…
▽ More
Traditionally, compiler researchers either conduct experiments within an existing production compiler or develop their own prototype compiler; both options come with trade-offs. On one hand, prototyping in a production compiler can be cumbersome, as they are often optimized for program compilation speed at the expense of software simplicity and development speed. On the other hand, the transition from a prototype compiler to production requires significant engineering work. To bridge this gap, we introduce the concept of sidekick compiler frameworks, an approach that uses multiple frameworks that interoperate with each other by leveraging textual interchange formats and declarative descriptions of abstractions. Each such compiler framework is specialized for specific use cases, such as performance or prototyping. Abstractions are by design shared across frameworks, simplifying the transition from prototyping to production. We demonstrate this idea with xDSL, a sidekick for MLIR focused on prototyping and teaching. xDSL interoperates with MLIR through a shared textual IR and the exchange of IRs through an IR Definition Language. The benefits of sidekick compiler frameworks are evaluated by showing on three use cases how xDSL impacts their development: teaching, DSL compilation, and rewrite system prototyping. We also investigate the trade-offs that xDSL offers, and demonstrate how we simplify the transition between frameworks using the IRDL dialect. With sidekick compilation, we envision a future in which engineers minimize the cost of development by choosing a framework built for their immediate needs, and later transitioning to production with minimal overhead.
△ Less
Submitted 16 June, 2024; v1 submitted 13 November, 2023;
originally announced November 2023.
-
mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR using Program Synthesis
Authors:
Alexander Brauckmann,
Elizabeth Polgreen,
Tobias Grosser,
Michael F. P. O'Boyle
Abstract:
MLIR is an emerging compiler infrastructure for modern hardware, but existing programs cannot take advantage of MLIR's high-performance compilation if they are described in lower-level general purpose languages. Consequently, to avoid programs needing to be rewritten manually, this has led to efforts to automatically raise lower-level to higher-level dialects in MLIR. However, current methods rely…
▽ More
MLIR is an emerging compiler infrastructure for modern hardware, but existing programs cannot take advantage of MLIR's high-performance compilation if they are described in lower-level general purpose languages. Consequently, to avoid programs needing to be rewritten manually, this has led to efforts to automatically raise lower-level to higher-level dialects in MLIR. However, current methods rely on manually-defined raising rules, which limit their applicability and make them challenging to maintain as MLIR dialects evolve.
We present mlirSynth -- a novel approach which translates programs from lower-level MLIR dialects to high-level ones without manually defined rules. Instead, it uses available dialect definitions to construct a program space and searches it effectively using type constraints and equivalences. We demonstrate its effectiveness \revi{by raising C programs} to two distinct high-level MLIR dialects, which enables us to use existing high-level dialect specific compilation flows. On Polybench, we show a greater coverage than previous approaches, resulting in geomean speedups of 2.5x (Intel) and 3.4x (AMD) over state-of-the-art compilation flows for the C programming language. mlirSynth also enables retargetability to domain-specific accelerators, resulting in a geomean speedup of 21.6x on a TPU.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Stencil-HMLS: A multi-layered approach to the automatic optimisation of stencil codes on FPGA
Authors:
Gabriel Rodriguez-Canal,
Nick Brown,
Maurice Jamieson,
Emilien Bauer,
Anton Lydike,
Tobias Grosser
Abstract:
The challenges associated with effectively programming FPGAs have been a major blocker in popularising reconfigurable architectures for HPC workloads. However new compiler technologies, such as MLIR, are providing new capabilities which potentially deliver the ability to extract domain specific information and drive automatic structuring of codes for FPGAs.
In this paper we explore domain specif…
▽ More
The challenges associated with effectively programming FPGAs have been a major blocker in popularising reconfigurable architectures for HPC workloads. However new compiler technologies, such as MLIR, are providing new capabilities which potentially deliver the ability to extract domain specific information and drive automatic structuring of codes for FPGAs.
In this paper we explore domain specific optimisations for stencils, a fundamental access pattern in scientific computing, to obtain high performance on FPGAs via automated code structuring. We propose Stencil-HMLS, a multi-layered approach to automatic optimisation of stencil codes and introduce the HLS dialect, which brings FPGA programming into the MLIR ecosystem. Using the PSyclone Fortran DSL, we demonstrate an improvement of 14-100$\times$ with respect to the next best performant state-of-the-art tool. Furthermore, our approach is 14 to 92 times more energy efficient than the next most energy efficient approach.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Fortran performance optimisation and auto-parallelisation by leveraging MLIR-based domain specific abstractions in Flang
Authors:
Nick Brown,
Maurice Jamieson,
Anton Lydike,
Emilien Bauer,
Tobias Grosser
Abstract:
MLIR has become popular since it was open sourced in 2019. A sub-project of LLVM, the flexibility provided by MLIR to represent Intermediate Representations (IR) as dialects at different abstraction levels, to mix these, and to leverage transformations between dialects provides opportunities for automated program optimisation and parallelisation. In addition to general purpose compilers built upon…
▽ More
MLIR has become popular since it was open sourced in 2019. A sub-project of LLVM, the flexibility provided by MLIR to represent Intermediate Representations (IR) as dialects at different abstraction levels, to mix these, and to leverage transformations between dialects provides opportunities for automated program optimisation and parallelisation. In addition to general purpose compilers built upon MLIR, domain specific abstractions have also been developed.
In this paper we explore complimenting the Flang MLIR general purpose compiler by combining with the domain specific Open Earth Compiler's MLIR stencil dialect. Developing transformations to discover and extracts stencils from Fortran, this specialisation delivers between a 2 and 10 times performance improvement for our benchmarks on a Cray supercomputer compared to using Flang alone. Furthermore, by leveraging existing MLIR transformations we develop an auto-parallelisation approach targeting multi-threaded and distributed memory parallelism, and optimised execution on GPUs, without any modifications to the serial Fortran source code.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
CONVOLVE: Smart and seamless design of smart edge processors
Authors:
M. Gomony,
F. Putter,
A. Gebregiorgis,
G. Paulin,
L. Mei,
V. Jain,
S. Hamdioui,
V. Sanchez,
T. Grosser,
M. Geilen,
M. Verhelst,
F. Zenke,
F. Gurkaynak,
B. Bruin,
S. Stuijk,
S. Davidson,
S. De,
M. Ghogho,
A. Jimborean,
S. Eissa,
L. Benini,
D. Soudris,
R. Bishnoi,
S. Ainsworth,
F. Corradi
, et al. (3 additional authors not shown)
Abstract:
With the rise of Deep Learning (DL), our world braces for AI in every edge device, creating an urgent need for edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at Ultra Low Power (ULP), with a very short time to market. With its strong legacy in edge solutions and open processing platforms, the EU is well-positioned to become a leader in this SoC…
▽ More
With the rise of Deep Learning (DL), our world braces for AI in every edge device, creating an urgent need for edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at Ultra Low Power (ULP), with a very short time to market. With its strong legacy in edge solutions and open processing platforms, the EU is well-positioned to become a leader in this SoC market. However, this requires AI edge processing to become at least 100 times more energy-efficient, while offering sufficient flexibility and scalability to deal with AI as a fast-moving target. Since the design space of these complex SoCs is huge, advanced tooling is needed to make their design tractable. The CONVOLVE project (currently in Inital stage) addresses these roadblocks. It takes a holistic approach with innovations at all levels of the design hierarchy. Starting with an overview of SOTA DL processing support and our project methodology, this paper presents 8 important design choices largely impacting the energy efficiency and flexibility of DL hardware. Finding good solutions is key to making smart-edge computing a reality.
△ Less
Submitted 2 May, 2023; v1 submitted 1 December, 2022;
originally announced December 2022.
-
TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine
Authors:
Nick Brown,
Brandon Echols,
Justs Zarins,
Tobias Grosser
Abstract:
The Cerebras Wafer Scale Engine (WSE) is an accelerator that combines hundreds of thousands of AI-cores onto a single chip. Whilst this technology has been designed for machine learning workloads, the significant amount of available raw compute means that it is also a very interesting potential target for accelerating traditional HPC computational codes. Many of these algorithms are stencil-based,…
▽ More
The Cerebras Wafer Scale Engine (WSE) is an accelerator that combines hundreds of thousands of AI-cores onto a single chip. Whilst this technology has been designed for machine learning workloads, the significant amount of available raw compute means that it is also a very interesting potential target for accelerating traditional HPC computational codes. Many of these algorithms are stencil-based, where update operations involve contributions from neighbouring elements, and in this paper we explore the suitability of this technology for such codes from the perspective of an early adopter of the technology, compared to CPUs and GPUs. Using TensorFlow as the interface, we explore the performance and demonstrate that, whilst there is still work to be done around exposing the programming interface to users, performance of the WSE is impressive as it out performs four V100 GPUs by two and a half times and two Intel Xeon Platinum CPUs by around 114 times in our experiments. There is significant potential therefore for this technology to play an important role in accelerating HPC codes on future exascale supercomputers.
△ Less
Submitted 26 August, 2022;
originally announced October 2022.
-
MOM: Matrix Operations in MLIR
Authors:
Lorenzo Chelini,
Henrik Barthels,
Paolo Bientinesi,
Marcin Copik,
Tobias Grosser,
Daniele G. Spampinato
Abstract:
Modern research in code generators for dense linear algebra computations has shown the ability to produce optimized code with a performance which compares and often exceeds the one of state-of-the-art implementations by domain experts. However, the underlying infrastructure is often developed in isolation making the interconnection of logically combinable systems complicated if not impossible. In…
▽ More
Modern research in code generators for dense linear algebra computations has shown the ability to produce optimized code with a performance which compares and often exceeds the one of state-of-the-art implementations by domain experts. However, the underlying infrastructure is often developed in isolation making the interconnection of logically combinable systems complicated if not impossible. In this paper, we propose to leverage MLIR as a unifying compiler infrastructure for the optimization of dense linear algebra operations. We propose a new MLIR dialect for expressing linear algebraic computations including matrix properties to enable high-level algorithmic transformations. The integration of this new dialect in MLIR enables end-to-end compilation of matrix computations via conversion to existing lower-level dialects already provided by the framework.
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
Lambda the Ultimate SSA: Optimizing Functional Programs in SSA
Authors:
Siddharth Bhat,
Tobias Grosser
Abstract:
Static Single Assignment (SSA) is the workhorse of modern optimizing compilers for imperative programming languages. However, functional languages have been slow to adopt SSA and prefer to use intermediate representations based on minimal lambda calculi due to SSA's inability to express higher order constructs. We exploit a new SSA construct -- regions -- in order to express functional optimizatio…
▽ More
Static Single Assignment (SSA) is the workhorse of modern optimizing compilers for imperative programming languages. However, functional languages have been slow to adopt SSA and prefer to use intermediate representations based on minimal lambda calculi due to SSA's inability to express higher order constructs. We exploit a new SSA construct -- regions -- in order to express functional optimizations via classical SSA based reasoning. Region optimization currently relies on ad-hoc analyses and transformations on imperative programs. These ad-hoc transformations are sufficient for imperative languages as regions are used in a limited fashion. In contrast, we use regions pervasively to model sub-expressions in our functional IR. This motivates us to systematize region optimizations. We extend classical SSA reasoning to regions for functional-style analyses and transformations. We implement a new SSA+regions based backend for LEAN4, a theorem prover that implements a purely functional, dependently typed programming language. Our backend is feature-complete and handles all constructs of LEAN4's functional intermediate representation λrc within the SSA framework. We evaluate our proposed region optimizations by optimizing λrc within an SSA+regions based framework implemented in MLIR and demonstrating performance parity with the current LEAN4 backend. We believe our work will pave the way for a unified optimization framework capable of representing, analyzing, and optimizing both functional and imperative languages.
△ Less
Submitted 18 January, 2022;
originally announced January 2022.
-
Extracting Clean Performance Models from Tainted Programs
Authors:
Marcin Copik,
Alexandru Calotoiu,
Tobias Grosser,
Nicolas Wicki,
Felix Wolf,
Torsten Hoefler
Abstract:
Performance models are well-known instruments to understand the scaling behavior of parallel applications. They express how performance changes as key execution parameters, such as the number of processes or the size of the input problem, vary. Besides reasoning about program behavior, such models can also be automatically derived from performance data. This is called empirical performance modelin…
▽ More
Performance models are well-known instruments to understand the scaling behavior of parallel applications. They express how performance changes as key execution parameters, such as the number of processes or the size of the input problem, vary. Besides reasoning about program behavior, such models can also be automatically derived from performance data. This is called empirical performance modeling. While this sounds simple at the first glance, this approach faces several serious interrelated challenges, including expensive performance measurements, inaccuracies inflicted by noisy benchmark data, and overall complex experiment design, starting with the selection of the right parameters. The more parameters one considers, the more experiments are needed and the stronger the impact of noise. In this paper, we show how taint analysis, a technique borrowed from the domain of computer security, can substantially improve the modeling process, lowering its cost, improving model quality, and help validate performance models and experimental setups.
△ Less
Submitted 31 December, 2020;
originally announced December 2020.
-
Work-stealing prefix scan: Addressing load imbalance in large-scale image registration
Authors:
Marcin Copik,
Tobias Grosser,
Torsten Hoefler,
Paolo Bientinesi,
Benjamin Berkels
Abstract:
Parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this paper, we study the recursive registration of a series of electron microscopy images - a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix…
▽ More
Parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this paper, we study the recursive registration of a series of electron microscopy images - a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
Domain-Specific Multi-Level IR Rewriting for GPU
Authors:
Tobias Gysi,
Christoph Müller,
Oleksandr Zinenko,
Stephan Herhut,
Eddie Davis,
Tobias Wicky,
Oliver Fuhrer,
Torsten Hoefler,
Tobias Grosser
Abstract:
Traditional compilers operate on a single generic intermediate representation (IR). These IRs are usually low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs…
▽ More
Traditional compilers operate on a single generic intermediate representation (IR). These IRs are usually low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific dialects based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM's extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure.
△ Less
Submitted 27 July, 2020; v1 submitted 26 May, 2020;
originally announced May 2020.
-
LLHD: A Multi-level Intermediate Representation for Hardware Description Languages
Authors:
Fabian Schuiki,
Andreas Kurth,
Tobias Grosser,
Luca Benini
Abstract:
Modern Hardware Description Languages (HDLs) such as SystemVerilog or VHDL are, due to their sheer complexity, insufficient to transport designs through modern circuit design flows. Instead, each design automation tool lowers HDLs to its own Intermediate Representation (IR). These tools are monolithic and mostly proprietary, disagree in their implementation of HDLs, and while many redundant IRs ex…
▽ More
Modern Hardware Description Languages (HDLs) such as SystemVerilog or VHDL are, due to their sheer complexity, insufficient to transport designs through modern circuit design flows. Instead, each design automation tool lowers HDLs to its own Intermediate Representation (IR). These tools are monolithic and mostly proprietary, disagree in their implementation of HDLs, and while many redundant IRs exists, no IR today can be used through the entire circuit design flow. To solve this problem, we propose the LLHD multi-level IR. LLHD is designed as simple, unambiguous reference description of a digital circuit, yet fully captures existing HDLs. We show this with our reference compiler on designs as complex as full CPU cores. LLHD comes with lowering passes to a hardware-near structural IR, which readily integrates with existing tools. LLHD establishes the basis for innovation in HDLs and tools without redundant compilers or disjoint IRs. For instance, we implement an LLHD simulator that runs up to 2.4x faster than commercial simulators but produces equivalent, cycle-accurate results. An initial vertically-integrated research prototype is capable of representing all levels of the IR, implements lowering from the behavioural to the structural IR, and covers a sufficient subset of SystemVerilog to support a full CPU design.
△ Less
Submitted 7 April, 2020;
originally announced April 2020.
-
Compiling Neural Networks for a Computational Memory Accelerator
Authors:
Kornilios Kourtis,
Martino Dazzi,
Nikolas Ioannou,
Tobias Grosser,
Abu Sebastian,
Evangelos Eleftheriou
Abstract:
Computational memory (CM) is a promising approach for accelerating inference on neural networks (NN) by using enhanced memories that, in addition to storing data, allow computations on them. One of the main challenges of this approach is defining a hardware/software interface that allows a compiler to map NN models for efficient execution on the underlying CM accelerator. This is a non-trivial tas…
▽ More
Computational memory (CM) is a promising approach for accelerating inference on neural networks (NN) by using enhanced memories that, in addition to storing data, allow computations on them. One of the main challenges of this approach is defining a hardware/software interface that allows a compiler to map NN models for efficient execution on the underlying CM accelerator. This is a non-trivial task because efficiency dictates that the CM accelerator is explicitly programmed as a dataflow engine where the execution of the different NN layers form a pipeline. In this paper, we present our work towards a software stack for executing ML models on such a multi-core CM accelerator. We describe an architecture for the hardware and software, and focus on the problem of implementing the appropriate control logic so that data dependencies are respected. We propose a solution to the latter that is based on polyhedral compilation.
△ Less
Submitted 24 April, 2020; v1 submitted 5 March, 2020;
originally announced March 2020.
-
A Fast Analytical Model of Fully Associative Caches
Authors:
Tobias Gysi,
Tobias Grosser,
Laurin Brandner,
Torsten Hoefler
Abstract:
While the cost of computation is an easy to understand local property, the cost of data movement on cached architectures depends on global state, does not compose, and is hard to predict. As a result, programmers often fail to consider the cost of data movement. Existing cache models and simulators provide the missing information but are computationally expensive. We present a lightweight cache mo…
▽ More
While the cost of computation is an easy to understand local property, the cost of data movement on cached architectures depends on global state, does not compose, and is hard to predict. As a result, programmers often fail to consider the cost of data movement. Existing cache models and simulators provide the missing information but are computationally expensive. We present a lightweight cache model for fully associative caches with least recently used (LRU) replacement policy that gives fast and accurate results. We count the cache misses without explicit enumeration of all memory accesses by using symbolic counting techniques twice: 1) to derive the stack distance for each memory access and 2) to count the memory accesses with stack distance larger than the cache size. While this technique seems infeasible in theory, due to non-linearities after the first round of counting, we show that the counting problems are sufficiently linear in practice. Our cache model often computes the results within seconds and contrary to simulation the execution time is mostly problem size independent. Our evaluation measures modeling errors below 0.6% on real hardware. By providing accurate data placement information we enable memory hierarchy aware software development.
△ Less
Submitted 6 January, 2020;
originally announced January 2020.
-
Accelerator Codesign as Non-Linear Optimization
Authors:
Nirmal Prajapati,
Sanjay Rajopadhye,
Hristo Djidjev,
Nandkishore Santhi,
Tobias Grosser,
Rumen Andonov
Abstract:
We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple, analytical model for the silicon area usage of accelerator architectures and a workload characterization of stencil computations. We combine this characterizati…
▽ More
We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple, analytical model for the silicon area usage of accelerator architectures and a workload characterization of stencil computations. We combine this characterization with a parametric execution time model and formulate a mathematical optimization problem. That problem seeks to maximize a common objective function of 'all the hardware and software parameters'. The solution to this problem, therefore "solves" the codesign problem: simultaneously choosing software-hardware parameters to optimize total performance.
We validate this approach by proposing architectural variants of the NVIDIA Maxwell GTX-980 (respectively, Titan X) specifically tuned to a predetermined workload of four common 2D stencils (Heat, Jacobi, Laplacian, and Gradient) and two 3D ones (Heat and Laplacian). Our model predicts that performance would potentially improve by 28% (respectively, 33%) with simple tweaks to the hardware parameters such as adapting coarse and fine-grained parallelism by changing the number of streaming multiprocessors and the number of compute cores each contains. We propose a set of Pareto-optimal design points to exploit the trade-off between performance and silicon area and show that by additionally eliminating GPU caches, we can get a further 2-fold improvement.
△ Less
Submitted 13 December, 2017;
originally announced December 2017.
-
PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs
Authors:
Riyadh Baghdadi,
Albert Cohen,
Serge Guelton,
Sven Verdoolaege,
Jun Inoue,
Tobias Grosser,
Georgia Kouveli,
Alexey Kravets,
Anton Lokhmotov,
Cedric Nugteren,
Fraser Waters,
Alastair F. Donaldson
Abstract:
We motivate the design and implementation of a platform-neutral compute intermediate language (PENCIL) for productive and performance-portable accelerator programming.
We motivate the design and implementation of a platform-neutral compute intermediate language (PENCIL) for productive and performance-portable accelerator programming.
△ Less
Submitted 22 February, 2013;
originally announced February 2013.