Luis Ceze

Seattle, Washington, United States Contact Info
6K followers 500+ connections

Join to view profile

About

I am a computer architect and startup co-founder and CEO. I do research in the…

Articles by Luis

Activity

Join now to see all activity

Experience & Education

  • OctoAI

View Luis’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

  • TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

    OSDI

    There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep…

    There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

    Other authors
    See publication
  • General-Purpose Code Acceleration with Limited-Precision Analog Computation

    International Symposium on Computer Architecture (ISCA)

    As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts…

    As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to an “analog” neural model. We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in computation, circuit inaccuracies, noise, and constraints on supported topologies. We address these limitations with a combination of circuit techniques, a novel hardware/software interface, neural-network training techniques, and compiler support. Analog neural acceleration provides whole applica- tion speedup of 3.3× and energy savings of 12.1× with quality loss less than 10% for all except one benchmark. These results show that using limited-precision analog circuits for code ac- celeration, through a neural approach, is both feasible and beneficial over a range of approximation-tolerant, emerging applications including financial analysis, signal processing, robotics, 3D gaming, compression, image processing.

    See publication
  • A Limit Study of JavaScript Parallelism

    IISWC 2010

    JavaScript is ubiquitous on the web. At the same time, the language’s dynamic behavior makes optimizations chal- lenging, leading to poor performance. In this paper we conduct a limit study on the potential parallelism of JavaScript appli- cations, including popular web pages and standard JavaScript benchmarks. We examine dependency types and looping behavior to better understand the potential for JavaScript parallelization. Our results show that the potential speedup is very encouraging—…

    JavaScript is ubiquitous on the web. At the same time, the language’s dynamic behavior makes optimizations chal- lenging, leading to poor performance. In this paper we conduct a limit study on the potential parallelism of JavaScript appli- cations, including popular web pages and standard JavaScript benchmarks. We examine dependency types and looping behavior to better understand the potential for JavaScript parallelization. Our results show that the potential speedup is very encouraging— averaging 8.9x and as high as 45.5x. Parallelizing functions themselves, rather than just loop bodies proves to be more fruitful in increasing JavaScript execution speed. The results also indicate in our JavaScript engine, most of the dependencies manifest via virtual registers rather than hash table lookups.

    See publication
  • CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution

    ASPLOS XV

    The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs…

    The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically.

    A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually.

    Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution.

    See publication
  • Bulk Disambiguation of Speculative Threads in Multiprocessors

    Proceedings of the 33rd Annual International Symposium on Computer Architecture

  • Evaluation of a multithreaded architecture for cellular computing

    Proceedings of the Eight International Symposium on High Performance Computer Architectures

Patents

  • Method and apparatus to trigger synchronization and validation actions upon memory access

    Issued USPTO 08327084

    A system and method to trigger synchronization and validation actions at memory access, in one aspect, identifies a storage class associated with a region of shared memory being accessed by a thread, determines whether the thread holds the storage class and acquires the storage class if the thread does not hold the storage class, identifies a programmable action associated with the storage class and the thread, and triggers the programmable action. One or more storage classes are respectively…

    A system and method to trigger synchronization and validation actions at memory access, in one aspect, identifies a storage class associated with a region of shared memory being accessed by a thread, determines whether the thread holds the storage class and acquires the storage class if the thread does not hold the storage class, identifies a programmable action associated with the storage class and the thread, and triggers the programmable action. One or more storage classes are respectively associated with one or more regions of shared memory. An array of storage classes associated with a thread holds one or more storage classes acquired by the thread. A configurable action table associated with a thread indicates one or more programmable actions associated with a storage class.

    See patent
  • Enhanced reliability using deterministic multiprocessing-based synchronized replication

    Filed US 8,453,120 20110283262-A1

Courses

  • The Hardware/Software Interface

    UW CSE351, Coursera

Projects

  • DNA Data Storage

    - Present

    Using DNA for digital data storage.

    Other creators
    See project
  • Approximate Computing

    - Present

    The key idea in approximate computing is to trade off accuracy in computation, storage, and communication for better performance and energy efficiency. It enables effective use of more aggressive transistor technology, analog computing techniques in a more general way, and new optimizations or code transformations (e.g., using fundamentally approximate models of execution like neural networks).

    See project
  • Neural Processing Units

    -

    This project takes a learning-based approach to the acceleration of approximate programs. We introduce a program transformation, called the “Parrot transformation,” that selects and trains a neural network to mimic a region of imperative code. After the learning transformation phase, the compiler replaces the original code with an invocation of a low-power accelerator called a “neural processing unit” (NPU). The NPU is tightly coupled to the processor’s speculative pipeline, since many of the…

    This project takes a learning-based approach to the acceleration of approximate programs. We introduce a program transformation, called the “Parrot transformation,” that selects and trains a neural network to mimic a region of imperative code. After the learning transformation phase, the compiler replaces the original code with an invocation of a low-power accelerator called a “neural processing unit” (NPU). The NPU is tightly coupled to the processor’s speculative pipeline, since many of the accelerated code regions are small. Since neural networks produce inherently approximate results, we define a programming model that allows programmers to identify approximable code regions-code that can produce imprecise but acceptable results. Mimicking approximable code regions with an NPU is both faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides an average whole-application speedup of 2.3x and energy savings of 3.0x with quality loss at most 9.6%.

    This is a joint project at University of Washington, Microsoft Research, and The University of Texas at Austin.

    Other creators
    See project

Honors & Awards

  • UIUC Distinguished Alumni Award

    University of Illinois

  • IEEE TCCA Young Computer Architect Award

    IEEE

  • Sloan Foundation Fellow

    Sloan Foundation

  • NSF CAREER Award

    NSF

Organizations

  • ACM, IEEE, USENIX

    -

More activity by Luis

View Luis’ full profile

  • See who you know in common
  • Get introduced
  • Contact Luis directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses