About
Articles by Luis
Activity
-
I am excited to share that I have co-founded NetFabric.ai -- a new deep-tech startup out of ETH Zürich (4th for me) -- both from my lab and from the…
I am excited to share that I have co-founded NetFabric.ai -- a new deep-tech startup out of ETH Zürich (4th for me) -- both from my lab and from the…
Liked by Luis Ceze
-
J.P. Morgan Payments team: "Would you guys be interested in doing a live Acquired recording for your listeners?" Me: "I unfortunately think that…
J.P. Morgan Payments team: "Would you guys be interested in doing a live Acquired recording for your listeners?" Me: "I unfortunately think that…
Liked by Luis Ceze
-
I'm excited to announce our strategic partnership with Scale AI that will bring unprecedented model customization & evaluation capabilities for…
I'm excited to announce our strategic partnership with Scale AI that will bring unprecedented model customization & evaluation capabilities for…
Liked by Luis Ceze
Experience & Education
Publications
-
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
OSDI
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep…
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.
Other authorsSee publication -
General-Purpose Code Acceleration with Limited-Precision Analog Computation
International Symposium on Computer Architecture (ISCA)
As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts…
As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to an “analog” neural model. We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in computation, circuit inaccuracies, noise, and constraints on supported topologies. We address these limitations with a combination of circuit techniques, a novel hardware/software interface, neural-network training techniques, and compiler support. Analog neural acceleration provides whole applica- tion speedup of 3.3× and energy savings of 12.1× with quality loss less than 10% for all except one benchmark. These results show that using limited-precision analog circuits for code ac- celeration, through a neural approach, is both feasible and beneficial over a range of approximation-tolerant, emerging applications including financial analysis, signal processing, robotics, 3D gaming, compression, image processing.
-
A Limit Study of JavaScript Parallelism
IISWC 2010
JavaScript is ubiquitous on the web. At the same time, the language’s dynamic behavior makes optimizations chal- lenging, leading to poor performance. In this paper we conduct a limit study on the potential parallelism of JavaScript appli- cations, including popular web pages and standard JavaScript benchmarks. We examine dependency types and looping behavior to better understand the potential for JavaScript parallelization. Our results show that the potential speedup is very encouraging—…
JavaScript is ubiquitous on the web. At the same time, the language’s dynamic behavior makes optimizations chal- lenging, leading to poor performance. In this paper we conduct a limit study on the potential parallelism of JavaScript appli- cations, including popular web pages and standard JavaScript benchmarks. We examine dependency types and looping behavior to better understand the potential for JavaScript parallelization. Our results show that the potential speedup is very encouraging— averaging 8.9x and as high as 45.5x. Parallelizing functions themselves, rather than just loop bodies proves to be more fruitful in increasing JavaScript execution speed. The results also indicate in our JavaScript engine, most of the dependencies manifest via virtual registers rather than hash table lookups.
-
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution
ASPLOS XV
The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs…
The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically.
A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually.
Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution. -
Bulk Disambiguation of Speculative Threads in Multiprocessors
Proceedings of the 33rd Annual International Symposium on Computer Architecture
-
Evaluation of a multithreaded architecture for cellular computing
Proceedings of the Eight International Symposium on High Performance Computer Architectures
Patents
-
Method and apparatus to trigger synchronization and validation actions upon memory access
Issued USPTO 08327084
A system and method to trigger synchronization and validation actions at memory access, in one aspect, identifies a storage class associated with a region of shared memory being accessed by a thread, determines whether the thread holds the storage class and acquires the storage class if the thread does not hold the storage class, identifies a programmable action associated with the storage class and the thread, and triggers the programmable action. One or more storage classes are respectively…
A system and method to trigger synchronization and validation actions at memory access, in one aspect, identifies a storage class associated with a region of shared memory being accessed by a thread, determines whether the thread holds the storage class and acquires the storage class if the thread does not hold the storage class, identifies a programmable action associated with the storage class and the thread, and triggers the programmable action. One or more storage classes are respectively associated with one or more regions of shared memory. An array of storage classes associated with a thread holds one or more storage classes acquired by the thread. A configurable action table associated with a thread indicates one or more programmable actions associated with a storage class.
-
Enhanced reliability using deterministic multiprocessing-based synchronized replication
Filed US 8,453,120 20110283262-A1
Courses
-
The Hardware/Software Interface
UW CSE351, Coursera
Projects
-
Approximate Computing
- Present
The key idea in approximate computing is to trade off accuracy in computation, storage, and communication for better performance and energy efficiency. It enables effective use of more aggressive transistor technology, analog computing techniques in a more general way, and new optimizations or code transformations (e.g., using fundamentally approximate models of execution like neural networks).
-
Neural Processing Units
-
This project takes a learning-based approach to the acceleration of approximate programs. We introduce a program transformation, called the “Parrot transformation,” that selects and trains a neural network to mimic a region of imperative code. After the learning transformation phase, the compiler replaces the original code with an invocation of a low-power accelerator called a “neural processing unit” (NPU). The NPU is tightly coupled to the processor’s speculative pipeline, since many of the…
This project takes a learning-based approach to the acceleration of approximate programs. We introduce a program transformation, called the “Parrot transformation,” that selects and trains a neural network to mimic a region of imperative code. After the learning transformation phase, the compiler replaces the original code with an invocation of a low-power accelerator called a “neural processing unit” (NPU). The NPU is tightly coupled to the processor’s speculative pipeline, since many of the accelerated code regions are small. Since neural networks produce inherently approximate results, we define a programming model that allows programmers to identify approximable code regions-code that can produce imprecise but acceptable results. Mimicking approximable code regions with an NPU is both faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides an average whole-application speedup of 2.3x and energy savings of 3.0x with quality loss at most 9.6%.
This is a joint project at University of Washington, Microsoft Research, and The University of Texas at Austin.Other creatorsSee project
Honors & Awards
-
UIUC Distinguished Alumni Award
University of Illinois
-
IEEE TCCA Young Computer Architect Award
IEEE
-
Sloan Foundation Fellow
Sloan Foundation
-
NSF CAREER Award
NSF
Organizations
-
ACM, IEEE, USENIX
-
More activity by Luis
-
AMD has entered into a definitive agreement to acquire Silo AI, Europe's largest private AI lab. The Silo AI team consists of world-class AI…
AMD has entered into a definitive agreement to acquire Silo AI, Europe's largest private AI lab. The Silo AI team consists of world-class AI…
Liked by Luis Ceze
-
In 2019, we set a goal to match 100% of the electricity we consumed across Amazon’s global operations – including our data centers, corp buildings…
In 2019, we set a goal to match 100% of the electricity we consumed across Amazon’s global operations – including our data centers, corp buildings…
Liked by Luis Ceze
-
Guess who stopped by VinAI? 👀 The renowned Prof. Stephen Boyd from #StanfordUniversity! He dropped by our office this morning to share his…
Guess who stopped by VinAI? 👀 The renowned Prof. Stephen Boyd from #StanfordUniversity! He dropped by our office this morning to share his…
Liked by Luis Ceze
-
For Inside Philanthropy I wrote about Dwight and Dian Diercks' $20 million gift to Mayo Clinic for AI-driven innovation, speaking to NVIDIA executive…
For Inside Philanthropy I wrote about Dwight and Dian Diercks' $20 million gift to Mayo Clinic for AI-driven innovation, speaking to NVIDIA executive…
Liked by Luis Ceze
-
Behind the show-stopping Las Vegas Sphere, NVIDIA technology is helping to power the stunning visuals. Learn more about it in our latest blog:…
Behind the show-stopping Las Vegas Sphere, NVIDIA technology is helping to power the stunning visuals. Learn more about it in our latest blog:…
Liked by Luis Ceze
-
Ketaki S., Cody Coleman, Chazz Sims, and I will see you tomorrow at the Amazon Web Services (AWS) NY Summit for our discussion on leveraging GenAI to…
Ketaki S., Cody Coleman, Chazz Sims, and I will see you tomorrow at the Amazon Web Services (AWS) NY Summit for our discussion on leveraging GenAI to…
Liked by Luis Ceze
-
Try SD3 on OctoAI now! Super fast and crisp output. Have fun!
Try SD3 on OctoAI now! Super fast and crisp output. Have fun!
Shared by Luis Ceze
-
Descubra a melhor opção de carregamento para veículos elétricos em residências e condomínios. Leia nosso artigo e compare os benefícios do wallbox e…
Descubra a melhor opção de carregamento para veículos elétricos em residências e condomínios. Leia nosso artigo e compare os benefícios do wallbox e…
Liked by Luis Ceze
-
Our engineering team is cooking - latency, throughput and quality all keep improving across different models. 🚀 🙏 artificialanalysis.ai for the…
Our engineering team is cooking - latency, throughput and quality all keep improving across different models. 🚀 🙏 artificialanalysis.ai for the…
Liked by Luis Ceze
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More