Luis Ceze

Seattle, Washington, United States Contact Info

Sign in to view Luis’ full profile

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

6K followers 500+ connections

View mutual connections with Luis

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Join to view profile

OctoAI

University of Illinois at Urbana-Champaign

About

I am a computer architect and startup co-founder and CEO. I do research in the…

Articles by Luis

OctoML Spring'22 Update - OctoBlooming!

OctoML Spring'22 Update - OctoBlooming!

By Luis Ceze

Mar 29, 2022

Activity

I am excited to share that I have co-founded NetFabric.ai -- a new deep-tech startup out of ETH Zürich (4th for me) -- both from my lab and from the…

I am excited to share that I have co-founded NetFabric.ai -- a new deep-tech startup out of ETH Zürich (4th for me) -- both from my lab and from the…

Liked by Luis Ceze
J.P. Morgan Payments team: "Would you guys be interested in doing a live Acquired recording for your listeners?" Me: "I unfortunately think that…

J.P. Morgan Payments team: "Would you guys be interested in doing a live Acquired recording for your listeners?" Me: "I unfortunately think that…

Liked by Luis Ceze
I'm excited to announce our strategic partnership with Scale AI that will bring unprecedented model customization & evaluation capabilities for…

I'm excited to announce our strategic partnership with Scale AI that will bring unprecedented model customization & evaluation capabilities for…

Liked by Luis Ceze

Join now to see all activity

Experience & Education

OctoAI

********** ** **********

*********
******* ******* *****

******* *******
********** ** ******** ** ******-*********

*** ******** *******

2002 - 2007
************ ** *ã* *****

** *** **** ********** *** ******** ***********

1996 - 2001

View Luis’s full experience

See their title, tenure and more.

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

OSDI Oct 2018
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep…

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

Other authors
See publication
A DNA-based Archival Storage System

ACM May 4, 2016
Other authors
See publication
General-Purpose Code Acceleration with Limited-Precision Analog Computation

International Symposium on Computer Architecture (ISCA) Apr 2014

As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts…

As improvements in per-transistor speed and energy efficiency diminish, radical departures from conventional approaches are becoming critical to improving the performance and energy efficiency of general-purpose processors. We propose a solution—from circuit to compiler—that enables general-purpose use of limited-precision, analog hardware to accelerate “approximable” code—code that can tolerate imprecise execution. We utilize an algorithmic transformation that automatically converts approximable regions of code from a von Neumann model to an “analog” neural model. We outline the challenges of taking an analog approach, including restricted-range value encoding, limited precision in computation, circuit inaccuracies, noise, and constraints on supported topologies. We address these limitations with a combination of circuit techniques, a novel hardware/software interface, neural-network training techniques, and compiler support. Analog neural acceleration provides whole applica- tion speedup of 3.3× and energy savings of 12.1× with quality loss less than 10% for all except one benchmark. These results show that using limited-precision analog circuits for code ac- celeration, through a neural approach, is both feasible and beneficial over a range of approximation-tolerant, emerging applications including financial analysis, signal processing, robotics, 3D gaming, compression, image processing.

See publication
A Limit Study of JavaScript Parallelism

IISWC 2010 Dec 2010

JavaScript is ubiquitous on the web. At the same time, the language’s dynamic behavior makes optimizations chal- lenging, leading to poor performance. In this paper we conduct a limit study on the potential parallelism of JavaScript appli- cations, including popular web pages and standard JavaScript benchmarks. We examine dependency types and looping behavior to better understand the potential for JavaScript parallelization. Our results show that the potential speedup is very encouraging—…

JavaScript is ubiquitous on the web. At the same time, the language’s dynamic behavior makes optimizations chal- lenging, leading to poor performance. In this paper we conduct a limit study on the potential parallelism of JavaScript appli- cations, including popular web pages and standard JavaScript benchmarks. We examine dependency types and looping behavior to better understand the potential for JavaScript parallelization. Our results show that the potential speedup is very encouraging— averaging 8.9x and as high as 45.5x. Parallelizing functions themselves, rather than just loop bodies proves to be more fruitful in increasing JavaScript execution speed. The results also indicate in our JavaScript engine, most of the dependencies manifest via virtual registers rather than hash table lookups.

See publication
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution

ASPLOS XV Mar 2010

The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs…

The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically.

A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually.

Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution.

See publication
Bulk Disambiguation of Speculative Threads in Multiprocessors

Proceedings of the 33rd Annual International Symposium on Computer Architecture Jun 2006

See publication
Evaluation of a multithreaded architecture for cellular computing

Proceedings of the Eight International Symposium on High Performance Computer Architectures Feb 2002

See publication

Patents

Method and apparatus to trigger synchronization and validation actions upon memory access

Issued December 4, 2012 USPTO 08327084

A system and method to trigger synchronization and validation actions at memory access, in one aspect, identifies a storage class associated with a region of shared memory being accessed by a thread, determines whether the thread holds the storage class and acquires the storage class if the thread does not hold the storage class, identifies a programmable action associated with the storage class and the thread, and triggers the programmable action. One or more storage classes are respectively…

A system and method to trigger synchronization and validation actions at memory access, in one aspect, identifies a storage class associated with a region of shared memory being accessed by a thread, determines whether the thread holds the storage class and acquires the storage class if the thread does not hold the storage class, identifies a programmable action associated with the storage class and the thread, and triggers the programmable action. One or more storage classes are respectively associated with one or more regions of shared memory. An array of storage classes associated with a thread holds one or more storage classes acquired by the thread. A configurable action table associated with a thread indicates one or more programmable actions associated with a storage class.

See patent
Enhanced reliability using deterministic multiprocessing-based synchronized replication

Filed May 1, 2013 US 8,453,120 20110283262-A1

Courses

The Hardware/Software Interface

UW CSE351, Coursera

Projects

Automated end-to-end compilation and optimization for Deep Learning

Jul 2016 - Present

See project
DNA Data Storage

Dec 2014 - Present
Using DNA for digital data storage.

Other creators
See project
Approximate Computing

Jan 2011 - Present

The key idea in approximate computing is to trade off accuracy in computation, storage, and communication for better performance and energy efficiency. It enables effective use of more aggressive transistor technology, analog computing techniques in a more general way, and new optimizations or code transformations (e.g., using fundamentally approximate models of execution like neural networks).

See project
Neural Processing Units

Mar 2011 - Dec 2015
This project takes a learning-based approach to the acceleration of approximate programs. We introduce a program transformation, called the “Parrot transformation,” that selects and trains a neural network to mimic a region of imperative code. After the learning transformation phase, the compiler replaces the original code with an invocation of a low-power accelerator called a “neural processing unit” (NPU). The NPU is tightly coupled to the processor’s speculative pipeline, since many of the…

This project takes a learning-based approach to the acceleration of approximate programs. We introduce a program transformation, called the “Parrot transformation,” that selects and trains a neural network to mimic a region of imperative code. After the learning transformation phase, the compiler replaces the original code with an invocation of a low-power accelerator called a “neural processing unit” (NPU). The NPU is tightly coupled to the processor’s speculative pipeline, since many of the accelerated code regions are small. Since neural networks produce inherently approximate results, we define a programming model that allows programmers to identify approximable code regions-code that can produce imprecise but acceptable results. Mimicking approximable code regions with an NPU is both faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides an average whole-application speedup of 2.3x and energy savings of 3.0x with quality loss at most 9.6%.

This is a joint project at University of Washington, Microsoft Research, and The University of Texas at Austin.

Other creators
See project

Honors & Awards

UIUC Distinguished Alumni Award

University of Illinois

Oct 2015
IEEE TCCA Young Computer Architect Award

IEEE

Jun 2013
Sloan Foundation Fellow

Sloan Foundation

Aug 2010
NSF CAREER Award

NSF

Aug 2009

Organizations

ACM, IEEE, USENIX

-

More activity by Luis

AMD has entered into a definitive agreement to acquire Silo AI, Europe's largest private AI lab. The Silo AI team consists of world-class AI…

AMD has entered into a definitive agreement to acquire Silo AI, Europe's largest private AI lab. The Silo AI team consists of world-class AI…

Liked by Luis Ceze
In 2019, we set a goal to match 100% of the electricity we consumed across Amazon’s global operations – including our data centers, corp buildings…

In 2019, we set a goal to match 100% of the electricity we consumed across Amazon’s global operations – including our data centers, corp buildings…

Liked by Luis Ceze
Guess who stopped by VinAI? 👀 The renowned Prof. Stephen Boyd from #StanfordUniversity! He dropped by our office this morning to share his…

Guess who stopped by VinAI? 👀 The renowned Prof. Stephen Boyd from #StanfordUniversity! He dropped by our office this morning to share his…

Liked by Luis Ceze
For Inside Philanthropy I wrote about Dwight and Dian Diercks' $20 million gift to Mayo Clinic for AI-driven innovation, speaking to NVIDIA executive…

For Inside Philanthropy I wrote about Dwight and Dian Diercks' $20 million gift to Mayo Clinic for AI-driven innovation, speaking to NVIDIA executive…

Liked by Luis Ceze
Behind the show-stopping Las Vegas Sphere, NVIDIA technology is helping to power the stunning visuals. Learn more about it in our latest blog:…

Behind the show-stopping Las Vegas Sphere, NVIDIA technology is helping to power the stunning visuals. Learn more about it in our latest blog:…

Liked by Luis Ceze
Ketaki S., Cody Coleman, Chazz Sims, and I will see you tomorrow at the Amazon Web Services (AWS) NY Summit for our discussion on leveraging GenAI to…

Ketaki S., Cody Coleman, Chazz Sims, and I will see you tomorrow at the Amazon Web Services (AWS) NY Summit for our discussion on leveraging GenAI to…

Liked by Luis Ceze
Try SD3 on OctoAI now! Super fast and crisp output. Have fun!

Try SD3 on OctoAI now! Super fast and crisp output. Have fun!

Shared by Luis Ceze
Descubra a melhor opção de carregamento para veículos elétricos em residências e condomínios. Leia nosso artigo e compare os benefícios do wallbox e…

Descubra a melhor opção de carregamento para veículos elétricos em residências e condomínios. Leia nosso artigo e compare os benefícios do wallbox e…

Liked by Luis Ceze
Our engineering team is cooking - latency, throughput and quality all keep improving across different models. 🚀 🙏 artificialanalysis.ai for the…

Our engineering team is cooking - latency, throughput and quality all keep improving across different models. 🚀 🙏 artificialanalysis.ai for the…

Liked by Luis Ceze

View Luis’ full profile

See who you know in common
Get introduced
Contact Luis directly

Join to view full profile

Sign in

Stay updated on your professional world

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses

See all courses

About

Articles by Luis

OctoML Spring'22 Update - OctoBlooming!

By Luis Ceze

Activity

I am excited to share that I have co-founded NetFabric.ai -- a new deep-tech startup out of ETH Zürich (4th for me) -- both from my lab and from the…

Liked by Luis Ceze

J.P. Morgan Payments team: "Would you guys be interested in doing a live Acquired recording for your listeners?" Me: "I unfortunately think that…

Liked by Luis Ceze

I'm excited to announce our strategic partnership with Scale AI that will bring unprecedented model customization & evaluation capabilities for…

Liked by Luis Ceze

Experience & Education

OctoAI

**-******* *** ***

View Luis’s full experience

See their title, tenure and more.

Publications

OSDI Oct 2018

ACM May 4, 2016

International Symposium on Computer Architecture (ISCA) Apr 2014

IISWC 2010 Dec 2010

ASPLOS XV Mar 2010

Proceedings of the 33rd Annual International Symposium on Computer Architecture Jun 2006

Proceedings of the Eight International Symposium on High Performance Computer Architectures Feb 2002

Patents

Issued December 4, 2012 USPTO 08327084

Enhanced reliability using deterministic multiprocessing-based synchronized replication

Filed May 1, 2013 US 8,453,120 20110283262-A1

Courses

The Hardware/Software Interface

UW CSE351, Coursera

Projects

Jul 2016 - Present

Dec 2014 - Present

Jan 2011 - Present

Mar 2011 - Dec 2015

Honors & Awards

UIUC Distinguished Alumni Award

University of Illinois

IEEE TCCA Young Computer Architect Award

IEEE

Sloan Foundation Fellow

Sloan Foundation

NSF CAREER Award

NSF

Organizations

ACM, IEEE, USENIX

-

More activity by Luis

AMD has entered into a definitive agreement to acquire Silo AI, Europe's largest private AI lab. The Silo AI team consists of world-class AI…

Liked by Luis Ceze

In 2019, we set a goal to match 100% of the electricity we consumed across Amazon’s global operations – including our data centers, corp buildings…

Liked by Luis Ceze

Guess who stopped by VinAI? 👀 The renowned Prof. Stephen Boyd from #StanfordUniversity! He dropped by our office this morning to share his…

Liked by Luis Ceze

For Inside Philanthropy I wrote about Dwight and Dian Diercks' $20 million gift to Mayo Clinic for AI-driven innovation, speaking to NVIDIA executive…

Liked by Luis Ceze

Behind the show-stopping Las Vegas Sphere, NVIDIA technology is helping to power the stunning visuals. Learn more about it in our latest blog:…

Liked by Luis Ceze

Ketaki S., Cody Coleman, Chazz Sims, and I will see you tomorrow at the Amazon Web Services (AWS) NY Summit for our discussion on leveraging GenAI to…

Liked by Luis Ceze

Try SD3 on OctoAI now! Super fast and crisp output. Have fun!

Shared by Luis Ceze

Descubra a melhor opção de carregamento para veículos elétricos em residências e condomínios. Leia nosso artigo e compare os benefícios do wallbox e…

Liked by Luis Ceze

Our engineering team is cooking - latency, throughput and quality all keep improving across different models. 🚀 🙏 artificialanalysis.ai for the…

Liked by Luis Ceze

View Luis’ full profile

Sign in

Other similar profiles

Ali Farhadi

Daniel Li ✅

Jason Knight

Jared Roesch

Tianqi Chen

Carlos Guestrin

Thierry Moreau

Lia Campbell

David Messina