“I had the pleasure to work with Gennady in IBM Rational compiler team for the same project. Gennady always has a passion to make smart and elegant changes to improve the product. He also processes a keen sense of spotting problems. Whenever I discuss a problem with him, he will be able to point out the specific areas to look for a bug. Most importantly, he possesses the passion to explore new concept and to explain them to his team mates. He never gets tired of expressing his ideas and giving advice to others. Working with a passionate programmer like Gennady is always a joyful and exciting experience to me.”
About
Activity
-
Looking forward to my PhD student Amey Agrawal's #OSDI24 talk on how to tame tail latency in Large Language Model (LLM) inference efficiently, while…
Looking forward to my PhD student Amey Agrawal's #OSDI24 talk on how to tame tail latency in Large Language Model (LLM) inference efficiently, while…
Liked by Gennady Pekhimenko
-
My dad was a professor of Philosophy and his teachings continue through me and my students ... ای مرغ سحر! چو این شب…
My dad was a professor of Philosophy and his teachings continue through me and my students ... ای مرغ سحر! چو این شب…
Liked by Gennady Pekhimenko
-
Your Voice Matters in Shaping the Future of LLMs At CentML, we're on a mission to revolutionize AI optimization, and your insights are crucial. If…
Your Voice Matters in Shaping the Future of LLMs At CentML, we're on a mission to revolutionize AI optimization, and your insights are crucial. If…
Liked by Gennady Pekhimenko
Experience & Education
Publications
-
A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Efficient Data Compression
ACM
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.
This paper introduces the…Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.
This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency.
CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps.
We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.Other authorsSee publication -
Page Overlays: An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management
ACM
Many recent works propose mechanisms demonstrating the potential advantages of managing memory at a fine (e.g., cache line) granularity---e.g., fine-grained deduplication and fine-grained memory protection. Unfortunately, existing virtual memory systems track memory at a larger granularity (e.g., 4 KB pages), inhibiting efficient implementation of such techniques. Simply reducing the page size results in an unacceptable increase in page table overhead and TLB pressure.
We propose a new…Many recent works propose mechanisms demonstrating the potential advantages of managing memory at a fine (e.g., cache line) granularity---e.g., fine-grained deduplication and fine-grained memory protection. Unfortunately, existing virtual memory systems track memory at a larger granularity (e.g., 4 KB pages), inhibiting efficient implementation of such techniques. Simply reducing the page size results in an unacceptable increase in page table overhead and TLB pressure.
We propose a new virtual memory framework that enables efficient implementation of a variety of fine-grained memory management techniques. In our framework, each virtual page can be mapped to a structure called a page overlay, in addition to a regular physical page. An overlay contains a subset of cache lines from the virtual page. Cache lines that are present in the overlay are accessed from there and all other cache lines are accessed from the regular physical page. Our page-overlay framework enables cache-line-granularity memory management without significantly altering the existing virtual memory framework or introducing high overheads.
We show that our framework can enable simple and efficient implementations of seven memory management techniques, each of which has a wide variety of applications. We quantitatively evaluate the potential benefits of two of these techniques: overlay-on-write and sparse-data-structure computation. Our evaluations show that overlay-on-write, when applied to fork, can improve performance by 15% and reduce memory capacity requirements by 53% on average compared to traditional copy-on-write. For sparse data computation, our framework can outperform a state-of-the-art software-based sparse representation on a number of real-world sparse matrices. Our framework is general, powerful, and effective in enabling fine-grained memory management at low cost.Other authorsSee publication -
PocketTrend: Timely Identification and Delivery of Trending Search Content to Mobile Users
ACM
Trending search topics cause unpredictable query load spikes that hurt the end-user search experience, particularly the mobile one, by introducing longer delays. To understand how trending search topics are formed and evolve over time, we analyze 21 million queries submitted during periods where popular events caused search query volume spikes. Based on our findings, we design and evaluate PocketTrend, a system that automatically detects trending topics in real time, identifies the search…
Trending search topics cause unpredictable query load spikes that hurt the end-user search experience, particularly the mobile one, by introducing longer delays. To understand how trending search topics are formed and evolve over time, we analyze 21 million queries submitted during periods where popular events caused search query volume spikes. Based on our findings, we design and evaluate PocketTrend, a system that automatically detects trending topics in real time, identifies the search content associated to the topics, and then intelligently pushes this content to users in a timely manner. In that way, PocketTrend enables a client-side search engine that can instantly answer user queries related to trending events, while at the same time reducing the impact of these trends on the datacenter workload. Our results, using real mobile search logs, show that in the presence of a trending event, up to 13-17% of the overall search traffic can be eliminated from the datacenter, with as many as 19% of all users benefiting from PocketTrend.
Other authorsSee publication -
Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case
21st International Symposium on High-Performance Computer Architecture (HPCA)
In current systems, memory accesses to a DRAM chip must
obey a set of minimum latency restrictions specified in the
DRAM standard. Such timing parameters exist to guarantee reliable
operation. When deciding the timing parameters, DRAM
manufacturers incorporate a very large margin as a provision
against two worst-case scenarios. First, due to process variation,
some outlier chips are much slower than others and cannot
be operated as fast. Second, chips become slower at…In current systems, memory accesses to a DRAM chip must
obey a set of minimum latency restrictions specified in the
DRAM standard. Such timing parameters exist to guarantee reliable
operation. When deciding the timing parameters, DRAM
manufacturers incorporate a very large margin as a provision
against two worst-case scenarios. First, due to process variation,
some outlier chips are much slower than others and cannot
be operated as fast. Second, chips become slower at higher
temperatures, and all chips need to operate reliably at the highest
supported (i.e., worst-case) DRAM temperature (85◦C). In
this paper, we show that typical DRAM chips operating at typical
temperatures (e.g., 55◦C) are capable of providing a much
smaller access latency, but are nevertheless forced to operate
at the largest latency of the worst-case.
Our goal in this paper is to exploit the extra margin that
is built into the DRAM timing parameters to improve performance.
Using an FPGA-based testing platform, we first characterize
the extra margin for 115 DRAM modules from three
major manufacturers. Our results demonstrate that it is possible
to reduce four of the most critical timing parameters by
a minimum/maximum of 17.3%/54.8% at 55◦C without sacrificing
correctness. Based on this characterization, we propose
Adaptive-Latency DRAM (AL-DRAM), a mechanism that
adaptively reduces the timing parameters for DRAM modules
based on the current operating condition. AL-DRAM does not
require any changes to the DRAM chip or its interface.
We evaluate AL-DRAM on a real system that allows us to reconfigure
the timing parameters at runtime. We show that ALDRAM
improves the performance of memory-intensive workloads
by an average of 14% without introducing any errors.
We discuss and show why AL-DRAM does not compromise reliability.
We conclude that dynamically optimizing the DRAM
timing parameters can reliably improve system performance.Other authorsSee publication -
Exploiting Compressed Block Size as an Indicator of Future Reuse
21st International Symposium on High-Performance Computer Architecture (HPCA)
We introduce a set of new Compression-Aware Management
Policies (CAMP) for on-chip caches that employ data compression.
Our management policies are based on two key ideas. First,
we show that it is possible to build a more ecient management
policy for compressed caches if the compressed block size is directly
used in calculating the value (importance) of a block to
the cache. This leads to Minimal-Value Eviction (MVE), a policy
that evicts the cache blocks with the least…We introduce a set of new Compression-Aware Management
Policies (CAMP) for on-chip caches that employ data compression.
Our management policies are based on two key ideas. First,
we show that it is possible to build a more ecient management
policy for compressed caches if the compressed block size is directly
used in calculating the value (importance) of a block to
the cache. This leads to Minimal-Value Eviction (MVE), a policy
that evicts the cache blocks with the least value, based on both
the size and the expected future reuse. Second, we show that, in
some cases, compressed block size can be used as an ecient indicator
of the future reuse of a cache block. We use this idea to
build a new insertion policy called Size-based Insertion Policy
(SIP) that dynamically prioritizes cache blocks using their compressed
size as an indicator.
We compare CAMP (and its global variant G-CAMP) to prior
on-chip cache management policies (both size-oblivious and
size-aware) and nd that our mechanisms are more eective in
using compressed block size as an extra dimension in cache management
decisions. Our results show that the proposed management
policies (i) decrease o-chip bandwidth consumption
(by 8.7% in single-core), (ii) decrease memory subsystem energy
consumption (by 7.2% in single-core) for memory intensive
workloads compared to the best prior mechanism, and (iii)
improve performance (by 4.9%/9.0%/10.2% on average in single-
/two-/four-core workload evaluations and up to 20.1%) CAMP is
eective for a variety of compression algorithms and dierent
cache designs with local and global replacement strategies.Other authorsSee publication -
Shifted Hamming Distance: A Fast and Accurate SIMD-Friendly Filter to Accelerate Alignment Verification in Read Mapping
Oxford Bioinformatics
Motivation: Calculating the edit-distance (i.e., minimum number of insertions, deletions, and substitutions) between short DNA sequences is the primary task performed by seed-and-extend based mappers, which compare billions of sequences.
In practice, only sequence pairs with a small edit-distance provide useful scientific data. However, the majority of sequence pairs analyzed by seed-and-extend based mappers differ by significantly more errors than what is typically allowed. Such…Motivation: Calculating the edit-distance (i.e., minimum number of insertions, deletions, and substitutions) between short DNA sequences is the primary task performed by seed-and-extend based mappers, which compare billions of sequences.
In practice, only sequence pairs with a small edit-distance provide useful scientific data. However, the majority of sequence pairs analyzed by seed-and-extend based mappers differ by significantly more errors than what is typically allowed. Such error-abundant sequence pairs needlessly waste resources and severely hinder the performance of read mappers. Therefore, it is crucial to develop a fast and accurate filter that can rapidly and efficiently detect error- abundant string pairs and remove them from consideration before more computationally expensive methods are used.
Results: We present a simple and efficient algorithm, Shifted Hamming Distance (SHD), which accelerates the alignment verification procedure in read mapping, by quickly filtering out error-abundant sequence pairs using bit-parallel and SIMD-parallel operations. SHD only filters string pairs that contain more errors than a user-defined threshold, making it fully comprehensive. It also maintains high accuracy with moderate error threshold (up to 5% of the string length) while achieving a 3-fold speedup over the best previous algorithm (Gene Myers's bit-vector algorithm). SHD is compatible with all mappers that perform sequence alignment for verification.Other authorsSee publication -
Rollback-Free Value Prediction with Approximate Loads
The 23rd International Conference on Parallel Architecture and Compiler Techniques (PACT'14)
This paper demonstrates how to utilize the inherent error resilience of a wide range of applications to mitigate the memory wall—the discrepancy between core and memory speed. We define a new microarchitecturally-triggered approximation technique called rollback-free value prediction. This technique predicts the value of safe-to-approximate loads when they miss in the cache without tracking mispredictions or requiring costly recovery from misspeculations. This technique mitigates the memory…
This paper demonstrates how to utilize the inherent error resilience of a wide range of applications to mitigate the memory wall—the discrepancy between core and memory speed. We define a new microarchitecturally-triggered approximation technique called rollback-free value prediction. This technique predicts the value of safe-to-approximate loads when they miss in the cache without tracking mispredictions or requiring costly recovery from misspeculations. This technique mitigates the memory wall by allowing the core to continue computation without stalling for long-latency memory accesses. Our detailed study of the quality trade-offs shows that with a modern out-of-order processor, average 8% (up to 19%) per- formance improvement is possible with 0.8% (up to 1.8%) average quality loss on an approximable subset of SPEC CPU 2000/2006.
Other authorsSee publication -
Linearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory Compression Framework
MICRO 2013
-
Software Automatic Tuning: From Concepts to State-of-the-Art Results
Springer
Chapter 19, Gennady Pekhimenko, Angela Demke Brown
"Efficient Program Compilation Through Machine
Learning Techniques"Other authorsSee publication
Patents
-
Managing speculative assist threads
Issued US 12/905,202
An illustrative embodiment provides a computer-implemented process for managing speculative assist threads for data pre-fetching that analyzes collected source code and cache profiling information to identify a code region containing a delinquent load instruction and generates an assist thread, including a value for a local version number, at a program entry point within the identified code region. Upon activation of the assist thread the local version number of the assist thread is compared to…
An illustrative embodiment provides a computer-implemented process for managing speculative assist threads for data pre-fetching that analyzes collected source code and cache profiling information to identify a code region containing a delinquent load instruction and generates an assist thread, including a value for a local version number, at a program entry point within the identified code region. Upon activation of the assist thread the local version number of the assist thread is compared to the global unique version number of the main thread for the identified code region and an iteration distance between the assist thread relative to the main thread is compared to a predefined value. The assist thread is executed when the local version number of the assist thread matches the global unique version number of the main thread, and the iteration distance between the assist thread relative to the main thread is within a predefined range of values.
Courses
-
Graduate Algorithms
15-750
-
Graduate Computer Architecture
15-740
-
Graduate Computer Networks
15-744
-
Graduate Machine Learning
15-781
-
Optimizing Compilers for Modern Architecture
15-745
-
Parallel Computer Architecture
18-742
-
Program Analysis
15-819
-
Semantics of Programming Languages
15-812
Honors & Awards
-
NVIDIA Graduate Fellowship, 2015-2016
NVIDIA
-
Qualcomm Innovation Fellowship (QInF'13, Honorable Mention)
Qualcomm
-
Microsoft Research Fellowship, 2013-2015
Microsoft Research
-
Alexander Graham Bell Canada Graduate Scholarship, 2012-2014
NSERC (Canada)
Languages
-
English
Full professional proficiency
-
Russian
Native or bilingual proficiency
-
Ukrainian
Native or bilingual proficiency
-
German
Elementary proficiency
Organizations
-
ACM
-
- Present
Recommendations received
-
LinkedIn User
8 people have recommended Gennady
Join now to viewMore activity by Gennady
-
Congratulations to Rahul Bera, Adithya Ranganathan and co-authors on receiving the Best Paper Award at #ISCACONF2024 for their work “Constable:…
Congratulations to Rahul Bera, Adithya Ranganathan and co-authors on receiving the Best Paper Award at #ISCACONF2024 for their work “Constable:…
Liked by Gennady Pekhimenko
-
The best part of what we do is working with people we love:)
The best part of what we do is working with people we love:)
Liked by Gennady Pekhimenko
-
📢 📢 Nvidia's latest GPUs feature CUDA MPS, an often overlooked but powerful tool that allows spatial partitioning of GPUs SMs while sharing memory.…
📢 📢 Nvidia's latest GPUs feature CUDA MPS, an often overlooked but powerful tool that allows spatial partitioning of GPUs SMs while sharing memory.…
Liked by Gennady Pekhimenko
-
Happy Canada Day, everyone! 🇨🇦 This Canada Day, we are thrilled to welcome Ruslan Salakhutdinov as our newest investor and research advisor at…
Happy Canada Day, everyone! 🇨🇦 This Canada Day, we are thrilled to welcome Ruslan Salakhutdinov as our newest investor and research advisor at…
Liked by Gennady Pekhimenko
-
I am thrilled to announce that Juliana Salazar will be joining my team as the Director of Executive Operations. Juliana brings a wealth of experience…
I am thrilled to announce that Juliana Salazar will be joining my team as the Director of Executive Operations. Juliana brings a wealth of experience…
Liked by Gennady Pekhimenko
-
I am honored to be selected as a 2024 MLSys Rising Star ⭐! I look forward to meeting everyone at the workshop next month in Santa…
I am honored to be selected as a 2024 MLSys Rising Star ⭐! I look forward to meeting everyone at the workshop next month in Santa…
Liked by Gennady Pekhimenko
-
For as long as I can remember, I’ve been drawn to the cutting edge of technology, constantly seeking roles that allow me to be part of groundbreaking…
For as long as I can remember, I’ve been drawn to the cutting edge of technology, constantly seeking roles that allow me to be part of groundbreaking…
Liked by Gennady Pekhimenko
-
Many thanks to Lisa Hsu and Suvinay Subramanian for hosting me at the Computer Architecture Podcast with an episode on Sustainability in a Post-AI…
Many thanks to Lisa Hsu and Suvinay Subramanian for hosting me at the Computer Architecture Podcast with an episode on Sustainability in a Post-AI…
Liked by Gennady Pekhimenko
-
SAPEON Inc. is proud to collaborate with the University of Toronto and SK Telecom on optimizing AI models for NPUs using our advanced computing…
SAPEON Inc. is proud to collaborate with the University of Toronto and SK Telecom on optimizing AI models for NPUs using our advanced computing…
Liked by Gennady Pekhimenko
-
I am deeply honored to receive the 2024 ACM SIGARCH Maurice Wilkes Award for "In Memory Computing" at ISCA in Buenos Aires. Maurice Wilkes Award is…
I am deeply honored to receive the 2024 ACM SIGARCH Maurice Wilkes Award for "In Memory Computing" at ISCA in Buenos Aires. Maurice Wilkes Award is…
Liked by Gennady Pekhimenko
-
Compute cost is definitely a big pain point. I got more evidence of that, as I was talking to fellow AI-founders at #CollisionConf today.
Compute cost is definitely a big pain point. I got more evidence of that, as I was talking to fellow AI-founders at #CollisionConf today.
Liked by Gennady Pekhimenko
People also viewed
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More