Gennady Pekhimenko

Toronto, Ontario, Canada Contact Info
5K followers 500+ connections

Join to view profile

About

I am generally interested in the areas of systems and machine learning. My major research…

Activity

Join now to see all activity

Experience & Education

  • CentML

View Gennady’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

  • A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Efficient Data Compression

    ACM

    Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.

    This paper introduces the…

    Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.

    This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency.

    CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps.

    We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.

    Other authors
    See publication
  • Page Overlays: An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management

    ACM

    Many recent works propose mechanisms demonstrating the potential advantages of managing memory at a fine (e.g., cache line) granularity---e.g., fine-grained deduplication and fine-grained memory protection. Unfortunately, existing virtual memory systems track memory at a larger granularity (e.g., 4 KB pages), inhibiting efficient implementation of such techniques. Simply reducing the page size results in an unacceptable increase in page table overhead and TLB pressure.

    We propose a new…

    Many recent works propose mechanisms demonstrating the potential advantages of managing memory at a fine (e.g., cache line) granularity---e.g., fine-grained deduplication and fine-grained memory protection. Unfortunately, existing virtual memory systems track memory at a larger granularity (e.g., 4 KB pages), inhibiting efficient implementation of such techniques. Simply reducing the page size results in an unacceptable increase in page table overhead and TLB pressure.

    We propose a new virtual memory framework that enables efficient implementation of a variety of fine-grained memory management techniques. In our framework, each virtual page can be mapped to a structure called a page overlay, in addition to a regular physical page. An overlay contains a subset of cache lines from the virtual page. Cache lines that are present in the overlay are accessed from there and all other cache lines are accessed from the regular physical page. Our page-overlay framework enables cache-line-granularity memory management without significantly altering the existing virtual memory framework or introducing high overheads.

    We show that our framework can enable simple and efficient implementations of seven memory management techniques, each of which has a wide variety of applications. We quantitatively evaluate the potential benefits of two of these techniques: overlay-on-write and sparse-data-structure computation. Our evaluations show that overlay-on-write, when applied to fork, can improve performance by 15% and reduce memory capacity requirements by 53% on average compared to traditional copy-on-write. For sparse data computation, our framework can outperform a state-of-the-art software-based sparse representation on a number of real-world sparse matrices. Our framework is general, powerful, and effective in enabling fine-grained memory management at low cost.

    Other authors
    See publication
  • Toggle-Aware Compression for GPUs

    Computer Architecture Letters (Jan-Jun 2015)

    Other authors
  • PocketTrend: Timely Identification and Delivery of Trending Search Content to Mobile Users

    ACM

    Trending search topics cause unpredictable query load spikes that hurt the end-user search experience, particularly the mobile one, by introducing longer delays. To understand how trending search topics are formed and evolve over time, we analyze 21 million queries submitted during periods where popular events caused search query volume spikes. Based on our findings, we design and evaluate PocketTrend, a system that automatically detects trending topics in real time, identifies the search…

    Trending search topics cause unpredictable query load spikes that hurt the end-user search experience, particularly the mobile one, by introducing longer delays. To understand how trending search topics are formed and evolve over time, we analyze 21 million queries submitted during periods where popular events caused search query volume spikes. Based on our findings, we design and evaluate PocketTrend, a system that automatically detects trending topics in real time, identifies the search content associated to the topics, and then intelligently pushes this content to users in a timely manner. In that way, PocketTrend enables a client-side search engine that can instantly answer user queries related to trending events, while at the same time reducing the impact of these trends on the datacenter workload. Our results, using real mobile search logs, show that in the presence of a trending event, up to 13-17% of the overall search traffic can be eliminated from the datacenter, with as many as 19% of all users benefiting from PocketTrend.

    Other authors
    See publication
  • Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case

    21st International Symposium on High-Performance Computer Architecture (HPCA)

    In current systems, memory accesses to a DRAM chip must
    obey a set of minimum latency restrictions specified in the
    DRAM standard. Such timing parameters exist to guarantee reliable
    operation. When deciding the timing parameters, DRAM
    manufacturers incorporate a very large margin as a provision
    against two worst-case scenarios. First, due to process variation,
    some outlier chips are much slower than others and cannot
    be operated as fast. Second, chips become slower at…

    In current systems, memory accesses to a DRAM chip must
    obey a set of minimum latency restrictions specified in the
    DRAM standard. Such timing parameters exist to guarantee reliable
    operation. When deciding the timing parameters, DRAM
    manufacturers incorporate a very large margin as a provision
    against two worst-case scenarios. First, due to process variation,
    some outlier chips are much slower than others and cannot
    be operated as fast. Second, chips become slower at higher
    temperatures, and all chips need to operate reliably at the highest
    supported (i.e., worst-case) DRAM temperature (85◦C). In
    this paper, we show that typical DRAM chips operating at typical
    temperatures (e.g., 55◦C) are capable of providing a much
    smaller access latency, but are nevertheless forced to operate
    at the largest latency of the worst-case.
    Our goal in this paper is to exploit the extra margin that
    is built into the DRAM timing parameters to improve performance.
    Using an FPGA-based testing platform, we first characterize
    the extra margin for 115 DRAM modules from three
    major manufacturers. Our results demonstrate that it is possible
    to reduce four of the most critical timing parameters by
    a minimum/maximum of 17.3%/54.8% at 55◦C without sacrificing
    correctness. Based on this characterization, we propose
    Adaptive-Latency DRAM (AL-DRAM), a mechanism that
    adaptively reduces the timing parameters for DRAM modules
    based on the current operating condition. AL-DRAM does not
    require any changes to the DRAM chip or its interface.
    We evaluate AL-DRAM on a real system that allows us to reconfigure
    the timing parameters at runtime. We show that ALDRAM
    improves the performance of memory-intensive workloads
    by an average of 14% without introducing any errors.
    We discuss and show why AL-DRAM does not compromise reliability.
    We conclude that dynamically optimizing the DRAM
    timing parameters can reliably improve system performance.

    Other authors
    See publication
  • Exploiting Compressed Block Size as an Indicator of Future Reuse

    21st International Symposium on High-Performance Computer Architecture (HPCA)

    We introduce a set of new Compression-Aware Management
    Policies (CAMP) for on-chip caches that employ data compression.
    Our management policies are based on two key ideas. First,
    we show that it is possible to build a more ecient management
    policy for compressed caches if the compressed block size is directly
    used in calculating the value (importance) of a block to
    the cache. This leads to Minimal-Value Eviction (MVE), a policy
    that evicts the cache blocks with the least…

    We introduce a set of new Compression-Aware Management
    Policies (CAMP) for on-chip caches that employ data compression.
    Our management policies are based on two key ideas. First,
    we show that it is possible to build a more ecient management
    policy for compressed caches if the compressed block size is directly
    used in calculating the value (importance) of a block to
    the cache. This leads to Minimal-Value Eviction (MVE), a policy
    that evicts the cache blocks with the least value, based on both
    the size and the expected future reuse. Second, we show that, in
    some cases, compressed block size can be used as an ecient indicator
    of the future reuse of a cache block. We use this idea to
    build a new insertion policy called Size-based Insertion Policy
    (SIP) that dynamically prioritizes cache blocks using their compressed
    size as an indicator.
    We compare CAMP (and its global variant G-CAMP) to prior
    on-chip cache management policies (both size-oblivious and
    size-aware) and nd that our mechanisms are more eective in
    using compressed block size as an extra dimension in cache management
    decisions. Our results show that the proposed management
    policies (i) decrease o-chip bandwidth consumption
    (by 8.7% in single-core), (ii) decrease memory subsystem energy
    consumption (by 7.2% in single-core) for memory intensive
    workloads compared to the best prior mechanism, and (iii)
    improve performance (by 4.9%/9.0%/10.2% on average in single-
    /two-/four-core workload evaluations and up to 20.1%) CAMP is
    eective for a variety of compression algorithms and dierent
    cache designs with local and global replacement strategies.

    Other authors
    See publication
  • Shifted Hamming Distance: A Fast and Accurate SIMD-Friendly Filter to Accelerate Alignment Verification in Read Mapping

    Oxford Bioinformatics

    Motivation: Calculating the edit-distance (i.e., minimum number of insertions, deletions, and substitutions) between short DNA sequences is the primary task performed by seed-and-extend based mappers, which compare billions of sequences.

    In practice, only sequence pairs with a small edit-distance provide useful scientific data. However, the majority of sequence pairs analyzed by seed-and-extend based mappers differ by significantly more errors than what is typically allowed. Such…

    Motivation: Calculating the edit-distance (i.e., minimum number of insertions, deletions, and substitutions) between short DNA sequences is the primary task performed by seed-and-extend based mappers, which compare billions of sequences.

    In practice, only sequence pairs with a small edit-distance provide useful scientific data. However, the majority of sequence pairs analyzed by seed-and-extend based mappers differ by significantly more errors than what is typically allowed. Such error-abundant sequence pairs needlessly waste resources and severely hinder the performance of read mappers. Therefore, it is crucial to develop a fast and accurate filter that can rapidly and efficiently detect error- abundant string pairs and remove them from consideration before more computationally expensive methods are used.

    Results: We present a simple and efficient algorithm, Shifted Hamming Distance (SHD), which accelerates the alignment verification procedure in read mapping, by quickly filtering out error-abundant sequence pairs using bit-parallel and SIMD-parallel operations. SHD only filters string pairs that contain more errors than a user-defined threshold, making it fully comprehensive. It also maintains high accuracy with moderate error threshold (up to 5% of the string length) while achieving a 3-fold speedup over the best previous algorithm (Gene Myers's bit-vector algorithm). SHD is compatible with all mappers that perform sequence alignment for verification.

    Other authors
    See publication
  • Rollback-Free Value Prediction with Approximate Loads

    The 23rd International Conference on Parallel Architecture and Compiler Techniques (PACT'14)

    This paper demonstrates how to utilize the inherent error resilience of a wide range of applications to mitigate the memory wall—the discrepancy between core and memory speed. We define a new microarchitecturally-triggered approximation technique called rollback-free value prediction. This technique predicts the value of safe-to-approximate loads when they miss in the cache without tracking mispredictions or requiring costly recovery from misspeculations. This technique mitigates the memory…

    This paper demonstrates how to utilize the inherent error resilience of a wide range of applications to mitigate the memory wall—the discrepancy between core and memory speed. We define a new microarchitecturally-triggered approximation technique called rollback-free value prediction. This technique predicts the value of safe-to-approximate loads when they miss in the cache without tracking mispredictions or requiring costly recovery from misspeculations. This technique mitigates the memory wall by allowing the core to continue computation without stalling for long-latency memory accesses. Our detailed study of the quality trade-offs shows that with a modern out-of-order processor, average 8% (up to 19%) per- formance improvement is possible with 0.8% (up to 1.8%) average quality loss on an approximable subset of SPEC CPU 2000/2006.

    Other authors
    See publication
  • Linearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory Compression Framework

    MICRO 2013

  • Software Automatic Tuning: From Concepts to State-of-the-Art Results

    Springer

    Chapter 19, Gennady Pekhimenko, Angela Demke Brown
    "Efficient Program Compilation Through Machine
    Learning Techniques"

    Other authors
    See publication

Patents

  • Trend response management

    Issued US 14/175,934

  • Managing speculative assist threads

    Issued US 12/905,202

    An illustrative embodiment provides a computer-implemented process for managing speculative assist threads for data pre-fetching that analyzes collected source code and cache profiling information to identify a code region containing a delinquent load instruction and generates an assist thread, including a value for a local version number, at a program entry point within the identified code region. Upon activation of the assist thread the local version number of the assist thread is compared to…

    An illustrative embodiment provides a computer-implemented process for managing speculative assist threads for data pre-fetching that analyzes collected source code and cache profiling information to identify a code region containing a delinquent load instruction and generates an assist thread, including a value for a local version number, at a program entry point within the identified code region. Upon activation of the assist thread the local version number of the assist thread is compared to the global unique version number of the main thread for the identified code region and an iteration distance between the assist thread relative to the main thread is compared to a predefined value. The assist thread is executed when the local version number of the assist thread matches the global unique version number of the main thread, and the iteration distance between the assist thread relative to the main thread is within a predefined range of values.

Courses

  • Graduate Algorithms

    15-750

  • Graduate Computer Architecture

    15-740

  • Graduate Computer Networks

    15-744

  • Graduate Machine Learning

    15-781

  • Optimizing Compilers for Modern Architecture

    15-745

  • Parallel Computer Architecture

    18-742

  • Program Analysis

    15-819

  • Semantics of Programming Languages

    15-812

Honors & Awards

  • NVIDIA Graduate Fellowship, 2015-2016

    NVIDIA

  • Qualcomm Innovation Fellowship (QInF'13, Honorable Mention)

    Qualcomm

  • Microsoft Research Fellowship, 2013-2015

    Microsoft Research

  • Alexander Graham Bell Canada Graduate Scholarship, 2012-2014

    NSERC (Canada)

Languages

  • English

    Full professional proficiency

  • Russian

    Native or bilingual proficiency

  • Ukrainian

    Native or bilingual proficiency

  • German

    Elementary proficiency

Organizations

  • ACM

    -

    - Present

Recommendations received

More activity by Gennady

View Gennady’s full profile

  • See who you know in common
  • Get introduced
  • Contact Gennady directly
Join to view full profile

People also viewed

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses