Dhruba Borthakur

Sunnyvale, California, United States Contact Info
9K followers 500+ connections

Join to view profile

Experience & Education

  • Rockset (acquired by OpenAI)

View Dhruba’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Volunteer Experience

  • PC Committee Member

    18TH ACM INTERNATIONAL CONFERENCE ON DISTRIBUTED AND EVENT-BASED SYSTEMS

    - Present 7 months

    Science and Technology

    https://2024.debs.org/program-committee/

  • National Science Foundation (NSF) Graphic

    Panel Reviewer

    National Science Foundation (NSF)

    - 5 months

    Science and Technology

    The National Science Foundation is a United States government agency that supports fundamental research and education. I help review research proposals from universities and educational institutes that the NSF funds.

  • Organizing Committee Member

    XLDB 2018 at Stanford Linear Accelerator Center (SLAC)

    - 1 year

    Education

    https://conf.slac.stanford.edu/xldb2018/committees

  • Program Committee Member

    PC USENIX ATC 2018

    - 1 year

    Education

    https://www.usenix.org/conference/atc18#about

  • Program Committee Member

    IEEE International Conference on Peer-to-Peer and Edge Computing 2015

    - 9 months

    http://wan.poly.edu/p2p2015/organizers.html

  • Program Committee Member

    O'Reilly Strata Data Conference

    - Present 9 years 9 months

    Science and Technology

    https://conferences.oreilly.com/strata/strata-ca/public/content/about#committee

  • Program Committee Member

    Big Data Benchmarking Conference 2012

    - 7 months

    Education

    http://clds.ucsd.edu/wbdb2014.de/organizers

  • Program Committee Member

    IEEE International Conference on Peer-to-Peer Computing 2014

    - 8 months

    Education

Publications

  • Disaggregated Database Management Systems

    Springer

    Disaggregated Database Management Systems
    Shahram Ghandeharizadeh, Philip A. Bernstein, Dhruba Borthakur, Haoyu Huang, Jai Menon & Sumit Puri
    Conference paper
    First Online: 28 March 2023
    24 Accesses

    Part of the Lecture Notes in Computer Science book series (LNCS,volume 13860)

    Abstract
    Modern applications demand high performance and cost efficient database management systems (DBMSs). Their workloads may be diverse, ranging from online transaction processing to…

    Disaggregated Database Management Systems
    Shahram Ghandeharizadeh, Philip A. Bernstein, Dhruba Borthakur, Haoyu Huang, Jai Menon & Sumit Puri
    Conference paper
    First Online: 28 March 2023
    24 Accesses

    Part of the Lecture Notes in Computer Science book series (LNCS,volume 13860)

    Abstract
    Modern applications demand high performance and cost efficient database management systems (DBMSs). Their workloads may be diverse, ranging from online transaction processing to analytics and decision support. The cloud infrastructure enables disaggregation of monolithic DBMSs into components that facilitate software-hardware co-design. This is realized using pools of hardware resources, i.e., CPUs, GPUs, memory, FPGA, NVM, etc., connected using high-speed networks. This disaggregation trend is being adopted by cloud DBMSs because hardware re-provisioning can be achieved by simply invoking software APIs. Disaggregated DBMSs separate processing from storage, enabling each to scale elastically and independently. They may disaggregate compute usage based on functionality, e.g., compute needed for writes from compute needed for queries and compute needed for compaction. They may also use disaggregated memory, e.g., for intermediate results in a shuffle or for remote caching. The DBMS monitors the characteristics of a workload and dynamically assembles its components that are most efficient and cost effective for the workload. This paper is a summary of a panel session that discussed the capability, challenges, and opportunities of these emerging DBMSs and disaggregated hardware systems.

    Other authors
    See publication
  • The Aggregator-Leaf-Tailer Architecture for powering fast SQL over semi-structured data

    High Performance Transaction Systems HPTS

    A key component of Aggregator Leaf Tailer Architecture is a high-performance serving layer that serves complex queries, and not just key-value lookups. The existence of this serving layer obviates the need for complex data pipelines and simplifies the complexities of the lambda architecture.

    See publication
  • Optimizing Space Amplification in RocksDB

    CIDR

    RocksDB is an embedded, high-performance, persistent keyvalue storage engine developed at Facebook. Much of our current focus in developing and configuring RocksDB is to
    give priority to resource efficiency instead of giving priority to the more standard performance metrics, such as response time latency and throughput, as long as the latter remain
    acceptable. In particular, we optimize space efficiency while ensuring read and write latencies meet service-level requirements for the…

    RocksDB is an embedded, high-performance, persistent keyvalue storage engine developed at Facebook. Much of our current focus in developing and configuring RocksDB is to
    give priority to resource efficiency instead of giving priority to the more standard performance metrics, such as response time latency and throughput, as long as the latter remain
    acceptable. In particular, we optimize space efficiency while ensuring read and write latencies meet service-level requirements for the intended workloads. This choice is motivated by the fact that storage space is most often the primary bottleneck when using Flash SSDs under typical production
    workloads at Facebook. RocksDB uses log-structured merge trees to obtain significant space efficiency and better write throughput while achieving acceptable read performance.

    This paper describes methods we used to reduce storage usage in RocksDB. We discuss how we are able to trade off storage efficiency and CPU overhead, as well as read and write amplification. Based on experimental evaluations of MySQL with RocksDB as the embedded storage engine (using TPC-C and LinkBench benchmarks) and based on measurements taken from production databases, we show that RocksDB uses less than half the storage that InnoDB uses, yet performs well and in many cases even better than the B-tree-based InnoDB storage engine. To the best of our knowledge, this is the first time a Log-structured merge treebased storage engine has shown competitive performance when running OLTP workloads at large scale.

    Other authors
    • Siying Dong, Mark Callaghan, Leonidas Galanis, Tony Savor, Micheal Stumm
    See publication
  • A hitchhiker's guide to fast and efficient data reconstruction in erasure-coded data centers

    SIGCOMM

    Erasure codes such as Reed-Solomon (RS) codes are being extensively deployed in data centers since they offer significantly higher reliability than data replication methods at much lower storage overheads. These codes however mandate much higher resources with respect to network bandwidth and disk IO during reconstruction of data that is missing or otherwise unavailable. Existing solutions to this problem either demand additional storage space or severely limit the choice of the system…

    Erasure codes such as Reed-Solomon (RS) codes are being extensively deployed in data centers since they offer significantly higher reliability than data replication methods at much lower storage overheads. These codes however mandate much higher resources with respect to network bandwidth and disk IO during reconstruction of data that is missing or otherwise unavailable. Existing solutions to this problem either demand additional storage space or severely limit the choice of the system parameters. In this paper, we present Hitchhiker, a new erasure-coded storage system that reduces both network traffic and disk IO by around 25% to 45% during reconstruction of missing or otherwise unavailable data, with no additional storage, the same fault tolerance, and arbitrary flexibility in the choice of parameters, as compared to RS-based systems. Hitchhiker "rides" on top of RS codes, and is based on novel encoding and decoding techniques that will be presented in this paper. We have implemented Hitchhiker in the Hadoop Distributed File System (HDFS).

    Other authors
    See publication
  • Analysis of HDFS Under HBase: A Facebook Messages Case Study

    (FAST ’14) 12th USENIX Conference on File and Storage Technologies

    We present a multilayer study of the Facebook Messages stack, which is based on HBase and HDFS. We collect and analyze HDFS traces to identify potential improvements, which we then evaluate via simulation. Messages represents a new HDFS workload: whereas HDFS was built to store very large files and receive mostly sequential I/O, 90% of files are smaller than 15MB and I/O is highly random. We find hot data is too large to easily fit in RAM and cold data is too large to easily fit in flash;…

    We present a multilayer study of the Facebook Messages stack, which is based on HBase and HDFS. We collect and analyze HDFS traces to identify potential improvements, which we then evaluate via simulation. Messages represents a new HDFS workload: whereas HDFS was built to store very large files and receive mostly sequential I/O, 90% of files are smaller than 15MB and I/O is highly random. We find hot data is too large to easily fit in RAM and cold data is too large to easily fit in flash; however, cost simulations show that adding a small flash tier improves performance more than equivalent spending on RAM or disks. HBase’s layered design offers simplicity, but at the cost of performance; our simulations show that network I/O can be halved if compaction bypasses the replication layer. Finally, although Messages is read-dominated, several features of the stack (i.e., logging, compaction, replication, and caching) amplify write I/O, causing writes to dominate disk I/O.

    Other authors
    See publication
  • A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster

    HotStorage

    In this paper, we first present a study on the impact of recovery operations of erasure-coded data on the data-center network, based on measurements from Facebook’s warehouse cluster in production. To the best of our knowledge, this is the first study of its kind available in the literature. Our study reveals that recovery of RS-coded data results in a significant increase in network traffic, more than a hundred terabytes per day, in a cluster storing multiple petabytes of RS-coded…

    In this paper, we first present a study on the impact of recovery operations of erasure-coded data on the data-center network, based on measurements from Facebook’s warehouse cluster in production. To the best of our knowledge, this is the first study of its kind available in the literature. Our study reveals that recovery of RS-coded data results in a significant increase in network traffic, more than a hundred terabytes per day, in a cluster storing multiple petabytes of RS-coded data.

    To address this issue, we present a new storage code using our recently proposed Piggybacking framework, that reduces the network and disk usage during recovery by 30% in theory, while also being storage optimal and supporting arbitrary design parameters. The implementation of the proposed code in the Hadoop Distributed File System (HDFS) is underway. We use the measurements from the warehouse cluster to show that the proposed code would lead to a reduction of close to fifty terabytes of cross-rack traffic per day.

    Other authors
    • K V Rashmi
    • Nihar B Shah
    • Diakang Gu
    • Hairong Kuang
    See publication
  • LinkBench: a Database Benchmark based on the Facebook Social Graph

    VLDB

    In this paper we present a new synthetic benchmark called LinkBench. LinkBench is based on traces from production databases that store "social graph" data at Facebook, a major social network. We characterize the data and query workload in many dimensions, and use the insights gained to construct a realistic synthetic benchmark. LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers…

    In this paper we present a new synthetic benchmark called LinkBench. LinkBench is based on traces from production databases that store "social graph" data at Facebook, a major social network. We characterize the data and query workload in many dimensions, and use the insights gained to construct a realistic synthetic benchmark. LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers, developers and administrators.

    Other authors
    See publication
  • Petabyte Scale Data at Facebook

    SIGMOD

    At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances…

    At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack data store for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of click logs and combine it with the power of Apache HBase to store all Facebook Messages.

    This paper describes the reasons why each of these databases is appropriate for that workload and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We describe the techniques we have used to map the Facebook Graph Database into a set of relational tables. We speak of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.

    Esteemed researchers in the Database Management community have benchmarked query latencies on Hive/Hadoop to be less performant than a traditional Parallel DBMS. We describe why these benchmarks are insufficient for Big Data deployments and why we continue to use Hadoop/Hive. We present an alternate set of benchmark techniques that measure capacity of a database, the value/byte in that database and the efficiency of inbuilt crowd-sourcing techniques to reduce administration costs of that database.

    See publication
  • XORing Elephants: Novel Erasure Codes for Big Data

    VLDB

    Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability.

    This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently…

    Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability.

    This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance.

    We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2× on the repair disk I/O and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher compared to replication.

    Other authors
    • Maheswaran Sathiamoorthy
    • Megasthenis Asteris
    • Dimitris Papailiopoulos
    • Alexandros G. Dimakis
    • Ramkumar Vadali
    • Scott Chen
    See publication
  • Realtime Hadoop at Facebook

    SIGMOD

    Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application's requirements for consistency, availability, partition tolerance, data model and scalability. We explore the…

    Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application's requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system, and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies. We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.

    https://dl.acm.org/doi/10.1145/1989323.1989438

    See publication
  • Data Warehousing and Analytics Infrastructure at Facebook

    SIGMOD

    Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In…

    Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

    https://dl.acm.org/doi/10.1145/1807167.1807278

    See publication
  • Delay Scheduling in Hadoop

    EuroSys conference

    As organizations start to use data-intensive cluster computing systems like Hadoop and Dryad for more applications, there is a growing need to share clusters between users. However, there is a conflict between fairness in scheduling and data locality (placing tasks on nodes that contain their input data). We illustrate this problem through our experience designing a fair scheduler for a 600-node Hadoop
    cluster at Facebook. To address the conflict between locality and fairness, we propose a…

    As organizations start to use data-intensive cluster computing systems like Hadoop and Dryad for more applications, there is a growing need to share clusters between users. However, there is a conflict between fairness in scheduling and data locality (placing tasks on nodes that contain their input data). We illustrate this problem through our experience designing a fair scheduler for a 600-node Hadoop
    cluster at Facebook. To address the conflict between locality and fairness, we propose a simple algorithm called delay scheduling: when the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time, letting other jobs launch tasks instead. We find that delay scheduling achieves nearly optimal datalocality in a variety of workloads and can increase throughput by up to 2x while preserving fairness. In addition, the simplicity of delay scheduling makes it applicable under a wide variety of scheduling policies beyond fair sharing.

    Other authors
    See publication

Patents

View Dhruba’s full profile

  • See who you know in common
  • Get introduced
  • Contact Dhruba directly
Join to view full profile

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Dhruba Borthakur