Dhruba Borthakur

Sunnyvale, California, United States Contact Info

Sign in to view Dhruba’s full profile

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

9K followers 500+ connections

View mutual connections with Dhruba

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Join to view profile

Rockset (acquired by OpenAI)

University of Wisconsin-Madison

Experience & Education

Rockset (acquired by OpenAI)

*** ****** ******** **********

********* *** *** ****** ** ****** ****** **** ******
******

********* *******
********** ** *********-*******

****** ** ******* (**) ******** *******

1993 - 1994
***** ********* ** ********** *** *******, ******

** ******** *******

1987 - 1991

View Dhruba’s full experience

See their title, tenure and more.

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Volunteer Experience

PC Committee Member

18TH ACM INTERNATIONAL CONFERENCE ON DISTRIBUTED AND EVENT-BASED SYSTEMS

Jan 2024 - Present 7 months

Science and Technology

https://2024.debs.org/program-committee/
Panel Reviewer

National Science Foundation (NSF)

Jan 2019 - May 2019 5 months

Science and Technology

The National Science Foundation is a United States government agency that supports fundamental research and education. I help review research proposals from universities and educational institutes that the NSF funds.
Organizing Committee Member

XLDB 2018 at Stanford Linear Accelerator Center (SLAC)

Dec 2017 - 2018 1 year

Education

https://conf.slac.stanford.edu/xldb2018/committees
Program Committee Member

PC USENIX ATC 2018

Oct 2017 - 2018 1 year

Education

https://www.usenix.org/conference/atc18#about
Program Committee Member

IEEE International Conference on Peer-to-Peer and Edge Computing 2015

Jan 2015 - Sep 2015 9 months

http://wan.poly.edu/p2p2015/organizers.html
Program Committee Member

O'Reilly Strata Data Conference

Nov 2014 - Present 9 years 9 months

Science and Technology

https://conferences.oreilly.com/strata/strata-ca/public/content/about#committee
Program Committee Member

Big Data Benchmarking Conference 2012

Feb 2014 - Aug 2014 7 months

Education

http://clds.ucsd.edu/wbdb2014.de/organizers
Program Committee Member

IEEE International Conference on Peer-to-Peer Computing 2014

Feb 2014 - Sep 2014 8 months

Education

Publications

Disaggregated Database Management Systems

Springer March 28, 2023
Disaggregated Database Management Systems
Shahram Ghandeharizadeh, Philip A. Bernstein, Dhruba Borthakur, Haoyu Huang, Jai Menon & Sumit Puri
Conference paper
First Online: 28 March 2023
24 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13860)

Abstract
Modern applications demand high performance and cost efficient database management systems (DBMSs). Their workloads may be diverse, ranging from online transaction processing to…

Disaggregated Database Management Systems
Shahram Ghandeharizadeh, Philip A. Bernstein, Dhruba Borthakur, Haoyu Huang, Jai Menon & Sumit Puri
Conference paper
First Online: 28 March 2023
24 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13860)

Abstract
Modern applications demand high performance and cost efficient database management systems (DBMSs). Their workloads may be diverse, ranging from online transaction processing to analytics and decision support. The cloud infrastructure enables disaggregation of monolithic DBMSs into components that facilitate software-hardware co-design. This is realized using pools of hardware resources, i.e., CPUs, GPUs, memory, FPGA, NVM, etc., connected using high-speed networks. This disaggregation trend is being adopted by cloud DBMSs because hardware re-provisioning can be achieved by simply invoking software APIs. Disaggregated DBMSs separate processing from storage, enabling each to scale elastically and independently. They may disaggregate compute usage based on functionality, e.g., compute needed for writes from compute needed for queries and compute needed for compaction. They may also use disaggregated memory, e.g., for intermediate results in a shuffle or for remote caching. The DBMS monitors the characteristics of a workload and dynamically assembles its components that are most efficient and cost effective for the workload. This paper is a summary of a panel session that discussed the capability, challenges, and opportunities of these emerging DBMSs and disaggregated hardware systems.

Other authors
See publication
The Aggregator-Leaf-Tailer Architecture for powering fast SQL over semi-structured data

High Performance Transaction Systems HPTS Nov 2019

A key component of Aggregator Leaf Tailer Architecture is a high-performance serving layer that serves complex queries, and not just key-value lookups. The existence of this serving layer obviates the need for complex data pipelines and simplifies the complexities of the lambda architecture.

See publication
Building a Big Data Platform in the 21st century

IEEE Big Data Conference August 1, 2018

See publication
Optimizing Space Amplification in RocksDB

CIDR Jan 2017
RocksDB is an embedded, high-performance, persistent keyvalue storage engine developed at Facebook. Much of our current focus in developing and configuring RocksDB is to
give priority to resource efficiency instead of giving priority to the more standard performance metrics, such as response time latency and throughput, as long as the latter remain
acceptable. In particular, we optimize space efficiency while ensuring read and write latencies meet service-level requirements for the…

RocksDB is an embedded, high-performance, persistent keyvalue storage engine developed at Facebook. Much of our current focus in developing and configuring RocksDB is to
give priority to resource efficiency instead of giving priority to the more standard performance metrics, such as response time latency and throughput, as long as the latter remain
acceptable. In particular, we optimize space efficiency while ensuring read and write latencies meet service-level requirements for the intended workloads. This choice is motivated by the fact that storage space is most often the primary bottleneck when using Flash SSDs under typical production
workloads at Facebook. RocksDB uses log-structured merge trees to obtain significant space efficiency and better write throughput while achieving acceptable read performance.

This paper describes methods we used to reduce storage usage in RocksDB. We discuss how we are able to trade off storage efficiency and CPU overhead, as well as read and write amplification. Based on experimental evaluations of MySQL with RocksDB as the embedded storage engine (using TPC-C and LinkBench benchmarks) and based on measurements taken from production databases, we show that RocksDB uses less than half the storage that InnoDB uses, yet performs well and in many cases even better than the B-tree-based InnoDB storage engine. To the best of our knowledge, this is the first time a Log-structured merge treebased storage engine has shown competitive performance when running OLTP workloads at large scale.

Other authors
See publication
A hitchhiker's guide to fast and efficient data reconstruction in erasure-coded data centers

SIGCOMM Aug 2015
Erasure codes such as Reed-Solomon (RS) codes are being extensively deployed in data centers since they oﬀer significantly higher reliability than data replication methods at much lower storage overheads. These codes however mandate much higher resources with respect to network bandwidth and disk IO during reconstruction of data that is missing or otherwise unavailable. Existing solutions to this problem either demand additional storage space or severely limit the choice of the system…

Erasure codes such as Reed-Solomon (RS) codes are being extensively deployed in data centers since they oﬀer significantly higher reliability than data replication methods at much lower storage overheads. These codes however mandate much higher resources with respect to network bandwidth and disk IO during reconstruction of data that is missing or otherwise unavailable. Existing solutions to this problem either demand additional storage space or severely limit the choice of the system parameters. In this paper, we present Hitchhiker, a new erasure-coded storage system that reduces both network traffic and disk IO by around 25% to 45% during reconstruction of missing or otherwise unavailable data, with no additional storage, the same fault tolerance, and arbitrary ﬂexibility in the choice of parameters, as compared to RS-based systems. Hitchhiker "rides" on top of RS codes, and is based on novel encoding and decoding techniques that will be presented in this paper. We have implemented Hitchhiker in the Hadoop Distributed File System (HDFS).

Other authors
See publication
Analysis of HDFS Under HBase: A Facebook Messages Case Study

(FAST ’14) 12th USENIX Conference on File and Storage Technologies February 17, 2014
We present a multilayer study of the Facebook Messages stack, which is based on HBase and HDFS. We collect and analyze HDFS traces to identify potential improvements, which we then evaluate via simulation. Messages represents a new HDFS workload: whereas HDFS was built to store very large files and receive mostly sequential I/O, 90% of files are smaller than 15MB and I/O is highly random. We find hot data is too large to easily fit in RAM and cold data is too large to easily fit in flash;…

We present a multilayer study of the Facebook Messages stack, which is based on HBase and HDFS. We collect and analyze HDFS traces to identify potential improvements, which we then evaluate via simulation. Messages represents a new HDFS workload: whereas HDFS was built to store very large files and receive mostly sequential I/O, 90% of files are smaller than 15MB and I/O is highly random. We find hot data is too large to easily fit in RAM and cold data is too large to easily fit in flash; however, cost simulations show that adding a small flash tier improves performance more than equivalent spending on RAM or disks. HBase’s layered design offers simplicity, but at the cost of performance; our simulations show that network I/O can be halved if compaction bypasses the replication layer. Finally, although Messages is read-dominated, several features of the stack (i.e., logging, compaction, replication, and caching) amplify write I/O, causing writes to dominate disk I/O.

Other authors
See publication
A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster

HotStorage 2013
In this paper, we first present a study on the impact of recovery operations of erasure-coded data on the data-center network, based on measurements from Facebook’s warehouse cluster in production. To the best of our knowledge, this is the first study of its kind available in the literature. Our study reveals that recovery of RS-coded data results in a significant increase in network traffic, more than a hundred terabytes per day, in a cluster storing multiple petabytes of RS-coded…

In this paper, we first present a study on the impact of recovery operations of erasure-coded data on the data-center network, based on measurements from Facebook’s warehouse cluster in production. To the best of our knowledge, this is the first study of its kind available in the literature. Our study reveals that recovery of RS-coded data results in a significant increase in network traffic, more than a hundred terabytes per day, in a cluster storing multiple petabytes of RS-coded data.

To address this issue, we present a new storage code using our recently proposed Piggybacking framework, that reduces the network and disk usage during recovery by 30% in theory, while also being storage optimal and supporting arbitrary design parameters. The implementation of the proposed code in the Hadoop Distributed File System (HDFS) is underway. We use the measurements from the warehouse cluster to show that the proposed code would lead to a reduction of close to fifty terabytes of cross-rack traffic per day.

Other authors
See publication
LinkBench: a Database Benchmark based on the Facebook Social Graph

VLDB 2013
In this paper we present a new synthetic benchmark called LinkBench. LinkBench is based on traces from production databases that store "social graph" data at Facebook, a major social network. We characterize the data and query workload in many dimensions, and use the insights gained to construct a realistic synthetic benchmark. LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers…

In this paper we present a new synthetic benchmark called LinkBench. LinkBench is based on traces from production databases that store "social graph" data at Facebook, a major social network. We characterize the data and query workload in many dimensions, and use the insights gained to construct a realistic synthetic benchmark. LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers, developers and administrators.

Other authors
See publication
Petabyte Scale Data at Facebook

SIGMOD 2013

At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances…

At Facebook, we use various types of databases and storage system to satisfy the needs of different applications. The solutions built around these data store systems have a common set of requirements: they have to be highly scalable, maintenance costs should be low and they have to perform efficiently. We use a sharded mySQL+memcache solution to support real-time access of tens of petabytes of data and we use TAO to provide consistency of this web-scale database across geographical distances. We use Haystack data store for storing the 3 billion new photos we host every week. We use Apache Hadoop to mine intelligence from 100 petabytes of click logs and combine it with the power of Apache HBase to store all Facebook Messages.

This paper describes the reasons why each of these databases is appropriate for that workload and the design decisions and tradeoffs that were made while implementing these solutions. We touch upon the consistency, availability and partitioning tolerance of each of these solutions. We touch upon the reasons why some of these systems need ACID semantics and other systems do not. We describe the techniques we have used to map the Facebook Graph Database into a set of relational tables. We speak of how we plan to do big-data deployments across geographical locations and our requirements for a new breed of pure-memory and pure-SSD based transactional database.

Esteemed researchers in the Database Management community have benchmarked query latencies on Hive/Hadoop to be less performant than a traditional Parallel DBMS. We describe why these benchmarks are insufficient for Big Data deployments and why we continue to use Hadoop/Hive. We present an alternate set of benchmark techniques that measure capacity of a database, the value/byte in that database and the efficiency of inbuilt crowd-sourcing techniques to reduce administration costs of that database.

See publication
XORing Elephants: Novel Erasure Codes for Big Data

VLDB 2013
Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability.

This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently…

Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability.

This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance.

We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2× on the repair disk I/O and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher compared to replication.

Other authors
See publication
DeTail: Reducing the Flow Completion Time Tail in Datacenter Networks

SIGCOMM 2012
Other authors
See publication
Energy Efficiency for Large-Scale MapReduce Workloads

EuroSys 2012
Other authors
See publication
PACMan: Coordinated Memory Caching for Parallel Jobs

NSDI 2012
Other authors
See publication
Fate and Destini

NSDI 2011
Other authors
See publication
Realtime Hadoop at Facebook

SIGMOD 2011

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application's requirements for consistency, availability, partition tolerance, data model and scalability. We explore the…

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application's requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system, and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies. We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.

https://dl.acm.org/doi/10.1145/1989323.1989438

See publication
Data Warehousing and Analytics Infrastructure at Facebook

SIGMOD 2010

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In…

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

https://dl.acm.org/doi/10.1145/1807167.1807278

See publication
Delay Scheduling in Hadoop

EuroSys conference 2010
As organizations start to use data-intensive cluster computing systems like Hadoop and Dryad for more applications, there is a growing need to share clusters between users. However, there is a conflict between fairness in scheduling and data locality (placing tasks on nodes that contain their input data). We illustrate this problem through our experience designing a fair scheduler for a 600-node Hadoop
cluster at Facebook. To address the conflict between locality and fairness, we propose a…

As organizations start to use data-intensive cluster computing systems like Hadoop and Dryad for more applications, there is a growing need to share clusters between users. However, there is a conflict between fairness in scheduling and data locality (placing tasks on nodes that contain their input data). We illustrate this problem through our experience designing a fair scheduler for a 600-node Hadoop
cluster at Facebook. To address the conflict between locality and fairness, we propose a simple algorithm called delay scheduling: when the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time, letting other jobs launch tasks instead. We find that delay scheduling achieves nearly optimal datalocality in a variety of workloads and can increase throughput by up to 2x while preserving fairness. In addition, the simplicity of delay scheduling makes it applicable under a wide variety of scheduling policies beyond fair sharing.

Other authors
See publication

Patents

69 Issued patents. To see the list, click the "See patent" button below.

' '

See patent

View Dhruba’s full profile

See who you know in common
Get introduced
Contact Dhruba directly

Join to view full profile

Sign in

Stay updated on your professional world

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Dhruba Borthakur

7 others named Dhruba Borthakur are on LinkedIn

See others named Dhruba Borthakur

Experience & Education

Rockset (acquired by OpenAI)

**-******* *** ***

View Dhruba’s full experience

See their title, tenure and more.

Volunteer Experience

PC Committee Member

18TH ACM INTERNATIONAL CONFERENCE ON DISTRIBUTED AND EVENT-BASED SYSTEMS

Panel Reviewer

Organizing Committee Member

XLDB 2018 at Stanford Linear Accelerator Center (SLAC)

Program Committee Member

PC USENIX ATC 2018

Program Committee Member

IEEE International Conference on Peer-to-Peer and Edge Computing 2015

Program Committee Member

O'Reilly Strata Data Conference

Program Committee Member

Big Data Benchmarking Conference 2012

Program Committee Member

IEEE International Conference on Peer-to-Peer Computing 2014

Publications

Springer March 28, 2023

High Performance Transaction Systems HPTS Nov 2019

IEEE Big Data Conference August 1, 2018

CIDR Jan 2017

SIGCOMM Aug 2015

(FAST ’14) 12th USENIX Conference on File and Storage Technologies February 17, 2014

HotStorage 2013

VLDB 2013

SIGMOD 2013

VLDB 2013

SIGCOMM 2012

EuroSys 2012

NSDI 2012

NSDI 2011

SIGMOD 2011

SIGMOD 2010

EuroSys conference 2010

Patents

' '

View Dhruba’s full profile

Sign in

Explore collaborative articles

Others named Dhruba Borthakur

Dhruba Jyoti Borthakur

Dhruba Borthakur

DHRUBA JYOTI BORTHAKUR

Dhruba jyoti Borthakur