SlideShare a Scribd company logo
How to build TiDB
PingCAP
About me
● Infrastructure engineer / CEO of PingCAP
● Working on open source projects: TiDB/TiKV
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv
Email: liuqi@pingcap.com
Let’s say we want to build a NewSQL Database
● From the beginning
● What’s wrong with the existing DBs?
○ Relational databases
○ NoSQL
We have a key-value store (RocksDB)
● Good start, RocksDB is fast and stable.
○ Atomic batch write
○ Snapshot
● However… It’s a local embedded kv store.
○ Can’t tolerate machine failures
○ Scalability depends on the capacity of the disk
Let’s fix Fault Tolerance
● Use Raft to replicate data
○ Key features of Raft
■ Strong leader: leader does most of the work, issue all log updates
■ Leader election
■ Membership changes
● Implementation:
○ Ported from etcd
Let’s fix Fault Tolerance
Machine 1 Machine 2 Machine 3
RocksDB RocksDB RocksDB
Raft Raft
That’s cool
● Basically we have a lite version of etcd or zookeeper.
○ Does not support watch command, and some other features
● Let’s make it better.
How about Scalability?
● What if we SPLIT data into many regions?
○ We got many Raft groups.
○ Region = Contiguous Keys
● Hash partitioning or Range partitioning
○ Redis: Hash partitioning
○ HBase: Range partitioning
That’s Cool, but...
● But what if we want to scan data?
○ How to support API: scan(startKey, endKey, limit)
● So, we need a globally ordered map
○ Can’t use hash partitioning
○ Use range partitioning
■ Region 1 -> [a - d]
■ Region 2 -> [e - h]
■ …
■ Region n -> [w - z]
How to scale? (1/2)
● That’s simple
● Just Split && Move Region 1
Region 1 Region 2
How to scale? (2/2)
● Raft comes for rescue again
○ Using Raft Membership changes, 2 steps:
■ Add a new replica
■ Destroy old region replica
Region 1
Region 3
Region 1
Region 2
Scale-out(initial state)
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node C
Node D
Region 1
Region 3
Region 1^
Region 2
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node E
1) Transfer leadership of region 1 from Node A to
Node B
Node C
Node D
Scale-out (add new node)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
2) Add Replica on Node E
Node C
Node D
Node E
Region 1
Scale-out (balancing)
Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
3) Remove Replica from Node A
Node C
Node D
Node E
Scale-out (balancing)
Now we have a distributed key-value store
● We want to keep replicas in different datacenters
○ For HA: any node might crash, even the whole Data center
○ And to balance the workload
● So, we need Placement Driver (PD) to act as cluster manager, for:
○ Replication constraint
○ Data movement
Placement Driver
● Concept comes from Spanner
● Provide the God’s view of the whole cluster
● Store the metadata
○ Clients have cache of placement information.
● Maintain the replication constraint
○ 3 replicas, by default
● Data movement
○ For balancing the workload
● It’s a cluster too, of course.
○ Thanks to Raft.
Placement
Driver
Placement
Driver
Placement
Driver
Raft
Raft
Raft
Placement Driver
● Rebalance without moving data.
○ Raft: Leadership transfer extension
● Moving data is a slow operation.
● We need fast rebalance.
Store4
Raft groups
RPCRPC
Client
Store1
TiKV Node1
Region 1
Region 3
...
Store2
TiKV Node2
Region 1
Region 2
Region 3
...
Store3
TiKV Node3
Region 1Region 2
...
TiKV Node4
Region 2Region 3
...
RPCRPC
TiKV: The whole picture
Placement
Driver
That’s Cool, but hold on...
● It could be cooler if we have:
○ MVCC
○ ACID Transaction
■ Transaction mode: Google Percolator (2PC)
MVCC (Multi-Version Concurrency Control)
● Each transaction sees a snapshot of database at the beginning time of this
transaction, any changes made by this transaction will not seen by other
transactions until the transaction is committed.
● Data is tagged with versions
○ Key_version: value
● Lock-free snapshot reads
Transaction API style (go code)
txn := store.Begin() // start a transaction
txn.Set([]byte("key1"), []byte("value1"))
txn.Set([]byte("key2"), []byte("value2"))
err = txn.Commit() // commit transaction
if err != nil {
txn.Rollback()
}
I want to write
code like this.
Transaction Model
● Inspired by Google Percolator
● 3 column families
○ cf:lock: An uncommitted transaction is writing this cell; contains the
location/pointer of primary lock
○ cf: write: it stores the commit timestamp of the data
○ cf: data: Stores the data itself
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
6:
5: $10
6:
5:
6: data @ 5
5:
Joe
6:
5: $2
6:
5:
6: data @ 5
5:
Bob wants to transfer 7$ to Joe
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
7: $3
6:
5: $10
7: I am Primary
6:
5:
7:
6: data @ 5
5:
Joe
6:
5: $2
6:
5:
6: data @ 5
5:
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
7: $3
6:
5: $10
7: I am Primary
6:
5:
7:
6: data @ 5
5:
Joe
7: $9
6:
5: $2
7:Primary@Bob.bal
6:
5:
7:
6: data @ 5
5:
Transaction Model (commit point)
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $3
6:
5: $10
8:
7: I am Primary
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $9
6:
5: $2
8:
7:Primary@Bob
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $6
6:
5: $10
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $6
6:
5: $2
8:
7:Primary@Bob
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $6
6:
5: $10
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $6
6:
5: $2
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
TiKV: Architecture overview (Logical)
Transaction
MVCC
RaftKV
Local KV Storage (RocksDB)
● Highly layered
● Using Raft for consistency and
scalability
● No distributed file system
○ For better performance and lower
latency
TiKV: Highly layered (API angle)
Transaction
MVCC
RaftKV
Local KV Storage (RocksDB)
get(key)
raft_get(key)
MVCC_get(key, ver)
txn_get(key, txn_start_ts)
That’s really really Cool
● We have A Distributed Key-Value Database with
○ Geo-Replication / Auto Rebalance
○ ACID Transaction support
○ Horizontal Scalability
What if we support SQL?
● SQL is simple and very productive
● We want to write code like this:
SELECT COUNT(*) FROM user
WHERE age > 20 and age < 30;
And this...
BEGIN;
INSERT INTO person VALUES(‘tom’, 25);
INSERT INTO person VALUES(‘jerry’, 30);
COMMIT;
First of all, map table data to key value store
● What happens behind:
CREATE TABLE user (
id INT PRIMARY KEY,
name TEXT,
email TEXT
);
Mapping table data to kv store
Key Value
user/1 dongxu | huang@pingcap.com
user/2 tom | tom@pingcap.com
... ...
INSERT INTO user VALUES (1, “dongxu”, “huang@pingcap.com”);
INSERT INTO user VALUES (2, “tom”, “tom@pingcap.com”);
Secondary index is necessary
● Global index
○ All indexes in TiDB are transactional and fully consistent
○ Stored as separate key-value pairs in TiKV
● Keyed by a concatenation of the index prefix and primary key in TiKV
○ For example: table := {id, name} , id is primary key. If we want to build an index on the name
column, for example we have a row r := (1, ‘tom’), we could store another kv pair just like:
■ name_index/tom_1 => nil
■ name_index/tom_2 => nil
○ For unique index
■ id_index/tom => 1,
Index is just not enough...
● Can we push down filters?
○ select count(*) from person
where age > 20 and age < 30
● It should be much faster, maybe 100x
○ Less RPC round trip
○ Less transferring data
Predicate pushdown
TiKV Node1 TiKV Node2 TiKV Node3
TiDB Server
Region 2Region 1
Region 5
age > 20 and age < 30 age > 20 and age < 30
age > 20 and age < 30
TiDB knows that
Region 1 / 2 / 5
stores the data of
person table.
But TiKV doesn’t know the schema
● Key-value database doesn’t have any information about table and row
● Coprocessor comes for help:
○ Concept comes from HBase
○ Inject your own logic to data nodes
What about drivers for every language?
● We have to build drivers for Java, Python, PHP, C/C++, Rust, Go…
● It needs lots of time and code.
○ Trust me, you don’t want to do that.
OR...
● We just build a protocol layer that is compatible with MySQL. Then we have
all the MySQL drivers.
○ All the tools
○ All the ORMs
○ All the applications
● That’s what TiDB does.
Schema change in distributed RDBMS?
● A must-have feature!
● But you don’t want to lock the whole table while changing schema.
○ Usually distributed database stores tons of data spanning multiple machines
● We need a non-blocking schema change algorithm
● Thanks F1 again
○ Similar to《Online, Asynchronous Schema Change in F1》 - VLDB 2013 Google
Architecture (The whole picture)
MySQL Clients (e.g. JDBC)
TiDB
TiKV
RPC
MySQL Protocol
F1
Spanner
Applications
Testing
● Testing in distributed system is really hard
Embedded testing to your design
● Design for testing
● Get tests from community
○ Lots of tests in MySQL drivers/connectors
○ Lots of ORMs
○ Lots of applications (Record---replay)
And more
● Fault injection
○ Hardware
■ disk error
■ network card
■ cpu
■ clock
○ Software
■ file system
■ network & protocol
And more
● Simulate everything
○ Network example :
https://github.com/pingcap/tikv/pull/916/commits/3cf0f7
248b32c3c523927eed5ebf82aabea481ec
Distribute testing
● Jepsen
● Namazu
○ ZooKeeper:
■ Found ZOOKEEPER-2212, ZOOKEEPER-2080 (race): (blog article)
○ Etcd:
■ Found etcdctl bug #3517 (timing specification), fixed in #3530. The fix also resulted a
hint of #3611
■ Reproduced flaky tests {#4006, #4039}
○ YARN:
○ Found YARN-4301 (fault tolerance), Reproduced flaky tests{1978, 4168, 4543, 4548, 4556}
More to come
Distributed query plan - WIP
Change history (binlog) - WIP
Run TiDB on top of Kubernetes
Thanks
Q&A
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv

More Related Content

How to build TiDB

  • 1. How to build TiDB PingCAP
  • 2. About me ● Infrastructure engineer / CEO of PingCAP ● Working on open source projects: TiDB/TiKV https://github.com/pingcap/tidb https://github.com/pingcap/tikv Email: liuqi@pingcap.com
  • 3. Let’s say we want to build a NewSQL Database ● From the beginning ● What’s wrong with the existing DBs? ○ Relational databases ○ NoSQL
  • 4. We have a key-value store (RocksDB) ● Good start, RocksDB is fast and stable. ○ Atomic batch write ○ Snapshot ● However… It’s a local embedded kv store. ○ Can’t tolerate machine failures ○ Scalability depends on the capacity of the disk
  • 5. Let’s fix Fault Tolerance ● Use Raft to replicate data ○ Key features of Raft ■ Strong leader: leader does most of the work, issue all log updates ■ Leader election ■ Membership changes ● Implementation: ○ Ported from etcd
  • 6. Let’s fix Fault Tolerance Machine 1 Machine 2 Machine 3 RocksDB RocksDB RocksDB Raft Raft
  • 7. That’s cool ● Basically we have a lite version of etcd or zookeeper. ○ Does not support watch command, and some other features ● Let’s make it better.
  • 8. How about Scalability? ● What if we SPLIT data into many regions? ○ We got many Raft groups. ○ Region = Contiguous Keys ● Hash partitioning or Range partitioning ○ Redis: Hash partitioning ○ HBase: Range partitioning
  • 9. That’s Cool, but... ● But what if we want to scan data? ○ How to support API: scan(startKey, endKey, limit) ● So, we need a globally ordered map ○ Can’t use hash partitioning ○ Use range partitioning ■ Region 1 -> [a - d] ■ Region 2 -> [e - h] ■ … ■ Region n -> [w - z]
  • 10. How to scale? (1/2) ● That’s simple ● Just Split && Move Region 1 Region 1 Region 2
  • 11. How to scale? (2/2) ● Raft comes for rescue again ○ Using Raft Membership changes, 2 steps: ■ Add a new replica ■ Destroy old region replica
  • 12. Region 1 Region 3 Region 1 Region 2 Scale-out(initial state) Region 1* Region 2 Region 2 Region 3Region 3 Node A Node B Node C Node D
  • 13. Region 1 Region 3 Region 1^ Region 2 Region 1* Region 2 Region 2 Region 3Region 3 Node A Node B Node E 1) Transfer leadership of region 1 from Node A to Node B Node C Node D Scale-out (add new node)
  • 14. Region 1 Region 3 Region 1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 2) Add Replica on Node E Node C Node D Node E Region 1 Scale-out (balancing)
  • 15. Region 1 Region 3 Region 1* Region 2 Region 2 Region 2 Region 3 Region 1 Region 3 Node A Node B 3) Remove Replica from Node A Node C Node D Node E Scale-out (balancing)
  • 16. Now we have a distributed key-value store ● We want to keep replicas in different datacenters ○ For HA: any node might crash, even the whole Data center ○ And to balance the workload ● So, we need Placement Driver (PD) to act as cluster manager, for: ○ Replication constraint ○ Data movement
  • 17. Placement Driver ● Concept comes from Spanner ● Provide the God’s view of the whole cluster ● Store the metadata ○ Clients have cache of placement information. ● Maintain the replication constraint ○ 3 replicas, by default ● Data movement ○ For balancing the workload ● It’s a cluster too, of course. ○ Thanks to Raft. Placement Driver Placement Driver Placement Driver Raft Raft Raft
  • 18. Placement Driver ● Rebalance without moving data. ○ Raft: Leadership transfer extension ● Moving data is a slow operation. ● We need fast rebalance.
  • 19. Store4 Raft groups RPCRPC Client Store1 TiKV Node1 Region 1 Region 3 ... Store2 TiKV Node2 Region 1 Region 2 Region 3 ... Store3 TiKV Node3 Region 1Region 2 ... TiKV Node4 Region 2Region 3 ... RPCRPC TiKV: The whole picture Placement Driver
  • 20. That’s Cool, but hold on... ● It could be cooler if we have: ○ MVCC ○ ACID Transaction ■ Transaction mode: Google Percolator (2PC)
  • 21. MVCC (Multi-Version Concurrency Control) ● Each transaction sees a snapshot of database at the beginning time of this transaction, any changes made by this transaction will not seen by other transactions until the transaction is committed. ● Data is tagged with versions ○ Key_version: value ● Lock-free snapshot reads
  • 22. Transaction API style (go code) txn := store.Begin() // start a transaction txn.Set([]byte("key1"), []byte("value1")) txn.Set([]byte("key2"), []byte("value2")) err = txn.Commit() // commit transaction if err != nil { txn.Rollback() } I want to write code like this.
  • 23. Transaction Model ● Inspired by Google Percolator ● 3 column families ○ cf:lock: An uncommitted transaction is writing this cell; contains the location/pointer of primary lock ○ cf: write: it stores the commit timestamp of the data ○ cf: data: Stores the data itself
  • 24. Transaction Model Key Bal: Data Bal: Lock Bal: Write Bob 6: 5: $10 6: 5: 6: data @ 5 5: Joe 6: 5: $2 6: 5: 6: data @ 5 5: Bob wants to transfer 7$ to Joe
  • 25. Transaction Model Key Bal: Data Bal: Lock Bal: Write Bob 7: $3 6: 5: $10 7: I am Primary 6: 5: 7: 6: data @ 5 5: Joe 6: 5: $2 6: 5: 6: data @ 5 5:
  • 26. Transaction Model Key Bal: Data Bal: Lock Bal: Write Bob 7: $3 6: 5: $10 7: I am Primary 6: 5: 7: 6: data @ 5 5: Joe 7: $9 6: 5: $2 7:Primary@Bob.bal 6: 5: 7: 6: data @ 5 5:
  • 27. Transaction Model (commit point) Key Bal: Data Bal: Lock Bal: Write Bob 8: 7: $3 6: 5: $10 8: 7: I am Primary 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $9 6: 5: $2 8: 7:Primary@Bob 6: 5: 8: data @ 7 7: 6: data @ 5 5:
  • 28. Transaction Model Key Bal: Data Bal: Lock Bal: Write Bob 8: 7: $6 6: 5: $10 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $6 6: 5: $2 8: 7:Primary@Bob 6: 5: 8: data @ 7 7: 6: data @ 5 5:
  • 29. Transaction Model Key Bal: Data Bal: Lock Bal: Write Bob 8: 7: $6 6: 5: $10 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $6 6: 5: $2 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5:
  • 30. TiKV: Architecture overview (Logical) Transaction MVCC RaftKV Local KV Storage (RocksDB) ● Highly layered ● Using Raft for consistency and scalability ● No distributed file system ○ For better performance and lower latency
  • 31. TiKV: Highly layered (API angle) Transaction MVCC RaftKV Local KV Storage (RocksDB) get(key) raft_get(key) MVCC_get(key, ver) txn_get(key, txn_start_ts)
  • 32. That’s really really Cool ● We have A Distributed Key-Value Database with ○ Geo-Replication / Auto Rebalance ○ ACID Transaction support ○ Horizontal Scalability
  • 33. What if we support SQL? ● SQL is simple and very productive ● We want to write code like this: SELECT COUNT(*) FROM user WHERE age > 20 and age < 30;
  • 34. And this... BEGIN; INSERT INTO person VALUES(‘tom’, 25); INSERT INTO person VALUES(‘jerry’, 30); COMMIT;
  • 35. First of all, map table data to key value store ● What happens behind: CREATE TABLE user ( id INT PRIMARY KEY, name TEXT, email TEXT );
  • 36. Mapping table data to kv store Key Value user/1 dongxu | huang@pingcap.com user/2 tom | tom@pingcap.com ... ... INSERT INTO user VALUES (1, “dongxu”, “huang@pingcap.com”); INSERT INTO user VALUES (2, “tom”, “tom@pingcap.com”);
  • 37. Secondary index is necessary ● Global index ○ All indexes in TiDB are transactional and fully consistent ○ Stored as separate key-value pairs in TiKV ● Keyed by a concatenation of the index prefix and primary key in TiKV ○ For example: table := {id, name} , id is primary key. If we want to build an index on the name column, for example we have a row r := (1, ‘tom’), we could store another kv pair just like: ■ name_index/tom_1 => nil ■ name_index/tom_2 => nil ○ For unique index ■ id_index/tom => 1,
  • 38. Index is just not enough... ● Can we push down filters? ○ select count(*) from person where age > 20 and age < 30 ● It should be much faster, maybe 100x ○ Less RPC round trip ○ Less transferring data
  • 39. Predicate pushdown TiKV Node1 TiKV Node2 TiKV Node3 TiDB Server Region 2Region 1 Region 5 age > 20 and age < 30 age > 20 and age < 30 age > 20 and age < 30 TiDB knows that Region 1 / 2 / 5 stores the data of person table.
  • 40. But TiKV doesn’t know the schema ● Key-value database doesn’t have any information about table and row ● Coprocessor comes for help: ○ Concept comes from HBase ○ Inject your own logic to data nodes
  • 41. What about drivers for every language? ● We have to build drivers for Java, Python, PHP, C/C++, Rust, Go… ● It needs lots of time and code. ○ Trust me, you don’t want to do that.
  • 42. OR... ● We just build a protocol layer that is compatible with MySQL. Then we have all the MySQL drivers. ○ All the tools ○ All the ORMs ○ All the applications ● That’s what TiDB does.
  • 43. Schema change in distributed RDBMS? ● A must-have feature! ● But you don’t want to lock the whole table while changing schema. ○ Usually distributed database stores tons of data spanning multiple machines ● We need a non-blocking schema change algorithm ● Thanks F1 again ○ Similar to《Online, Asynchronous Schema Change in F1》 - VLDB 2013 Google
  • 44. Architecture (The whole picture) MySQL Clients (e.g. JDBC) TiDB TiKV RPC MySQL Protocol F1 Spanner Applications
  • 45. Testing ● Testing in distributed system is really hard
  • 46. Embedded testing to your design ● Design for testing ● Get tests from community ○ Lots of tests in MySQL drivers/connectors ○ Lots of ORMs ○ Lots of applications (Record---replay)
  • 47. And more ● Fault injection ○ Hardware ■ disk error ■ network card ■ cpu ■ clock ○ Software ■ file system ■ network & protocol
  • 48. And more ● Simulate everything ○ Network example : https://github.com/pingcap/tikv/pull/916/commits/3cf0f7 248b32c3c523927eed5ebf82aabea481ec
  • 49. Distribute testing ● Jepsen ● Namazu ○ ZooKeeper: ■ Found ZOOKEEPER-2212, ZOOKEEPER-2080 (race): (blog article) ○ Etcd: ■ Found etcdctl bug #3517 (timing specification), fixed in #3530. The fix also resulted a hint of #3611 ■ Reproduced flaky tests {#4006, #4039} ○ YARN: ○ Found YARN-4301 (fault tolerance), Reproduced flaky tests{1978, 4168, 4543, 4548, 4556}
  • 50. More to come Distributed query plan - WIP Change history (binlog) - WIP Run TiDB on top of Kubernetes