How to build TiDB
- 2. About me
● Infrastructure engineer / CEO of PingCAP
● Working on open source projects: TiDB/TiKV
https://github.com/pingcap/tidb
https://github.com/pingcap/tikv
Email: liuqi@pingcap.com
- 3. Let’s say we want to build a NewSQL Database
● From the beginning
● What’s wrong with the existing DBs?
○ Relational databases
○ NoSQL
- 4. We have a key-value store (RocksDB)
● Good start, RocksDB is fast and stable.
○ Atomic batch write
○ Snapshot
● However… It’s a local embedded kv store.
○ Can’t tolerate machine failures
○ Scalability depends on the capacity of the disk
- 5. Let’s fix Fault Tolerance
● Use Raft to replicate data
○ Key features of Raft
■ Strong leader: leader does most of the work, issue all log updates
■ Leader election
■ Membership changes
● Implementation:
○ Ported from etcd
- 6. Let’s fix Fault Tolerance
Machine 1 Machine 2 Machine 3
RocksDB RocksDB RocksDB
Raft Raft
- 7. That’s cool
● Basically we have a lite version of etcd or zookeeper.
○ Does not support watch command, and some other features
● Let’s make it better.
- 8. How about Scalability?
● What if we SPLIT data into many regions?
○ We got many Raft groups.
○ Region = Contiguous Keys
● Hash partitioning or Range partitioning
○ Redis: Hash partitioning
○ HBase: Range partitioning
- 9. That’s Cool, but...
● But what if we want to scan data?
○ How to support API: scan(startKey, endKey, limit)
● So, we need a globally ordered map
○ Can’t use hash partitioning
○ Use range partitioning
■ Region 1 -> [a - d]
■ Region 2 -> [e - h]
■ …
■ Region n -> [w - z]
- 10. How to scale? (1/2)
● That’s simple
● Just Split && Move Region 1
Region 1 Region 2
- 11. How to scale? (2/2)
● Raft comes for rescue again
○ Using Raft Membership changes, 2 steps:
■ Add a new replica
■ Destroy old region replica
- 12. Region 1
Region 3
Region 1
Region 2
Scale-out(initial state)
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node C
Node D
- 13. Region 1
Region 3
Region 1^
Region 2
Region 1*
Region 2 Region 2
Region 3Region 3
Node A
Node B
Node E
1) Transfer leadership of region 1 from Node A to
Node B
Node C
Node D
Scale-out (add new node)
- 14. Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
2) Add Replica on Node E
Node C
Node D
Node E
Region 1
Scale-out (balancing)
- 15. Region 1
Region 3
Region 1*
Region 2
Region 2 Region 2
Region 3
Region 1
Region 3
Node A
Node B
3) Remove Replica from Node A
Node C
Node D
Node E
Scale-out (balancing)
- 16. Now we have a distributed key-value store
● We want to keep replicas in different datacenters
○ For HA: any node might crash, even the whole Data center
○ And to balance the workload
● So, we need Placement Driver (PD) to act as cluster manager, for:
○ Replication constraint
○ Data movement
- 17. Placement Driver
● Concept comes from Spanner
● Provide the God’s view of the whole cluster
● Store the metadata
○ Clients have cache of placement information.
● Maintain the replication constraint
○ 3 replicas, by default
● Data movement
○ For balancing the workload
● It’s a cluster too, of course.
○ Thanks to Raft.
Placement
Driver
Placement
Driver
Placement
Driver
Raft
Raft
Raft
- 18. Placement Driver
● Rebalance without moving data.
○ Raft: Leadership transfer extension
● Moving data is a slow operation.
● We need fast rebalance.
- 20. That’s Cool, but hold on...
● It could be cooler if we have:
○ MVCC
○ ACID Transaction
■ Transaction mode: Google Percolator (2PC)
- 21. MVCC (Multi-Version Concurrency Control)
● Each transaction sees a snapshot of database at the beginning time of this
transaction, any changes made by this transaction will not seen by other
transactions until the transaction is committed.
● Data is tagged with versions
○ Key_version: value
● Lock-free snapshot reads
- 22. Transaction API style (go code)
txn := store.Begin() // start a transaction
txn.Set([]byte("key1"), []byte("value1"))
txn.Set([]byte("key2"), []byte("value2"))
err = txn.Commit() // commit transaction
if err != nil {
txn.Rollback()
}
I want to write
code like this.
- 23. Transaction Model
● Inspired by Google Percolator
● 3 column families
○ cf:lock: An uncommitted transaction is writing this cell; contains the
location/pointer of primary lock
○ cf: write: it stores the commit timestamp of the data
○ cf: data: Stores the data itself
- 24. Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
6:
5: $10
6:
5:
6: data @ 5
5:
Joe
6:
5: $2
6:
5:
6: data @ 5
5:
Bob wants to transfer 7$ to Joe
- 25. Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
7: $3
6:
5: $10
7: I am Primary
6:
5:
7:
6: data @ 5
5:
Joe
6:
5: $2
6:
5:
6: data @ 5
5:
- 26. Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob
7: $3
6:
5: $10
7: I am Primary
6:
5:
7:
6: data @ 5
5:
Joe
7: $9
6:
5: $2
7:Primary@Bob.bal
6:
5:
7:
6: data @ 5
5:
- 27. Transaction Model (commit point)
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $3
6:
5: $10
8:
7: I am Primary
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $9
6:
5: $2
8:
7:Primary@Bob
6:
5:
8: data @ 7
7:
6: data @ 5
5:
- 28. Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $6
6:
5: $10
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $6
6:
5: $2
8:
7:Primary@Bob
6:
5:
8: data @ 7
7:
6: data @ 5
5:
- 29. Transaction Model
Key Bal: Data Bal: Lock Bal: Write
Bob 8:
7: $6
6:
5: $10
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
Joe 8:
7: $6
6:
5: $2
8:
7:
6:
5:
8: data @ 7
7:
6: data @ 5
5:
- 30. TiKV: Architecture overview (Logical)
Transaction
MVCC
RaftKV
Local KV Storage (RocksDB)
● Highly layered
● Using Raft for consistency and
scalability
● No distributed file system
○ For better performance and lower
latency
- 31. TiKV: Highly layered (API angle)
Transaction
MVCC
RaftKV
Local KV Storage (RocksDB)
get(key)
raft_get(key)
MVCC_get(key, ver)
txn_get(key, txn_start_ts)
- 32. That’s really really Cool
● We have A Distributed Key-Value Database with
○ Geo-Replication / Auto Rebalance
○ ACID Transaction support
○ Horizontal Scalability
- 33. What if we support SQL?
● SQL is simple and very productive
● We want to write code like this:
SELECT COUNT(*) FROM user
WHERE age > 20 and age < 30;
- 35. First of all, map table data to key value store
● What happens behind:
CREATE TABLE user (
id INT PRIMARY KEY,
name TEXT,
email TEXT
);
- 36. Mapping table data to kv store
Key Value
user/1 dongxu | huang@pingcap.com
user/2 tom | tom@pingcap.com
... ...
INSERT INTO user VALUES (1, “dongxu”, “huang@pingcap.com”);
INSERT INTO user VALUES (2, “tom”, “tom@pingcap.com”);
- 37. Secondary index is necessary
● Global index
○ All indexes in TiDB are transactional and fully consistent
○ Stored as separate key-value pairs in TiKV
● Keyed by a concatenation of the index prefix and primary key in TiKV
○ For example: table := {id, name} , id is primary key. If we want to build an index on the name
column, for example we have a row r := (1, ‘tom’), we could store another kv pair just like:
■ name_index/tom_1 => nil
■ name_index/tom_2 => nil
○ For unique index
■ id_index/tom => 1,
- 38. Index is just not enough...
● Can we push down filters?
○ select count(*) from person
where age > 20 and age < 30
● It should be much faster, maybe 100x
○ Less RPC round trip
○ Less transferring data
- 39. Predicate pushdown
TiKV Node1 TiKV Node2 TiKV Node3
TiDB Server
Region 2Region 1
Region 5
age > 20 and age < 30 age > 20 and age < 30
age > 20 and age < 30
TiDB knows that
Region 1 / 2 / 5
stores the data of
person table.
- 40. But TiKV doesn’t know the schema
● Key-value database doesn’t have any information about table and row
● Coprocessor comes for help:
○ Concept comes from HBase
○ Inject your own logic to data nodes
- 41. What about drivers for every language?
● We have to build drivers for Java, Python, PHP, C/C++, Rust, Go…
● It needs lots of time and code.
○ Trust me, you don’t want to do that.
- 42. OR...
● We just build a protocol layer that is compatible with MySQL. Then we have
all the MySQL drivers.
○ All the tools
○ All the ORMs
○ All the applications
● That’s what TiDB does.
- 43. Schema change in distributed RDBMS?
● A must-have feature!
● But you don’t want to lock the whole table while changing schema.
○ Usually distributed database stores tons of data spanning multiple machines
● We need a non-blocking schema change algorithm
● Thanks F1 again
○ Similar to《Online, Asynchronous Schema Change in F1》 - VLDB 2013 Google
- 44. Architecture (The whole picture)
MySQL Clients (e.g. JDBC)
TiDB
TiKV
RPC
MySQL Protocol
F1
Spanner
Applications
- 46. Embedded testing to your design
● Design for testing
● Get tests from community
○ Lots of tests in MySQL drivers/connectors
○ Lots of ORMs
○ Lots of applications (Record---replay)
- 47. And more
● Fault injection
○ Hardware
■ disk error
■ network card
■ cpu
■ clock
○ Software
■ file system
■ network & protocol
- 48. And more
● Simulate everything
○ Network example :
https://github.com/pingcap/tikv/pull/916/commits/3cf0f7
248b32c3c523927eed5ebf82aabea481ec
- 49. Distribute testing
● Jepsen
● Namazu
○ ZooKeeper:
■ Found ZOOKEEPER-2212, ZOOKEEPER-2080 (race): (blog article)
○ Etcd:
■ Found etcdctl bug #3517 (timing specification), fixed in #3530. The fix also resulted a
hint of #3611
■ Reproduced flaky tests {#4006, #4039}
○ YARN:
○ Found YARN-4301 (fault tolerance), Reproduced flaky tests{1978, 4168, 4543, 4548, 4556}