SlideShare a Scribd company logo
Need For Time Series
Database
Pramit Choudhary, ML Engineer @eHarmony
Motivation
Speed Matters
We want to know, what’s happening NOW
User accessing data through different mobile platform, no patience
Data is scattered around
MongoDb, Voldemort, Netezza, Hive, Whisper, may be more
For cross platform analytical work, data is still moved around ( cause of worry )
Need for simplifying the Database Tech Stack
Increase in complexity as we start tracking more metrics in-regards to Mobile
devices
Data-Analytics Use-cases:
Most of the time we study data pattern over a period of time
e.g. 1. What are probable times for the user to get matches ? => need to start tracking
the amount of time user spends during the day
2. Feature exploration and extraction: What other features could we possibly use ?
=> more t/f/z/p statistics tests probably ?
Re-CAP
Consistency: Data remains consistent after the execution
of an operation. E.g. Post update all client have the same
state of the data.
Availability: Always on ( no downtime)
Partition Tolerance: System continues to function even
with no communication with one another
Different Combinations
CA : Single Cite cluster, all nodes are always in contact. e.g.
SQL type RDMS
CP : Some data may not be accessible, but the rest is
consistent and accurate e.g. MongoDB, HBase, Redis
AP : Available under partitioning, but no guarantee on
consistency e.g. Cassandra, Riak, DynamoDb
No SQL World
• Key-Value Store (Redis, Riak)
• Document Store (MongoDB, Couchbase)
• Column Store (Cassandra, Hbase, OpenTSDB)
• Graph Store (Neo4j, Node.js)
Introducing a new DB
OpenTSDB
Author: Benoit Sigoure @ StumbleUpon
What is OpenTSDB?
Open Source Time Series Database
Store trillions of data points
Sucks up all data and keeps going
Never loses precision
Scales using HBase
Note: Using this as an example, better results with KairosDB or InfluxDB.
They work on similar principles.
Author: Benoit Sigoure and Chris Larsen
Use-Cases
MongoDB and Couchbase : user profiles, product catalogs,
geospatial, financial products, social media, digital
content, gaming, metadata, events, bills and invoices
Hbase and Cassandra : Structured, semi-structured,
unstructured data, full table scans, read, intensive
operations, time series interval data, geospatial data
Other Options
Author: Oliver Hankeln
What are Time Series?
Time Series: Data points for an identity over time
Typical Identity:
Dotted string: web01.sys.cpu.user.0 ( no concept of filters )
OpenTSDB Identity:
Metric: sys.cpu.user
Tags (name/value pairs): act as filters
host=web01 cpu=0
Author: Benoit Sigoure and Chris Larsen
What are Time Series?
Data Point:
Metric + Tags
+ Value: 42
+ Timestamp: 123
„ sys.cpu.user 1234567890 42 host=web01 cpu=0 „
Author: Benoit Sigoure and Chris Larsen
Architecture
Author: Benoit Sigoure and Chris Larsen
Another View
Author: slideshare
About TSDs
Write throughput
Are CPU bounded
Worst Case: Can handle 2000 points/sec on an old 2006 dual core CPU
Read throughput
Depends on the cardinality of a metric
Timespan and number of data points retrieved
Reliability
No single point of failure no concept of master daemon
Dependency, needs HBase with zookeeper
Has single point of failure if running over HDFS, but none with
respect to database.
More info on the Wiki : http://opentsdb.net/faq.html
Simplistic View of the
Table
Without OpenTSDB Hbase Table Representation
Author: Oliver Hankeln
OpenTSDB Magic
“Compact columns by concatenation “
Author: Oliver Hankeln
• Tags are put at the end of the row key
• Timestamp is normalized on 1hr boundaries
Row Key Size
Author: Oliver Hankeln
BenchMarks
Load Phase
Heavy Read
Heavy Read
Heavy Range Scan
Heavy Inserts
Is it being extensively
used?
OVH: #3 largest cloud/hosting provider : Monitor
everything includes network performance, resource
utilization, application performance, customer facing
metric
35 servers, 100k writes/s, 25tb raw data
5 day moving window of Hbase snapshot
Redis cache on top for customer facing data
Yahoo: Monitoring application performance and
statistics ( 15 servers, 280k writes/s
Arista Networks: High performance network
monitoring
5k writes/s uses varnish for caching
MapR
“OpenTSDB is a widely used database intended to store
and analyze time-series data. Originally designed for
only data center monitoring, poor ingest performance
had limited the expansion of its use. This benchmark
demonstrates a viable option for new applications, such
as IoT and other real-time data-analysis applications,
using OpenTSDB running on MapR. “ Ted Dunning, Chief
Application Architect
Others
Some References
Book: TimeSeries Database – Ted Dunning and Ellen
Friedman (
https://www.dropbox.com/s/c1zj0l0q0qmfvo8/Time_
Series_Databases.pdf?dl=0 )
Benchmarks:
https://www.dropbox.com/s/g67yoxwabwb5s0g/Perf
ormanceBenchMark.pdf?dl=0
Lessons learned:
http://www.slideshare.net/cloudera/4-opentsdb-
hbasecon
Some Comparisons:
http://prometheus.io/docs/introduction/comparison/
Demo
Questions?

More Related Content

Need for Time series Database

  • 1. Need For Time Series Database Pramit Choudhary, ML Engineer @eHarmony
  • 2. Motivation Speed Matters We want to know, what’s happening NOW User accessing data through different mobile platform, no patience Data is scattered around MongoDb, Voldemort, Netezza, Hive, Whisper, may be more For cross platform analytical work, data is still moved around ( cause of worry ) Need for simplifying the Database Tech Stack Increase in complexity as we start tracking more metrics in-regards to Mobile devices Data-Analytics Use-cases: Most of the time we study data pattern over a period of time e.g. 1. What are probable times for the user to get matches ? => need to start tracking the amount of time user spends during the day 2. Feature exploration and extraction: What other features could we possibly use ? => more t/f/z/p statistics tests probably ?
  • 3. Re-CAP Consistency: Data remains consistent after the execution of an operation. E.g. Post update all client have the same state of the data. Availability: Always on ( no downtime) Partition Tolerance: System continues to function even with no communication with one another
  • 4. Different Combinations CA : Single Cite cluster, all nodes are always in contact. e.g. SQL type RDMS CP : Some data may not be accessible, but the rest is consistent and accurate e.g. MongoDB, HBase, Redis AP : Available under partitioning, but no guarantee on consistency e.g. Cassandra, Riak, DynamoDb
  • 5. No SQL World • Key-Value Store (Redis, Riak) • Document Store (MongoDB, Couchbase) • Column Store (Cassandra, Hbase, OpenTSDB) • Graph Store (Neo4j, Node.js)
  • 6. Introducing a new DB OpenTSDB Author: Benoit Sigoure @ StumbleUpon
  • 7. What is OpenTSDB? Open Source Time Series Database Store trillions of data points Sucks up all data and keeps going Never loses precision Scales using HBase Note: Using this as an example, better results with KairosDB or InfluxDB. They work on similar principles. Author: Benoit Sigoure and Chris Larsen
  • 8. Use-Cases MongoDB and Couchbase : user profiles, product catalogs, geospatial, financial products, social media, digital content, gaming, metadata, events, bills and invoices Hbase and Cassandra : Structured, semi-structured, unstructured data, full table scans, read, intensive operations, time series interval data, geospatial data
  • 10. What are Time Series? Time Series: Data points for an identity over time Typical Identity: Dotted string: web01.sys.cpu.user.0 ( no concept of filters ) OpenTSDB Identity: Metric: sys.cpu.user Tags (name/value pairs): act as filters host=web01 cpu=0 Author: Benoit Sigoure and Chris Larsen
  • 11. What are Time Series? Data Point: Metric + Tags + Value: 42 + Timestamp: 123 „ sys.cpu.user 1234567890 42 host=web01 cpu=0 „ Author: Benoit Sigoure and Chris Larsen
  • 14. About TSDs Write throughput Are CPU bounded Worst Case: Can handle 2000 points/sec on an old 2006 dual core CPU Read throughput Depends on the cardinality of a metric Timespan and number of data points retrieved Reliability No single point of failure no concept of master daemon Dependency, needs HBase with zookeeper Has single point of failure if running over HDFS, but none with respect to database. More info on the Wiki : http://opentsdb.net/faq.html
  • 15. Simplistic View of the Table Without OpenTSDB Hbase Table Representation Author: Oliver Hankeln
  • 16. OpenTSDB Magic “Compact columns by concatenation “ Author: Oliver Hankeln • Tags are put at the end of the row key • Timestamp is normalized on 1hr boundaries
  • 17. Row Key Size Author: Oliver Hankeln
  • 23. Is it being extensively used? OVH: #3 largest cloud/hosting provider : Monitor everything includes network performance, resource utilization, application performance, customer facing metric 35 servers, 100k writes/s, 25tb raw data 5 day moving window of Hbase snapshot Redis cache on top for customer facing data
  • 24. Yahoo: Monitoring application performance and statistics ( 15 servers, 280k writes/s Arista Networks: High performance network monitoring 5k writes/s uses varnish for caching MapR “OpenTSDB is a widely used database intended to store and analyze time-series data. Originally designed for only data center monitoring, poor ingest performance had limited the expansion of its use. This benchmark demonstrates a viable option for new applications, such as IoT and other real-time data-analysis applications, using OpenTSDB running on MapR. “ Ted Dunning, Chief Application Architect
  • 26. Some References Book: TimeSeries Database – Ted Dunning and Ellen Friedman ( https://www.dropbox.com/s/c1zj0l0q0qmfvo8/Time_ Series_Databases.pdf?dl=0 ) Benchmarks: https://www.dropbox.com/s/g67yoxwabwb5s0g/Perf ormanceBenchMark.pdf?dl=0 Lessons learned: http://www.slideshare.net/cloudera/4-opentsdb- hbasecon Some Comparisons: http://prometheus.io/docs/introduction/comparison/
  • 27. Demo

Editor's Notes

  1. HBase has unconquerable superiority in writes, and with a pre-created regions it showed us up to 40K ops/sec. Cassandra also provides noticeable performance during loading phase with around 15K ops/sec. MySQL Cluster can show much higher numbers in “just in-memory” mode
  2. Deferred log flush does the right job for HBase during mutation ops. Edits are committed to the memstore firstly and then aggregated edits are flushed to HLog asynchronously. Cassandra has great write throughput since writes are first written to the commit log with append method which is fast operation. MongoDB’s latency suffers from global write lock. Riak behaves more stably than MongoDB.