Need for Time series Database

Need For Time Series
Database
Pramit Choudhary, ML Engineer @eHarmony

Motivation
Speed Matters
We want to know, what’s happening NOW
User accessing data through different mobile platform, no patience
Data is scattered around
MongoDb, Voldemort, Netezza, Hive, Whisper, may be more
For cross platform analytical work, data is still moved around ( cause of worry )
Need for simplifying the Database Tech Stack
Increase in complexity as we start tracking more metrics in-regards to Mobile
devices
Data-Analytics Use-cases:
Most of the time we study data pattern over a period of time
e.g. 1. What are probable times for the user to get matches ? => need to start tracking
the amount of time user spends during the day
2. Feature exploration and extraction: What other features could we possibly use ?
=> more t/f/z/p statistics tests probably ?

Re-CAP
Consistency: Data remains consistent after the execution
of an operation. E.g. Post update all client have the same
state of the data.
Availability: Always on ( no downtime)
Partition Tolerance: System continues to function even
with no communication with one another

Different Combinations
CA : Single Cite cluster, all nodes are always in contact. e.g.
SQL type RDMS
CP : Some data may not be accessible, but the rest is
consistent and accurate e.g. MongoDB, HBase, Redis
AP : Available under partitioning, but no guarantee on
consistency e.g. Cassandra, Riak, DynamoDb

No SQL World
• Key-Value Store (Redis, Riak)
• Document Store (MongoDB, Couchbase)
• Column Store (Cassandra, Hbase, OpenTSDB)
• Graph Store (Neo4j, Node.js)

Introducing a new DB
OpenTSDB
Author: Benoit Sigoure @ StumbleUpon

What is OpenTSDB?
Open Source Time Series Database
Store trillions of data points
Sucks up all data and keeps going
Never loses precision
Scales using HBase
Note: Using this as an example, better results with KairosDB or InfluxDB.
They work on similar principles.
Author: Benoit Sigoure and Chris Larsen

Use-Cases
MongoDB and Couchbase : user profiles, product catalogs,
geospatial, financial products, social media, digital
content, gaming, metadata, events, bills and invoices
Hbase and Cassandra : Structured, semi-structured,
unstructured data, full table scans, read, intensive
operations, time series interval data, geospatial data

Other Options
Author: Oliver Hankeln

What are Time Series?
Time Series: Data points for an identity over time
Typical Identity:
Dotted string: web01.sys.cpu.user.0 ( no concept of filters )
OpenTSDB Identity:
Metric: sys.cpu.user
Tags (name/value pairs): act as filters
host=web01 cpu=0

What are Time Series?
Data Point:
Metric + Tags
+ Value: 42
+ Timestamp: 123
„ sys.cpu.user 1234567890 42 host=web01 cpu=0 „

Architecture

Another View
Author: slideshare

About TSDs
Write throughput
Are CPU bounded
Worst Case: Can handle 2000 points/sec on an old 2006 dual core CPU
Read throughput
Depends on the cardinality of a metric
Timespan and number of data points retrieved
Reliability
No single point of failure no concept of master daemon
Dependency, needs HBase with zookeeper
Has single point of failure if running over HDFS, but none with
respect to database.
More info on the Wiki : http://opentsdb.net/faq.html

Simplistic View of the
Table
Without OpenTSDB Hbase Table Representation

OpenTSDB Magic
“Compact columns by concatenation “
• Tags are put at the end of the row key
• Timestamp is normalized on 1hr boundaries

Row Key Size

Is it being extensively
used?
OVH: #3 largest cloud/hosting provider : Monitor
everything includes network performance, resource
utilization, application performance, customer facing
metric
35 servers, 100k writes/s, 25tb raw data
5 day moving window of Hbase snapshot
Redis cache on top for customer facing data

Yahoo: Monitoring application performance and
statistics ( 15 servers, 280k writes/s
Arista Networks: High performance network
monitoring
5k writes/s uses varnish for caching
MapR
“OpenTSDB is a widely used database intended to store
and analyze time-series data. Originally designed for
only data center monitoring, poor ingest performance
had limited the expansion of its use. This benchmark
demonstrates a viable option for new applications, such
as IoT and other real-time data-analysis applications,
using OpenTSDB running on MapR. “ Ted Dunning, Chief
Application Architect

Some References
Book: TimeSeries Database – Ted Dunning and Ellen
Friedman (
https://www.dropbox.com/s/c1zj0l0q0qmfvo8/Time_
Series_Databases.pdf?dl=0 )
Benchmarks:
https://www.dropbox.com/s/g67yoxwabwb5s0g/Perf
ormanceBenchMark.pdf?dl=0
Lessons learned:
http://www.slideshare.net/cloudera/4-opentsdb-
hbasecon
Some Comparisons:
http://prometheus.io/docs/introduction/comparison/

Need for Time series Database

Related slideshows

More Related Content

Need for Time series Database

Editor's Notes