Time series storage in Cassandra

•

7 likes•2,982 views

Presented at Cassandra London (April 7, 2014); The challenges of time-series storage and analytics in OpenNMS, with an introduction to Newts, a new Cassandra-based time-series data store.

Time Series Storage
Cassandra London Meetup
April 7, 2014
Eric Evans
eevans@opennms.org
@jericevans

OpenNMS: What It Is
● Network Management System
○ Discovery and Provisioning
○ Service monitoring
○ Data collection
○ Event management and notifications
● Java, open source, GPLv3
● Since 1999

RRDTool
● Round robin database
● First released 1999
● Time-series storage
● File-based
● Constant-size
● Automatic, amortized aggregation

Consider
● 2 IOPs per update (read-update-write)
● 1 RRD per data source (storeByGroup=false)
● 100,000s of data sources, 1,000s IOPS
● 1,000,000s of data sources, 10,000s IOPS
● 15,000 RPM SAS drive, ~175-200 IOPS

Also
● Not everything is a graph
● Inflexible
● Incremental backups impractical
● ...

Observation #1
We collect and write a great deal; We read
(graph) relatively little.
We are optimized for reading everything,
always.

Observation #2
Samples are naturally collected, and graphed
together in groups.
Grouping samples that are accessed together
is an easy optimization.

Project: Newts
Goals:
● Stand-alone time-series data store
● High-throughput
● Horizontally scalable
● Grouped metric storage/retrieval
● Late-aggregating

Cassandra
Why:
● Write-optimized
● Sorted
● Horizontally scalable (linear)

Gist
● Samples stored as-is.
● Samples can be retrieved as-is.
● Measurements are aggregations calculated
from samples (at time of query).

Sample
{
“resource” : “london”,
“timestamp” : 1396289065,
“name” : “meanTemp”,
“type” : “GAUGE”,
“value” : 17.2,
“attributes” : { “units”: “celsius” }
}

Samples
CREATE TABLE newts.samples (
resource text,
collected_at timestamp,
metric_name text,
metric_type text,
value blob,
attributes map<text, text>,
PRIMARY KEY(resource, collected_at, metric_name)
);

Samples
resource | collected_at | metric_name | value
---------+---------------------+--------------+-----------
london | 2014-03-31 18:04:25 | dewPoint | 0xc01a0000
london | 2014-03-31 18:04:25 | maxTemp | 0x40280000
london | 2014-03-31 18:04:25 | maxWindGust | 0x7ff80000
london | 2014-03-31 18:04:25 | maxWindSpeed | 0x40180000
london | 2014-03-31 18:04:25 | meanTemp | 0xbfe00000

Behind the scenes...
london (2014-03-31 18:04:25, dewPoint):
0xc01a0000
(2014-03-31 18:04:25, maxTemp):
0x40280000
...
Ascending Order

What's hot

Wikimedia Content API (Strangeloop)

Eric Evans

The Wikimedia Foundation is a non-profit and charitable organization driven by a vision of a world where every human can freely share in the sum of all knowledge. Each month Wikimedia sites serve over 18 billion page views to 500 million unique visitors around the world. Among the many resources offered by Wikimedia is a public-facing API that provides low-latency, programmatic access to full-history content and meta-data, in a variety of formats. Commonly, results from this system are the product of computationally intensive transformations, and must be pre-generated and persisted to meet latency expectations. Unsurprisingly, there are numerous challenges to providing low-latency storage of such a massive data-set, in a demanding, globally distributed environment. This talk covers Wikimedia Content API, and it's use of Apache Cassandra, a massively-scalable distributed database, as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature are discussed.

Mongo nyc nyt + mongodb

Deep Kapadia

The document discusses lessons learned from using MongoDB at the New York Times over 6 months. It covers initial setup without backups or monitoring, improving to replication and monitoring, optimizing storage, backups, restores, querying, indexing and administration. Key lessons include using replication and backups, monitoring all aspects of MongoDB and storage, optimizing data and indexes for queries, and understanding data and access patterns.

Redis Day TLV 2018 - RediSearch Aggregations

Redis Labs

This document discusses RediSearch aggregations, which allow processing search results to produce statistical insights. Aggregations take a search query, group and reduce the results, apply transformations, and sort. Key steps include filtering results, grouping and reducing with functions like count and average, applying expressions, and sorting. Examples show finding top GitHub committers and visits by hour. Scaling aggregations to multiple nodes requires pushing processing stages to nodes and merging results, such as summing counts or taking list intersections.

Sampling based Histogram in MariaDB

Teodor Niculescu

The document describes improvements to building histograms for database tables. It outlines collecting a histogram using samples of rows rather than a full table scan to avoid sorting all values and improve performance. The new implementation allows the user to specify a sampling percentage and constructs an equal-width histogram using multiple samples to estimate the min and max values and then bucket the values between those ranges.

Climate data in r with the raster package

Alberto Labarga

Redis Day TLV 2018 - Redis as a Time-Series DB

Redis Labs

Redis can be used as a time-series database by using the redis-timeseries module. The module provides a custom data structure and commands for storing and querying time-series data in Redis. Data can be added with a timestamp and value and queried within a time range. Downsampling aggregates and stores data at regular intervals to reduce the size of long time-series data. Global configuration allows defining downsampling rules and retention policies for all keys.

Slide smallfiles

rledisez

The document proposes a solution to replace inode-based storage with a key-value store mapping objects directly to positions in large "volumes" or files to address scalability issues. It benchmarks significantly better performance for puts, gets, and concurrent operations compared to an XFS filesystem, using less RAM and avoiding compaction costs. Open tasks include replication, erasure coding, and testing on object servers.

ECMAScript: past, present and future

Kseniya Redunova

This document provides a history and overview of ECMAScript (ES), the standard upon which JavaScript is based. It discusses the major versions from ES3 in 1999 to ES2016. Key changes and new features are outlined for each version, including the addition of classes, modules, iterators and more in ES6/ES2015. Transpilers like Babel allow the use of new syntax by compiling ES6 to older JavaScript. Compatibility and adoption are addressed, noting a goal of evolving the language without breaking the web. Links for further reading on ES6 features and syntax are also included.

Be a Zen monk, the Python way

Sriram Murali

Bitcoin Price Detection with Pyspark presentation

Yakup Görür

Cryptocurrencies are digital currencies that have garnered significant investor attention in the financial markets. The aim of this project is to predict the daily price, particularly the daily closing price of the cryptocurrency Bitcoin. This plays a vital role in making trading decisions. There exist various factors which affect the price of Bitcoin, thereby making price prediction a complex and technically challenging task. To perform prediction, random forest model was trained on the historical time series which is the past prices of Bitcoin over several years. Features such as the opening price, highest price, lowest price, closing price, volume of Bitcoin, volume of currencies, and weighted price were taken into consideration so as to predict the closing price of the next day. Random forest model designed and implemented on both of pyspark and scikit learn frameworks to build predictive analysis and evaluated them by computing various measures such as the RMSE (root mean square error) and r (Pearson's correlation coefficient) on test data. Pyspark framework was used to make parallelize the creating trees when training the random forest to handle bigdata. Code has been made available at: https://github.com/ykpgrr/Price-Prediction-with-Random-Forest

Developing Ansible Dynamic Inventory Script - Nov 2017

Ahmed AbouZaid

A session about my experience with writing an external inventory script from scratch for "Netbox" (IPAM and DCIM tool from DigitalOcean network engineering team) and push it to upstream to became an official inventory script. Repo: https://github.com/AAbouZaid/netbox-as-ansible-inventory The "Dynamic inventory" is one of nice features in Ansible, where you can use an external service as inventory for Ansible instead the basic text-based ini file. So you can use AWS EC2 as inventory of your hosts, or maybe OpenStack, or whatever ... you actually can use any source inventory for Ansible, and you can write your own "External Inventory Script".

“Show Me the Garbage!”, Understanding Garbage Collection

Haim Yadid

“Just leave the garbage outside and we will take care of it for you”. This is the panacea promised by garbage collection mechanisms built into most software stacks available today. So, we don’t need to think about it anymore, right? Wrong! When misused, garbage collectors can fail miserably. When this happens they slow down your application and lead to unacceptable pauses. In this talk we will go over different garbage collectors approaches in different software runtimes and what are the conditions which enable them to function well. Presented on Reversim summit 2019 https://summit2019.reversim.com/session/5c754052d0e22f001706cbd8

Intro to Apache Spark - Lab

Mammoth Data

This document provides an introduction and overview of several Apache Spark labs covering: a "hello world" example of Resilient Distributed Datasets (RDDs); importing and performing operations on a wine dataset using DataFrames and SQL; and using the MLlib library to perform k-means clustering on features from the wine dataset. The labs demonstrate basic Spark concepts like RDDs, DataFrames, ML pipelines, and clustering algorithms.

[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...

PingCAP

Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Selfaligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure. We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2–4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×.

Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...

NoSQLmatters

Ted Dunning – Very High Bandwidth Time Series Database Implementation This talk will describe our work in creating time series databases with very high ingest rates (over 100 million points / second) on very small clusters. Starting with openTSDB and the off-the-shelf version of MapR-DB, we were able to accelerate ingest by >1000x. I will describe our techniques in detail and talk about the architectural changes required. We are also working to allow access to openTSDB data using SQL via Apache Drill. In addition, I will talk about how this work has implications regarding the much fabled Internet of Things. And tell some stories about the origins of open source big data in the 19th century at sea.

Introduction to Bizur

Akira Hayakawa

Bizur is a consensus algorithm invented by Elastifile to address issues with log-based algorithms like Paxos. It optimizes for strongly consistent distributed key-value stores by having independent buckets for keys that are replicated and use leader election. Reads require a majority and writes succeed with majority acknowledgement. It includes recovery mechanisms like ensuring buckets are up-to-date after leader changes and can reconfigure membership or number of buckets per shard dynamically through techniques like SMART migrations.

Spark Gotchas and Lessons Learned (2/20/20)

Jen Waller

Cassandra Lunch #59 Functions in Cassandra

Anant Corporation

Extended memory access in PHP

Andrew Goodwin

This document discusses using PHP to collect and store large amounts of physiological data from an intensive care unit. The system collected around 100,000 values per second from 40 beds, totaling over 2 trillion samples per year. Various database options were considered for storing this time series data, with custom compressed binary files chosen due to their small disk footprint. PHP was used to develop a prototype that compressed the data to around 0.57TB per year. While PHP has limitations for a production system, it was effective for rapid prototyping of compression algorithms and accessing large amounts of compressed data in "extended memory".

Data Step Hash Object vs SQL Join

Geoff Ness

The document compares using SAS hash objects versus SQL joins to combine data from multiple tables. Hash objects store key-value pairs in memory for fast lookups, providing a potential alternative to joins. While hash objects can improve performance, especially for larger datasets, they require more code and memory than joins. The document evaluates performance differences between hash objects and joins for various scenarios and sizes of data. It also discusses additional capabilities and considerations for using hash objects.

What's hot (20)

Wikimedia Content API (Strangeloop)

Mongo nyc nyt + mongodb

Redis Day TLV 2018 - RediSearch Aggregations

Sampling based Histogram in MariaDB

Climate data in r with the raster package

Redis Day TLV 2018 - Redis as a Time-Series DB

Slide smallfiles

ECMAScript: past, present and future

Be a Zen monk, the Python way

Bitcoin Price Detection with Pyspark presentation

Developing Ansible Dynamic Inventory Script - Nov 2017

“Show Me the Garbage!”, Understanding Garbage Collection

Intro to Apache Spark - Lab

[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...

Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...

Introduction to Bizur

Spark Gotchas and Lessons Learned (2/20/20)

Cassandra Lunch #59 Functions in Cassandra

Extended memory access in PHP

Data Step Hash Object vs SQL Join

Viewers also liked

Big Data, NoSQL with MongoDB and Cassasdra

Brian Enochson

This document provides an overview and introduction to NoSQL databases using MongoDB and Cassandra as examples. It discusses the rise of NoSQL databases due to the need to handle big data and internet-scale applications. MongoDB is presented as a popular document-oriented NoSQL database with common components like documents, collections, querying and replication. The presentation also touches on data modeling with MongoDB and provides a brief introduction to Cassandra.

Storing time series data with Apache Cassandra

Patrick McFadin

If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.

Time Series Data Storage in MongoDB

sky_jackson

The document discusses a company that finances, develops, and operates renewable energy and efficiency installations. They collect large amounts of time series data from these installations, including temperature readings and flow rates taken at regular intervals. The author is considering using MongoDB to build a flexible data pipeline to store, search, and analyze this time series data. Key requirements are that the system needs to scale to potentially large amounts of data from many installations, and that it is designed with analytics and flexibility in mind to support a variety of use cases and evolving business needs.

Con8862 no sql, json and time series data

Anuj Sahni

This document discusses using JSON and NoSQL databases for time series data. It provides an overview of Oracle NoSQL Database, including its key-value data model, ACID transactions, horizontal scalability, and support for JSON. It then presents a case study on using Oracle NoSQL Database for real-time stock tick analysis, where large volumes of tick data must be stored quickly to enable trend analysis and real-time trading by customers.

Cassandra Basics, Counters and Time Series Modeling

Vassilis Bekiaris

MongoDB in the Big Data Landscape

MongoDB

- MongoDB is a document database management system that is recognized as a leader by Gartner. It has over 520 employees, 2500+ customers, and offices globally. - MongoDB ranked 4th in database mindshare according to DB-Engines. It has seen 172% growth in the last 20 months. - Several companies such as a quantitative investment manager, an insurance company, a telecommunications company, and an ecommerce company migrated their systems to MongoDB and saw benefits like 100x faster data retrieval, 50% lower costs, and being able to build applications faster.

Wikimedia Content API: A Cassandra Use-case

Eric Evans

Among the resources offered by Wikimedia is an API providing low-latency access to full-history content, in many formats. Its results are often the product of computationally intensive transforms, and must be pre-generated and stored to meet latency expectations. Unsurprisingly, there are many challenges to providing low-latency access to such a large data-set, in a demanding, globally distributed environment. This presentation covers the Wikimedia content API and its use of Apache Cassandra as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature will be discussed.

Wikimedia Content API: A Cassandra Use-case

Eric Evans

This document summarizes Eric Evans' presentation on using Cassandra as the backend for Wikimedia's content API. It discusses Wikimedia's goals of providing free knowledge, key metrics about Wikipedia and its architecture. It then focuses on how Wikimedia uses Cassandra, including their data model, compression techniques, and ongoing work to optimize compaction strategies and reduce node density to improve performance.

Castle enhanced Cassandra

Eric Evans

Webinaire Business&Decision - Trifacta

Victor Coustenoble

Webinar Degetel DataStax

Victor Coustenoble

DataStax et Apache Cassandra pour la gestion des flux IoT

Victor Coustenoble

DataStax Enterprise BBL

Victor Coustenoble

CQL In Cassandra 1.0 (and beyond)

Eric Evans

This document provides an overview and history of the Cassandra Query Language (CQL) and discusses changes between versions 1.0 and 2.0. It notes that CQL was introduced in Cassandra 0.8.0 to provide a more stable and user-friendly interface than the native Cassandra API. Major changes in CQL 2.0 included data type changes and additional functionality like named keys, counters, and timestamps. The document outlines the roadmap for future CQL features and lists several third-party driver projects supporting CQL connectivity.

Cassandra by Example: Data Modelling with CQL3

Eric Evans

This document summarizes a presentation about modeling data with Cassandra Query Language (CQL) using examples from a Twitter-like application called Twissandra. It introduces CQL as an alternative to Thrift for querying Cassandra and describes how to model users, followers, tweets, timelines and other social media data structures in Cassandra tables. The presentation emphasizes denormalizing data and using materialized views to optimize queries, and concludes by noting that applications can be built in various languages thanks to Cassandra drivers.

Virtual Nodes: Rethinking Topology in Cassandra

Eric Evans

Virtual Nodes: Rethinking Topology in Cassandra

Eric Evans

The document discusses Cassandra's topology and how it is moving from a single token per node model to a virtual node model where each node is assigned multiple tokens. This improves load balancing and data distribution in the cluster. Specifically, it addresses problems with the single token approach like poor load distribution when nodes fail and inefficient data movement when adding or replacing nodes. The virtual node model with random token assignment provides better scaling properties as the number of nodes and data size increases.

CQL: SQL In Cassandra

Eric Evans

This document discusses CQL, the Cassandra Query Language. CQL is designed to be similar to SQL but with some differences to account for Cassandra's data model. The presentation provides an overview of CQL's syntax and capabilities, discusses why CQL was created to provide a more stable interface than Cassandra's native protocol, and analyzes CQL's performance compared to the native protocol. Future roadmap items for CQL are also presented, including prepared statements and custom transports. Available CQL drivers for languages like Java, Python, Ruby, and Node.js are also briefly mentioned.

Data Modeling with Cassandra and Time Series Data

Dani Traphagen

Rethinking Topology In Cassandra (ApacheCon NA)

Eric Evans

The document discusses topology and partitioning in Cassandra distributed hash tables (DHTs). It describes issues with poor load distribution and data distribution in traditional DHT designs. It proposes using virtual nodes, where each physical node is assigned multiple tokens, to better distribute partitions and improve performance. Configuration options for Cassandra are presented that implement virtual nodes using a random token assignment strategy.

Viewers also liked (20)

Big Data, NoSQL with MongoDB and Cassasdra

Storing time series data with Apache Cassandra

Time Series Data Storage in MongoDB

Con8862 no sql, json and time series data

Cassandra Basics, Counters and Time Series Modeling

MongoDB in the Big Data Landscape

Wikimedia Content API: A Cassandra Use-case

Castle enhanced Cassandra

Webinaire Business&Decision - Trifacta

Webinar Degetel DataStax

DataStax et Apache Cassandra pour la gestion des flux IoT

DataStax Enterprise BBL

CQL In Cassandra 1.0 (and beyond)

Cassandra by Example: Data Modelling with CQL3

Virtual Nodes: Rethinking Topology in Cassandra

CQL: SQL In Cassandra

Data Modeling with Cassandra and Time Series Data

Rethinking Topology In Cassandra (ApacheCon NA)

Similar to Time series storage in Cassandra

Graph databases in computational bioloby: case of neo4j and TitanDB

Andrei KUCHARAVY

This document discusses graph databases and their use in computational biology. It introduces Neo4j and TitanDB as graph database options and describes how biological interaction networks and pathways can be modeled as graphs. Key advantages of graph databases over relational databases are also summarized, such as increased speed for graph queries and simpler programming. The document provides an overview of Neo4j and TitanDB, including their core abstractions, interfaces, and advantages/limitations for storing large biological network data. Examples are given of loading Reactome pathway data into Neo4j and performing graph queries.

Security sizing meetup

Daliya Spasova

This document provides guidance on sizing Elastic Stack deployments for security use cases. It discusses Elasticsearch internals and computing resources needed for different node roles. It recommends preparing by ingesting sample data and monitoring size and ingestion rates to calculate storage needs. The document also discusses optimizing performance by understanding hardware capabilities, balancing cluster size and costs, and aiming for optimal shard sizes. It suggests using techniques like cross-cluster search, data tiering, and transforms. Guidance is provided on scaling Kibana and the detection engine. Examples are given for calculating storage needs and determining necessary data nodes for small and large deployments.

MongoDB Operational Best Practices (mongosf2012)

Scott Hernandez

The document outlines operational best practices learned from analyzing real support cases. It describes 3 scenarios where performance issues were identified: 1) response time timeouts due to disk monitoring and instrumentation issues, 2) high CPU usage due to poorly indexed queries, and 3) general slowdowns due to large disk read-ahead size. Key learnings include monitoring logs and systems, performance testing before deployments, using database profilers and indexes, and planning rollouts and configurations.

AWS Big Data Demystified #1: Big data architecture lessons learned

Omid Vahdaty

AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company The video: https://youtu.be/l5KmaZNQxaU dont forget to subcribe to the youtube channel The website: https://amazon-aws-big-data-demystified.ninja/ The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/ The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/

Data- How Does It Work-

Boyang Niu

This document provides an overview of Castle Global Analytics Pipeline (CGAP) for obtaining and analyzing raw data. It describes how raw data from multiple sources is extracted, transformed, and loaded into Kafka for temporary storage and then into AWS Redshift data warehouses for analysis. It provides details on the ETL process using Spark and discusses business intelligence tools. It also provides statistics on the volume of data ingested daily from various sources and the costs associated with the current data pipeline infrastructure.

Scaling an ELK stack at bol.com

Renzo Tomà

A presentation about the deployment of an ELK stack at bol.com At bol.com we use Elasticsearch, Logstash and Kibana in a logsearch system that allows our developers and operations people to easilly access and search thru logevents coming from all layers of its infrastructure. The presentations explains the initial design and its failures. It continues with explaining the latest design (mid 2014). Its improvements. And finally a set of tips are giving regarding Logstash and Elasticsearch scaling. These slides were first presented at the Elasticsearch NL meetup on September 22nd 2014 at the Utrecht bol.com HQ.

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...

Hernan Costante

Nowadays in an increasingly more complex and dynamic network its not enough to be a regex ninja and storing only the logs you think you might need. From network traffic to custom logs you won't know which logs will be crucial to stop the next attacker, and if you are not planning to spend a half of your security budget in a commercial solution we will show you a way to building you own SIEM with open source. The talk will go from how to build a powerful logging environment for your organization to scaling on the cloud and storing everything forever. We will walk through how to build such a system with open source solutions as Elasticsearch and Hadoop, and creating your own custom monitoring rules to monitor everything you need. The talk will also include how to secure the environment and allow restricted access to other teams as well as avoiding common pitfalls and ensuring compliance standards.

PGConf APAC 2018 - High performance json postgre-sql vs. mongodb

PGConf APAC

Speakers: Dominic Dwyer & Wei Shan Ang This talk was presented in Percona Live Europe 2017. However, we did not have enough time to test against more scenario. We will be giving an updated talk with a more comprehensive tests and numbers. We hope to run it against citusDB and MongoRocks as well to provide a comprehensive comparison. https://www.percona.com/live/e17/sessions/high-performance-json-postgresql-vs-mongodb

Assignment 1-mtat

zafargilani

This document discusses instrumentation and analysis of the NAS Parallel Benchmarks (NPB) application using the Extrae tracing library. It summarizes the tests performed on local and remote machines using 2, 4, 8, 16, and 32 processes. Key metrics like computation time, communication time, load imbalance, and bottlenecks are measured. The analysis shows the NPB application scales well on the remote server but not the local laptop beyond 16 processes due to increased communication and wait times.

Big data nyu

Edward Capriolo

Big data refers to large, complex datasets that are difficult to process using traditional methods. This document discusses three examples of real-world big data challenges and their solutions. The challenges included storage, analysis, and processing capabilities given hardware and time constraints. Solutions involved switching databases, using Hadoop/MapReduce, and representing complex data structures to enable analysis of terabytes of ad serving data. Flexibility and understanding domain needs were key to feasible versus theoretical solutions.

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Omid Vahdaty

Security Monitoring for big Infrastructures without a Million Dollar budget

Juan Berner

Introduction to AWS Big Data

Omid Vahdaty

Interactive Data Analysis in Spark Streaming

datamantra

This document discusses strategies for building interactive streaming applications in Spark Streaming. It describes using Zookeeper as a dynamic configuration source to allow modifying a Spark Streaming application's behavior at runtime. The key points are: - Zookeeper can be used to track configuration changes and trigger Spark Streaming context restarts through its watch mechanism and Curator library. - This allows building interactive applications that can adapt to configuration updates without needing to restart the whole streaming job. - Examples are provided of using Curator caches like node and path caches to monitor Zookeeper for changes and restart Spark Streaming contexts in response.

Monitoring Cassandra With An EYE

Knoldus Inc.

This document discusses monitoring Cassandra, including an overview of Cassandra, its internal concepts like read/write paths and compactions, and important metrics to monitor. Key metrics to monitor Cassandra's performance include read/write latency, live SSTable count, thread pool pending/completed tasks, and memtable flush count. Operations like compactions and hinted handoff replication should also be monitored. Resource usage metrics like JVM garbage collection time and memory usage are important to monitor as well. Monitoring these metrics helps detect anomalies, optimize performance, and ensure Cassandra's successful operation over the long run.

From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...

Imperva Incapsula

The document describes the evolution of Incapsula's big data system over four generations from 2010 to 2015. Each generation improved on processing throughput, read performance, and scalability by simplifying the data model and moving to distributed processing across multiple points of presence. Key changes included moving from a centralized SQL database to NoSQL storage, implementing multi-threaded processing, and distributing workloads across data centers.

OVHcloud – Enterprise Cloud Databases

OVHcloud

Scaling Up with PHP and AWS

Heath Dutton ☕

Lessons learned from shifting real data around: An ad hoc data challenge from...

Jisc

Session 10 handling bigger data

bodaceacat

This document discusses handling larger datasets and moving to distributed systems. It begins by explaining different storage sizes from gigabytes to exabytes and yottabytes. For too big data, it recommends reading data in chunks, using parallel processing libraries like Dask, and compiled Python. It then discusses distributed file systems, MapReduce frameworks, and distributed programming platforms like Hadoop and Spark. The document also covers SQL and NoSQL databases, data warehouses, data lakes, and typical big data science team roles including data scientists, engineers, and analysts. It provides examples of distributed systems and concludes with exercises and suggestions for further reading.

Similar to Time series storage in Cassandra (20)

Graph databases in computational bioloby: case of neo4j and TitanDB

Security sizing meetup

MongoDB Operational Best Practices (mongosf2012)

AWS Big Data Demystified #1: Big data architecture lessons learned

Data- How Does It Work-

Scaling an ELK stack at bol.com

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...

PGConf APAC 2018 - High performance json postgre-sql vs. mongodb

Assignment 1-mtat

Big data nyu

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Security Monitoring for big Infrastructures without a Million Dollar budget

Introduction to AWS Big Data

Interactive Data Analysis in Spark Streaming

Monitoring Cassandra With An EYE

From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...

OVHcloud – Enterprise Cloud Databases

Scaling Up with PHP and AWS

Lessons learned from shifting real data around: An ad hoc data challenge from...

Session 10 handling bigger data

More from Eric Evans

Cassandra By Example: Data Modelling with CQL3

Eric Evans

CQL is the query language for Apache Cassandra that provides an SQL-like interface. The document discusses the evolution from the older Thrift RPC interface to CQL and provides examples of modeling tweet data in Cassandra using tables like users, tweets, following, followers, userline, and timeline. It also covers techniques like denormalization, materialized views, and batch loading of related data to optimize for common queries.

Cassandra: Not Just NoSQL, It's MoSQL

Eric Evans

CQL is a structured query language for Apache Cassandra that is similar to SQL. It provides an alternative interface to the existing Thrift API, with the goals of being more stable, easier to use, and providing a better mental model for querying and data. The document outlines the motivations for developing CQL, including limitations of the existing Thrift API, and provides details on CQL specification, drivers, and additional resources.

NoSQL Yes, But YesCQL, No?

Eric Evans

1. The document discusses Cassandra Query Language (CQL), a new structured query language for Apache Cassandra that is similar to SQL. 2. CQL aims to provide a simpler alternative to Cassandra's existing Thrift API, which is difficult for clients to use and unstable due to its tight coupling to Cassandra's internal APIs. 3. The document outlines some benefits of CQL compared to the Thrift API, such as requiring less client-side abstraction and being more intuitive through its use of a familiar query/data model.

Cassandra Explained

Eric Evans

Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single points of failure and linear scalability as nodes are added. Cassandra uses a peer-to-peer distributed architecture and tunable consistency levels to achieve high performance and availability without requiring strong consistency. It is based on Amazon's Dynamo and Google's Bigtable papers and provides a combination of their features.

Cassandra Explained

Eric Evans

This document provides an overview and introduction to Cassandra, an open source distributed database management system designed to handle large amounts of data across many commodity servers. It discusses Cassandra's origins from influential papers on Bigtable and Dynamo, its properties including flexibility, scalability and high availability. The document also covers Cassandra's data model using keyspaces and column families, its consistency options, API including Thrift and language drivers, and provides examples of usage for an address book app and storing timeseries data.

Outside The Box With Apache Cassnadra

Eric Evans

The Cassandra Distributed Database

Eric Evans

This document summarizes Cassandra, an open source distributed database management system designed to handle large amounts of data across many commodity servers. It discusses Cassandra's history, key features like tunable consistency levels and support for structured and indexed columns. Case studies describe how companies like Digg, Twitter, Facebook and Mahalo use Cassandra to handle terabytes of data and high transaction volumes. The roadmap outlines upcoming releases that will improve features like compaction, management tools, and support for dynamic schema changes.

An Introduction To Cassandra

Eric Evans

This document is an introduction to Cassandra presented by Eric Evans. It provides an outline that covers the project history, description of Cassandra as a massively scalable and decentralized structured data store, and lists some of the people and companies involved in Cassandra including Facebook, Digg, IBM Research, Rackspace and Twitter. The document discusses Cassandra's capabilities such as tunable consistency levels, structured columns and supercolumns, querying, updates, client APIs and performance compared to MySQL.

Cassandra In A Nutshell

Eric Evans

This document summarizes Cassandra, an open source distributed database. It describes Cassandra's history starting at Facebook, then being taken over by Apache. It provides details on Cassandra's architecture as a massively scalable, distributed, structured data store with tunable consistency levels and fast reads/writes. The document outlines that values in Cassandra are structured and indexed by columns and supercolumns with slicing queries supported. Key features like hinted handoff, Thrift API, data center awareness, pluggable comparators, and enumeration/range queries are also summarized.

More from Eric Evans (9)

Cassandra By Example: Data Modelling with CQL3

Cassandra: Not Just NoSQL, It's MoSQL

NoSQL Yes, But YesCQL, No?

Cassandra Explained

Outside The Box With Apache Cassnadra

The Cassandra Distributed Database

An Introduction To Cassandra

Cassandra In A Nutshell

Recently uploaded

Coordinate Systems in FME 101 - Webinar Slides

Safe Software

If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights. During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to: - Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value - Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems - Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors - Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported - Look Ahead: Gain insights into where FME is headed with coordinate systems in the future Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!

Implementations of Fused Deposition Modeling in real world

Emerging Tech

The presentation showcases the diverse real-world applications of Fused Deposition Modeling (FDM) across multiple industries: 1. **Manufacturing**: FDM is utilized in manufacturing for rapid prototyping, creating custom tools and fixtures, and producing functional end-use parts. Companies leverage its cost-effectiveness and flexibility to streamline production processes. 2. **Medical**: In the medical field, FDM is used to create patient-specific anatomical models, surgical guides, and prosthetics. Its ability to produce precise and biocompatible parts supports advancements in personalized healthcare solutions. 3. **Education**: FDM plays a crucial role in education by enabling students to learn about design and engineering through hands-on 3D printing projects. It promotes innovation and practical skill development in STEM disciplines. 4. **Science**: Researchers use FDM to prototype equipment for scientific experiments, build custom laboratory tools, and create models for visualization and testing purposes. It facilitates rapid iteration and customization in scientific endeavors. 5. **Automotive**: Automotive manufacturers employ FDM for prototyping vehicle components, tooling for assembly lines, and customized parts. It speeds up the design validation process and enhances efficiency in automotive engineering. 6. **Consumer Electronics**: FDM is utilized in consumer electronics for designing and prototyping product enclosures, casings, and internal components. It enables rapid iteration and customization to meet evolving consumer demands. 7. **Robotics**: Robotics engineers leverage FDM to prototype robot parts, create lightweight and durable components, and customize robot designs for specific applications. It supports innovation and optimization in robotic systems. 8. **Aerospace**: In aerospace, FDM is used to manufacture lightweight parts, complex geometries, and prototypes of aircraft components. It contributes to cost reduction, faster production cycles, and weight savings in aerospace engineering. 9. **Architecture**: Architects utilize FDM for creating detailed architectural models, prototypes of building components, and intricate designs. It aids in visualizing concepts, testing structural integrity, and communicating design ideas effectively. Each industry example demonstrates how FDM enhances innovation, accelerates product development, and addresses specific challenges through advanced manufacturing capabilities.

Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...

Erasmo Purificato

[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf

Kief Morris

Details of description part II: Describing images in practice - Tech Forum 2024

BookNet Canada

This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator. Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/ Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.

Calgary MuleSoft Meetup APM and IDP .pptx

ishalveerrandhawa1

7 Most Powerful Solar Storms in the History of Earth.pdf

Enterprise Wired

Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...

Bert Blevins

Today’s digitally connected world presents a wide range of security challenges for enterprises. Insider security threats are particularly noteworthy because they have the potential to cause significant harm. Unlike external threats, insider risks originate from within the company, making them more subtle and challenging to identify. This blog aims to provide a comprehensive understanding of insider security threats, including their types, examples, effects, and mitigation techniques.

How RPA Help in the Transportation and Logistics Industry.pptx

SynapseIndia

Manual | Product | Research Presentation

welrejdoall

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In

TrustArc

Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk. What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year? Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year. This webinar will review: - Key changes to privacy regulations in 2024 - Key themes in privacy and data governance in 2024 - How to maximize your privacy program in the second half of 2024

Quantum Communications Q&A with Gemini LLM

Vijayananda Mohire

論文紹介：A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...

Toru Tamaki

find out more about the role of autonomous vehicles in facing global challenges

huseindihon

Recent Advancements in the NIST-JARVIS Infrastructure

KAMAL CHOUDHARY

How to Build a Profitable IoT Product.pptx

Adam Dunkels

Advanced Techniques for Cyber Security Analysis and Anomaly Detection

Bert Blevins

Cybersecurity is a major concern in today's connected digital world. Threats to organizations are constantly evolving and have the potential to compromise sensitive information, disrupt operations, and lead to significant financial losses. Traditional cybersecurity techniques often fall short against modern attackers. Therefore, advanced techniques for cyber security analysis and anomaly detection are essential for protecting digital assets. This blog explores these cutting-edge methods, providing a comprehensive overview of their application and importance.

20240705 QFM024 Irresponsible AI Reading List June 2024

Matthew Sinclair

The Rise of Supernetwork Data Intensive Computing

Larry Smarr

UiPath Community Day Kraków: Devs4Devs Conference

UiPathCommunity

We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner! We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too! Check out our proposed agenda below 👇👇 08:30 ☕ Welcome coffee (30') 09:00 Opening note/ Intro to UiPath Community (10') Cristina Vidu, Global Manager, Marketing Community @UiPath Dawid Kot, Digital Transformation Lead @Proservartner 09:10 Cloud migration - Proservartner & DOVISTA case study (30') Marcin Drozdowski, Automation CoE Manager @DOVISTA Pawel Kamiński, RPA developer @DOVISTA Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner 09:40 From bottlenecks to breakthroughs: Citizen Development in action (25') Pawel Poplawski, Director, Improvement and Automation @McCormick & Company Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company 10:05 Next-level bots: API integration in UiPath Studio (30') Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner 10:35 ☕ Coffee Break (15') 10:50 Document Understanding with my RPA Companion (45') Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath 11:35 Power up your Robots: GenAI and GPT in REFramework (45') Krzysztof Karaszewski, Global RPA Product Manager 12:20 🍕 Lunch Break (1hr) 13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30') Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance 13:50 Communications Mining - focus on AI capabilities (30') Thomasz Wierzbicki, Business Analyst @Office Samurai 14:20 Polish MVP panel: Insights on MVP award achievements and career profiling

Recently uploaded (20)

Coordinate Systems in FME 101 - Webinar Slides

Implementations of Fused Deposition Modeling in real world

Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...

[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf

Details of description part II: Describing images in practice - Tech Forum 2024

Calgary MuleSoft Meetup APM and IDP .pptx

7 Most Powerful Solar Storms in the History of Earth.pdf

Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...

How RPA Help in the Transportation and Logistics Industry.pptx

Manual | Product | Research Presentation

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In

Quantum Communications Q&A with Gemini LLM

論文紹介：A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...

find out more about the role of autonomous vehicles in facing global challenges

Recent Advancements in the NIST-JARVIS Infrastructure

How to Build a Profitable IoT Product.pptx

Advanced Techniques for Cyber Security Analysis and Anomaly Detection

20240705 QFM024 Irresponsible AI Reading List June 2024

The Rise of Supernetwork Data Intensive Computing

UiPath Community Day Kraków: Devs4Devs Conference

Time series storage in Cassandra

1. Time Series Storage Cassandra London Meetup April 7, 2014 Eric Evans eevans@opennms.org @jericevans

2. Open

3. Open

4. Network Management System

5. OpenNMS: What It Is ● Network Management System ○ Discovery and Provisioning ○ Service monitoring ○ Data collection ○ Event management and notifications ● Java, open source, GPLv3 ● Since 1999

6. Graph All The Things

7. RRDTool ● Round robin database ● First released 1999 ● Time-series storage ● File-based ● Constant-size ● Automatic, amortized aggregation

8. Consider ● 2 IOPs per update (read-update-write) ● 1 RRD per data source (storeByGroup=false) ● 100,000s of data sources, 1,000s IOPS ● 1,000,000s of data sources, 10,000s IOPS ● 15,000 RPM SAS drive, ~175-200 IOPS

9. Also ● Not everything is a graph ● Inflexible ● Incremental backups impractical ● ...

10. Observation #1 We collect and write a great deal; We read (graph) relatively little. We are optimized for reading everything, always.

11. Observation #2 Samples are naturally collected, and graphed together in groups. Grouping samples that are accessed together is an easy optimization.

12. Project: Newts Goals: ● Stand-alone time-series data store ● High-throughput ● Horizontally scalable ● Grouped metric storage/retrieval ● Late-aggregating

13. Cassandra Why: ● Write-optimized ● Sorted ● Horizontally scalable (linear)

14. Gist ● Samples stored as-is. ● Samples can be retrieved as-is. ● Measurements are aggregations calculated from samples (at time of query).

15. Sample { “resource” : “london”, “timestamp” : 1396289065, “name” : “meanTemp”, “type” : “GAUGE”, “value” : 17.2, “attributes” : { “units”: “celsius” } }

16. Samples CREATE TABLE newts.samples ( resource text, collected_at timestamp, metric_name text, metric_type text, value blob, attributes map<text, text>, PRIMARY KEY(resource, collected_at, metric_name) );

18. Behind the scenes... london (2014-03-31 18:04:25, dewPoint): 0xc01a0000 (2014-03-31 18:04:25, maxTemp): 0x40280000 ... Ascending Order

19. http://github.com/OpenNMS/newts

Time series storage in Cassandra

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Time series storage in Cassandra

Similar to Time series storage in Cassandra (20)

More from Eric Evans

More from Eric Evans (9)

Recently uploaded

Recently uploaded (20)

Time series storage in Cassandra