This document discusses using Apache Cassandra to store and retrieve time series data more efficiently than the traditional RRDTool approach. It describes how Cassandra is well-suited for time series data due to its high write throughput, ability to store data sorted on disk, and partitioning and replication. The document also outlines a data model for storing time series metrics in Cassandra and discusses Newts, an open source time series data store built on Cassandra.
This document discusses RediSearch aggregations, which allow processing search results to produce statistical insights. Aggregations take a search query, group and reduce the results, apply transformations, and sort. Key steps include filtering results, grouping and reducing with functions like count and average, applying expressions, and sorting. Examples show finding top GitHub committers and visits by hour. Scaling aggregations to multiple nodes requires pushing processing stages to nodes and merging results, such as summing counts or taking list intersections.
The document discusses lessons learned from using MongoDB at the New York Times over 6 months. It covers initial setup without backups or monitoring, improving to replication and monitoring, optimizing storage, backups, restores, querying, indexing and administration. Key lessons include using replication and backups, monitoring all aspects of MongoDB and storage, optimizing data and indexes for queries, and understanding data and access patterns.
Redis can be used as a time-series database by using the redis-timeseries module. The module provides a custom data structure and commands for storing and querying time-series data in Redis. Data can be added with a timestamp and value and queried within a time range. Downsampling aggregates and stores data at regular intervals to reduce the size of long time-series data. Global configuration allows defining downsampling rules and retention policies for all keys.
This document discusses working with time series data using InfluxDB. It provides an overview of time series data and why InfluxDB is useful for storing and querying it. Key features of InfluxDB covered include its SQL-like query language, retention policies for managing data storage, continuous queries for aggregation, and tools for data collection, visualization and monitoring.
NBITSearch is a search engine with an open API for local stations, LAN and Internet. Advantages over counterparts: 1. Object indexing. It allows to index objects S of any types T. 2. Multifunctional indexing. It allows to index objects simultaneously by a set of any functions F (S). 3. Very fast search. It allows to save time and money.
NBITSearch is a search engine with an open API for local stations, LAN and Internet. Advantages over counterparts: 1. Object indexing. It allows to index objects S of any types T. 2. Multifunctional indexing. It allows to index objects simultaneously by set any functions F (S). 3. Very fast search. It allows to save time and money.
Ted Dunning – Very High Bandwidth Time Series Database Implementation This talk will describe our work in creating time series databases with very high ingest rates (over 100 million points / second) on very small clusters. Starting with openTSDB and the off-the-shelf version of MapR-DB, we were able to accelerate ingest by >1000x. I will describe our techniques in detail and talk about the architectural changes required. We are also working to allow access to openTSDB data using SQL via Apache Drill. In addition, I will talk about how this work has implications regarding the much fabled Internet of Things. And tell some stories about the origins of open source big data in the 19th century at sea.
How to store billions of time series points and access them within a few milliseconds? Chronix! Chronix is a young but mature open source project that allows one for example to store about 15 GB (csv) of time series in 238 MB with average query times of 21 ms. Chronix is built on top of Apache Solr a bulletproof distributed NoSQL database with impressive search capabilities. In this code-intense session we show how Chronix achieves its efficiency in both respects by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with (pre-computed) attributes, and by specialized query functions.
A session about my experience with writing an external inventory script from scratch for "Netbox" (IPAM and DCIM tool from DigitalOcean network engineering team) and push it to upstream to became an official inventory script. Repo: https://github.com/AAbouZaid/netbox-as-ansible-inventory The "Dynamic inventory" is one of nice features in Ansible, where you can use an external service as inventory for Ansible instead the basic text-based ini file. So you can use AWS EC2 as inventory of your hosts, or maybe OpenStack, or whatever ... you actually can use any source inventory for Ansible, and you can write your own "External Inventory Script".
The document outlines a migration plan to improve the performance and scalability of an Elasticsearch cluster. The current cluster has performance issues due to a large inverted index, outdated software version, and lack of document purge policies. The plan involves defining requirements, measuring the new infrastructure needs, installing an updated version, defining index structures, performing a remote reindex to migrate data, and adding logic to avoid downtime during migration. The new cluster will have dedicated roles, monthly indices of optimal size, and policies to retain only one year of data.
“Just leave the garbage outside and we will take care of it for you”. This is the panacea promised by garbage collection mechanisms built into most software stacks available today. So, we don’t need to think about it anymore, right? Wrong! When misused, garbage collectors can fail miserably. When this happens they slow down your application and lead to unacceptable pauses. In this talk we will go over different garbage collectors approaches in different software runtimes and what are the conditions which enable them to function well. Presented on Reversim summit 2019 https://summit2019.reversim.com/session/5c754052d0e22f001706cbd8
The document compares using SAS hash objects versus SQL joins to combine data from multiple tables. Hash objects store key-value pairs in memory for fast lookups, providing a potential alternative to joins. While hash objects can improve performance, especially for larger datasets, they require more code and memory than joins. The document evaluates performance differences between hash objects and joins for various scenarios and sizes of data. It also discusses additional capabilities and considerations for using hash objects.
This document summarizes the work done by OpenStack@IIIT-H, which uses OpenStack to run an Indian language search engine, conduct research in information extraction/retrieval/access and virtualization/cloud computing, and provide users with OpenStack, Hadoop, and other open-source software. Before OpenStack, provisioning resources was ad hoc and unmanaged, user management lacked controls, and storage had reliability and duplication issues. After implementing OpenStack, resources can be quickly and reliably provisioned on demand, usage is monitored and restricted with quotas, and storage uses Swift for reliability without data fragmentation. OpenStack also supports research and teaching over 350 students through hands-on projects.
They promise that IoT (Internet of Things) will conquer the world. But what will tackle billions of bytes that flow into our servers every hour? First released in 2013, InfluxDB is used by eBay, Cisco, IBM and other big companies. It’s a production proven time-series storage. During this talk we're going to get acquainted with it and see how InfluxDB can help to solve your problems. We’ll see how to quickly install it on Amazon Web Services platform and how it scales. And for the dessert, we’re going to draw pretty Grafana graphs using InfluxDB data.
This document summarizes several consistent hashing algorithms: Mod-N hashing is simple but makes adding or removing servers difficult as it may require reconstructing the entire hashing table. Consistent hashing addresses this by using two hash functions - one for data and one for servers, placing them on a ring to distribute data. Jump hashing improves load distribution and reduces space usage to O(1) but does not support arbitrary node removal. Maglev hashing provides constant-time lookups but slow generation on node failure. Later algorithms aim to improve aspects like bounded load distribution and memory usage versus lookup speed. Overall, consistent hashing algorithms involve tradeoffs between these factors.
The document discusses using the raster package in R to work with geographical grid data. It covers downloading and loading the raster package, creating raster objects and adding random values, reading in real climate data files, performing operations like cropping and aggregation, and sources for global climate data like WorldClim.
This document discusses using InfluxDB and Grafana together for analyzing IoT data. It provides benchmarks showing InfluxDB's fast performance for ingesting and querying large time series data compared to PostgreSQL. It also covers hosting InfluxDB on AWS for horizontal scalability and high availability using InfluxDB relays.
Comment maîtriser le flux de données IoT avec DataStax, Apache Cassandra et Apache Spark ? Petit Déjeuner IoT du 19 novembre 2015
The document discusses topology and partitioning in Cassandra distributed hash tables (DHTs). It describes issues with poor load distribution and data distribution in traditional DHT designs. It proposes using virtual nodes, where each physical node is assigned multiple tokens, to better distribute partitions and improve performance. Configuration options for Cassandra are presented that implement virtual nodes using a random token assignment strategy.
This document discusses building scalable IoT applications using open source technologies. It begins by providing an overview of the growth of the IoT market and connected devices. It then discusses challenges with traditional "data lake" architectures for IoT data due to the high volume, velocity, and variety of IoT data. The document proposes an architecture combining stream processing for real-time data with analytics on both real-time and stored data. It discusses data access patterns and storage requirements for different types of IoT data. Finally, it provides an overview of open source technologies that can be used to build scalable IoT applications.
This document summarizes Eric Evans' presentation on using Cassandra as the backend for Wikimedia's content API. It discusses Wikimedia's goals of providing free knowledge, key metrics about Wikipedia and its architecture. It then focuses on how Wikimedia uses Cassandra, including their data model, compression techniques, and ongoing work to optimize compaction strategies and reduce node density to improve performance.
Among the resources offered by Wikimedia is an API providing low-latency access to full-history content, in many formats. Its results are often the product of computationally intensive transforms, and must be pre-generated and stored to meet latency expectations. Unsurprisingly, there are many challenges to providing low-latency access to such a large data-set, in a demanding, globally distributed environment. This presentation covers the Wikimedia content API and its use of Apache Cassandra as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature will be discussed.
Webinar Degetel DataStax du 15 octobre 2015 Du SQL au NoSQL : Pourquoi ? Différences ? Comment ça marche ?
The Wikimedia Foundation is a non-profit and charitable organization driven by a vision of a world where every human can freely share in the sum of all knowledge. Each month Wikimedia sites serve over 18 billion page views to 500 million unique visitors around the world. Among the many resources offered by Wikimedia is a public-facing API that provides low-latency, programmatic access to full-history content and meta-data, in a variety of formats. Commonly, results from this system are the product of computationally intensive transformations, and must be pre-generated and persisted to meet latency expectations. Unsurprisingly, there are numerous challenges to providing low-latency storage of such a massive data-set, in a demanding, globally distributed environment. This talk covers Wikimedia Content API, and it's use of Apache Cassandra, a massively-scalable distributed database, as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature are discussed.
Castle is an open-source project that provides an alternative to the lower layers of the storage stack -- RAID and POSIX filesystems -- for big data workloads, and distributed data stores such as Apache Cassandra. This presentation from Berlin Buzzwords 2012 provides a high-level overview of Castle and how it is used with Cassandra to improve performance and predictability.
Webinaire Banque / Assurance Reprenez le pouvoir sur vos données
Découvrez DataStax Enterprise avec ses intégrations de Apache Spark et de Apache Solr, ainsi que son nouveau modèle de données de type Graph.