OpenNMS User Conference Europe presentation on using Apache Cassandra and Newts for time-series data storage.
The raster package in R allows users to work with geographic grid data. It contains functions for reading raster files into R, performing operations on raster layers like cropping and aggregation, and visualizing raster maps. Common sources of global climate data that can be accessed in R include WorldClim, the Global Summary of Day from NOAA, and datasets available on the CGIAR website.
The document describes improvements to building histograms for database tables. It outlines collecting a histogram using samples of rows rather than a full table scan to avoid sorting all values and improve performance. The new implementation allows the user to specify a sampling percentage and constructs an equal-width histogram using multiple samples to estimate the min and max values and then bucket the values between those ranges.
The document proposes a solution to replace inode-based storage with a key-value store mapping objects directly to positions in large "volumes" or files to address scalability issues. It benchmarks significantly better performance for puts, gets, and concurrent operations compared to an XFS filesystem, using less RAM and avoiding compaction costs. Open tasks include replication, erasure coding, and testing on object servers.
This document provides a history and overview of ECMAScript (ES), the standard upon which JavaScript is based. It discusses the major versions from ES3 in 1999 to ES2016. Key changes and new features are outlined for each version, including the addition of classes, modules, iterators and more in ES6/ES2015. Transpilers like Babel allow the use of new syntax by compiling ES6 to older JavaScript. Compatibility and adoption are addressed, noting a goal of evolving the language without breaking the web. Links for further reading on ES6 features and syntax are also included.
Redis can be used as a time-series database by using the redis-timeseries module. The module provides a custom data structure and commands for storing and querying time-series data in Redis. Data can be added with a timestamp and value and queried within a time range. Downsampling aggregates and stores data at regular intervals to reduce the size of long time-series data. Global configuration allows defining downsampling rules and retention policies for all keys.
Be a Zen monk, the Python way. A short tech talk at Imaginea to get developers bootstrapped with the focus and philosophy of Python and their point of convergence with the philosophy.
Cryptocurrencies are digital currencies that have garnered significant investor attention in the financial markets. The aim of this project is to predict the daily price, particularly the daily closing price of the cryptocurrency Bitcoin. This plays a vital role in making trading decisions. There exist various factors which affect the price of Bitcoin, thereby making price prediction a complex and technically challenging task. To perform prediction, random forest model was trained on the historical time series which is the past prices of Bitcoin over several years. Features such as the opening price, highest price, lowest price, closing price, volume of Bitcoin, volume of currencies, and weighted price were taken into consideration so as to predict the closing price of the next day. Random forest model designed and implemented on both of pyspark and scikit learn frameworks to build predictive analysis and evaluated them by computing various measures such as the RMSE (root mean square error) and r (Pearson's correlation coefficient) on test data. Pyspark framework was used to make parallelize the creating trees when training the random forest to handle bigdata. Code has been made available at: https://github.com/ykpgrr/Price-Prediction-with-Random-Forest
“Just leave the garbage outside and we will take care of it for you”. This is the panacea promised by garbage collection mechanisms built into most software stacks available today. So, we don’t need to think about it anymore, right? Wrong! When misused, garbage collectors can fail miserably. When this happens they slow down your application and lead to unacceptable pauses. In this talk we will go over different garbage collectors approaches in different software runtimes and what are the conditions which enable them to function well. Presented on Reversim summit 2019 https://summit2019.reversim.com/session/5c754052d0e22f001706cbd8
This document provides an introduction and overview of several Apache Spark labs covering: a "hello world" example of Resilient Distributed Datasets (RDDs); importing and performing operations on a wine dataset using DataFrames and SQL; and using the MLlib library to perform k-means clustering on features from the wine dataset. The labs demonstrate basic Spark concepts like RDDs, DataFrames, ML pipelines, and clustering algorithms.
A session about my experience with writing an external inventory script from scratch for "Netbox" (IPAM and DCIM tool from DigitalOcean network engineering team) and push it to upstream to became an official inventory script. Repo: https://github.com/AAbouZaid/netbox-as-ansible-inventory The "Dynamic inventory" is one of nice features in Ansible, where you can use an external service as inventory for Ansible instead the basic text-based ini file. So you can use AWS EC2 as inventory of your hosts, or maybe OpenStack, or whatever ... you actually can use any source inventory for Ansible, and you can write your own "External Inventory Script".
Bizur is a consensus algorithm invented by Elastifile to address issues with log-based algorithms like Paxos. It optimizes for strongly consistent distributed key-value stores by having independent buckets for keys that are replicated and use leader election. Reads require a majority and writes succeed with majority acknowledgement. It includes recovery mechanisms like ensuring buckets are up-to-date after leader changes and can reconfigure membership or number of buckets per shard dynamically through techniques like SMART migrations.
Presentation from the Boulder/Denver Big Data Meetup on 2/20/2020 in Boulder, CO. Topics covered: Troubleshooting Spark jobs (groupby, shuffle) for big data, tuning AWS EMR Spark clusters, EMR cluster resource utilization, writing scaleable Scala for scanning S3 metadata.
Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Selfaligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure. We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2–4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×.
In Apache Cassandra Lunch #59: Functions in Cassandra, we discussed the functions that are usable inside of the Cassandra database. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live.
Ted Dunning – Very High Bandwidth Time Series Database Implementation This talk will describe our work in creating time series databases with very high ingest rates (over 100 million points / second) on very small clusters. Starting with openTSDB and the off-the-shelf version of MapR-DB, we were able to accelerate ingest by >1000x. I will describe our techniques in detail and talk about the architectural changes required. We are also working to allow access to openTSDB data using SQL via Apache Drill. In addition, I will talk about how this work has implications regarding the much fabled Internet of Things. And tell some stories about the origins of open source big data in the 19th century at sea.
In Apache Cassandra Lunch #67, we discussed how to move data from Open Source Cassandra to Datastax Astra using dsbulk/scylla migratory. https://github.com/DataStax-Examples/dsbulk-to-astra/ Accompanying Blog: https://blog.anant.us/apache-cassandra-lunch-67-moving-data-from-cassandra-to-datastax-astra-with-dsbulk Accompanying Youtube: https://youtu.be/0k7RBf5vi5M Sign Up For Our Newsletter: http://eepurl.com/grdMkn Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/ Cassandra.Link: https://cassandra.link/ Follow Us and Reach Us At: Anant: https://www.anant.us/ Awesome Cassandra: https://github.com/Anant/awesome-cassandra Cassandra.Lunch: https://github.com/Anant/Cassandra.Lunch Email: solutions@anant.us LinkedIn: https://www.linkedin.com/company/anant/ Twitter: https://twitter.com/anantcorp Eventbrite: https://www.eventbrite.com/o/anant-1072927283 Facebook: https://www.facebook.com/AnantCorp/
This document discusses using PHP to collect and store large amounts of physiological data from an intensive care unit. The system collected around 100,000 values per second from 40 beds, totaling over 2 trillion samples per year. Various database options were considered for storing this time series data, with custom compressed binary files chosen due to their small disk footprint. PHP was used to develop a prototype that compressed the data to around 0.57TB per year. While PHP has limitations for a production system, it was effective for rapid prototyping of compression algorithms and accessing large amounts of compressed data in "extended memory".
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information. In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc. Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
Presented at Athens Cassandra Users Group meetup http://www.meetup.com/Athens-Cassandra-Users/events/177040142/
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster. In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring. Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
Among the resources offered by Wikimedia is an API providing low-latency access to full-history content, in many formats. Its results are often the product of computationally intensive transforms, and must be pre-generated and stored to meet latency expectations. Unsurprisingly, there are many challenges to providing low-latency access to such a large data-set, in a demanding, globally distributed environment. This presentation covers the Wikimedia content API and its use of Apache Cassandra as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature will be discussed.
This document summarizes Eric Evans' presentation on using Cassandra as the backend for Wikimedia's content API. It discusses Wikimedia's goals of providing free knowledge, key metrics about Wikipedia and its architecture. It then focuses on how Wikimedia uses Cassandra, including their data model, compression techniques, and ongoing work to optimize compaction strategies and reduce node density to improve performance.
Webinaire Banque / Assurance Reprenez le pouvoir sur vos données
Castle is an open-source project that provides an alternative to the lower layers of the storage stack -- RAID and POSIX filesystems -- for big data workloads, and distributed data stores such as Apache Cassandra. This presentation from Berlin Buzzwords 2012 provides a high-level overview of Castle and how it is used with Cassandra to improve performance and predictability.
Webinar Degetel DataStax du 15 octobre 2015 Du SQL au NoSQL : Pourquoi ? Différences ? Comment ça marche ?
Comment maîtriser le flux de données IoT avec DataStax, Apache Cassandra et Apache Spark ? Petit Déjeuner IoT du 19 novembre 2015
Découvrez DataStax Enterprise avec ses intégrations de Apache Spark et de Apache Solr, ainsi que son nouveau modèle de données de type Graph.
This document provides an overview and history of the Cassandra Query Language (CQL) and discusses changes between versions 1.0 and 2.0. It notes that CQL was introduced in Cassandra 0.8.0 to provide a more stable and user-friendly interface than the native Cassandra API. Major changes in CQL 2.0 included data type changes and additional functionality like named keys, counters, and timestamps. The document outlines the roadmap for future CQL features and lists several third-party driver projects supporting CQL connectivity.
A presentation on the recent work to transition Cassandra from its naive 1-partition-per-node distribution, to a proper virtual nodes implementation.