Whether it's statistics, weather forecasting, astronomy, finance, or network management, time series data plays a critical role in analytics and forecasting. Unfortunately, while many tools exist for time series storage and analysis, few are able to scale past memory limits, or provide rich query and analytics capabilities outside what is necessary to produce simple plots; For those challenged by large volumes of data, there is much room for improvement. Apache Cassandra is a fully distributed second-generation database. Cassandra stores data in key-sorted order making it ideal for time series, and its high throughput and linear scalability make it well suited to very large data sets. This talk will cover some of the requirements and challenges of large scale time series storage and analysis. Cassandra data and query modeling for this use-case will be discussed, and Newts, an open source Cassandra-based time series store under development at The OpenNMS Group will be introduced.
The raster package in R allows users to work with geographic grid data. It contains functions for reading raster files into R, performing operations on raster layers like cropping and aggregation, and visualizing raster maps. Common sources of global climate data that can be accessed in R include WorldClim, the Global Summary of Day from NOAA, and datasets available on the CGIAR website.
This document discusses RediSearch aggregations, which allow processing search results to produce statistical insights. Aggregations take a search query, group and reduce the results, apply transformations, and sort. Key steps include filtering results, grouping and reducing with functions like count and average, applying expressions, and sorting. Examples show finding top GitHub committers and visits by hour. Scaling aggregations to multiple nodes requires pushing processing stages to nodes and merging results, such as summing counts or taking list intersections.
The Mapbox Directions Pipeline aims to always have the freshest map data available for routing. It involves getting the latest OpenStreetMap data, pre-processing it for directions, loading the new data into API servers, and then repeating the process. Each step uses its own CloudFormation stack. The pipeline downloads planet files from OpenStreetMap, pre-processes them for different transport profiles, uploads the results to S3, and updates the API CloudFormation stacks to fetch the new data.
This document provides a history and overview of ECMAScript (ES), the standard upon which JavaScript is based. It discusses the major versions from ES3 in 1999 to ES2016. Key changes and new features are outlined for each version, including the addition of classes, modules, iterators and more in ES6/ES2015. Transpilers like Babel allow the use of new syntax by compiling ES6 to older JavaScript. Compatibility and adoption are addressed, noting a goal of evolving the language without breaking the web. Links for further reading on ES6 features and syntax are also included.
Redis can be used as a time-series database by using the redis-timeseries module. The module provides a custom data structure and commands for storing and querying time-series data in Redis. Data can be added with a timestamp and value and queried within a time range. Downsampling aggregates and stores data at regular intervals to reduce the size of long time-series data. Global configuration allows defining downsampling rules and retention policies for all keys.
Abstract: At DataRobot we deal with automation challenges every day. This talk will give insight into how we use Python tools built around Ansible, Terraform, and Docker to solve real-world problems in infrastructure and automation.
Be a Zen monk, the Python way. A short tech talk at Imaginea to get developers bootstrapped with the focus and philosophy of Python and their point of convergence with the philosophy.
The MinMax Cache concept keeps the minimum and maximum values of each data partition in a cache. When searching for a value, it can skip partitions that are outside the range of the minimum and maximum values, improving search speed compared to searching all partitions. Setting up a MinMax cache on columns where values increase over time allows faster pruning of partitions during searches. Testing on a dataset of 100M records showed searches were over 7 times faster with the MinMax cache enabled compared to without it.
In class, we discussed min-heaps. In a min-heap the element of the heap with the smallest key is the root of a binary tree. A max-heap has as root always the element with the biggest key and the relationship between the keys of a node and its parent is less than or equal (). Your task is to develop your own binary tree ADT and your own flex-heap ADT. The flex-heap ADT must implement a min- as well as a max-heap. Apparently, instead of defining a removeMin or removeMax operation you will only provide a remove operation in your flex-heap. It must be implemented using your binary tree ADT. You have to implement these two ADTs in Java. The flex-heap ADT has to additionally support the dynamic switch from a min- to a max-heap and vice versa: remove() removes and returns the element with the smallest or biggest key value depending on the heap status (min-heap vs. max-heap) and repairs the flex-heap afterwards accordingly. toggleHeap() transforms a min- to a max-heap or vice versa. switchMinHeap() transforms a max- to a min-heap. switchMaxHeap() transforms a min- to a max-heap. Binary trees must be implemented with an extendable array-list similar to what we discussed in class and in §7.3.5 of the textbook. You are not allowed to implement trees with lists. Further, you are not allowed to use any array-list, queue, vector, (binary) tree, or heap interfaces already available in Java. Your toggleHeap, switchMinHeap, and switchMaxHeap operations must run in O(n) time. All other flex-heap operations must be either in O(1) or O(log n). You may safely assume for the binary tree and flex-heap ADT that keys are of type integer and values are of type character. So, the use of generics is not required. You have to submit the following deliverables: a) Specification of the binary tree and flex-heap ADTs including comments about assumptions and semantics (especially about the 3 added flex-heap operations). b) Pseudo code of your implementation of the binary tree and flex-heap ADTs. Keep in mind that Java code will not be considered as pseudo code. Your pseudo code must be on a higher and more abstract level. c) Well-formatted and documented Java source code and the corresponding executable jar file with at least 20 different but representative examples demonstrating the functionality of your implemented ADTs. These examples should demonstrate all cases of your ADT functionality (e.g., all operations of your ADTs, sufficient sizes of flex-heaps, sufficient number of down and up-heap, toggleHeap, switchMinHeap, and switchMaxHeap operations). You must have separate tests for each ADT but the majority of tests should cover flex-heaps because they are implemented using binary trees.
The document proposes a solution to replace inode-based storage with a key-value store mapping objects directly to positions in large "volumes" or files to address scalability issues. It benchmarks significantly better performance for puts, gets, and concurrent operations compared to an XFS filesystem, using less RAM and avoiding compaction costs. Open tasks include replication, erasure coding, and testing on object servers.
This document provides an introduction and overview of several Apache Spark labs covering: a "hello world" example of Resilient Distributed Datasets (RDDs); importing and performing operations on a wine dataset using DataFrames and SQL; and using the MLlib library to perform k-means clustering on features from the wine dataset. The labs demonstrate basic Spark concepts like RDDs, DataFrames, ML pipelines, and clustering algorithms.
Ted Dunning – Very High Bandwidth Time Series Database Implementation This talk will describe our work in creating time series databases with very high ingest rates (over 100 million points / second) on very small clusters. Starting with openTSDB and the off-the-shelf version of MapR-DB, we were able to accelerate ingest by >1000x. I will describe our techniques in detail and talk about the architectural changes required. We are also working to allow access to openTSDB data using SQL via Apache Drill. In addition, I will talk about how this work has implications regarding the much fabled Internet of Things. And tell some stories about the origins of open source big data in the 19th century at sea.
This document discusses working with time series data using InfluxDB. It provides an overview of time series data and why InfluxDB is useful for storing and querying it. Key features of InfluxDB covered include its SQL-like query language, retention policies for managing data storage, continuous queries for aggregation, and tools for data collection, visualization and monitoring.
Cryptocurrencies are digital currencies that have garnered significant investor attention in the financial markets. The aim of this project is to predict the daily price, particularly the daily closing price of the cryptocurrency Bitcoin. This plays a vital role in making trading decisions. There exist various factors which affect the price of Bitcoin, thereby making price prediction a complex and technically challenging task. To perform prediction, random forest model was trained on the historical time series which is the past prices of Bitcoin over several years. Features such as the opening price, highest price, lowest price, closing price, volume of Bitcoin, volume of currencies, and weighted price were taken into consideration so as to predict the closing price of the next day. Random forest model designed and implemented on both of pyspark and scikit learn frameworks to build predictive analysis and evaluated them by computing various measures such as the RMSE (root mean square error) and r (Pearson's correlation coefficient) on test data. Pyspark framework was used to make parallelize the creating trees when training the random forest to handle bigdata. Code has been made available at: https://github.com/ykpgrr/Price-Prediction-with-Random-Forest
“Just leave the garbage outside and we will take care of it for you”. This is the panacea promised by garbage collection mechanisms built into most software stacks available today. So, we don’t need to think about it anymore, right? Wrong! When misused, garbage collectors can fail miserably. When this happens they slow down your application and lead to unacceptable pauses. In this talk we will go over different garbage collectors approaches in different software runtimes and what are the conditions which enable them to function well. Presented on Reversim summit 2019 https://summit2019.reversim.com/session/5c754052d0e22f001706cbd8
Presto Raptor is a columnar storage system designed to work natively with Presto that provides real-time analytics capabilities. Raptor is optimized for high performance on flash storage and scales to handle large volumes of data and high query throughput. Key features of Raptor include bucketed tables to co-locate related data and enable fast joins, temporal columns to optimize queries on time-series data, and physical data awareness to skip unnecessary data during queries. Raptor can be used for real-time dashboards, funnels, and event analytics on large datasets stored in its distributed database.
NBITSearch is a search engine with an open API for local stations, LAN and Internet. Advantages over counterparts: 1. Object indexing. It allows to index objects S of any types T. 2. Multifunctional indexing. It allows to index objects simultaneously by a set of any functions F (S). 3. Very fast search. It allows to save time and money.
The document discusses topology and partitioning in Cassandra distributed hash tables (DHTs). It describes issues with poor load distribution and data distribution in traditional DHT designs. It proposes using virtual nodes, where each physical node is assigned multiple tokens, to better distribute partitions and improve performance. Configuration options for Cassandra are presented that implement virtual nodes using a random token assignment strategy.
Comment maîtriser le flux de données IoT avec DataStax, Apache Cassandra et Apache Spark ? Petit Déjeuner IoT du 19 novembre 2015
This document discusses building scalable IoT applications using open source technologies. It begins by providing an overview of the growth of the IoT market and connected devices. It then discusses challenges with traditional "data lake" architectures for IoT data due to the high volume, velocity, and variety of IoT data. The document proposes an architecture combining stream processing for real-time data with analytics on both real-time and stored data. It discusses data access patterns and storage requirements for different types of IoT data. Finally, it provides an overview of open source technologies that can be used to build scalable IoT applications.
Feeling the need to contribute something to Apache Cassandra? Maybe you want to help guide the future of your favorite database? Get off the sidelines and get in the game! It's easy to say but how do you even get started? I will outline some of the ways you can help contribute to Apache Cassandra from minor to major. If you don't have the time or ability to submit code, there are a lot of ways you can participate. What if you do want to write some code? I can walk you through the process of creating a patch and submitting for final approval. Got a great idea? I'll show you propose that to the community at large. Take it from me, participating is so much more fun than just watching the project from a distance. Time to jump in! About the Speaker Patrick McFadin Chief Evangelist, DataStax Patrick McFadin is one of the leading experts of Apache Cassandra and data modeling techniques. As the Chief Evangelist for Apache Cassandra and consultant for DataStax, he has helped build some of the largest and exciting deployments in production. Previous to DataStax, he was Chief Architect at Hobsons and an Oracle DBA/Developer for over 15 years.
A 30 minute talk I did at Cassandra Dublin and Cassandra London. Just some things I've learned along the way as I've helped some of the largest users of Cassandra be successful. Learn form other peoples mistakes!
In this talk we describe the features of Cassandra that set it above the pack, and how to get the most out of them, depending on your application. In particular, we'll describe de-normalization, and detail how the algorithms behind Cassandra leverage awesome write speed to accelerate reads; and we'll explain how Cassandra achieves multi-datacenter support, tunable consistency and no single point of failure, to give a great solution for highly available systems.
Among the resources offered by Wikimedia is an API providing low-latency access to full-history content, in many formats. Its results are often the product of computationally intensive transforms, and must be pre-generated and stored to meet latency expectations. Unsurprisingly, there are many challenges to providing low-latency access to such a large data-set, in a demanding, globally distributed environment. This presentation covers the Wikimedia content API and its use of Apache Cassandra as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature will be discussed.
This document summarizes Eric Evans' presentation on using Cassandra as the backend for Wikimedia's content API. It discusses Wikimedia's goals of providing free knowledge, key metrics about Wikipedia and its architecture. It then focuses on how Wikimedia uses Cassandra, including their data model, compression techniques, and ongoing work to optimize compaction strategies and reduce node density to improve performance.
Webinar Degetel DataStax du 15 octobre 2015 Du SQL au NoSQL : Pourquoi ? Différences ? Comment ça marche ?
Webinaire Banque / Assurance Reprenez le pouvoir sur vos données
Castle is an open-source project that provides an alternative to the lower layers of the storage stack -- RAID and POSIX filesystems -- for big data workloads, and distributed data stores such as Apache Cassandra. This presentation from Berlin Buzzwords 2012 provides a high-level overview of Castle and how it is used with Cassandra to improve performance and predictability.
DataStax is a company that drives development of the Apache Cassandra database. It has over 400 customers including 24 Fortune 100 companies. DataStax Enterprise provides a highly available, scalable and secure database platform using Cassandra for mission critical applications. It supports analytics, search and multi-datacenter deployments across hybrid cloud environments.
Découvrez DataStax Enterprise avec ses intégrations de Apache Spark et de Apache Solr, ainsi que son nouveau modèle de données de type Graph.