This document discusses using Apache Cassandra to store and manage time series data in OpenNMS. It describes some limitations of the existing RRDTool-based data storage, such as high I/O requirements for updating and aggregating data. Cassandra is presented as an alternative that is optimized for write throughput, flexible data modeling, high availability, and ability to perform aggregations at read time rather than write time. The Newts project is introduced as a standalone time series data store built on Cassandra that aims to provide fast storage and retrieval of raw samples along with flexible aggregation capabilities.
The document describes improvements to building histograms for database tables. It outlines collecting a histogram using samples of rows rather than a full table scan to avoid sorting all values and improve performance. The new implementation allows the user to specify a sampling percentage and constructs an equal-width histogram using multiple samples to estimate the min and max values and then bucket the values between those ranges.
This document provides a history and overview of ECMAScript (ES), the standard upon which JavaScript is based. It discusses the major versions from ES3 in 1999 to ES2016. Key changes and new features are outlined for each version, including the addition of classes, modules, iterators and more in ES6/ES2015. Transpilers like Babel allow the use of new syntax by compiling ES6 to older JavaScript. Compatibility and adoption are addressed, noting a goal of evolving the language without breaking the web. Links for further reading on ES6 features and syntax are also included.
This document discusses frequent pattern mining algorithms. It describes the Apriori, AprioriTid, and FP-Growth algorithms. The Apriori algorithm uses candidate generation and database scanning to find frequent itemsets. AprioriTid tracks transaction IDs to reduce scans. FP-Growth avoids candidate generation and multiple scans by building a frequent-pattern tree. It finds frequent patterns by mining the tree.
Streaming data presents new challenges for statistics and machine learning on extremely large data sets. Tools such as Apache Storm, a stream processing framework, can power range of data analytics but lack advanced statistical capabilities. These slides are from the Apache.con talk, which discussed developing streaming algorithms with the flexibility of both Storm and R, a statistical programming language. At the talk I dicsussed issues of why and how to use Storm and R to develop streaming algorithms; in particular I focused on: • Streaming algorithms • Online machine learning algorithms • Use cases showing how to process hundreds of millions of events a day in (near) real time See: https://apacheconna2015.sched.org/event/09f5a1cc372860b008bce09e15a034c4#.VUf7wxOUd5o
An overview of streaming algorithms: what they are, what the general principles regarding them are, and how they fit into a big data architecture. Also four specific examples of streaming algorithms and use-cases.
This talk will go over how to build an end-to-end data processing system in Python, from data ingest, to data analytics, to machine learning, to user presentation. Developments in old and new tools have made this particularly possible today. The talk in particular will talk about Airflow for process workflows, PySpark for data processing, Python data science libraries for machine learning and advanced analytics, and building agile microservices in Python. System architects, software engineers, data scientists, and business leaders can all benefit from attending the talk. They should learn how to build more agile data processing systems and take away some ideas on how their data systems could be simpler and more powerful.
Slides for Alt.NET vAPRIL17 session. This session illustrates the approach we at Hibernating Rhinos took when developing a new version of RavenDB.
The document discusses using the raster package in R to work with geographical grid data. It covers downloading and loading the raster package, creating raster objects and adding random values, reading in real climate data files, performing operations like cropping and aggregation, and sources for global climate data like WorldClim.
Heapsort is a sorting algorithm that uses a binary tree known as a heap. It has a worst-case runtime of O(n log n), making it useful for critical applications. A heap is a balanced, left-justified binary tree where each node's value is greater than or equal to its children. Heapsort inserts values into a heap and then removes the largest value to sort the data.
1) Apache SAMOA is a platform for mining big data streams in real-time that provides algorithms, libraries and an execution framework. 2) It allows researchers to develop and compare stream mining algorithms and practitioners to easily apply state-of-the-art algorithms to problems like sentiment analysis, spam detection and recommendations. 3) The Vertical Hoeffding Tree algorithm in SAMOA provides high parallelism and accuracy for streaming decision tree learning, outperforming native Apache Flink implementations on certain datasets while being faster on others.
Be a Zen monk, the Python way. A short tech talk at Imaginea to get developers bootstrapped with the focus and philosophy of Python and their point of convergence with the philosophy.
We propose a pipeline for text extraction from infographics that makes use of a novel combination of data mining and computer vision techniques. The pipeline defines a sequence of steps to identify characters, cluster them into text lines, determine their rotation angle, and apply state-of-the-art OCR to recognize the text. In this paper, we formally define the pipeline and present its current implementation. In addition, we have conducted preliminary evaluations over a data corpus of 121 manually annotated infographics from a broad range of illustration types such as bar charts, pie charts, and line charts, maps, and others. We assess the results of our text extraction pipeline by comparing it with two baselines. Finally, we sketch an outline for future work and possibilities for improving the pipeline. - http://ceur-ws.org/Vol-1458/
This document provides an overview of machine learning fundamentals and techniques. It defines machine learning as a field that allows computers to learn without being explicitly programmed. It also describes supervised learning techniques like regression and classification and unsupervised learning techniques like clustering. Specific algorithms discussed include linear regression, logistic regression, neural networks, K-means clustering, and anomaly detection. Frameworks for machine learning like Azure ML and Spark MLlib are also summarized.
DEBS Grand Challenge is a yearly, real-life data based event processing challenge posted by Distributed Event Based Systems conference. The 2015 challenge uses a taxi trip data set from New York city that includes 173 million events col- lected over the year 2013. This paper presents how we used WSO2 CEP, an open source, commercially available Complex Event Processing Engine, to solve the problem. With a 8-core commodity physical machine, our solution did about 350,000 events/second throughput with mean latency less than one millisecond for both queries. With a VirtualBox VM, our solution did about 50,000 events/second throughput with mean latency of 1 and 6 milliseconds for the first and second queries respectively. The paper will outline the solution, present results, and discuss how we optimized the solution for maximum performance.
Linked Open Data (LOD) is about publishing and interlinking data of different origin and purpose on the web. The Resource Description Framework (RDF) is used to describe data on the LOD cloud. In contrast to relational databases, RDF does not provide a fixed, pre-defined schema. Rather, RDF allows for flexibly modeling the data schema by attaching RDF types and properties to the entities. Our schema-level index called SchemEX allows for searching in large-scale RDF graph data. The index can be efficiently computed with reasonable accuracy over large-scale data sets with billions of RDF triples, the smallest information unit on the LOD cloud. SchemEX is highly needed as the size of the LOD cloud quickly increases. Due to the evolution of the LOD cloud, one observes frequent changes of the data. We show that also the data schema changes in terms of combinations of RDF types and properties. As changes cannot capture the dynamics of the LOD cloud, current work includes temporal clustering and finding periodicities in entity dynamics over large-scale snapshots of the LOD cloud with about 100 million triples per week for more than three years.
"At OpenX we not only use the tools in big data ecosystems to solve our business problems, but also explore the cutting edge algorithms for practical uses. HyperLogLog is one of the algorithm that we use intensively in our internal system. It has really low computation cost and can easily plug into map-reduce framework (hadoop or spark). Some of the applications that worth to highlight are: * high cardinality test * distinct count of unique users over time * Visualize hyperloglog for fraud detection"
Sun Tzu said “if you know your enemies and know yourself, you can win a hundred battles without a single loss.” Those words have never been truer than in our time. We are faced with an avalanche of data. Many believe the ability to process and gain insights from a vast array of available data will be the primary competitive advantage for organizations in the years to come. To make sense of data, you will have to face many challenges: how to collect, how to store, how to process, and how to react fast. Although you can build these systems from bottom up, it is a significant problem. There are many technologies, both open source and proprietary, that you can put together to build your analytics solution, which will likely save you effort and provide a better solution. In this session, Srinath will discuss WSO2’s middleware offering in BigData and explain how you can put them together to build a solution that will make sense of your data. The session will cover technologies like thrift for collecting data, Cassandra for storing data, Hadoop for analyzing data in batch mode, and Complex event processing for analyzing data real time.
The document discusses topology and partitioning in Cassandra distributed hash tables (DHTs). It describes issues with poor load distribution and data distribution in traditional DHT designs. It proposes using virtual nodes, where each physical node is assigned multiple tokens, to better distribute partitions and improve performance. Configuration options for Cassandra are presented that implement virtual nodes using a random token assignment strategy.
Comment maîtriser le flux de données IoT avec DataStax, Apache Cassandra et Apache Spark ? Petit Déjeuner IoT du 19 novembre 2015
Découvrez DataStax Enterprise avec ses intégrations de Apache Spark et de Apache Solr, ainsi que son nouveau modèle de données de type Graph.
This document discusses building scalable IoT applications using open source technologies. It begins by providing an overview of the growth of the IoT market and connected devices. It then discusses challenges with traditional "data lake" architectures for IoT data due to the high volume, velocity, and variety of IoT data. The document proposes an architecture combining stream processing for real-time data with analytics on both real-time and stored data. It discusses data access patterns and storage requirements for different types of IoT data. Finally, it provides an overview of open source technologies that can be used to build scalable IoT applications.
Feeling the need to contribute something to Apache Cassandra? Maybe you want to help guide the future of your favorite database? Get off the sidelines and get in the game! It's easy to say but how do you even get started? I will outline some of the ways you can help contribute to Apache Cassandra from minor to major. If you don't have the time or ability to submit code, there are a lot of ways you can participate. What if you do want to write some code? I can walk you through the process of creating a patch and submitting for final approval. Got a great idea? I'll show you propose that to the community at large. Take it from me, participating is so much more fun than just watching the project from a distance. Time to jump in! About the Speaker Patrick McFadin Chief Evangelist, DataStax Patrick McFadin is one of the leading experts of Apache Cassandra and data modeling techniques. As the Chief Evangelist for Apache Cassandra and consultant for DataStax, he has helped build some of the largest and exciting deployments in production. Previous to DataStax, he was Chief Architect at Hobsons and an Oracle DBA/Developer for over 15 years.
A 30 minute talk I did at Cassandra Dublin and Cassandra London. Just some things I've learned along the way as I've helped some of the largest users of Cassandra be successful. Learn form other peoples mistakes!
In this talk we describe the features of Cassandra that set it above the pack, and how to get the most out of them, depending on your application. In particular, we'll describe de-normalization, and detail how the algorithms behind Cassandra leverage awesome write speed to accelerate reads; and we'll explain how Cassandra achieves multi-datacenter support, tunable consistency and no single point of failure, to give a great solution for highly available systems.
Among the resources offered by Wikimedia is an API providing low-latency access to full-history content, in many formats. Its results are often the product of computationally intensive transforms, and must be pre-generated and stored to meet latency expectations. Unsurprisingly, there are many challenges to providing low-latency access to such a large data-set, in a demanding, globally distributed environment. This presentation covers the Wikimedia content API and its use of Apache Cassandra as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature will be discussed.
This document summarizes Eric Evans' presentation on using Cassandra as the backend for Wikimedia's content API. It discusses Wikimedia's goals of providing free knowledge, key metrics about Wikipedia and its architecture. It then focuses on how Wikimedia uses Cassandra, including their data model, compression techniques, and ongoing work to optimize compaction strategies and reduce node density to improve performance.
The Wikimedia Foundation is a non-profit and charitable organization driven by a vision of a world where every human can freely share in the sum of all knowledge. Each month Wikimedia sites serve over 18 billion page views to 500 million unique visitors around the world. Among the many resources offered by Wikimedia is a public-facing API that provides low-latency, programmatic access to full-history content and meta-data, in a variety of formats. Commonly, results from this system are the product of computationally intensive transformations, and must be pre-generated and persisted to meet latency expectations. Unsurprisingly, there are numerous challenges to providing low-latency storage of such a massive data-set, in a demanding, globally distributed environment. This talk covers Wikimedia Content API, and it's use of Apache Cassandra, a massively-scalable distributed database, as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature are discussed.