The document discusses analyzing log data using Apache Spark. It covers challenges with log data like schema mediation and feature engineering to transform log records into vectors. It also discusses visualizing the structured data using dimensionality reduction techniques like PCA and self-organizing maps to find outliers. The document provides examples of analyzing log data to identify the most frequent error levels and applications generating logs.
This deck walks you through using Apache Drill and Apache Superset (Incubating) to explore cyber security datasets including PCAP, HTTPD log files, Syslog and more.
This document provides an overview and demonstration of using Docker for a sample web application. It begins with an introduction to Docker and its components like containers. It then demonstrates building a Python/Django application within a Docker container and connecting it to a MySQL database in another linked container. Performance is compared across different configurations, including changing the database to PostgreSQL, adding Nginx and Gunicorn, and integrating Memcached caching. The document concludes by showing how to use load testing tools with the Dockerized application setup.
IBM Informix is a database management system that provides capabilities for handling different types of data including relational tables, JSON collections, and time series data. It uses a hybrid approach that allows seamless access to different data types using SQL and NoSQL APIs. The document discusses how Informix can be used to store and analyze IoT, mobile, and sensor data from devices and gateways in both on-premises and cloud environments. It also highlights the Informix Warehouse Accelerator for in-memory analytics and how Informix can be integrated with other IBM products and services like MongoDB, Bluemix, and Cognos.
Slides of my talk I gave @ PyRE.it in ReggioEmilia about developing a Rest Api in Python using a little bit of Flask and SqlAlchemy.
www.pyre.it
www.alessandrocucci.it/pyre/restapi
This document discusses NSClient++, a simple but powerful system monitoring agent. It provides examples of using filters to customize monitoring checks, including filtering by level, source, size, load, and other attributes. The document also outlines NSClient++ version history and support/funding options for the open source project.
1. The document discusses Docker containers, Docker machines, and Docker Compose as tools for building Python development environments and deploying backend services.
2. It provides examples of using Docker to run sample Python/Django applications with MySQL and PostgreSQL databases in containers, and load testing the applications.
3. The examples demonstrate performance testing Python REST APIs with different database backends and caching configurations using Docker containers.
1. The document discusses Docker containers, Docker machines, and Docker Compose as tools for building Python development environments and deploying backend services.
2. It provides examples of using Docker to run sample Python/Django applications with MySQL and PostgreSQL databases in containers, and load testing the applications.
3. The examples demonstrate performance testing Python REST APIs with different database backends and caching configurations using Docker containers.
This document discusses using SPARQL and RDF data for data science and analytics. It provides examples of using SPARQL to perform business intelligence queries on RDF data, calculate graph measures like shortest paths, and implement clustering algorithms. Large amounts of RDF data are available for analysis from sources like Freebase, the Linked Open Data Cloud, and schemas like schema.org. SPARQL is presented as a standard query language that can be used to enable data science and analytics over RDF graphs at web-scale.
How We Learned To Love The Data Center Operating System
This document discusses how Adform, an online advertising company, adopted containers and the container orchestration platform DC/OS to manage their data science workloads. It describes the challenges they faced with inconsistent infrastructure and environments. Containers provided isolation, consistent deployment, and resource management. Key DC/OS components like Marathon and Mesos helped with scheduling, deployment, and cluster management. Overall containers created a unified way for data scientists to develop models and analyze data at scale.
This document describes SHI3LD, a context-aware access control system for RDF graph stores. SHI3LD uses semantic web technologies and vocabularies to define access policies and user contexts. It evaluates policies against user contexts to determine which named graphs the user can access. This allows fine-grained, context-sensitive access control over RDF data. The system was evaluated using a SPARQL benchmark dataset, and response times increased only slightly as more user contexts and consumers were added. Future work may focus on improving context data trustworthiness and performing user-centered evaluations.
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
This document discusses bridging unstructured and structured data with Hadoop and Vertica. It describes using Hadoop to extract and structure unstructured investment data from the web. Then it uses Pig to add zip code data and store the results in Vertica. Finally, it explains how Vertica can be used for reporting and data visualization of the structured data for analysis.
This document provides an overview of using the SILK tool suite to analyze netflow data. It discusses:
- The basics of netflow and some use cases for analyzing netflow data
- The key components of the SILK architecture and how to set up SILK
- Using the SILK command line interface and PySILK API to perform basic analysis workflows like filtering, grouping, aggregating, and enriching netflow data
- Examples of investigating security incidents and characterizing network activity using SILK
Spark Summit EU talk by Debasish Das and Pramod Narasimha
This document describes a system called DeviceAnalyzer that builds predictive models in near-real time using Apache Spark and Apache Lucene. It discusses:
1) Integrating Spark and Lucene to enable column search capabilities in Spark and add Spark operations to Lucene.
2) Representing Spark DataFrames as Lucene documents to build a distributed Lucene index from DataFrames.
3) Using the index for tasks like searching devices matching a query, generating statistical and predictive models on retrieved devices, and finding dimensions correlated with selected devices.
4) Architectural components like Trapezium for batch, streaming, and API services and a LuceneDAO for indexing DataFrames and querying the index.
Spark Summit EU talk by Debasish Das and Pramod Narasimha
This document describes a system called DeviceAnalyzer that uses Apache Spark and Apache Lucene to build predictive models in near-real time from streaming and batch data. It discusses:
1) Integrating Spark and Lucene to index streaming and batch data for fast search and retrieval, enabling statistical and predictive modeling on the retrieved data.
2) A batch workflow that indexes batch data using Lucene, and a streaming workflow that processes streaming queries and compares or augments results.
3) Statistical and machine learning operators like summation, L1/L2 regularization, and sparse linear algebra for building models on retrieved device profiles.
Dumb and Dumber: how smart is your monitoring data?
Big Data is all the rage right now. Everyone from a social media company to your grandmother's online knitting store is suddenly a big data shop. Application monitoring tools are no exception from this trend – they collect gigabytes of monitoring data from your application every minute. But most of this data is useless. It's dumb data. More data isn't better if the data you're getting from your tools isn't helping you do your job – in fact, it's a real problem.
In this session AppDynamics will cover how to be smarter about collecting monitoring data, and how to ensure that the data we're collecting is intelligent.
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Apache Spark and Tensorflow as a Service with Jim Dowling
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
Apache Spark and Tensorflow as a Service with Jim Dowling
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Next CERN Accelerator Logging Service with Jakub Wozniak
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Powering a Startup with Apache Spark with Kevin Kim
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.
In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.
For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Nielsen used Databricks to test new digital advertising rating methodologies on a large scale. Databricks allowed Nielsen to run analyses on thousands of advertising campaigns using both small panel data and large production data. This identified edge cases and performance gains faster than traditional methods. Using Databricks reduced the time required to test and deploy improved rating methodologies to benefit Nielsen's clients.
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
OSMC 2013 | Making monitoring simple? by Michael MedinNETWAYS
In diesem Vortrag wird der neue vereinfachte Monitoring Agent NSClient++ vorgestellt. Mit der kommenden 0.4.2 Release von NSClient++ wird es nun endlich ein neues Check Subsystem geben, welches nicht nur auf aktuellen Windows-Systemen laufen, sondern dabei auch noch sehr schnell sein wird. Der Vortrag zeigt die neue vereinfachte Syntax sowie das Real-Time-Monitoring über alle Kommandos von NSClient++ hinweg. Ergänzend wird es Updates an der Linuxfront und dem agentenlosen Monitoring geben.
When distributed system fail, they usually do so in spectacular ways that often have disastrous effects on your systems and users. This baptism by fire is commonly how we learn how big data systems really work. This presentation looks at real-world examples of failures using Java big data technologies such as Hadoop, Spark, Cassandra, or Kafka.
The document summarizes a presentation given by David Golden on using MongoDB and Perl together. It discusses how MongoDB and Perl have similar data modeling approaches using dynamic data structures. It also notes some challenges of each, such as missing features in MongoDB and quirks in Perl, but says that both have enthusiastic communities. The document advocates for trying out MongoDB and getting involved with its community.
Drilling Cyber Security Data With Apache DrillCharles Givre
This deck walks you through using Apache Drill and Apache Superset (Incubating) to explore cyber security datasets including PCAP, HTTPD log files, Syslog and more.
This document provides an overview and demonstration of using Docker for a sample web application. It begins with an introduction to Docker and its components like containers. It then demonstrates building a Python/Django application within a Docker container and connecting it to a MySQL database in another linked container. Performance is compared across different configurations, including changing the database to PostgreSQL, adding Nginx and Gunicorn, and integrating Memcached caching. The document concludes by showing how to use load testing tools with the Dockerized application setup.
Informix SQL & NoSQL: Putting it all togetherKeshav Murthy
IBM Informix is a database management system that provides capabilities for handling different types of data including relational tables, JSON collections, and time series data. It uses a hybrid approach that allows seamless access to different data types using SQL and NoSQL APIs. The document discusses how Informix can be used to store and analyze IoT, mobile, and sensor data from devices and gateways in both on-premises and cloud environments. It also highlights the Informix Warehouse Accelerator for in-memory analytics and how Informix can be integrated with other IBM products and services like MongoDB, Bluemix, and Cognos.
Slides of my talk I gave @ PyRE.it in ReggioEmilia about developing a Rest Api in Python using a little bit of Flask and SqlAlchemy.
www.pyre.it
www.alessandrocucci.it/pyre/restapi
NSClient++: Monitoring Simplified at OSMC 2013Michael Medin
This document discusses NSClient++, a simple but powerful system monitoring agent. It provides examples of using filters to customize monitoring checks, including filtering by level, source, size, load, and other attributes. The document also outlines NSClient++ version history and support/funding options for the open source project.
1. The document discusses Docker containers, Docker machines, and Docker Compose as tools for building Python development environments and deploying backend services.
2. It provides examples of using Docker to run sample Python/Django applications with MySQL and PostgreSQL databases in containers, and load testing the applications.
3. The examples demonstrate performance testing Python REST APIs with different database backends and caching configurations using Docker containers.
1. The document discusses Docker containers, Docker machines, and Docker Compose as tools for building Python development environments and deploying backend services.
2. It provides examples of using Docker to run sample Python/Django applications with MySQL and PostgreSQL databases in containers, and load testing the applications.
3. The examples demonstrate performance testing Python REST APIs with different database backends and caching configurations using Docker containers.
This document discusses using SPARQL and RDF data for data science and analytics. It provides examples of using SPARQL to perform business intelligence queries on RDF data, calculate graph measures like shortest paths, and implement clustering algorithms. Large amounts of RDF data are available for analysis from sources like Freebase, the Linked Open Data Cloud, and schemas like schema.org. SPARQL is presented as a standard query language that can be used to enable data science and analytics over RDF graphs at web-scale.
How We Learned To Love The Data Center Operating Systemsaulius_vl
This document discusses how Adform, an online advertising company, adopted containers and the container orchestration platform DC/OS to manage their data science workloads. It describes the challenges they faced with inconsistent infrastructure and environments. Containers provided isolation, consistent deployment, and resource management. Key DC/OS components like Marathon and Mesos helped with scheduling, deployment, and cluster management. Overall containers created a unified way for data scientists to develop models and analyze data at scale.
Context-Aware Access Control for RDF Graph StoresSerena Villata
This document describes SHI3LD, a context-aware access control system for RDF graph stores. SHI3LD uses semantic web technologies and vocabularies to define access policies and user contexts. It evaluates policies against user contexts to determine which named graphs the user can access. This allows fine-grained, context-sensitive access control over RDF data. The system was evaluated using a SPARQL benchmark dataset, and response times increased only slightly as more user contexts and consumers were added. Future work may focus on improving context data trustworthiness and performing user-centered evaluations.
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
This document discusses bridging unstructured and structured data with Hadoop and Vertica. It describes using Hadoop to extract and structure unstructured investment data from the web. Then it uses Pig to add zip code data and store the results in Vertica. Finally, it explains how Vertica can be used for reporting and data visualization of the structured data for analysis.
This document provides an overview of using the SILK tool suite to analyze netflow data. It discusses:
- The basics of netflow and some use cases for analyzing netflow data
- The key components of the SILK architecture and how to set up SILK
- Using the SILK command line interface and PySILK API to perform basic analysis workflows like filtering, grouping, aggregating, and enriching netflow data
- Examples of investigating security incidents and characterizing network activity using SILK
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
This document describes a system called DeviceAnalyzer that builds predictive models in near-real time using Apache Spark and Apache Lucene. It discusses:
1) Integrating Spark and Lucene to enable column search capabilities in Spark and add Spark operations to Lucene.
2) Representing Spark DataFrames as Lucene documents to build a distributed Lucene index from DataFrames.
3) Using the index for tasks like searching devices matching a query, generating statistical and predictive models on retrieved devices, and finding dimensions correlated with selected devices.
4) Architectural components like Trapezium for batch, streaming, and API services and a LuceneDAO for indexing DataFrames and querying the index.
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
This document describes a system called DeviceAnalyzer that uses Apache Spark and Apache Lucene to build predictive models in near-real time from streaming and batch data. It discusses:
1) Integrating Spark and Lucene to index streaming and batch data for fast search and retrieval, enabling statistical and predictive modeling on the retrieved data.
2) A batch workflow that indexes batch data using Lucene, and a streaming workflow that processes streaming queries and compares or augments results.
3) Statistical and machine learning operators like summation, L1/L2 regularization, and sparse linear algebra for building models on retrieved device profiles.
Dumb and Dumber: how smart is your monitoring data?tlevey
Big Data is all the rage right now. Everyone from a social media company to your grandmother's online knitting store is suddenly a big data shop. Application monitoring tools are no exception from this trend – they collect gigabytes of monitoring data from your application every minute. But most of this data is useless. It's dumb data. More data isn't better if the data you're getting from your tools isn't helping you do your job – in fact, it's a real problem.
In this session AppDynamics will cover how to be smarter about collecting monitoring data, and how to ensure that the data we're collecting is intelligent.
Similar to Analyzing Log Data With Apache Spark (20)
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.
In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.
For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
Nielsen used Databricks to test new digital advertising rating methodologies on a large scale. Databricks allowed Nielsen to run analyses on thousands of advertising campaigns using both small panel data and large production data. This identified edge cases and performance gains faster than traditional methods. Using Databricks reduced the time required to test and deploy improved rating methodologies to benefit Nielsen's clients.
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
Goal Based Data Production with Sim SimeonovSpark Summit
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
4. Challenges of log data
SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg)
FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%'
GROUP BY hostname, hour
5. Challenges of log data
11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00
SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg)
FROM LOGS WHERE level='CRIT' AND msg LIKE '%failure%'
GROUP BY hostname, hour
6. Challenges of log data
postgres
httpd
syslog
INFO INFO WARN CRIT DEBUG INFO
GET GET GET POST
WARN WARN INFO INFO INFO
GET (404)
INFO
(ca. 2000)
7. Challenges of log data
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
(ca. 2016)
8. Challenges of log datapostgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
postgres
httpd
syslog
INFO WARN
GET (404)
CRIT INFO
GET GET GET POST
INFO INFO INFO WARN
CouchDB
httpd
Django
INFO CRITINFO
GET POST
INFO INFO INFO WARN
haproxy
k8s
INFO INFO WARN CRITDEBUG
WARN WARN INFO INFOINFO
INFO
Cassandra
nginx
Rails
INFO CRIT INFO
GET POST PUT POST
INFO INFO INFOWARN
INFO
redis INFO CRIT INFOINFO
PUT (500)httpd
syslog
GET PUT
INFO INFO INFOWARN
How many services are
generating logs in your
datacenter today?
10. Collecting log data
collecting
Ingesting live log
data via rsyslog,
logstash, fluentd
normalizing
Reconciling log
record metadata
across sources
warehousing
Storing normalized
records in ES indices
analysis
cache warehoused
data as Parquet files
on Gluster volume
local to Spark cluster
11. Collecting log data
warehousing
Storing normalized
records in ES indices
analysis
cache warehoused
data as Parquet files
on Gluster volume
local to Spark cluster
12. Collecting log data
warehousing
Storing normalized
records in ES indices
analysis
cache warehoused
data as Parquet files
on Gluster volume
local to Spark cluster
17. logs
.select("level").distinct
.map { case Row(s: String) => s }
.collect
Exploring structured data
logs
.groupBy($"level", $"rsyslog.app_name")
.agg(count("level").as("total"))
.orderBy($"total".desc)
.show
info kubelet 17933574
info kube-proxy 10961117
err journal 6867921
info systemd 5184475
…
debug, notice, emerg,
err, warning, crit, info,
severe, alert
18. Exploring structured data
logs
.groupBy($"level", $"rsyslog.app_name")
.agg(count("level").as("total"))
.orderBy($"total".desc)
.show
info kubelet 17933574
info kube-proxy 10961117
err journal 6867921
info systemd 5184475
…
logs
.select("level").distinct
.as[String].collect
debug, notice, emerg,
err, warning, crit, info,
severe, alert
19. Exploring structured data
logs
.groupBy($"level", $"rsyslog.app_name")
.agg(count("level").as("total"))
.orderBy($"total".desc)
.show
info kubelet 17933574
info kube-proxy 10961117
err journal 6867921
info systemd 5184475
…
logs
.select("level").distinct
.as[String].collect
debug, notice, emerg,
err, warning, crit, info,
severe, alert
This class must be declared outside the REPL!
21. From log records to vectors
What does it mean for two sets of categorical features to be similar?
red
green
blue
orange
-> 000
-> 010
-> 100
-> 001
pancakes
waffles
aebliskiver
omelets
bacon
hash browns
-> 10000
-> 01000
-> 00100
-> 00001
-> 00000
-> 00010
22. From log records to vectors
What does it mean for two sets of categorical features to be similar?
red
green
blue
orange
-> 000
-> 010
-> 100
-> 001
pancakes
waffles
aebliskiver
omelets
bacon
hash browns
-> 10000
-> 01000
-> 00100
-> 00001
-> 00000
-> 00010
red pancakes
orange waffles
-> 00010000
-> 00101000
29. WARN INFO INFOINFO
WARN DEBUGINFOINFOINFO
WARN WARNINFO INFO INFO
WAR
INFO
INFO
Other interesting features
host01
host02
host03
30. INFO INFOINFO
DEBUGINFOINFO
WARNNFO INFO INFO
WARN INFO INFOINFO
INFO INFO
INFOINFOINFO
INFOINFO INFO INFO
INFO INFO
WARN DEBUG
Other interesting features
host01
host02
host03
31. INFO INFOINFO
INFO INFO
INFOINFO
INFO INFO INFO
WARN
DEBUG
WARN
INFO
INFO
EBUG
WARN
INFO
INFO
INFO
INFO
INFO
INFO INFO INFO
WARN
WARN
INFO
WARN
INFO
Other interesting features
host01
host02
host03
32. Other interesting features
: Great food, great service, a must-visit!
: Our whole table got gastroenteritis.
: This place is so wonderful that it has ruined all
other tacos for me and my family.
34. Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
CRIT: No requests in last hour; suspending running app containers.
35. Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
CRIT: No requests in last hour; suspending running app containers.
INFO: Phoenix datacenter is on fire; may not rise from ashes.
36. Other interesting features
INFO: Everything is great! Just checking in to let you know I’m OK.
CRIT: No requests in last hour; suspending running app containers.
INFO: Phoenix datacenter is on fire; may not rise from ashes.
See https://links.freevariable.com/nlp-logs/ for more!
50. Tree-based approaches
yes
no
yes
no
if orange
if !orange
if red
if !red
if !gray
if !gray
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
51. Tree-based approaches
yes
no
yes
no
if orange
if !orange
if red
if !red
if !gray
if !gray
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
yes
no
no
yes
yes
no
yes
no
63. Outliers in log data
0.95
0.97
0.92
0.37
An outlier is any
record whose best
match was at least
4σ below the mean.
0.94
0.89
0.91
0.93
0.96
65. Out of 310 million log
records, we identified
0.0012% as outliers.
68. Thirty most extreme outliers
10 Can not communicate with power supply 2.
9 Power supply 2 failed.
8 Power supply redundancy is lost.
1 Drive A is removed.
1 Can not communicate with power supply 1.
1 Power supply 1 failed.
74. On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + ex * alpha(t) * wt
75. On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + ex * alpha(t) * wt
at each step, we update each unit by
adding its value from the previous step…
76. On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + ex * alpha(t) * wt
to the example that we considered…
77. On-line SOM training
while t < iterations:
for ex in examples:
t = t + 1
if t == iterations:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + ex * alpha(t) * wt
scaled by a learning factor and the
distance from this unit to its best match
80. Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
81. Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
update the state of every cell in the neighborhood
of the best matching unit, weighting by distance
82. Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
keep track of the distance weights
we’ve seen for a weighted average
83. Batch SOM training
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
since we can easily merge multiple states, we
can train in parallel across many examples
101. Sharing models
class Model(private var entries: breeze.linalg.DenseVector[Double],
/* ... lots of (possibly) mutable state ... */ )
implements java.io.Serializable {
// lots of implementation details here
}
102. Sharing models
class Model(private var entries: breeze.linalg.DenseVector[Double],
/* ... lots of (possibly) mutable state ... */ )
implements java.io.Serializable {
// lots of implementation details here
}
case class FrozenModel(entries: Array[Double], /* ... */ ) { }
103. Sharing models
case class FrozenModel(entries: Array[Double], /* ... */ ) { }
class Model(private var entries: breeze.linalg.DenseVector[Double],
/* ... lots of (possibly) mutable state ... */ )
implements java.io.Serializable {
// lots of implementation details here
def freeze: FrozenModel = // ...
}
object Model {
def thaw(im: FrozenModel): Model = // ...
}
107. Spark and ElasticSearch
Data locality is an issue and caching is even more
important than when running from local storage.
If your data are write-once, consider exporting ES
indices to Parquet files and analyzing those instead.
108. Structured queries in Spark
Always program defensively: mediate schemas,
explicitly convert null values, etc.
Use the Dataset API whenever possible to minimize
boilerplate and benefit from query planning without
(entirely) forsaking type safety.
109. Memory and partitioning
Large JVM heaps can lead to appalling GC pauses and
executor timeouts.
Use multiple JVMs or off-heap storage (in Spark 2.0!)
Tree aggregation can save you both memory and
execution time by partially aggregating at worker nodes.
110. Interoperability
Avoid brittle or language-specific model serializers
when sharing models with non-Spark environments.
JSON is imperfect but ubiquitous. However, json4s
will serialize case classes for free!
See also SPARK-13944, merged recently into 2.0.
111. Feature engineering
Favor feature engineering effort over complex or
novel learning algorithms.
Prefer approaches that train interpretable models.
Design your feature engineering pipeline so you can
translate feature vectors back to factor values.