https://aka.ms/spark-architecture https://aka.ms/distributed-programming https://twitter.com/AdiPolak ------------------------------------------------------------------------------------------------- Spark is quickly gaining steam in the Big Data analytics world. However, it also has excellent features for Stream processing and machine learning development. Big Data systems are a new reality. Whether we're creating microservices-based solutions or machine learning-based products, we work with data, most often, big data. Apache Spark is your friend if you want to position yourself on the road for success as a Big Data Developer, Data Engineer, or Data Science with hands-on production experience. Come to this session to learn Apache Spark basics and how YOU can get starting with it.
Millions of websites depend on the WordPress.org infrastructure every day. Parts of it is almost a decade old, ancient in software terms, and most of it is not even based on WordPress itself. At the time of WordCamp San Antonio, we will hopefully have finished a project that drastically modernizes the theme repository and moves to over to a WordPress installation, complete with a redesigned front-end and a more robust backend. This session will go over the changes involved, and will look into how WordPress.org works and what the underlying structure of it looks like.
APIs are everywhere today and can be a great building block of modern applications. But all too often APIs are not truly great. Rather than love your API, developers curse it. How can you avoid that fate? In this session we'll look at the most common mistakes API providers make and you can avoid making them too. Do you offer a bad developer experience (DX)? Poor, inconsistent API design? Unreliable services? This talk is a deep dive on not just what to avoid but what to do instead. And you'll leave knowing how to get developers to love your API, not hate it.
Why Docker? Is it cool? Is it the newest thing? Does it solve _my_ problem? In reality, as DevOps thought leaders and professionals the question is really, "How can the cost of a Docker adoption -- in terms of risk and opportunity cost -- benefit my company?"
This document discusses the challenges of API documentation and differences between technical authors and API writers. It outlines that API writers require both strong technical skills like coding abilities as well as writing skills, which can be difficult to find. It also notes differences in tools used for general documentation versus API documentation. Finally, it provides recommendations like improving skills through training, having realistic expectations of skills, improving standards and tools, and making documentation a shared team responsibility.
This document outlines three principles for designing mobile APIs: 1. Reduce round trips to the server by bundling data into single requests to minimize network overhead which conserves mobile resources. 2. Control verbosity by purging unnecessary data, using compression, and allowing clients to specify desired response fields to reduce payload size. 3. Restrict access by identifying request sources, denying unauthorized requests, and protecting sensitive data with a mobile-friendly security model. Examples are provided to illustrate each principle in action.
The document discusses contributing to the pandas open source project. It provides information on how to become a contributor, the number of current contributors and issues, and steps for setting up a development environment, making changes, writing tests, and submitting pull requests.
Spring Data REST allows creating RESTful APIs for data access without writing any endpoints or services. It generates REST endpoints for CRUD operations on data sources based on Spring Data repositories. With Spring Data REST, microservices require fewer code, services, and endpoints while gaining more possibilities through HATEOAS links between resources. Documentation on Spring Data REST can be found on the Spring website and reference guide.
The document discusses RxJS, a library for reactive programming using observables. It begins with an introduction from a beginner and expert perspective on RxJS. It then covers topics like creating observables, best practices for importing RxJS, choosing operators, avoiding subscriptions, wrapping APIs, and the benefits of "same-shapedness". Code examples are provided for creating observables, getting input changes as an observable, using operators like switchMap, and merging multiple observable data sources.
In general, this presentation is about API Documentation plus a quick introduction to RAML and it's new version 1.0.
Espresso Logic builds and runs RESTful servers for SQL databases, with advanced support for row/column level security, and business logic by JavaScript events and Reactive Programming.
Imagine we have Ada, our data science intern. Let's run through a very simple wordcount spark job, and find a handful of potential failure points. Dozens of failures can and should happen when running spark jobs on commodity hardware. Given the basic foundation for infrastructure-level expectations, this talk gives Ada tools to ensure her job isn’t caught dead. Once the simple example job runs reliably, with the potential to scale, our data scientist can apply the same toolset to focus on some more interesting algorithms. Turn SNAFUs into successes by anticipating and handling Infra failures gracefully. Note: this talk is a spark-focused extension of Part I, "Just Enough DevOps For Data Scientists" from Scale by The Bay 2018 https://www.youtube.com/watch?v=RqpnBl5NgW0&t=19s
Anya Bida is a senior member of technical staff working on Spark tuning at Salesforce. She has a PhD from Mayo Clinic and BS from Johns Hopkins. The document discusses DevOps concepts for data scientists like handling infrastructure failures when running Spark jobs. It provides an overview of Spark operations like map, reduceByKey and saveAsTextFile. It also discusses best practices for avoiding common Spark and HDFS failures through techniques like high availability, sufficient disk space, optimizing partitions, and persisting or checkpointing data.
The document summarizes an agenda for a Big Data & Data Science event. The agenda includes: 1. A presentation on the Apache Spark ecosystem by an expert. 2. A talk on running SQL on large Hadoop clusters using SparkSQL and BigSQL. 3. A student presentation on a social data machine learning project. 4. A presentation on the Data Science Experience tool. 5. A talk on using data to detect anomalies. The event will conclude with a question and answer session.
Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.
This document provides an overview of optimizing Spark SQL performance. It begins with introducing the speaker and their background with Spark. It then discusses reading query plans, interpreting them to understand optimizations, and tuning plans by pushing down filters, avoiding implicit casts, and other techniques. It emphasizes tracking query execution through the Spark UI to analyze jobs, stages and tasks for bottlenecks. The document aims to help understand how to maximize Spark SQL performance.
Michal Malohlava talks about the PySparkling Water package for Spark and Python users. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Big Data is all about being to access and process data in various formats, and from various sources. Apache Bahir provides extensions to distributed analytic platforms providing them access to different data sources. In this talk we will introduce you to Apache Bahir and its various connectors that are available for Apache Spark and Apache Flink. We will also go over the details of how to build, test and deploy an Spark Application using the MQTT data source for the new Apache Spark 2.0 Structure Streaming functionality.
This document discusses AppsFlyer's experience running Spark on Mesos in production for retention data processing and analytics. Key points include: - AppsFlyer processes over 30 million installs and 5 billion sessions daily for retention reporting across 18 dimensions using Spark, Mesos, and S3. - Challenges included timeouts and errors when using Spark's S3 connectors due to the eventual consistency of S3, which was addressed by using more robust connectors and configuration options. - A coarse-grained Mesos scheduling approach was found to be more stable than fine-grained, though it has limitations like static core allocation that future Mesos improvements may address. - Tuning jobs for coarse-
In this presentation we'll explain how to use the R programming language with Spark using a Databricks notebook and the SparkR package. We'll discuss how to push data wrangling to the Spark nodes for massive scale and how to bring it back to a single node so we can use open source packages on the data. We'll demonstrate converting SQL tables into R distributed data frames and how to convert R data frames to SQL tables. We'll also have a look at how to train predictive models using data distributed over the Spark nodes. Bring your popcorn. This is a fun and interesting presentation. Speaker: Bryan Cafferky
Spark is in high demand for several reasons: it offers low-latency processing by keeping data in memory, supports streaming analytics, machine learning algorithms, and graph processing. It also introduces DataFrames for easier data analysis and integrates well with Hadoop for processing large datasets. Spark can sort 100TB of data 3 times faster than MapReduce using fewer resources, making it a popular big data processing engine.
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
The document discusses creating a fast data pipeline using Apache Spark's Structured Streaming and Spark Streaming. It presents a sensor anomaly detection pipeline that uses Structured Streaming for data exploration, preparation, and anomaly detection, and Spark Streaming for online model creation and training. It compares the execution and abstraction models of Structured Streaming and Spark Streaming, and demonstrates how to build the sensor anomaly detection pipeline using Kafka sources and sinks with SQL operations, event time windows, and watermarks.
Kappa Architecture is an alternative to Lambda Architecture that simplifies real-time data processing. It uses a distributed log like Kafka to store all input data immutably to allow reprocessing from the beginning if the processing code changes. This avoids having to maintain separate batch and real-time processing systems. The ASPgems team has implemented Kappa Architecture for several clients using Kafka, Spark Streaming, and Cassandra to provide real-time analytics and metrics in sectors like telecommunications, IoT, insurance, and energy.
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial: 1) Big Data Introduction 2) Batch vs Real Time Analytics 3) Why Apache Spark? 4) What is Apache Spark? 5) Using Spark with Hadoop 6) Apache Spark Features 7) Apache Spark Ecosystem 8) Demo: Earthquake Detection Using Apache Spark
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
This talk will provide a brief update on Microsoft’s recent history in Open Source with specific emphasis on Azure Databricks, a fast, easy and collaborative Apache Spark-based analytics service. Attendees will learn how to integrate MongoDB Atlas with Azure Databricks using the MongoDB Connector for Spark. This integration allows users to process data in MongoDB with the massive parallelism of Spark, its machine learning libraries, and streaming API.
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
This slide deck accompanied the presentation at #SUGUK on 20180322 in London, UK. PowerApps allows you to build business application with no-code, and is included in most Office 365 plans.
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
This document discusses processing large messages with Kafka Streams and Kafka Connect. It describes how large messages can exceed Kafka's maximum message size limit. It proposes using an S3-backed serializer to store large messages in S3 and send pointers to Kafka instead. This allows processing logic to remain unchanged while handling large messages. The serializer transparently retrieves messages from S3 during deserialization.
y running your workloads in Kubernetes, we can focus on designing and building your applications instead of managing the infrastructure that runs them. But wait! what about the cost?? With the Virtual Kubelet provider for AKS and Azure Container Instances, both Linux and Windows containers can be scheduled on a container instance as if it is a standard Kubernetes node. This configuration allows you to take advantage of both the capabilities of Kubernetes and the management value and cost-benefit of container instances. In this talk you will learn how to deploy an application to AKS and ACI with Virtual Kublet. While leveraging the scalability of Kubernetes and cost efficiency of ACI. https://dev.to/adipolak/kubernetes-and-virtual-kubelet-in-a-nutshell-gn4
This document discusses AI at scale using Apache Spark on Azure. It provides an overview of Apache Spark, how it can be used for machine learning with tools like MLlib and Databricks, and how cognitive services can be combined with Spark. It also discusses using Azure services like Databricks, HDInsight and AKS for running Spark workloads at scale, and the roles of data engineers and data scientists.
Breaking up a monolith or switching from client desktop to using the web in scale, require us to think of many factors, like the engineering team and the knowledge that the team already possess, technologies that exist, how to build the infrastructure right and much more. How can we use Kubernetes with Virtual Kubelet to cut costs and use the right service for the workload, whether it is a burst workload or a steady one