Akka HTTP is a toolkit for building scalable REST services in Scala. It provides a high-level API built on top of Akka actors and Akka streams for writing asynchronous, non-blocking and resilient microservices. The document discusses Akka HTTP's architecture, routing DSL, directives, testing, additional features like file uploads and websockets. It also compares Akka HTTP to other Scala frameworks and outlines pros and cons of using Akka HTTP for building REST APIs.
It Provide a way to consume continues stream of data. Build on top of Spark Core It supports Java, Scala and Python. API is similar to Spark Core.
Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.
Things were easier when all our data used to be offline, analyzed overnight in batches. Now our data is online, in motion, and generated constantly. For architects, developers and their businesses, this means that there is an urgent need for tools and applications that can deliver real-time (or near real-time) streaming ETL capabilities. In this session by Konrad Malawski, author, speaker and Senior Akka Engineer at Lightbend, you will learn how to build these streaming ETL pipelines with Akka Streams, Alpakka and Apache Kafka, and why they matter to enterprises that are increasingly turning to streaming Fast Data applications.
This document provides an overview of structured streaming with Kafka in Spark. It discusses data collection vs ingestion and why they are key. It also covers Kafka architecture and terminology. It describes how Spark integrates with Kafka for streaming data sources. It explains checkpointing in structured streaming and using Kafka as a sink. The document discusses delivery semantics and how Spark supports exactly-once semantics with certain output stores. Finally, it outlines new features in Kafka for exactly-once guarantees and the future of structured streaming.
A brief introduction to Datasource V2 API in Spark 2.3.0, Comparison with the previous Datasource API.
The last few years have seen the emergence of Serverless as a paradigm for event streaming. Its very simple programming model has attracted developers in droves. At the same time, its ability to elastically scale has simplified operations significantly. Combined together with the ubiquity of their presence across all cloud providers, serverless today has become the leading choice to do event processing at scale for a lot of companies. In this talk, Sijie Guo from StreamNative will explore how the serverless paradigm is applied to event streaming in Apache Pulsar, a next-generation event streaming system. Pulsar provides native support for serverless functions where the events are processed as soon as they arrive in a streaming manner and that provides flexible deployment options (thread, process, container). He will describe how these serverless functions make data engineering easier and share the real world usage of Pulsar Functions.
Spark 2.0 introduces several major changes including using Dataset as the main abstraction, replacing RDDs for optimized performance. The migration involves updating to Scala 2.11, replacing contexts with SparkSession, using built-in CSV connector, updating RDD-based code to use Dataset APIs, adding checks for cross joins, and updating custom ML transformers. Migrating leverages many of the improvements in Spark 2.0 while addressing breaking changes.
The term 'streams' has been getting pretty overloaded recently–it's hard to know where to best use different technologies with streams in the name. In this talk by noted hAkker Konrad Malawski, we'll disambiguate what streams are and what they aren't, taking a deeper look into Akka Streams (the implementation) and Reactive Streams (the standard). You'll be introduced to a number of real life scenarios where applying back-pressure helps to keep your systems fast and healthy at the same time. While the focus is mainly on the Akka Streams implementation, the general principles apply to any kind of asynchronous, message-driven architectures.
Akka Streams is an implementation of Reactive Streams, which is a standard for asynchronous stream processing with non-blocking backpressure on the JVM. In this talk we'll cover the rationale behind Reactive Streams, and explore the different building blocks available in Akka Streams. I'll use Scala for all coding examples, but Akka Streams also provides a full-fledged Java8 API.After this session you will be all set and ready to reap the benefits of using Akka Streams!
Spark can run on Kubernetes containers in two ways - as a static cluster or with native integration. As a static cluster, Spark pods are manually deployed without autoscaling. Native integration treats Kubernetes as a resource manager, allowing Spark to dynamically acquire and release containers like in YARN. It uses Kubernetes custom controllers to create driver pods that then launch worker pods. This provides autoscaling of resources based on job demands.
This document provides an introduction to Structured Streaming in Apache Spark. It discusses the evolution of stream processing, drawbacks of the DStream API, and advantages of Structured Streaming. Key points include: Structured Streaming models streams as infinite tables/datasets, allowing stream transformations to be expressed using SQL and Dataset APIs; it supports features like event time processing, state management, and checkpointing for fault tolerance; and it allows stream processing to be combined more easily with batch processing using the common Dataset abstraction. The document also provides examples of reading from and writing to different streaming sources and sinks using Structured Streaming.
The document discusses Reactive Slick, a new version of the Slick database access library for Scala that provides reactive capabilities. It allows parallel database execution and streaming of large query results using Reactive Streams. Reactive Slick is suitable for composite database tasks, combining async tasks, and processing large datasets through reactive streams.
Presentation of specs2 functionalities from simple ones to less well-known + overview of the next release
This document summarizes the key challenges and solutions in building a real-time data pipeline that ingests data from a database, transforms it using Spark Streaming, and publishes the output to Salesforce. The pipeline aims to have a latency of 1 minute with zero data loss and ordering guarantees. Some challenges discussed include handling out of sequence and late arrival events, schema evolution, bootstrap loading, data loss/corruption, and diagnosing issues. Solutions proposed use Kafka, checkpointing, replay capabilities, and careful broker/connect setups to help meet the reliability requirements for the pipeline.
Event-driven systems come in different shapes and sizes, and the rules for payload construction are: there are no rules (but there are guidelines). Flexible payloads are both the best and worst thing about event streaming - you never quite know what to expect from each system's payloads. Just like when you met your first NoSQL datastore, this sounds like chaos! In this session we will cover strategies for designing the payloads you stream over Kafka. From fields to include, common mistakes to avoid, and what to do when the data structure changes over time, this session has real-world advice and examples that you can apply in your own projects. We will also look at other aspects, such as when to use a self-contained data format such as JSON or XML, or when a serialization format like Avro is best - and how to handle the schemas. This session is recommended for anyone who wants to design their payloads right first time and have all their applications playing nicely together.
With Viktor Klang, Deputy CTO Lightbend, Inc. As software grows more and more interconnected, and with several generations of software having to interoperate, a new take on the integration of systems is needed—ad hoc, unversioned, and unreplicated scripts just won’t suffice, and the traditional Enterprise Service Bus (ESB) concept has experienced stability, reliability, performance, and scalability problems. In this webinar, Viktor explores a new take on Enterprise Integration Patterns: First, he will explore the Reactive Streams standard, an orchestration layer where transformations are standalone, composable, reusable, and—most importantly—using asynchronous flow-control—back pressure—to maintain predictable, stable, behavior over time. Furthermore, he will go through how one-off workloads relate to continuous, and batch, workloads, and how they can be addressed by that very same orchestration layer. Finally, he will review how this type of design achieves resilience, scalability, and ultimately—responsiveness.
Akka Streams and its amazing handling of stream back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially use cases where the amount of work increases as you process make you really value the back-pressure. This talk takes a sample web crawler use case where each processing pass expands to a larger and larger workload to process, and discusses how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts. In addition, we will also provide some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
This document discusses Telco analytics at scale using distributed stream processing. It describes using technologies like Apache Spark Streaming, Kafka, and Hadoop (HDFS, Hive, HBase) to ingest and process large volumes of streaming data from various sources in real-time or near real-time. Example use cases discussed include fraud detection, real-time rating, security information and event management. It also covers strategies for distributed in-memory caching and rule processing to enable low latency analytics at high throughput scales needed for telco data and applications.
Dan Persa, Senior Software Engineer at Zalando Dan Persa has been a software engineer at Zalando since 2013 and is a member of the Fashion Store team, which is responsible for Zalando’s core ecommerce business. He loves Java and Scala and more recently has been exploring Go and Node.js. He’s a big fan of Clean Code and Software Craftsmanship. In addition to coding, he enjoys mentoring new developers, organizing coder dojos and reading groups and giving tech talks. In his free time he likes to take photos and dance salsa. tech.zalando.com
An introduction to akka-http and its components, why akka http over spray? and high level server API and its components.
This document discusses a platform for data scientists that aims to automate routine jobs, maximize resource utilization, and allow data scientists to focus more on business solutions. The platform provides capabilities for data capture, analysis, modeling, and output of analytics. It seeks to reduce the time taken to turn data into insights from months to weeks. Key elements of the platform include tools for exploratory data analysis, advanced modeling, distributed architecture, bespoke algorithms, and packaged analytics solutions.
This document provides an overview and introduction to Akka HTTP, a Scala library built on Akka Streams for HTTP-based applications. Some key points: - Akka HTTP uses Akka Streams to model HTTP requests and responses as streaming data flows. - It allows building both HTTP clients and servers by composing stream processing stages together. - Common directives and operations like routing, marshalling, validation, and testing are supported through a high-level API. - Examples demonstrate basic usage like creating a route that returns XML, running a server, and writing tests against routes.
Going down the microservices route makes a lot of things around creating and maintaining large systems easier but it comes at a cost too, particularly associated with challenges around security. While securing monolithic applications was a relatively well understood area, the same can't be said about microservice based architectures. This presentation covers how implementing microservices affects the security of distributed systems, outlines pros and cons of several standards and common practices and offers practical suggestions for securing microservice based systems using Play and Akka HTTP.
This document discusses strategies for building interactive streaming applications in Spark Streaming. It describes using Zookeeper as a dynamic configuration source to allow modifying a Spark Streaming application's behavior at runtime. The key points are: - Zookeeper can be used to track configuration changes and trigger Spark Streaming context restarts through its watch mechanism and Curator library. - This allows building interactive applications that can adapt to configuration updates without needing to restart the whole streaming job. - Examples are provided of using Curator caches like node and path caches to monitor Zookeeper for changes and restart Spark Streaming contexts in response.
This document discusses the development of a single-page web application for a student markbook using Akka actors and HTTP. Key points discussed include: - Using multiple Akka actors to retrieve student, schedule, subject and mark data from various data services. - A worker actor that processes the retrieved data and returns student week marks. - A REST API with routes to get lists of students and individual student week marks. - The application server is initialized by binding the API routes to an HTTP server.
Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications. When does one use actors vs futures? Can we use Akka with, or in place of, Storm? How did we set up instrumentation and monitoring in production? How does one use VisualVM to debug Akka apps in production? What happens if the mailbox gets full? What is our Akka stack like? I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.
This document provides an overview of functional programming concepts in Scala. It discusses the history and advantages of functional programming. It then covers the basics of Scala including its support for object oriented and functional programming. Key functional programming aspects of Scala like immutable data, higher order functions, and implicit parameters are explained with examples.
The document discusses Akka, a toolkit for building highly concurrent, distributed, and resilient message-driven applications on the JVM. It describes key components of Akka including actors for concurrency, clusters for location-transparent resilient applications, persistence for event sourcing, and HTTP for asynchronous reactive servers. It also discusses the actor model of concurrent computation and related topics like reactive streams and advantages of asynchronous messaging.
This document provides an overview of Akka HTTP, a library for building HTTP-based services using Scala and Akka. It describes the common abstractions used in Akka HTTP like HTTP requests, responses, entities, and marshalling/unmarshalling. It also explains the low-level and high-level APIs, with the low-level API providing basic request handling functionality and the high-level API using directives and routing DSL for defining routes in a more flexible way.
The document discusses the architecture for real-time ETL processing, including using GoldenGate for change data capture from source databases, Kafka as the messaging system, and Spark jobs for streaming reconciliation and joining of data. It also covers requirements for the reconciler component like supporting idempotency, immutability, and schema evolution. Challenges with handling out-of-order events in Spark streaming and the data model used to address issues like idempotency and schema evolution are also described.
1. The document discusses the process of productionalizing a financial analytics application built on Spark over multiple iterations. It started with data scientists using Python and data engineers porting code to Scala RDDs. They then moved to using DataFrames and deployed on EMR. 2. Issues with code quality and testing led to adding ScalaTest, PR reviews, and daily Jenkins builds. Architectural challenges were addressed by moving to Databricks Cloud which provided notebooks, jobs, and throwaway clusters. 3. Future work includes using Spark SQL windows and Dataset API for stronger typing and schema support. The iterations improved the code, testing, deployment, and use of latest Spark features.
This document provides an overview of Spark Catalyst including: - Catalyst trees and expressions represent logical and physical query plans - Expressions have datatypes and operate on Row objects - Custom expressions can be defined - Code generation improves expression evaluation performance by generating Java code via Janino compiler - Key concepts like trees, expressions, datatypes, rows, code generation and Janino compiler are explained through examples