This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
Slides from http://www.meetup.com/Hadoop-User-Group-Munich/events/230313355/
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
This document provides an overview of building an Apache Apex application, including key concepts like DAGs, operators, and ports. It also includes an example "word count" application and demonstrates how to define the application and operators, and build Apache Apex from source code. The document outlines the sample application workflow and includes information on resources for learning more about Apache Apex.
Introduction to Apache Apex and writing a big data streaming application Apache Apex
Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application.
This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc.
Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.
This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email apex-meetup@datatorrent.com
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
This talk will be a deep dive into ingesting unbounded file data and streaming data from Kafka into Hadoop. We will also cover data enrichment and dimensional compute. Customer use-case and reference architecture.
From Batch to Streaming with Apache Apex Dataworks Summit 2017Apache Apex
This document discusses transitioning from batch to streaming data processing using Apache Apex. It provides an overview of Apex and how it can be used to build real-time streaming applications. Examples are given of how to build an application that processes Twitter data streams and visualizes results. The document also outlines Apex's capabilities for scalable stream processing, queryable state, and its growing library of connectors and transformations.
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with:
* Architecture for high throughput, low latency and exactly-once processing semantics.
* Comprehensive library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more
* Java based with unobtrusive API to build real-time and batch applications and implement custom business logic.
* Advanced engine features for auto-scaling, dynamic changes, compute locality.
Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
The document introduces Apache Apex, an open source unified streaming and batch processing framework. It discusses how Apex integrates with native Hadoop components like YARN and HDFS. It then describes Apex's programming model using directed acyclic graphs of operators and streams to process data. The document outlines Apex's support for scaling applications through partitioning, windowing, fault tolerance, and guarantees on processing semantics. It provides an example of building an application pipeline and shows the logical and physical plans. In closing, it directs the reader to Apache Apex community resources for more information.
Apache Apex is a stream processing framework that provides high performance, scalability, and fault tolerance. It uses YARN for resource management, can achieve single digit millisecond latency, and automatically recovers from failures without data loss through checkpointing. Apex applications are modeled as directed acyclic graphs of operators and can be partitioned for scalability. It has a large community of committers and is in the process of becoming a top-level Apache project.
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
This document discusses Apache Apex, an open source stream processing framework. It provides an overview of stream data processing and common use cases. It then describes key Apache Apex capabilities like in-memory distributed processing, scalability, fault tolerance, and state management. The document also highlights several customer use cases from companies like PubMatic, GE, and Silver Spring Networks that use Apache Apex for real-time analytics on data from sources like IoT sensors, ad networks, and smart grids.
Capital One's Next Generation Decision in less than 2 msApache Apex
This document discusses using Apache Apex for real-time decision making within 2 milliseconds. It provides performance benchmarks for Apex, showing average latency of 0.25ms for over 54 million events with 600GB of RAM. It compares Apex favorably to other streaming technologies like Storm and Flink, noting Apex's self-healing capabilities, independence of operators, and ability to meet latency and throughput requirements even during failures. The document recommends Apex for its maturity, fault tolerance, and ability to meet the goals of latency under 16ms, 99.999% availability, and scalability.
Apache Apex (incubating) is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Presented by Pramod Immaneni at Data Riders Meetup hosted by Nexient on Apr 5th, 2016
Apache Apex Fault Tolerance and Processing SemanticsApache Apex
Components of an Apex application running on YARN, how they are made fault tolerant, how checkpointing works, recovery from failures, incremental recovery, processing guarantees.
BigDataSpain 2016: Introduction to Apache ApexThomas Weise
Apache Apex is an open source stream processing platform, built for large scale, high-throughput, low-latency, high availability and operability. With a unified architecture it can be used for real-time and batch processing. Apex is Java based and runs natively on Apache Hadoop YARN and HDFS.
We will discuss the key features of Apache Apex and architectural differences from similar platforms and how these differences affect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, low latency SLA, high throughput and large scale ingestion.
Apex APIs and libraries of operators and examples focus on developer productivity. We will present the programming model with examples and how custom business logic can be easily integrated based on the Apex operator API.
We will cover integration with connectors to sources/destinations (including Kafka, JMS, SQL, NoSQL, files etc.), scalability with advanced partitioning, fault tolerance and processing guarantees, computation and scheduling model, state management, windowing and dynamic changes. Attendees will also learn how these features affect time to market and total cost of ownership and how they are important in existing Apex production deployments.
https://www.bigdataspain.org/
Next Gen Big Data Analytics with Apache Apex discusses Apache Apex, an open source stream processing framework. It provides an overview of Apache Apex's capabilities for processing continuous, real-time data streams at scale. Specifically, it describes how Apache Apex allows for in-memory, distributed stream processing using a programming model of operators in a directed acyclic graph. It also covers Apache Apex's features for fault tolerance, dynamic scaling, and integration with Hadoop and YARN.
Stream data from Apache Kafka for processing with Apache ApexApache Apex
Meetup presentation: How Apache Apex consumes from Kafka topics for real-time time processing and analytics. Learn about features of the Apex Kafka Connector, which is one of the most popular operators in the Apex Malhar operator library, and powers several production use cases. We explain the advanced features this operator provides for high throughput, low latency ingest and how it enables fault tolerant topologies with exactly once processing semantics.
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
• Architecture highlights: high throughput, low-latency, operability with stateful fault tolerance, strong processing guarantees, auto-scaling etc
• Application development model, unified approach for real-time and batch use cases
• Tools for ease of use, ease of operability and ease of management
• How customers use Apache Apex in production
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder of DataTorrent presented "Streaming Analytics with Apache Apex" as part of the Big Data, Berlin v 8.0 meetup organised on the 14th of July 2016 at the WeWork headquarters.
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
Internet of Things (IoT) devices are becoming more ubiquitous in consumer, business and industrial landscapes. They are being widely used in applications ranging from home automation to the industrial internet. They pose a unique challenge in terms of the volume of data they produce, and the velocity with which they produce it, and the variety of sources they need to handle. The challenge is to ingest and process this data at the speed at which it is being produced in a real-time and fault tolerant fashion. Apache Apex is an industrial grade, scalable and fault tolerant big data processing platform that runs natively on Hadoop. In this deck, you will see how Apex is being used in IoT applications and also see how the enterprise features such as dimensional analytics, real-time dashboards and monitoring play a key role.
Presented by Pramod Immaneni, Principal Architect at DataTorrent and PPMC member Apache Apex, on BrightTALK webinar on Apr 6th, 2016
Real Time Insights for Advertising TechApache Apex
A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several Data centers across the world. In batch data processing, data is collected at different geographic locations and processed at regular intervals. This system brings delay of at least 1 hour before an event is accounted for.
The goal of having real time streaming was to provide publishers, Demand Side Platforms (DSP's) and agencies actionable insights in a few minutes from the time of event generation.
This Ad Tech company uses DataTorrent RTS powered by Apex for:
• Real time reporting
• Resource monitoring
• Real time learning
• Allocation engine
Tushar Gosavi from DataTorrent will take the audience through the architecture, custom operators developed, use cases for real time and the challenges involved in implementing streaming systems at scale where multiple data centers are in play.
Tushar is a Senior Engineer at DataTorrent and has worked in distributed systems and storage domains.
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses the evolution of Netflix's data pipeline to incorporate Kafka to handle 400 billion events per day. It describes how Netflix uses Kafka clusters with different priorities and configurations. It also outlines some of the challenges of using Kafka at Netflix's scale, such as Zookeeper client issues and cluster scaling, and the solutions Netflix developed to address these challenges.
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.
- Apache Apex is a platform and framework for building highly scalable and fault-tolerant distributed applications on Hadoop.
- It allows developers to build any custom logic as distributed applications and ensures fault tolerance, scalability and data flow. Applications can process streaming or batch data with high throughput and low latency.
- Apex applications are composed of operators that perform processing on streams of data tuples. Operators can run in a distributed fashion across a cluster and automatically recover from failures without reprocessing data from the beginning.
Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale.
This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components.
Topics include:
* Pipeline functionality from event source through queryable state for real-time insights.
* API for application development and development process.
* Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results.
* Stateful processing with event time windowing.
* Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery
* Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality.
* Who is using Apex in production, and roadmap.
Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017Thomas Weise
https://berlinbuzzwords.de/17/session/batch-streaming-etl-apache-apex
Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale.
This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components.
Topics include:
Pipeline functionality from event source through queryable state for real-time insights.
API for application development and development process.
Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results.
Stateful processing with event time windowing.
Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery
Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality.
Recent project development and roadmap.
Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.
Real-time Stream Processing using Apache ApexApache Apex
Apache Apex is a stream processing framework that provides high performance, scalability, and fault tolerance. It uses YARN for resource management, can achieve single digit millisecond latency, and automatically recovers from failures without data loss through checkpointing. Apex applications are modeled as directed acyclic graphs of operators and can be partitioned for scalability. It has a large community of committers and is in the process of becoming a top-level Apache project.
The document discusses Cassandra and the xPatterns architecture. It describes exporting data from HDFS/Hive/Shark to Cassandra using a custom Spark job and generating REST APIs. A demo of a referral provider network dashboard is shown, which was built using exported Cassandra data to analyze medical records and provider relationships. Lessons learned from optimizing Cassandra performance from versions 0.6 to 2.0.6 are also discussed.
Data Stream Processing with Apache FlinkFabian Hueske
This talk is an introduction into Stream Processing with Apache Flink. I gave this talk at the Madrid Apache Flink Meetup at February 25th, 2016.
The talk discusses Flink's features, shows it's DataStream API and explains the benefits of Event-time stream processing. It gives an outlook on some features that will be added after the 1.0 release.
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees.
Speakers:
Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
The document discusses lessons learned from embedding Cassandra in the xPatterns big data analytics platform. It provides an agenda that includes discussing Cassandra usage in xPatterns, the necessary developments like data modeling optimizations, robust REST APIs, geo-replication, and a demo of exporting to NoSQL APIs. Key lessons learned since Cassandra versions 0.6 to 2.0.6 are also summarized, such as the need for consistent clocks, reducing column families, and monitoring.
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022HostedbyConfluent
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs is a hyperscale PaaS event stream broker with protocol support for HTTP, AMQP, and Apache Kafka RPC that accepts and forwards several trillion (!) events per day and is available in all global Azure regions. This session is a look behind the curtain where we dive deep into the architecture of Event Hubs and look at the Event Hubs cluster model, resource isolation, and storage strategies and also review some performance figures.
Similar to Architectual Comparison of Apache Apex and Spark Streaming (20)
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
YARN was introduced as part of Hadoop 2.0 to address limitations in the original MapReduce (MR1) architecture like scalability bottlenecks and underutilization of resources. YARN introduces a global ResourceManager and per-node NodeManagers to allocate cluster resources to distributed applications. It allows various distributed processing frameworks beyond MapReduce to share common cluster resources. Applications request containers for ApplicationMasters that then negotiate resources from YARN to run application components in containers across nodes. Existing MapReduce jobs can also run unchanged on YARN.
Here is how you can solve this problem using MapReduce and Unix commands:
Map step:
grep -o 'Blue\|Green' input.txt | wc -l > output
This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches).
Reduce step:
cat output
This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green.
So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map
HDFS stores files as blocks that are by default 64 MB in size to minimize disk seek times. The namenode manages the file system namespace and metadata, tracking which datanodes store each block. When writing a file, HDFS breaks it into blocks and replicates each block across multiple datanodes. The secondary namenode periodically merges namespace and edit log changes to prevent the log from growing too large. Small files are inefficient in HDFS due to each file requiring namespace metadata regardless of size.
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsApache Apex
Presenter:
Chaitanya Chebolu, Committer for Apache Apex and Software Engineer at DataTorrent.
In this session we will cover the use-case of ingesting data from Kafka and writing to HDFS with a couple of processing operators - Parser, Dedup, Transform.
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationApache Apex
This document provides an overview of building a first Apache Apex application. It describes the main concepts of an Apex application including operators that implement interfaces to process streaming data within windows. The document outlines a "Sorted Word Count" application that uses various operators like LineReader, WordReader, WindowWordCount, and FileWordCount. It also demonstrates wiring these operators together in a directed acyclic graph and running the application to process streaming data.
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Apache Apex
Presenter:
Priyanka Gugale, Committer for Apache Apex and Software Engineer at DataTorrent.
In this session we will cover introduction to Yarn, understanding yarn architecture as well as look into Yarn application lifecycle. We will also learn how Apache Apex is one of the Yarn applications in Hadoop.
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentApache Apex
Presenter - Dr Sandeep Deshmukh, Committer Apache Apex, DataTorrent engineer
Abstract:
Ingesting and extracting data from Hadoop can be a frustrating, time consuming activity for many enterprises. Apache Apex Data Ingestion is a standalone big data application that simplifies the collection, aggregation and movement of large amounts of data to and from Hadoop for a more efficient data processing pipeline. Apache Apex Data Ingestion makes configuring and running Hadoop data ingestion and data extraction a point and click process enabling a smooth, easy path to your Hadoop-based big data project.
In this series of talks, we would cover how Hadoop Ingestion is made easy using Apache Apex. The third talk in this series would focus on ingesting unbounded data from Kafka to JDBC with couple of processing operators -Transform and enrichment.
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexApache Apex
Roman Shaposhnik: Director of Open Source, Pivotal; Committer, Apache Hadoop; Founder, Apache Bigtop
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex.
Chinmay Kolhatkar: Engineer, DataTorrent & Committer, Apache Apex
For ease of use and deployment, Apache Apex leverages Apache Bigtop. Apex, being part of bigtop stack, can be easily deployed in both debian and rpm based cluster system and run validation tests for installation. This talk will cover a demo on how to install apex-bigtop and use it. It also covers a test sandbox docker environment, having pre-installed bigtop-hadoop and bigtop-apex, for quickly getting started with apex.
Building Your First Apache Apex ApplicationApache Apex
This document provides an overview of building an Apache Apex application, including key concepts like DAGs, operators, and ports. It also includes an example "word count" application and demonstrates how to define the application and operators, and build Apache Apex from source code. The document outlines the sample application workflow and includes information on resources for learning more about Apache Apex.
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfNeo4j
Presented at Gartner Data & Analytics, London Maty 2024. BT Group has used the Neo4j Graph Database to enable impressive digital transformation programs over the last 6 years. By re-imagining their operational support systems to adopt self-serve and data lead principles they have substantially reduced the number of applications and complexity of their operations. The result has been a substantial reduction in risk and costs while improving time to value, innovation, and process automation. Join this session to hear their story, the lessons they learned along the way and how their future innovation plans include the exploration of uses of EKG + Generative AI.
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxSynapseIndia
Your comprehensive guide to RPA in healthcare for 2024. Explore the benefits, use cases, and emerging trends of robotic process automation. Understand the challenges and prepare for the future of healthcare automation
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
UiPath Community Day Kraków: Devs4Devs ConferenceUiPathCommunity
We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner!
We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too!
Check out our proposed agenda below 👇👇
08:30 ☕ Welcome coffee (30')
09:00 Opening note/ Intro to UiPath Community (10')
Cristina Vidu, Global Manager, Marketing Community @UiPath
Dawid Kot, Digital Transformation Lead @Proservartner
09:10 Cloud migration - Proservartner & DOVISTA case study (30')
Marcin Drozdowski, Automation CoE Manager @DOVISTA
Pawel Kamiński, RPA developer @DOVISTA
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
09:40 From bottlenecks to breakthroughs: Citizen Development in action (25')
Pawel Poplawski, Director, Improvement and Automation @McCormick & Company
Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company
10:05 Next-level bots: API integration in UiPath Studio (30')
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
10:35 ☕ Coffee Break (15')
10:50 Document Understanding with my RPA Companion (45')
Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath
11:35 Power up your Robots: GenAI and GPT in REFramework (45')
Krzysztof Karaszewski, Global RPA Product Manager
12:20 🍕 Lunch Break (1hr)
13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30')
Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance
13:50 Communications Mining - focus on AI capabilities (30')
Thomasz Wierzbicki, Business Analyst @Office Samurai
14:20 Polish MVP panel: Insights on MVP award achievements and career profiling
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
Implementations of Fused Deposition Modeling in real worldEmerging Tech
The presentation showcases the diverse real-world applications of Fused Deposition Modeling (FDM) across multiple industries:
1. **Manufacturing**: FDM is utilized in manufacturing for rapid prototyping, creating custom tools and fixtures, and producing functional end-use parts. Companies leverage its cost-effectiveness and flexibility to streamline production processes.
2. **Medical**: In the medical field, FDM is used to create patient-specific anatomical models, surgical guides, and prosthetics. Its ability to produce precise and biocompatible parts supports advancements in personalized healthcare solutions.
3. **Education**: FDM plays a crucial role in education by enabling students to learn about design and engineering through hands-on 3D printing projects. It promotes innovation and practical skill development in STEM disciplines.
4. **Science**: Researchers use FDM to prototype equipment for scientific experiments, build custom laboratory tools, and create models for visualization and testing purposes. It facilitates rapid iteration and customization in scientific endeavors.
5. **Automotive**: Automotive manufacturers employ FDM for prototyping vehicle components, tooling for assembly lines, and customized parts. It speeds up the design validation process and enhances efficiency in automotive engineering.
6. **Consumer Electronics**: FDM is utilized in consumer electronics for designing and prototyping product enclosures, casings, and internal components. It enables rapid iteration and customization to meet evolving consumer demands.
7. **Robotics**: Robotics engineers leverage FDM to prototype robot parts, create lightweight and durable components, and customize robot designs for specific applications. It supports innovation and optimization in robotic systems.
8. **Aerospace**: In aerospace, FDM is used to manufacture lightweight parts, complex geometries, and prototypes of aircraft components. It contributes to cost reduction, faster production cycles, and weight savings in aerospace engineering.
9. **Architecture**: Architects utilize FDM for creating detailed architectural models, prototypes of building components, and intricate designs. It aids in visualizing concepts, testing structural integrity, and communicating design ideas effectively.
Each industry example demonstrates how FDM enhances innovation, accelerates product development, and addresses specific challenges through advanced manufacturing capabilities.
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...Toru Tamaki
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models" arXiv2023
https://arxiv.org/abs/2307.12980
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc
Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk.
What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year?
Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year.
This webinar will review:
- Key changes to privacy regulations in 2024
- Key themes in privacy and data governance in 2024
- How to maximize your privacy program in the second half of 2024
Kief Morris rethinks the infrastructure code delivery lifecycle, advocating for a shift towards composable infrastructure systems. We should shift to designing around deployable components rather than code modules, use more useful levels of abstraction, and drive design and deployment from applications rather than bottom-up, monolithic architecture and delivery.
2. Next Gen Stream Data Processing
• Data from variety of sources (IoT, Kafka, files, social media etc.)
• Unbounded, continuous data streams
ᵒ Batch can be processed as stream (but a stream is not a batch)
• (In-memory) Processing with temporal boundaries (windows)
• Stateful operations: Aggregation, Rules, … -> Analytics
• Results stored to variety of sinks or destinations
ᵒ Streaming application can also serve data with very low latency
2
Browser
Web Server
Kafka Input
(logs)
Decompress,
Parse, Filter
Dimensions
Aggregate Kafka
Logs
Kafka
3. Batch vs. Streaming
Credit: Gyula Fóra & Márton Balassi: Large-Scale Stream Processing in the Hadoop Ecosystem
3
4. Architecture and Features
Spark Streaming Apex
Model micro-batch native streaming/data-in-motion
Language Java, Scala, client bindings Java (Scala)
API declarative compositional (DAG), declarative*
Locality data locality advanced processing locality
Latency high very low (millis)
Throughput very high very high
Scalability scheduler limit horizontal
Partitioning standard advanced (parallel pipes, unifiers)
Connector Library Limited (certification),
externally maintained
Rich library of connectors and
processors, part of Apex (Malhar)
4
5. Operability
Spark Streaming Apex
State Management RDD, user code checkpointing
Recovery RDD lineage incremental (buffer server)
Processing Sem. exactly-once* end-to-end exactly-once
Backpressure user configuration Automatic (buffer server memory
+ disk)
Elasticity yes w/ limited control yes w/ full user control
Dynamic topology no yes
Security Kerberos Kerberos, RBAC*, LDAP*
Multi-tennancy depends on cluster manager YARN, full isolation
DevOps tools basic REST API, DataTorrent RTS
5
6. Apache Apex Features
• In-memory stream processing platform
ᵒ Developed since 2012, ASF TLP since 04/2016
• Unobtrusive Java API to express (custom) logic
• Scale out, distributed, parallel
• High throughput & low latency processing
• Windowing (temporal boundary)
• Reliability, fault tolerance, operability
• Hadoop native
• Compute locality, affinity
• Dynamic updates, elasticity
6
9. Application Development Model
9
A Stream is a sequence of data tuples
A typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance is single-threaded
Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Output
Stream
Tupl
e
Tupl
e
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
12. Scalability
• Large amount of data to process, arrival at high velocity
ᵒ Pipelining and partitioning
ᵒ Backpressure
• Partitioning
ᵒ Run same logic in multiple processes or threads
ᵒ Each partition processes a subset of the data
• Apex supports partitioning out of the box
ᵒ Different partitioning schemes
ᵒ Unification
ᵒ Static & Dynamic Partitioning
ᵒ Separation of processing logic from scaling decisions
12
13. Partitioning
13
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
14. Advanced Partitioning
14
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG
15. Dynamic Partitioning
15
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
16. How dynamic partitioning works
16
• Partitioning decision (yes/no) by trigger (StatsListener)
ᵒ Pluggable component, can use any system or custom metric
ᵒ Externally driven partitioning example: KafkaInputOperator
• Stateful!
ᵒ Uses checkpointed state
ᵒ Ability to transfer state from old to new partitions (partitioner, customizable)
ᵒ Steps:
• Call partitioner
• Modify physical plan, rewrite checkpoints as needed
• Undeploy old partitions from execution layer
• Release/request container resources
• Deploy new partitions (from rewritten checkpoint)
ᵒ No loss of data (buffered)
ᵒ Incremental operation, partitions that don’t change continue processing
• API: Partitioner interface
17. Fault Tolerance
17
• Operator state is checkpointed to persistent store
ᵒ Automatically performed by engine, no additional coding needed
ᵒ Asynchronous and distributed
ᵒ In case of failure operators are restarted from checkpoint state
• Automatic detection and recovery of failed containers
ᵒ Heartbeat mechanism
ᵒ YARN process status notification
• Buffering to enable replay of data from recovered point
ᵒ Fast, incremental recovery, spike handling
• Application master state checkpointed
ᵒ Snapshot of physical (and logical) plan
ᵒ Execution layer change log
18. • In-memory PubSub
• Stores results emitted by operator until committed
• Handles backpressure / spillover to local disk
• Ordering, idempotency
Operator
1
Container 1
Buffer
Server
Node 1
Operator
2
Container 2
Node 2
Buffer Server
18
20. Processing Guarantees
20
At-least-once
• On recovery data will be replayed from a previous checkpoint
ᵒ No messages lost
ᵒ Default, suitable for most applications
• Can be used to ensure data is written once to store
ᵒ Transactions with meta information, Rewinding output, Feedback from
external entity, Idempotent operations
At-most-once
• On recovery the latest data is made available to operator
ᵒ Useful in use cases where some data loss is acceptable and latest data is
sufficient
Exactly-once
ᵒ At-least-once + idempotency + transactional mechanisms (operator logic) to
achieve end-to-end exactly once behavior
21. End-to-End Exactly Once
21
• Becomes important when writing to external systems
• Data should not be duplicated or lost in the external system even in case of
application failures
• Common external systems
ᵒ Databases
ᵒ Files
ᵒ Message queues
• Platform support for at least once is a must so that no data is lost
• Data duplication must still be avoided when data is replayed from checkpoint
ᵒ Operators implement the logic dependent on the external system
ᵒ Platform provides checkpointing and repeatable windowing
26. Maximize Revenue w/ real-time insights
26
PubMatic is the leading marketing automation software company for publishers. Through real-time analytics,
yield management, and workflow automation, PubMatic enables publishers to make smarter inventory
decisions and improve revenue performance
Business Need Apex based Solution Client Outcome
• Ingest and analyze high volume clicks &
views in real-time to help customers
improve revenue
- 200K events/second data
flow
• Report critical metrics for campaign
monetization from auction and client
logs
- 22 TB/day data generated
• Handle ever increasing traffic with
efficient resource utilization
• Always-on ad network
• DataTorrent Enterprise platform,
powered by Apache Apex
• In-memory stream processing
• Comprehensive library of pre-built
operators including connectors
• Built-in fault tolerance
• Dynamically scalable
• Management UI & Data Visualization
console
• Helps PubMatic deliver ad performance
insights to publishers and advertisers in
real-time instead of 5+ hours
• Helps Publishers visualize campaign
performance and adjust ad inventory in
real-time to maximize their revenue
• Enables PubMatic reduce OPEX with
efficient compute resource utilization
• Built-in fault tolerance ensures
customers can always access ad
network
27. Industrial IoT applications
27
GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their
devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its
customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.
Business Need Apex based Solution Client Outcome
• Ingest and analyze high-volume, high speed
data from thousands of devices, sensors
per customer in real-time without data loss
• Predictive analytics to reduce costly
maintenance and improve customer
service
• Unified monitoring of all connected sensors
and devices to minimize disruptions
• Fast application development cycle
• High scalability to meet changing business
and application workloads
• Ingestion application using DataTorrent
Enterprise platform
• Powered by Apache Apex
• In-memory stream processing
• Built-in fault tolerance
• Dynamic scalability
• Comprehensive library of pre-built
operators
• Management UI console
• Helps GE improve performance and lower
cost by enabling real-time Big Data
analytics
• Helps GE detect possible failures and
minimize unplanned downtimes with
centralized management & monitoring of
devices
• Enables faster innovation with short
application development cycle
• No data loss and 24x7 availability of
applications
• Helps GE adjust to scalability needs with
auto-scaling
28. Smart energy applications
28
Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city
infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million
remote operations per year
Business Need Apex based Solution Client Outcome
• Ingest high-volume, high speed data from
millions of devices & sensors in real-time
without data loss
• Make data accessible to applications
without delay to improve customer service
• Capture & analyze historical data to
understand & improve grid operations
• Reduce the cost, time, and pain of
integrating with 3rd party apps
• Centralized management of software &
operations
• DataTorrent Enterprise platform, powered
by Apache Apex
• In-memory stream processing
• Pre-built operator
• Built-in fault tolerance
• Dynamically scalable
• Management UI console
• Helps Silver Spring Networks ingest &
analyze data in real-time for effective load
management & customer service
• Helps Silver Spring Networks detect
possible failures and reduce outages with
centralized management & monitoring of
devices
• Enables fast application development for
faster time to market
• Helps Silver Spring Networks scale with
easy to partition operators
• Automatic recovery from failures