PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends

Hail Hydrate! From Stream to
Lake with Pulsar and Friends
Tim Spann | Developer Advocate
https://portotechhub.com/conference-2021

The Need For Real-Time Data
Hybrid and multi-cloud
strategies with native
geo-replication
Seamlessly build
microservice architectures
with support for streaming
and messaging workloads
Built for Kubernetes
CloudNative
migrations with tools
360 degree customer data
multi-tenancy, inﬁnite
retention, and extensive
connector ecosystem

Tim Spann
Developer Advocate
● https://www.datainmotion.dev/
● https://github.com/tspannhw/SpeakerProﬁle
● https://dev.to/tspannhw
● https://sessionize.com/tspann/
DZone Zone Leader and Big Data
MVB Data DJay

streamnative.io
● Founded the original developers of
Apache Pulsar.
● Passionate and dedicated team.
● StreamNative helps teams to capture,
manage, and leverage data using
Pulsar’s uniﬁed messaging and
streaming platform.

Apache is an open source, cloud-native
distributed messaging and streaming platform.

What are the Beneﬁts of Pulsar?
Data Durability
Scalability Geo-Replication
Multi-Tenancy
Unified Messaging
Model

Top Pulsar Use Cases
#1 Message
Queuing
#2 Data
Streaming
● Not built for the cloud
● Single tenant systems
● Monolithic architecture couples compute with storage
● Lack of geo replication support

Key Milestones
2012 2016 2017 2018 2019 2020
Originally developed
inside Yahoo! as “Cloud
Messaging Service”
Pulsar is
committed to
Open Source
Pulsar is accepted into
the Apache Software
Foundation
Pulsar
becomes a
Top-Level
Project
● StreamNative is founded and
seed round raised.
● Tencent adopts Pulsar for
payment processing platform.
● BestPay adopts Pulsar for
payment processing.
● Pulsar hits 200 contributors.
● 2 global Pulsar conferences, 80+ speakers, 1,500+ attendees
● Pulsar hits 340 contributors
● StreamNative and OVHCloud launch Kafka on Pulsar (KoP)
● StreamNative + China Mobile launch AMQP on Pulsar (AoP)
● Pulsar Ecosystem expands - StreamNative Hub launches
● StreamNative Cloud launches on GCP and Alibaba Cloud
● StreamNative customer adoption continues - new
customers include Flipkart and Applied Materials
● Pulsar 2.7 + Transactions
● Pulsar Flink Connector 2.7
Major increase in adoption following
TLP designation in 2018
2021
● 3 global Pulsar conferences
● StreamNative hits 400
contributors (June).
● Pulsar surpasses Kafka in
monthly active contributors.
● Pulsar 2.8 + Exactly-Once
semantics
● StreamNative Platform launches

Apache Pulsar Overview
Enable Geo-Replicated Messaging
● Pub-Sub
● Geo-Replication
● Pulsar Functions
● Horizontal Scalability
● Multi-tenancy
● Tiered Persistent Storage
● Pulsar Connectors
● REST API
● CLI
● Many clients available
● Four Different Subscription Types
● Multi-Protocol Support
○ MQTT
○ AMQP
○ JMS
○ Kafka
○ ...

Pulsar’s Publish-Subscribe model
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
● Producers send messages.
● Topics are an ordered, named channel that
producers use to transmit messages to
subscribed consumers.
● Messages belong to a topic and contain an
arbitrary payload.
● Brokers handle connections and routes
messages between producers / consumers.
● Subscriptions are named conﬁguration
rules that determine how messages are
delivered to consumers.
● Consumers receive messages.

What is the Pulsar Ecosystem?
● Functions and Connectors
○ Functions: Lightweight stream processing
○ Connectors: Part of “Pulsar IO”, includes “Source” and “Sink” APIs
■ Files, Databases, Data tools, Cloud Services, etc
● Protocol Handlers
○ Allows Pulsar to handle additional protocols by an extendable API
running in the broker
■ AoP (AMQP), KoP (Kafka), MoP (MQTT)

Topics
Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Instance
Pulsar Cluster

Pulsar subscription modes
Different subscription modes have
different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active consumers,
no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover

Pub/Sub
API
Pub/Sub
API
Reader and
Batch
Pulsar
IO/Connectors
Stream Processor
Applications
Prebuilt Connectors Custom Connectors
Microservices or
Event-Driven Architecture
Publisher Subscriber

Moving Data In and Out of Pulsar
IO/Connectors are a simple way to integrate with external systems and move data
in and out of Pulsar. https://pulsar.apache.org/docs/en/io-jdbc-sink/
● Built on top of Pulsar Functions
● Built-in connectors - hub.streamnative.io
Source Sink

AMQP / RabbitMQ Protocol
https://www.inﬂuxdata.com/integration/mqtt-monitoring/
https:/
/github.com/streamnative/aop
AMQP on Pulsar (AoP)
https:/
/hub.streamnative.io/connectors/amqp-1-0-sink/
https:/
/hub.streamnative.io/connectors/amqp-1-0-source
19

Use Azure BlobStore offloader with
Pulsar
https://pulsar.apache.org/docs/en/tiered-storage-azure/

Apache Pulsar -
Other Sinks
https://hub.streamnative.io/connectors/cloud-storage-sink/2.5.1/
mongoDB
AWS Lambda
redis
AWS S3
GCS
21

Pulsar SQL
Presto/Trino workers can
read segments directly
from bookies (or
ofﬂoaded storage) in
parallel.
Bookie
1
Segment 1
Producer Consumer
Broker 1
Topic1-Part1
Broker 2
Topic1-Part2
Broker 3
Topic1-Part3
Segment 2 Segment 3 Segment 4 Segment X
Segment 1
Segment 1 Segment 1
Segment 3 Segment 3
Segment 3
Segment 2
Segment 2
Segment 2
Segment 4
Segment 4
Segment 4
Segment X
Segment X
Segment X
Bookie
2
Bookie
3
Query
Coordinator
...
...
SQL Worker SQL Worker SQL Worker
SQL Worker
Query
Topic
Metadata

Query Your Topics with Pulsar SQL (Trino)

Data Center 3
Data Center 2
Geo Replication
Replication is done
asynchronously.
Pulsar has built-in cross
data center replication
that is used in production
already.
Data Center 1

Pulsar is built for easy scale-out.
*Illustrations by Jack
Vanlightly

Powered by Apache Pulsar, StreamNative provides a cloud-native,
real-time messaging and streaming platform to support multi-cloud
and hybrid cloud strategies.
Built for Containers
Cloud Native
StreamNative Cloud
Flink SQL

Don’t Be Afraid
of Open Source

Why Apache NiFi?
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Hundreds of processors
• Visual command and
control
• Over a sixty sources
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
• Version Control

Architecture
https://nifi.apache.org/docs/nifi-docs/html/overview.html

Provenance
https://www.datainmotion.dev/2021/01/automating-starting-services-in-apache.html

Backpressure & Prioritizers
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html

Record Processors
https://www.datainmotion.dev/2019/03/advanced-xml-processing-with-apache.html
● XML, CSV, JSON, AVRO and more
● Schemas or Inferred Schemas
● Easily convert between them
● Support SQL with Apache Calcite

Record Processors
https://www.datainmotion.dev/2019/03/advanced-xml-processing-with-apache.html

Consume MQTT
This could read from Apache Pulsar - MoP (MQTT on Pulsar)

Apache MXNet Native Processor through DJL.AI for Apache NiFi
This processor uses the DJL.AI Java Interface
https://github.com/tspannhw/niﬁ-djl-processor
https://dev.to/tspannhw/easy-deep-learning-in-apache-niﬁ-with-djl-2d79

Apache Flink
https://ﬂink.apache.org/2021/01/07/pulsar-ﬂink-connector-270.html

SQL / Table API: Running The Same Query On Streams
SQL Query
Incremental query
execution
SELECT
room,
TUMBLE_END(rowtime, INTERVAL ‘1’ HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL ‘1’ HOUR), room
Interpret stream as
table

Example: E-Commerce with Pulsar
● Uniﬁed storage with
access to underlying data
● Native tiered storage
● Single system to exchange
data
● Teams share toolset

StreamNative Hub
StreamNative Cloud
Uniﬁed Batch and Stream COMPUTING
Batch
(Batch + Stream)
Uniﬁed Batch and Stream STORAGE
Offload
(Queuing + Streaming)
Apache Flink - Apache Pulsar - Apache NiFi <-> Events <-> Cloud Data Stores
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
---
HTTP
Pulsar
Sink
Pulsar
Sink
Streaming
Edge Gateway
Protocols
End-to-End Streaming FLiP(N) Apps
Micro
Service

Ingesting IoT Data via Java Pulsar
https://github.com/tspannhw/StreamingAnalyticsUsingFlinkSQL/

Ingesting IoT Data via Java Pulsar

MQTT from Python
pip3 install paho-mqtt
import paho.mqtt.client as mqtt
client = mqtt.Client("rpi4-iot")
row = { }
row['gasKO'] = str(readings)
json_string = json.dumps(row)
json_string = json_string.strip()
client.connect("pulsar-server.com", 1883, 180)
client.publish("persistent://public/default/mqtt-2", payload=json_string,
qos=0, retain=True)

Using NVIDIA Jetson Devices With Pulsar
https://dev.to/tspannhw/unboxing-the-most-amazing-edge-ai-devic
e-part-1-of-3-nvidia-jetson-xavier-nx-595k
https://github.com/tspannhw/minifi-xaviernx/
https://github.com/tspannhw/minifi-jetson-nano
https://github.com/tspannhw/Flip-iot
https://www.datainmotion.dev/2020/10/flank-streaming-edgeai-on-
new-nvidia.html
https://github.com/tspannhw/FLiP-Mobile/blob/30bcc1ec98fc31e0
39b51a06180d98545c1e0542/python3/enviro.py

Now Available
On-Demand
Pulsar Training
Academy.StreamNative.io

We’re Hiring
streamnative.io/careers/

Connect with the Community & Stay Up-To-Date
● Join the Pulsar Slack channel - Apache-Pulsar.slack.com
● Follow @streamnativeio and @apache_pulsar on Twitter
● Subscribe to Monthly Pulsar Newsletter for major news, events, project
updates, and resources in the Pulsar community
56

Interested In Learning More?
Flink SQL Cookbook
The Github Source for Flink
SQL Demo
The GitHub Source for Demo
Manning's Apache Pulsar in
Action
O’Reilly Book
[11/8] PASS Data Community
[11/18] Developer Week Austin
[11/19] Porto Tech Hub Con
[12/3] Data Science Camp
Resources Free eBooks Upcoming Events
57

● https://www.datainmotion.dev/2020/04/building-search-indexes-with-apache.html
● https://github.com/tspannhw/nifi-solr-example
● https://github.com/streamnative/pulsar-flink
● https://www.linkedin.com/pulse/2021-schedule-tim-spann/
● https://github.com/tspannhw/SpeakerProfile/blob/main/2021/talks/20210729_HailHydr
ate!FromStreamtoLake_TimSpann.pdf
● https://streamnative.io/en/blog/release/2021-04-20-flink-sql-on-streamnative-cloud
● https://docs.streamnative.io/cloud/stable/compute/flink-sql
Deeper Content
@PaasDev
https://www.pulsardeveloper.com/
timothyspann
58

Let’s Keep
in Touch!
Tim Spann
Developer Advocate
@PassDev
https://www.linkedin.com/in/timothyspann
https://github.com/tspannhw

PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends

Similar to PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends (20)

More from Timothy Spann

More from Timothy Spann (20)

Recently uploaded

Recently uploaded (20)

PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends