Data Pipelines and Telephony Fraud Detection Using Machine Learning

Data Pipelines and Telephony Fraud
Detec5on Using Machine Learning
Presented by
Eugene Shulga Pla;orm Engineer
Elana Woldenberg Pla;orm Engineer

1.Data Pipelines
2.Fraud Detec5on
Agenda
2

Massive amount of data
4
CDRs (Call Detail Records)
Hundreds of millions
SIP messages
Billions
LRN (Local RouCng Number)
Hundreds of millions

Telnyx Recipe
• Message rouCng and reliable delivery (KaIa, RabbitMQ)
• Storage (Cassandra, Postgres)
• Real Cme aggregaCon (Spark Streaming)
• Batch and ad-hoc analysis (Spark and Notebooks)
• VisualizaCon (Kibana, Grafana)
5

Cloud Agnos5c
6
Requirements
• Cannot use cloud specific data soluCons
• Flexible enough for HA
• All the services and servers are built with Docker
• Single deployment script for any cloud with Docker, Swarm and Ansible
Challenges
• Every cloud is different. Different APIs, hardware profiles, and performance
• What about data migraCon/replicaCon?

FreeSWITCH Data Pipeline
7
Fraud Detec+on
• All the data ﬂows to
Apache KaIa
• Spark Streaming for
real Cme processing
• Cassandra and
Spark batch jobs for
hourly, daily, weekly
analysis

KaLa
9
Pros
• High throughput distributed
messaging
• AutomaCc recovery from broker
failures
• Decouples data pipelines
• Handles massive data load
• Data distribuCon and parCConing
across nodes
• Distributed log implementaCon
Cons
• Zookeeper, support/monitoring tools

Apache Spark Programming Model
• RDD (Resilient Distributed Dataset) a collecCon of objects stored in memory or
disk across the cluster
• RDDs have acCons and transformaCons
• All the transformaCons are lazy, once acCon is called Spark creates a DAG
(Directed Acyclic Graph) and submits it to Scheduler
• Task Scheduler which launches tasks via cluster manager (Spark Standalone,
Yarn, Mesos)
11

Spark Cassandra Integra5on
13
App
Spark Worker
(JVM)
Cassandra
Executor
Executor
Spark
Worker
(JVM)
Spark
Worker
(JVM)
Spark
Worker
(JVM)
Executor
Executor
Cassandra
Cassandra
Spark Master (JVM)
Node 1
Node 2
Node 3
Node N
Cassandra

Cassandra Data Modeling
14
CDR Use Cases
Internal metrics/aggregates
across all customers
Historical and real Cme
analyCcs (per user, date)
Metrics (ASR, ACD, MOU, etc.)
for customers and dashboards
Customer Insights
Access to FreeSWITCH raw
CDRs for troubleshooCng

Distributed System Challenges
Idempotency
Helps with scale, greatly simpliﬁes processing
Par++oning
Split data to handle scale and isolate failure
Consistency model
Trade oﬀ between throughput and consistency
Denormaliza+on/duplica+on
SomeCmes data redundancy is good
15

FreeSWITCH Data Pipeline
16
Fraud Detec+on

18
What is fraud in Telecom?
Hint: $$$$

Fraud Detec5on
• How does a carrier detect
usage fraud?
• What does usage fraud
look like?
19

Steps of Fraud Detec5on
20
1. Collect the data
a. Time series
2. Process the data
a. Asynchronous
b. Scale horizontally
3. Detect anomalies
a. StaCc
b. Dynamic
4. Alert

Process the Data
How to handle huge datasets without sacriﬁcing speed or quality?
21
Golang + Worker Pools
+ Asynchronous
Telegraph + InﬂuxDB
+ Grafana
Open Source Proprietary

Detect Anomalies
StaCc
• Thresholds
Dynamic (PredicCve)
• StaCsCcs
- Mean / Standard DeviaCon
• Machine Learning
- K Means Clustering
- MulCvariate Gaussian DistribuCon
22

Alert
23
APIMessaging layer
Push Pull

Q & A
Presented by
Eugene Shulga Pla;orm Engineer
Elana Woldenberg Pla;orm Engineer

Data Pipelines and Telephony Fraud Detection Using Machine Learning

Related slideshows

More Related Content

Data Pipelines and Telephony Fraud Detection Using Machine Learning