SlideShare a Scribd company logo
Data Pipelines and Telephony Fraud
Detec5on Using Machine Learning
Presented by
Eugene Shulga Pla;orm Engineer
Elana Woldenberg Pla;orm Engineer
1.Data Pipelines
2.Fraud Detec5on
Agenda
2
Data Pipelines
Massive amount of data
4
CDRs (Call Detail Records)
Hundreds of millions
SIP messages
Billions
LRN (Local RouCng Number)
Hundreds of millions
Telnyx Recipe
• Message rouCng and reliable delivery (KaIa, RabbitMQ)
• Storage (Cassandra, Postgres)
• Real Cme aggregaCon (Spark Streaming)
• Batch and ad-hoc analysis (Spark and Notebooks)
• VisualizaCon (Kibana, Grafana)
5
Cloud Agnos5c
6
Requirements
• Cannot use cloud specific data soluCons
• Flexible enough for HA
• All the services and servers are built with Docker
• Single deployment script for any cloud with Docker, Swarm and Ansible
Challenges
• Every cloud is different. Different APIs, hardware profiles, and performance
• What about data migraCon/replicaCon?
FreeSWITCH Data Pipeline
7
Fraud Detec+on
• All the data flows to
Apache KaIa
• Spark Streaming for
real Cme processing
• Cassandra and
Spark batch jobs for
hourly, daily, weekly
analysis
FreeSWITCH Data Pipeline
8
KaLa
9
Pros
• High throughput distributed
messaging
• AutomaCc recovery from broker
failures
• Decouples data pipelines
• Handles massive data load
• Data distribuCon and parCConing
across nodes
• Distributed log implementaCon
Cons
• Zookeeper, support/monitoring tools
FreeSWITCH Data Pipeline
10
Apache Spark Programming Model
• RDD (Resilient Distributed Dataset) a collecCon of objects stored in memory or
disk across the cluster
• RDDs have acCons and transformaCons
• All the transformaCons are lazy, once acCon is called Spark creates a DAG
(Directed Acyclic Graph) and submits it to Scheduler
• Task Scheduler which launches tasks via cluster manager (Spark Standalone,
Yarn, Mesos)
11
FreeSWITCH Data Pipeline
12
Spark Cassandra Integra5on
13
App
Spark Worker
(JVM)
Cassandra
Executor
Executor
Spark
Worker
(JVM)
Spark
Worker
(JVM)
Spark
Worker
(JVM)
Executor
Executor
Cassandra
Cassandra
Spark Master (JVM)
Node 1
Node 2
Node 3
Node N
Cassandra
Cassandra Data Modeling
14
CDR Use Cases
Internal metrics/aggregates
across all customers
Historical and real Cme
analyCcs (per user, date)
Metrics (ASR, ACD, MOU, etc.)
for customers and dashboards
Customer Insights
Access to FreeSWITCH raw
CDRs for troubleshooCng
Distributed System Challenges
Idempotency
Helps with scale, greatly simplifies processing
Par++oning
Split data to handle scale and isolate failure
Consistency model
Trade off between throughput and consistency
Denormaliza+on/duplica+on
SomeCmes data redundancy is good
15
FreeSWITCH Data Pipeline
16
Fraud Detec+on
Fraud Detec5on
18
What is fraud in Telecom?
Hint: $$$$
Fraud Detec5on
• How does a carrier detect
usage fraud?
• What does usage fraud
look like?
19
Steps of Fraud Detec5on
20
1. Collect the data
a. Time series
2. Process the data
a. Asynchronous
b. Scale horizontally
3. Detect anomalies
a. StaCc
b. Dynamic
4. Alert
Process the Data
How to handle huge datasets without sacrificing speed or quality?
21
Golang + Worker Pools
+ Asynchronous
Telegraph + InfluxDB
+ Grafana
Open Source Proprietary
Detect Anomalies
StaCc
• Thresholds
Dynamic (PredicCve)
• StaCsCcs
- Mean / Standard DeviaCon
• Machine Learning
- K Means Clustering
- MulCvariate Gaussian DistribuCon
22
Alert
23
APIMessaging layer
Push Pull
Q & A
Presented by
Eugene Shulga Pla;orm Engineer
Elana Woldenberg Pla;orm Engineer

More Related Content

Data Pipelines and Telephony Fraud Detection Using Machine Learning

  • 1. Data Pipelines and Telephony Fraud Detec5on Using Machine Learning Presented by Eugene Shulga Pla;orm Engineer Elana Woldenberg Pla;orm Engineer
  • 4. Massive amount of data 4 CDRs (Call Detail Records) Hundreds of millions SIP messages Billions LRN (Local RouCng Number) Hundreds of millions
  • 5. Telnyx Recipe • Message rouCng and reliable delivery (KaIa, RabbitMQ) • Storage (Cassandra, Postgres) • Real Cme aggregaCon (Spark Streaming) • Batch and ad-hoc analysis (Spark and Notebooks) • VisualizaCon (Kibana, Grafana) 5
  • 6. Cloud Agnos5c 6 Requirements • Cannot use cloud specific data soluCons • Flexible enough for HA • All the services and servers are built with Docker • Single deployment script for any cloud with Docker, Swarm and Ansible Challenges • Every cloud is different. Different APIs, hardware profiles, and performance • What about data migraCon/replicaCon?
  • 7. FreeSWITCH Data Pipeline 7 Fraud Detec+on • All the data flows to Apache KaIa • Spark Streaming for real Cme processing • Cassandra and Spark batch jobs for hourly, daily, weekly analysis
  • 9. KaLa 9 Pros • High throughput distributed messaging • AutomaCc recovery from broker failures • Decouples data pipelines • Handles massive data load • Data distribuCon and parCConing across nodes • Distributed log implementaCon Cons • Zookeeper, support/monitoring tools
  • 11. Apache Spark Programming Model • RDD (Resilient Distributed Dataset) a collecCon of objects stored in memory or disk across the cluster • RDDs have acCons and transformaCons • All the transformaCons are lazy, once acCon is called Spark creates a DAG (Directed Acyclic Graph) and submits it to Scheduler • Task Scheduler which launches tasks via cluster manager (Spark Standalone, Yarn, Mesos) 11
  • 13. Spark Cassandra Integra5on 13 App Spark Worker (JVM) Cassandra Executor Executor Spark Worker (JVM) Spark Worker (JVM) Spark Worker (JVM) Executor Executor Cassandra Cassandra Spark Master (JVM) Node 1 Node 2 Node 3 Node N Cassandra
  • 14. Cassandra Data Modeling 14 CDR Use Cases Internal metrics/aggregates across all customers Historical and real Cme analyCcs (per user, date) Metrics (ASR, ACD, MOU, etc.) for customers and dashboards Customer Insights Access to FreeSWITCH raw CDRs for troubleshooCng
  • 15. Distributed System Challenges Idempotency Helps with scale, greatly simplifies processing Par++oning Split data to handle scale and isolate failure Consistency model Trade off between throughput and consistency Denormaliza+on/duplica+on SomeCmes data redundancy is good 15
  • 18. 18 What is fraud in Telecom? Hint: $$$$
  • 19. Fraud Detec5on • How does a carrier detect usage fraud? • What does usage fraud look like? 19
  • 20. Steps of Fraud Detec5on 20 1. Collect the data a. Time series 2. Process the data a. Asynchronous b. Scale horizontally 3. Detect anomalies a. StaCc b. Dynamic 4. Alert
  • 21. Process the Data How to handle huge datasets without sacrificing speed or quality? 21 Golang + Worker Pools + Asynchronous Telegraph + InfluxDB + Grafana Open Source Proprietary
  • 22. Detect Anomalies StaCc • Thresholds Dynamic (PredicCve) • StaCsCcs - Mean / Standard DeviaCon • Machine Learning - K Means Clustering - MulCvariate Gaussian DistribuCon 22
  • 24. Q & A Presented by Eugene Shulga Pla;orm Engineer Elana Woldenberg Pla;orm Engineer