Data Pipelines and Telephony Fraud Detection Using Machine Learning
- 1. Data Pipelines and Telephony Fraud
Detec5on Using Machine Learning
Presented by
Eugene Shulga Pla;orm Engineer
Elana Woldenberg Pla;orm Engineer
- 4. Massive amount of data
4
CDRs (Call Detail Records)
Hundreds of millions
SIP messages
Billions
LRN (Local RouCng Number)
Hundreds of millions
- 5. Telnyx Recipe
• Message rouCng and reliable delivery (KaIa, RabbitMQ)
• Storage (Cassandra, Postgres)
• Real Cme aggregaCon (Spark Streaming)
• Batch and ad-hoc analysis (Spark and Notebooks)
• VisualizaCon (Kibana, Grafana)
5
- 6. Cloud Agnos5c
6
Requirements
• Cannot use cloud specific data soluCons
• Flexible enough for HA
• All the services and servers are built with Docker
• Single deployment script for any cloud with Docker, Swarm and Ansible
Challenges
• Every cloud is different. Different APIs, hardware profiles, and performance
• What about data migraCon/replicaCon?
- 7. FreeSWITCH Data Pipeline
7
Fraud Detec+on
• All the data flows to
Apache KaIa
• Spark Streaming for
real Cme processing
• Cassandra and
Spark batch jobs for
hourly, daily, weekly
analysis
- 9. KaLa
9
Pros
• High throughput distributed
messaging
• AutomaCc recovery from broker
failures
• Decouples data pipelines
• Handles massive data load
• Data distribuCon and parCConing
across nodes
• Distributed log implementaCon
Cons
• Zookeeper, support/monitoring tools
- 11. Apache Spark Programming Model
• RDD (Resilient Distributed Dataset) a collecCon of objects stored in memory or
disk across the cluster
• RDDs have acCons and transformaCons
• All the transformaCons are lazy, once acCon is called Spark creates a DAG
(Directed Acyclic Graph) and submits it to Scheduler
• Task Scheduler which launches tasks via cluster manager (Spark Standalone,
Yarn, Mesos)
11
- 13. Spark Cassandra Integra5on
13
App
Spark Worker
(JVM)
Cassandra
Executor
Executor
Spark
Worker
(JVM)
Spark
Worker
(JVM)
Spark
Worker
(JVM)
Executor
Executor
Cassandra
Cassandra
Spark Master (JVM)
Node 1
Node 2
Node 3
Node N
Cassandra
- 14. Cassandra Data Modeling
14
CDR Use Cases
Internal metrics/aggregates
across all customers
Historical and real Cme
analyCcs (per user, date)
Metrics (ASR, ACD, MOU, etc.)
for customers and dashboards
Customer Insights
Access to FreeSWITCH raw
CDRs for troubleshooCng
- 15. Distributed System Challenges
Idempotency
Helps with scale, greatly simplifies processing
Par++oning
Split data to handle scale and isolate failure
Consistency model
Trade off between throughput and consistency
Denormaliza+on/duplica+on
SomeCmes data redundancy is good
15
- 20. Steps of Fraud Detec5on
20
1. Collect the data
a. Time series
2. Process the data
a. Asynchronous
b. Scale horizontally
3. Detect anomalies
a. StaCc
b. Dynamic
4. Alert
- 21. Process the Data
How to handle huge datasets without sacrificing speed or quality?
21
Golang + Worker Pools
+ Asynchronous
Telegraph + InfluxDB
+ Grafana
Open Source Proprietary
- 24. Q & A
Presented by
Eugene Shulga Pla;orm Engineer
Elana Woldenberg Pla;orm Engineer