Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration with Common Data Framework (Model Robot)

Advanced Machine Learning Data
Integration with Common Data
Framework (Model Robot)
June 20, 2018
Presenters:
Kevin Martelli (KPMG)
Managing Director - Data and Analytics
Balaji Wooputer (Freddie Mac)
Director – Risk Analytics

© Freddie Mac 2
 Freddie Mac makes home possible for
millions of families and individuals by
providing mortgage capital to lenders.
 Since our creation in 1970, we've made
housing more accessible and affordable
for homebuyers and renters in
communities nationwide.
 We are building a better housing finance
system for homebuyers, renters, lenders,
and taxpayers.
Freddie Mac

© Freddie Mac 3
Objective: Design, develop and implement a self-learning (AI), highly-flexible common
data engineering framework to automate the design of the entire data munging process.
Recap: Challenge & Objective
Reusability
Extensibility
Reduce
Development
Cost
Speed
to Market
Success = Automating the design for data munging and integration in a scalable way,
while reducing the time to implement a data application by 50% to 70%, thereby
allowing data scientists and business analysts to easily access data.
https://www.youtube.com/watch?v=ct6gydYAQr42017 : Link
Challenge: Quickly integrating multiple variations of vendor’s semi-structured and
structured loan level data in order to make quicker and better business decisions to Re-
Imagine the Mortgage Experience.

© Freddie Mac 4
 Data Enablement & Profiling Framework
Common Data Engineering Framework (CDF)
Data Enablement & Profiling Framework
Data Sources Data Integration Integration and Execution Information Access User Engagement
Data Preparation
ERD
Execution
Information Access
Analytics Tools
Actian Matrix
Hadoop Tools
Web Services
User Engagement
Guided Reporting
and Portfolio
Analytics
Customized
Dashboard
Consumption
Governance Business glossary
Data security and privacy
rules
Data flows lineage,
transformation rules
Data models and standards
Data quality and data
profiling standards
Authentication & identity
mgt. (Kerberos)
Data, masking,
encryption
Availability, Backup,
Recovery
Authorization and audit
(data level)
Patch, upgrade,
operations
Job scheduling
Security, Platform Mgmt
Business Continuity
Enterprise Loan
Application Data
Sources
Metadata
XSD
Business
Requirement
Other
Data Sources
Third-Party
Data
Other Business
Factors
Shared Operational Data
Operational Data
Store
Data Mart EDW
Job Automation
Vendor
Provided
ODI
Loan Application
XML
JSON
Raw Data Repository
Data Exploration & Landing
Data Dictionary
Data Model
Profiling
Data sources
structured, semi-
structured and
unstructured.
- Create data plan and data
dictionary (Automate)
- Understand and identify data
lineage
- Define business requirements &
rules
- Data transformation, profiling &
maturity
- Develop data work flows and
schedules
- Transform data & connect
database to backend platform
- Define and manage model &
program execution lifecycle
- Real-time model & program
execution & scoring
- Model automated learning's
- Performance, log management
for models and programs.
Execution
Enablement
Execution Logs
- Design and create data
visualization
- Business Intelligence
and reporting
Feedback upstream
systems
- Insights realized
- Connect Backend and
Frontend systems
- Tools for additional data
processing & analytics
- Generate additional
insights from enabled
data.
AUTOMATION
Part of the Common Engineer Data Framework

© Freddie Mac 5
Data Discovery
Transformation Rules
Analytical Data Model
CDF (2017)
5. Action
1. Ingest
2. Discover
4. Transform
&
Models
3. REQs.
Domain
Knowledge
Data Munging
Data Integration
Data Model
Analytics Modeling
Data Visualization
https://www.youtube.com/watch?v=ct6gydYAQr42017 HW Summit

© Freddie Mac 6
Data
Discovery
Business
Knowledge
Domain and Data Science SME’s
Customized
Enablement
Configuration
Data Model
CDF Data Model (2017)

© Freddie Mac 7
CDF Model Development Life Cycle
Data
Processing
Feature
Engineering
& Selection
Model
Development
Model
Selection
Model
Validation &
Testing
Model
Deployment
• Finalize production
code
• Develop the model
documentation
• Develop model
monitoring plan
• Develop the action
plan
• Define validation and
testing methods
• Error analysis
• Identify business case
calculation
• Model refinement
• Code development
• Model training
• Hyper-parameter
tuning
• Model taxonomy
• Create comparison
tables
• Identify candidates
• Select champion/
challenger
• Exploratory data
analysis
• Variable
transformation
• Recursive feature
elimination
• Dimension
reduction
• Data parsing
• Data cleaning
EXTRACT INFER CLASSIFY
Common Data Framework – Model Automation

© Freddie Mac 8
Model Framework Architecture
Data Platform
Model Framework Model Development
Model Warehouse
Batch Process Streaming
Insight Service as API
Model Management Considerations:
• Reusability and Model Build
• Model Versions
• Evaluation Framework (A/B) testing
• Feature Extraction
• Internal/External API
• Message Broker

© Freddie Mac 9
CDF Next Gen – Model Robot
Idea
Strategy
Customer
 Automatically discovers semantic layer from data
discovery and generates data model through a
machine-learning driven approach thereby
significantly reducing need for manual data
modeling.
 Enhance old fashioned rigid, pre-modeled and
pre-configured data models into flexible,
adaptive data entities.
 Accelerates data integration resulting in faster
data insights to users
Model Robot is “Fast, Flexible Autogenerated Data Model”

© Freddie Mac 10
Data Modeling and Analytics Package
• Spark MLlib (Naïve Bayes + Random
Forest)
• Scikit-learn
• NumPy, SciPy
• Pandas
NLP Package
• Gensim
• NLTK
Model
Prediction
Output (Data
Model)
HDP-HDFS
User Input Model
Configuration
Vendor
Input
JSON
Vendor
Input
XML
Existing
Data
Model
Training/Prediction
Target Label
For Training
Data
Profiling
and
Discovery
Feature
Engineer
Model
Training
Model
Ensemble
Model
Prediction
Metric
Persist
Model
Classification
Engine
Model Training
1. Based on Learning Source Descriptions for Data
Integration
2. Leverage the model framework
3. Use Vendor data for training
4. 3 Core Features: Numerical, Text, Relationship
Doan AH, Domingos P, Levy A (2000) Learning source
descriptions for data integration. In: Proc WebDB Workshop,
pp. 81–92
Model:
Training data: 25k JSON/XML Files, 300+ attributes
Testing data: 1k JSON/XML Files, 300+ attributes
Training vs Test data split: 80% vs 20%
Prediction Accuracy: 69%
Mapped 22 Target Labels
E.g. 6,000,000 numerical values and 9 000 text values.

© Freddie Mac 11
CDF Next Gen (2018) – Model Robot
Data
Discovery
SME
Confirmation
and Update
Predicted Data
Model
Continuous Learning
Feature
Engineer Training
Prediction
Spark ML
Business
Knowledge

© Freddie Mac 12
Feature
Extraction
Numeric-based
features
• Min, max, mean,
median...
Text-based features
• TF_IDF
• POS tagging
• NER tagging
Relationship-based
features
• Depth
• Number of neighbors
• Xpath
Modeling
Random Forest for
numeric
Naïve-Bayes for text
features
Cosine similarity for
relationships
Data
Discovery
Model
Ensemble
CDF Model Robot Design (High-level)

© Freddie Mac 13
CDF Autogenerated Data Model Prediction
Feature
Extraction
SME Review
and Update
Data
Discovery
New data for the same business area
/path/to/a AvailableBalance [5400, 6000, 3000, 1500, …]
/path/to/b AssetType [MNMT, …]
Numeric:
• Min, Max
• Mean, STD
• Percentiles
• …
Free-form text:
• Tokenize
• TF-IDF
• …
Path+Name:
• Tokenize
• TF-IDF
• …
Random
Forest
Naïve-
Bayes
Cosine-
similarity
PredictionPrediction
Source Predicted
AvailableBalance AssetAccountBalance
AssetType AccountType

© Freddie Mac 14
CDF - Outcomes
Cost
ReusabilityExtensible
Collaboration
Trust
Rules
Agility
Automation
Analytics
Data
Reconcile
Data
Data
Business Rules
New
Customer
Insights

© Freddie Mac 15
CDF Business Value Delivered
2016
2017
2018
2016
Developed a generic data
engineering framework.
Reduced the “Data Munging”
time by 50%.
Reports available 3 to 4
months after going live.
2017
Reduced data engineering
timeline by an additional
25%.
Report generated the next
business day.
2018
Develop a self-learning AI
data model.
Reduce data engineering and
data model effort timeline.
(Target 25%)

© Freddie Mac 17
CDF Conceptual Architecture
ElementTree
Sqoop/JDBC
DataProcessing (PySpark)
Oozie/Ranger/Ambari
ElementTree
PartitionedORCTables
Data Discovery Output
Analytics
Modeling
Sqoop/ODI
Data Profiling
and Discovery
Feature
Engineer
Model
Training
Model
Ensemble
Timeline:
• Automated data ingestion
• Leverage significant domain knowledge to enable (Manual)
• Use learnings to decrease domain knowledge and
accelerated Analytical Data Model (ADM)
RAW
AccelerateTimeToValue
Data Profiling
FlatFile
RAW
DataFrame (SparkSQL)
Partition
1
… Partitio
nN
Data Modeling
and Analytics
Package
• Spark MLlib
(Naïve Bayes
+ Random
Forest)
• Scikit-learn
• NumPy, SciPy
• Pandas
NLP Package
• Gensim
• NLTK
Resilient Distributed
Dataset
Partition
1
… Partitio
nN
Data model

© Freddie Mac 18
 Training data:
» 27467 JSON/XML Files
– 338 attributes
– 6,000,000 numerical values
– 9,000 text values
» 22 Target Labels
» Training v.s. Validation data split: 80% v.s. 20%
 Testing data:
» 1150 JSON/XML Files
 Overall Prediction Accuracy:
» 69%
CDF Autogenerated Data Model Training

Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration with Common Data Framework (Model Robot)

Related slideshows

More Related Content

Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration with Common Data Framework (Model Robot)

Editor's Notes