Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration with Common Data Framework (Model Robot)
- 1. Advanced Machine Learning Data
Integration with Common Data
Framework (Model Robot)
June 20, 2018
Presenters:
Kevin Martelli (KPMG)
Managing Director - Data and Analytics
Balaji Wooputer (Freddie Mac)
Director – Risk Analytics
- 2. © Freddie Mac 2
Freddie Mac makes home possible for
millions of families and individuals by
providing mortgage capital to lenders.
Since our creation in 1970, we've made
housing more accessible and affordable
for homebuyers and renters in
communities nationwide.
We are building a better housing finance
system for homebuyers, renters, lenders,
and taxpayers.
Freddie Mac
- 3. © Freddie Mac 3
Objective: Design, develop and implement a self-learning (AI), highly-flexible common
data engineering framework to automate the design of the entire data munging process.
Recap: Challenge & Objective
Reusability
Extensibility
Reduce
Development
Cost
Speed
to Market
Success = Automating the design for data munging and integration in a scalable way,
while reducing the time to implement a data application by 50% to 70%, thereby
allowing data scientists and business analysts to easily access data.
https://www.youtube.com/watch?v=ct6gydYAQr42017 : Link
Challenge: Quickly integrating multiple variations of vendor’s semi-structured and
structured loan level data in order to make quicker and better business decisions to Re-
Imagine the Mortgage Experience.
- 4. © Freddie Mac 4
Data Enablement & Profiling Framework
Common Data Engineering Framework (CDF)
Data Enablement & Profiling Framework
Data Sources Data Integration Integration and Execution Information Access User Engagement
Data Preparation
ERD
Execution
Information Access
Analytics Tools
Actian Matrix
Hadoop Tools
Web Services
User Engagement
Guided Reporting
and Portfolio
Analytics
Customized
Dashboard
Consumption
Governance Business glossary
Data security and privacy
rules
Data flows lineage,
transformation rules
Data models and standards
Data quality and data
profiling standards
Authentication & identity
mgt. (Kerberos)
Data, masking,
encryption
Availability, Backup,
Recovery
Authorization and audit
(data level)
Patch, upgrade,
operations
Job scheduling
Security, Platform Mgmt
Business Continuity
Enterprise Loan
Application Data
Sources
Metadata
XSD
Business
Requirement
Other
Data Sources
Third-Party
Data
Other Business
Factors
Shared Operational Data
Operational Data
Store
Data Mart EDW
Job Automation
Vendor
Provided
ODI
Loan Application
XML
JSON
Raw Data Repository
Data Exploration & Landing
Data Dictionary
Data Model
Profiling
Data sources
structured, semi-
structured and
unstructured.
- Create data plan and data
dictionary (Automate)
- Understand and identify data
lineage
- Define business requirements &
rules
- Data transformation, profiling &
maturity
- Develop data work flows and
schedules
- Transform data & connect
database to backend platform
- Define and manage model &
program execution lifecycle
- Real-time model & program
execution & scoring
- Model automated learning's
- Performance, log management
for models and programs.
Execution
Enablement
Execution Logs
- Design and create data
visualization
- Business Intelligence
and reporting
Feedback upstream
systems
- Insights realized
- Connect Backend and
Frontend systems
- Tools for additional data
processing & analytics
- Generate additional
insights from enabled
data.
AUTOMATION
Part of the Common Engineer Data Framework
- 5. © Freddie Mac 5
Data Discovery
Transformation Rules
Analytical Data Model
CDF (2017)
5. Action
1. Ingest
2. Discover
4. Transform
&
Models
3. REQs.
Domain
Knowledge
Data Munging
Data Integration
Data Model
Analytics Modeling
Data Visualization
https://www.youtube.com/watch?v=ct6gydYAQr42017 HW Summit
- 6. © Freddie Mac 6
Data
Discovery
Business
Knowledge
Domain and Data Science SME’s
Customized
Enablement
Configuration
Data Model
CDF Data Model (2017)
- 7. © Freddie Mac 7
CDF Model Development Life Cycle
Data
Processing
Feature
Engineering
& Selection
Model
Development
Model
Selection
Model
Validation &
Testing
Model
Deployment
• Finalize production
code
• Develop the model
documentation
• Develop model
monitoring plan
• Develop the action
plan
• Define validation and
testing methods
• Error analysis
• Identify business case
calculation
• Model refinement
• Code development
• Model training
• Hyper-parameter
tuning
• Model taxonomy
• Create comparison
tables
• Identify candidates
• Select champion/
challenger
• Exploratory data
analysis
• Variable
transformation
• Recursive feature
elimination
• Dimension
reduction
• Data parsing
• Data cleaning
EXTRACT INFER CLASSIFY
Common Data Framework – Model Automation
- 8. © Freddie Mac 8
Model Framework Architecture
Data Platform
Model Framework Model Development
Model Warehouse
Batch Process Streaming
Insight Service as API
Model Management Considerations:
• Reusability and Model Build
• Model Versions
• Evaluation Framework (A/B) testing
• Feature Extraction
• Internal/External API
• Message Broker
- 9. © Freddie Mac 9
CDF Next Gen – Model Robot
Idea
Strategy
Customer
Automatically discovers semantic layer from data
discovery and generates data model through a
machine-learning driven approach thereby
significantly reducing need for manual data
modeling.
Enhance old fashioned rigid, pre-modeled and
pre-configured data models into flexible,
adaptive data entities.
Accelerates data integration resulting in faster
data insights to users
Model Robot is “Fast, Flexible Autogenerated Data Model”
- 10. © Freddie Mac 10
Data Modeling and Analytics Package
• Spark MLlib (Naïve Bayes + Random
Forest)
• Scikit-learn
• NumPy, SciPy
• Pandas
NLP Package
• Gensim
• NLTK
Model
Prediction
Output (Data
Model)
HDP-HDFS
User Input Model
Configuration
Vendor
Input
JSON
Vendor
Input
XML
Existing
Data
Model
Training/Prediction
Target Label
For Training
Data
Profiling
and
Discovery
Feature
Engineer
Model
Training
Model
Ensemble
Model
Prediction
Metric
Persist
Model
Classification
Engine
Model Training
1. Based on Learning Source Descriptions for Data
Integration
2. Leverage the model framework
3. Use Vendor data for training
4. 3 Core Features: Numerical, Text, Relationship
Doan AH, Domingos P, Levy A (2000) Learning source
descriptions for data integration. In: Proc WebDB Workshop,
pp. 81–92
Model:
Training data: 25k JSON/XML Files, 300+ attributes
Testing data: 1k JSON/XML Files, 300+ attributes
Training vs Test data split: 80% vs 20%
Prediction Accuracy: 69%
Mapped 22 Target Labels
E.g. 6,000,000 numerical values and 9 000 text values.
- 11. © Freddie Mac 11
CDF Next Gen (2018) – Model Robot
Data
Discovery
SME
Confirmation
and Update
Predicted Data
Model
Continuous Learning
Feature
Engineer Training
Prediction
Spark ML
Business
Knowledge
- 12. © Freddie Mac 12
Feature
Extraction
Numeric-based
features
• Min, max, mean,
median...
Text-based features
• TF_IDF
• POS tagging
• NER tagging
Relationship-based
features
• Depth
• Number of neighbors
• Xpath
Modeling
Random Forest for
numeric
Naïve-Bayes for text
features
Cosine similarity for
relationships
Data
Discovery
Model
Ensemble
CDF Model Robot Design (High-level)
- 13. © Freddie Mac 13
CDF Autogenerated Data Model Prediction
Feature
Extraction
SME Review
and Update
Data
Discovery
New data for the same business area
/path/to/a AvailableBalance [5400, 6000, 3000, 1500, …]
/path/to/b AssetType [MNMT, …]
Numeric:
• Min, Max
• Mean, STD
• Percentiles
• …
Free-form text:
• Tokenize
• TF-IDF
• …
Path+Name:
• Tokenize
• TF-IDF
• …
Random
Forest
Naïve-
Bayes
Cosine-
similarity
PredictionPrediction
Source Predicted
AvailableBalance AssetAccountBalance
AssetType AccountType
- 14. © Freddie Mac 14
CDF - Outcomes
Cost
ReusabilityExtensible
Collaboration
Trust
Rules
Agility
Automation
Analytics
Data
Reconcile
Data
Data
Business Rules
New
Customer
Insights
- 15. © Freddie Mac 15
CDF Business Value Delivered
2016
2017
2018
2016
Developed a generic data
engineering framework.
Reduced the “Data Munging”
time by 50%.
Reports available 3 to 4
months after going live.
2017
Reduced data engineering
timeline by an additional
25%.
Report generated the next
business day.
2018
Develop a self-learning AI
data model.
Reduce data engineering and
data model effort timeline.
(Target 25%)
- 16. © Freddie Mac 16
Thank You
2018 Hortonworks Summit
Balaji Wooputur & Kevin Martelli
Jun/20/2018
- 17. © Freddie Mac 17
CDF Conceptual Architecture
ElementTree
Sqoop/JDBC
DataProcessing (PySpark)
Oozie/Ranger/Ambari
ElementTree
PartitionedORCTables
Data Discovery Output
Analytics
Modeling
Sqoop/ODI
Data Profiling
and Discovery
Feature
Engineer
Model
Training
Model
Ensemble
Timeline:
• Automated data ingestion
• Leverage significant domain knowledge to enable (Manual)
• Use learnings to decrease domain knowledge and
accelerated Analytical Data Model (ADM)
RAW
AccelerateTimeToValue
Data Profiling
FlatFile
RAW
DataFrame (SparkSQL)
Partition
1
… Partitio
nN
Data Modeling
and Analytics
Package
• Spark MLlib
(Naïve Bayes
+ Random
Forest)
• Scikit-learn
• NumPy, SciPy
• Pandas
NLP Package
• Gensim
• NLTK
Resilient Distributed
Dataset
Partition
1
… Partitio
nN
Data model
- 18. © Freddie Mac 18
Training data:
» 27467 JSON/XML Files
– 338 attributes
– 6,000,000 numerical values
– 9,000 text values
» 22 Target Labels
» Training v.s. Validation data split: 80% v.s. 20%
Testing data:
» 1150 JSON/XML Files
Overall Prediction Accuracy:
» 69%
CDF Autogenerated Data Model Training
Editor's Notes
- Good Afternoon everyone. How is everyone doing.
Welcome to Freddie Mac and KPMG case study session . on Advanced ML Data Integration with Common Data Framework (Model Robot).
Glad to be here to see great ideas and innovations sessions.
This is 2nd year in a row for FreddieMac and KPMG to present at HW summit.
My name is Balaji Wooputur working in Freddie Mac as Risk Analytics Director. I’m heading Risk Analytics team for SF Risk division. Freddie Mac has been partner for past couple of years with KPMG on HW solution.
I’m here today with Kevin Martelli from KPMG’s. Myself and Kevin will co-present today’s session.
Anyone in audience from last year CDF session?
I’m going to cover following in today’s session. First, who we are, recap of 2017 project objective, 2017 “Patent pending” CDF, 2018 CDF (Model Robot) and CDF Model life cycle and Model framework, which is extended and reused components from our CDF.
Let me start with who we are.
Kevin is a managing director and heading KPMG’s Big Data Software Engineering Team in KPMG COE for D&A Light. “”-Kevin Greetings and Welcome message””
- How many of you had mortgage experience?
Freddie Mac was created in year 1970 to expand the secondary market for mortgages in the US. Freddie Mac makes homeownership and rental housing financing more accessible and affordable.
Operating in the secondary mortgage market, we keep mortgage capital flowing by purchasing mortgage loans from lenders so they in turn can provide more loans to qualified borrowers.
Freddie Mac initiative of "reimagine the mortgage experience.“− ways we’re putting into action the feedback, insights, and opinions to get loan closing faster and save money
Our mission to provide liquidity, stability, and affordability to the U.S. housing market in all economic conditions extends to all communities from coast to coast.
Mortgage loan manufacturing consists of loan origination, loan closing and servicing loan after purchase.
Freddie Mac pool loans then securitize and sell as MBS (Mortgage back securities) to global investors.
Now, I’ll handover to Kevin.
- Kevin
Biggest Challenges understanding and processing the datasets variety of vendors and not very standardized
Time consuming to understand the datasets
Many of the people in the audience have the same problems?
60% of time in cleaning, organizating and collecting data. (least enjoyable)
Resolve challenge KPMG and Freddie have been working on a program over the last couple of years. First we focused on the foundation which automated processes but also allowed us to obtain data sets that could then be used for training the models to more fully automate the process
Framework built on 4 core principles
- Kevin
This is a busy slide and will not spend the time to review all aspects.
The idea is to show a conceptual flow of the complexity of producing and consumig an insight.
There is the standard data flow of identification of data sources, Ingestion…...
And then there are all the supporting processes – quality, security lineage, etc.
In a perfect world all these processes work together perfectly but we all know that is not the case.
- Help compensate for deficinies found in other areas
- The CDF is down into three main components. We discussed these in detail during the DW summit last year. I want to provide a quick overview as it is important to understand the foundation before we discuss the intelligence processing that was added.
The initial framework had 3 main components that align to the model above - Data Discovery, Transformation or Business Rules, & Analytical Model.
In Data Discovery, the program would automatically ingest semi-structure data from the vendors (mainly JSON and XML) and produce insights into the data. It would provide, sample data values, min, max and mean of values, nulls, outliers, where it fell in the object definition, etc. The Data Discover output would allow Domain experts to better understand the data in order to make determinations on how to link the data to the target data model.
Transformation rules are rules that business users can apply to the data. (i.e. Transformations (derive new attributes, standard data, data transformations, etc.)
Once the data is discovered and transformational rules are applied data is fed into the analytical data model, which then automatically updates or creates new tables in Hive.
Although there are parts, which are automated this is still a human intensive process; hence, the need to add more intelligence to the framework.
- This slide represents shows the overall flow of the CDF. We wanted to highlight the middle section where there was a lot of manual effort and time required for Domain experts and SME in order to produce a useable data model.
SME’s and Domain experts leverages the discovery output and would perform mapping based on their knowledge as well as the data discover output. Who generated the file and a lot of communicaton.
For 1 or 2 data sources this is ok but as sources increase there is a lot of Time spend and the manual effort is Risk and Error prone
As a result, the team wanted to further automate these process; hence the model robot. The idea of model robot was to automate these human intense processes (thru pass learnings), while still keeping the Human in the loop but to a lesser extent more for validation vs creating. In order to accelerate the time from ingestion to realized business value.
22 attributes
- As we started to add the intelligence to the framework we followed a specific defined model development framework:
The lifecycle is broken up into 6 main stages.
Data Processing put into a format that we can better understand the data
Feature Selection – (Research Paper) An important part in the overall lifecycle of framework but what features do I want to leverage. The data science team were able to leverage some standard practices; such as dimension reduction variable transformation, etc.
Model selection is important because selecting the wrong model can lead itself to waste time as you try to leverage and refine the model for accuracy. Having a larger team to leverage helps to identify models that have been successful in the past on similar data sets and problems.
Model Test and Validation?
Once everything is completed you need to deploy, which is not always easy on a Hadoop ecosystem. We will get more into that on the next slide.
At the bottom we have a small workflow of the model components. We built as components to enable reusability
- Model Management Native Support in Hadoop….
Once we have built the model we needed a manageable way to deploy and leverage within the Hadoop ecosystem. There is not a straight forward way to accomplish this task. In other analytical packages such as SAS they have applications to help manage.
If you are using native Hadoop how to manage modes that you deploy?
How do you track the version?
How to do do A/B Testing?
How do you execute the mode?
How to you stream and run in batch?
- Balaji
Data Insight (Identify noise, separate noise data, leverage business context of the data and dynamic modeling)
Semantic Layer – Vendor meta data not standardized. Vendor datasets are semi-structured data (Key value pairs – JSON and Dynamic XML)
Data Veracity – SPOT (Single Point of Truth) with data governance emerged. Data Management principles (Metadata standardization, Enterprise naming standards, data types etc.)
Data Model – Semi-structed dataset to confirmed data model to meet organization Data Model standards has to be applied to MPP reporting platform
- We are living the “world of information”
Decompose the information.. Identifying noise in the data.. Segregating actual meaningful data and noise in the data..
Bottom left - “Reference Doan AH”
Bottom right – Walkthrough sample data with sizing etc..
Evaluating meaningful data with domain features of the data
Predicting model with existing data model
Bringing in Human in loop to verify/validate prediction model
- Balaji Training outputs provided to SME’s and Domain experts.
Transforming raw data to features that better to present ,
Identifying factors that attributes useful for modeling
Train data Outputs predicted model (Human in the loop) for continuous
Let me dive deep-into Model robot design and flow
-Json, XML (Dynamic containers) , key value pairs
-Data in semi structured and schema less formats (Run Data discovery) output metadata and profiled data (ranges, types etc.)
- Balaji – Feature extraction
Numeric features
min, max, mean, median…
Text features
TF-IDF, POS tagging, NER tagging…
Attribute Names
Relationship-based
Xpath
Depth
Number of neighbors
Etc
- Balaji – Prediction Model
This is the moment you all are waiting for. CDF AI (Model Robot) here..
- TO summarize CDF outcomes
Collabaration, trust, automation, analytics and data ready….
CDF is one stop for any type/format data ingestion with Data Discovery, Data Model and Data Engineering.
We are proud today to say our risk analysis is equipped with “Intraday and Day 1 data insights”
CDF framework core components reused and extend (Data Model) and reducing cost for Data Integration/Engineering
Maturity model we are at 4 .. Fine tune ourselves to complete at 4 and go to 5
- Business value delivered
Using the generic data engineering framework approach for our next product offering in 2016, we reduced “Data Munging” time by 50% using automation, enabling analysts to generate reports within a month of release.
Our subsequent product launch in 2017 resulted in reducing the data engineering timeline by 25% allowing for reports to be generated and reviewed next business day.
Share our 2017 success story of business outcome on addressing loan risk and actionable provide feedback to customers on loan origination process. Data insights on loan quality,
Automated Collateral Evaluation (ACE)
Get to closing faster – no need for a traditional appraisal
Save money – no appraisal fee
Immediate certainty – automatically eligible for collateral rep and warranty relief
- Flat Files are loaded on share drive and manually up loaded to HDFS
Nothing like XSD files.
ss