SlideShare a Scribd company logo
Advanced Machine Learning Data
Integration with Common Data
Framework (Model Robot)
June 20, 2018
Presenters:
Kevin Martelli (KPMG)
Managing Director - Data and Analytics
Balaji Wooputer (Freddie Mac)
Director – Risk Analytics
© Freddie Mac 2
 Freddie Mac makes home possible for
millions of families and individuals by
providing mortgage capital to lenders.
 Since our creation in 1970, we've made
housing more accessible and affordable
for homebuyers and renters in
communities nationwide.
 We are building a better housing finance
system for homebuyers, renters, lenders,
and taxpayers.
Freddie Mac
© Freddie Mac 3
Objective: Design, develop and implement a self-learning (AI), highly-flexible common
data engineering framework to automate the design of the entire data munging process.
Recap: Challenge & Objective
Reusability
Extensibility
Reduce
Development
Cost
Speed
to Market
Success = Automating the design for data munging and integration in a scalable way,
while reducing the time to implement a data application by 50% to 70%, thereby
allowing data scientists and business analysts to easily access data.
https://www.youtube.com/watch?v=ct6gydYAQr42017 : Link
Challenge: Quickly integrating multiple variations of vendor’s semi-structured and
structured loan level data in order to make quicker and better business decisions to Re-
Imagine the Mortgage Experience.
© Freddie Mac 4
 Data Enablement & Profiling Framework
Common Data Engineering Framework (CDF)
Data Enablement & Profiling Framework
Data Sources Data Integration Integration and Execution Information Access User Engagement
Data Preparation
ERD
Execution
Information Access
Analytics Tools
Actian Matrix
Hadoop Tools
Web Services
User Engagement
Guided Reporting
and Portfolio
Analytics
Customized
Dashboard
Consumption
Governance Business glossary
Data security and privacy
rules
Data flows lineage,
transformation rules
Data models and standards
Data quality and data
profiling standards
Authentication & identity
mgt. (Kerberos)
Data, masking,
encryption
Availability, Backup,
Recovery
Authorization and audit
(data level)
Patch, upgrade,
operations
Job scheduling
Security, Platform Mgmt
Business Continuity
Enterprise Loan
Application Data
Sources
Metadata
XSD
Business
Requirement
Other
Data Sources
Third-Party
Data
Other Business
Factors
Shared Operational Data
Operational Data
Store
Data Mart EDW
Job Automation
Vendor
Provided
ODI
Loan Application
XML
JSON
Raw Data Repository
Data Exploration & Landing
Data Dictionary
Data Model
Profiling
Data sources
structured, semi-
structured and
unstructured.
- Create data plan and data
dictionary (Automate)
- Understand and identify data
lineage
- Define business requirements &
rules
- Data transformation, profiling &
maturity
- Develop data work flows and
schedules
- Transform data & connect
database to backend platform
- Define and manage model &
program execution lifecycle
- Real-time model & program
execution & scoring
- Model automated learning's
- Performance, log management
for models and programs.
Execution
Enablement
Execution Logs
- Design and create data
visualization
- Business Intelligence
and reporting
Feedback upstream
systems
- Insights realized
- Connect Backend and
Frontend systems
- Tools for additional data
processing & analytics
- Generate additional
insights from enabled
data.
AUTOMATION
Part of the Common Engineer Data Framework
© Freddie Mac 5
Data Discovery
Transformation Rules
Analytical Data Model
CDF (2017)
5. Action
1. Ingest
2. Discover
4. Transform
&
Models
3. REQs.
Domain
Knowledge
Data Munging
Data Integration
Data Model
Analytics Modeling
Data Visualization
https://www.youtube.com/watch?v=ct6gydYAQr42017 HW Summit
© Freddie Mac 6
Data
Discovery
Business
Knowledge
Domain and Data Science SME’s
Customized
Enablement
Configuration
Data Model
CDF Data Model (2017)
© Freddie Mac 7
CDF Model Development Life Cycle
Data
Processing
Feature
Engineering
& Selection
Model
Development
Model
Selection
Model
Validation &
Testing
Model
Deployment
• Finalize production
code
• Develop the model
documentation
• Develop model
monitoring plan
• Develop the action
plan
• Define validation and
testing methods
• Error analysis
• Identify business case
calculation
• Model refinement
• Code development
• Model training
• Hyper-parameter
tuning
• Model taxonomy
• Create comparison
tables
• Identify candidates
• Select champion/
challenger
• Exploratory data
analysis
• Variable
transformation
• Recursive feature
elimination
• Dimension
reduction
• Data parsing
• Data cleaning
EXTRACT INFER CLASSIFY
Common Data Framework – Model Automation
© Freddie Mac 8
Model Framework Architecture
Data Platform
Model Framework Model Development
Model Warehouse
Batch Process Streaming
Insight Service as API
Model Management Considerations:
• Reusability and Model Build
• Model Versions
• Evaluation Framework (A/B) testing
• Feature Extraction
• Internal/External API
• Message Broker
© Freddie Mac 9
CDF Next Gen – Model Robot
Idea
Strategy
Customer
 Automatically discovers semantic layer from data
discovery and generates data model through a
machine-learning driven approach thereby
significantly reducing need for manual data
modeling.
 Enhance old fashioned rigid, pre-modeled and
pre-configured data models into flexible,
adaptive data entities.
 Accelerates data integration resulting in faster
data insights to users
Model Robot is “Fast, Flexible Autogenerated Data Model”
© Freddie Mac 10
Data Modeling and Analytics Package
• Spark MLlib (Naïve Bayes + Random
Forest)
• Scikit-learn
• NumPy, SciPy
• Pandas
NLP Package
• Gensim
• NLTK
Model
Prediction
Output (Data
Model)
HDP-HDFS
User Input Model
Configuration
Vendor
Input
JSON
Vendor
Input
XML
Existing
Data
Model
Training/Prediction
Target Label
For Training
Data
Profiling
and
Discovery
Feature
Engineer
Model
Training
Model
Ensemble
Model
Prediction
Metric
Persist
Model
Classification
Engine
Model Training
1. Based on Learning Source Descriptions for Data
Integration
2. Leverage the model framework
3. Use Vendor data for training
4. 3 Core Features: Numerical, Text, Relationship
Doan AH, Domingos P, Levy A (2000) Learning source
descriptions for data integration. In: Proc WebDB Workshop,
pp. 81–92
Model:
Training data: 25k JSON/XML Files, 300+ attributes
Testing data: 1k JSON/XML Files, 300+ attributes
Training vs Test data split: 80% vs 20%
Prediction Accuracy: 69%
Mapped 22 Target Labels
E.g. 6,000,000 numerical values and 9 000 text values.
© Freddie Mac 11
CDF Next Gen (2018) – Model Robot
Data
Discovery
SME
Confirmation
and Update
Predicted Data
Model
Continuous Learning
Feature
Engineer Training
Prediction
Spark ML
Business
Knowledge
© Freddie Mac 12
Feature
Extraction
Numeric-based
features
• Min, max, mean,
median...
Text-based features
• TF_IDF
• POS tagging
• NER tagging
Relationship-based
features
• Depth
• Number of neighbors
• Xpath
Modeling
Random Forest for
numeric
Naïve-Bayes for text
features
Cosine similarity for
relationships
Data
Discovery
Model
Ensemble
CDF Model Robot Design (High-level)
© Freddie Mac 13
CDF Autogenerated Data Model Prediction
Feature
Extraction
SME Review
and Update
Data
Discovery
New data for the same business area
/path/to/a AvailableBalance [5400, 6000, 3000, 1500, …]
/path/to/b AssetType [MNMT, …]
Numeric:
• Min, Max
• Mean, STD
• Percentiles
• …
Free-form text:
• Tokenize
• TF-IDF
• …
Path+Name:
• Tokenize
• TF-IDF
• …
Random
Forest
Naïve-
Bayes
Cosine-
similarity
PredictionPrediction
Source Predicted
AvailableBalance AssetAccountBalance
AssetType AccountType
© Freddie Mac 14
CDF - Outcomes
Cost
ReusabilityExtensible
Collaboration
Trust
Rules
Agility
Automation
Analytics
Data
Reconcile
Data
Data
Business Rules
New
Customer
Insights
© Freddie Mac 15
CDF Business Value Delivered
2016
2017
2018
2016
Developed a generic data
engineering framework.
Reduced the “Data Munging”
time by 50%.
Reports available 3 to 4
months after going live.
2017
Reduced data engineering
timeline by an additional
25%.
Report generated the next
business day.
2018
Develop a self-learning AI
data model.
Reduce data engineering and
data model effort timeline.
(Target 25%)
© Freddie Mac 16
Thank You
2018 Hortonworks Summit
Balaji Wooputur & Kevin Martelli
Jun/20/2018
© Freddie Mac 17
CDF Conceptual Architecture
ElementTree
Sqoop/JDBC
DataProcessing (PySpark)
Oozie/Ranger/Ambari
ElementTree
PartitionedORCTables
Data Discovery Output
Analytics
Modeling
Sqoop/ODI
Data Profiling
and Discovery
Feature
Engineer
Model
Training
Model
Ensemble
Timeline:
• Automated data ingestion
• Leverage significant domain knowledge to enable (Manual)
• Use learnings to decrease domain knowledge and
accelerated Analytical Data Model (ADM)
RAW
AccelerateTimeToValue
Data Profiling
FlatFile
RAW
DataFrame (SparkSQL)
Partition
1
… Partitio
nN
Data Modeling
and Analytics
Package
• Spark MLlib
(Naïve Bayes
+ Random
Forest)
• Scikit-learn
• NumPy, SciPy
• Pandas
NLP Package
• Gensim
• NLTK
Resilient Distributed
Dataset
Partition
1
… Partitio
nN
Data model
© Freddie Mac 18
 Training data:
» 27467 JSON/XML Files
– 338 attributes
– 6,000,000 numerical values
– 9,000 text values
» 22 Target Labels
» Training v.s. Validation data split: 80% v.s. 20%
 Testing data:
» 1150 JSON/XML Files
 Overall Prediction Accuracy:
» 69%
CDF Autogenerated Data Model Training

More Related Content

Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration with Common Data Framework (Model Robot)

  • 1. Advanced Machine Learning Data Integration with Common Data Framework (Model Robot) June 20, 2018 Presenters: Kevin Martelli (KPMG) Managing Director - Data and Analytics Balaji Wooputer (Freddie Mac) Director – Risk Analytics
  • 2. © Freddie Mac 2  Freddie Mac makes home possible for millions of families and individuals by providing mortgage capital to lenders.  Since our creation in 1970, we've made housing more accessible and affordable for homebuyers and renters in communities nationwide.  We are building a better housing finance system for homebuyers, renters, lenders, and taxpayers. Freddie Mac
  • 3. © Freddie Mac 3 Objective: Design, develop and implement a self-learning (AI), highly-flexible common data engineering framework to automate the design of the entire data munging process. Recap: Challenge & Objective Reusability Extensibility Reduce Development Cost Speed to Market Success = Automating the design for data munging and integration in a scalable way, while reducing the time to implement a data application by 50% to 70%, thereby allowing data scientists and business analysts to easily access data. https://www.youtube.com/watch?v=ct6gydYAQr42017 : Link Challenge: Quickly integrating multiple variations of vendor’s semi-structured and structured loan level data in order to make quicker and better business decisions to Re- Imagine the Mortgage Experience.
  • 4. © Freddie Mac 4  Data Enablement & Profiling Framework Common Data Engineering Framework (CDF) Data Enablement & Profiling Framework Data Sources Data Integration Integration and Execution Information Access User Engagement Data Preparation ERD Execution Information Access Analytics Tools Actian Matrix Hadoop Tools Web Services User Engagement Guided Reporting and Portfolio Analytics Customized Dashboard Consumption Governance Business glossary Data security and privacy rules Data flows lineage, transformation rules Data models and standards Data quality and data profiling standards Authentication & identity mgt. (Kerberos) Data, masking, encryption Availability, Backup, Recovery Authorization and audit (data level) Patch, upgrade, operations Job scheduling Security, Platform Mgmt Business Continuity Enterprise Loan Application Data Sources Metadata XSD Business Requirement Other Data Sources Third-Party Data Other Business Factors Shared Operational Data Operational Data Store Data Mart EDW Job Automation Vendor Provided ODI Loan Application XML JSON Raw Data Repository Data Exploration & Landing Data Dictionary Data Model Profiling Data sources structured, semi- structured and unstructured. - Create data plan and data dictionary (Automate) - Understand and identify data lineage - Define business requirements & rules - Data transformation, profiling & maturity - Develop data work flows and schedules - Transform data & connect database to backend platform - Define and manage model & program execution lifecycle - Real-time model & program execution & scoring - Model automated learning's - Performance, log management for models and programs. Execution Enablement Execution Logs - Design and create data visualization - Business Intelligence and reporting Feedback upstream systems - Insights realized - Connect Backend and Frontend systems - Tools for additional data processing & analytics - Generate additional insights from enabled data. AUTOMATION Part of the Common Engineer Data Framework
  • 5. © Freddie Mac 5 Data Discovery Transformation Rules Analytical Data Model CDF (2017) 5. Action 1. Ingest 2. Discover 4. Transform & Models 3. REQs. Domain Knowledge Data Munging Data Integration Data Model Analytics Modeling Data Visualization https://www.youtube.com/watch?v=ct6gydYAQr42017 HW Summit
  • 6. © Freddie Mac 6 Data Discovery Business Knowledge Domain and Data Science SME’s Customized Enablement Configuration Data Model CDF Data Model (2017)
  • 7. © Freddie Mac 7 CDF Model Development Life Cycle Data Processing Feature Engineering & Selection Model Development Model Selection Model Validation & Testing Model Deployment • Finalize production code • Develop the model documentation • Develop model monitoring plan • Develop the action plan • Define validation and testing methods • Error analysis • Identify business case calculation • Model refinement • Code development • Model training • Hyper-parameter tuning • Model taxonomy • Create comparison tables • Identify candidates • Select champion/ challenger • Exploratory data analysis • Variable transformation • Recursive feature elimination • Dimension reduction • Data parsing • Data cleaning EXTRACT INFER CLASSIFY Common Data Framework – Model Automation
  • 8. © Freddie Mac 8 Model Framework Architecture Data Platform Model Framework Model Development Model Warehouse Batch Process Streaming Insight Service as API Model Management Considerations: • Reusability and Model Build • Model Versions • Evaluation Framework (A/B) testing • Feature Extraction • Internal/External API • Message Broker
  • 9. © Freddie Mac 9 CDF Next Gen – Model Robot Idea Strategy Customer  Automatically discovers semantic layer from data discovery and generates data model through a machine-learning driven approach thereby significantly reducing need for manual data modeling.  Enhance old fashioned rigid, pre-modeled and pre-configured data models into flexible, adaptive data entities.  Accelerates data integration resulting in faster data insights to users Model Robot is “Fast, Flexible Autogenerated Data Model”
  • 10. © Freddie Mac 10 Data Modeling and Analytics Package • Spark MLlib (Naïve Bayes + Random Forest) • Scikit-learn • NumPy, SciPy • Pandas NLP Package • Gensim • NLTK Model Prediction Output (Data Model) HDP-HDFS User Input Model Configuration Vendor Input JSON Vendor Input XML Existing Data Model Training/Prediction Target Label For Training Data Profiling and Discovery Feature Engineer Model Training Model Ensemble Model Prediction Metric Persist Model Classification Engine Model Training 1. Based on Learning Source Descriptions for Data Integration 2. Leverage the model framework 3. Use Vendor data for training 4. 3 Core Features: Numerical, Text, Relationship Doan AH, Domingos P, Levy A (2000) Learning source descriptions for data integration. In: Proc WebDB Workshop, pp. 81–92 Model: Training data: 25k JSON/XML Files, 300+ attributes Testing data: 1k JSON/XML Files, 300+ attributes Training vs Test data split: 80% vs 20% Prediction Accuracy: 69% Mapped 22 Target Labels E.g. 6,000,000 numerical values and 9 000 text values.
  • 11. © Freddie Mac 11 CDF Next Gen (2018) – Model Robot Data Discovery SME Confirmation and Update Predicted Data Model Continuous Learning Feature Engineer Training Prediction Spark ML Business Knowledge
  • 12. © Freddie Mac 12 Feature Extraction Numeric-based features • Min, max, mean, median... Text-based features • TF_IDF • POS tagging • NER tagging Relationship-based features • Depth • Number of neighbors • Xpath Modeling Random Forest for numeric Naïve-Bayes for text features Cosine similarity for relationships Data Discovery Model Ensemble CDF Model Robot Design (High-level)
  • 13. © Freddie Mac 13 CDF Autogenerated Data Model Prediction Feature Extraction SME Review and Update Data Discovery New data for the same business area /path/to/a AvailableBalance [5400, 6000, 3000, 1500, …] /path/to/b AssetType [MNMT, …] Numeric: • Min, Max • Mean, STD • Percentiles • … Free-form text: • Tokenize • TF-IDF • … Path+Name: • Tokenize • TF-IDF • … Random Forest Naïve- Bayes Cosine- similarity PredictionPrediction Source Predicted AvailableBalance AssetAccountBalance AssetType AccountType
  • 14. © Freddie Mac 14 CDF - Outcomes Cost ReusabilityExtensible Collaboration Trust Rules Agility Automation Analytics Data Reconcile Data Data Business Rules New Customer Insights
  • 15. © Freddie Mac 15 CDF Business Value Delivered 2016 2017 2018 2016 Developed a generic data engineering framework. Reduced the “Data Munging” time by 50%. Reports available 3 to 4 months after going live. 2017 Reduced data engineering timeline by an additional 25%. Report generated the next business day. 2018 Develop a self-learning AI data model. Reduce data engineering and data model effort timeline. (Target 25%)
  • 16. © Freddie Mac 16 Thank You 2018 Hortonworks Summit Balaji Wooputur & Kevin Martelli Jun/20/2018
  • 17. © Freddie Mac 17 CDF Conceptual Architecture ElementTree Sqoop/JDBC DataProcessing (PySpark) Oozie/Ranger/Ambari ElementTree PartitionedORCTables Data Discovery Output Analytics Modeling Sqoop/ODI Data Profiling and Discovery Feature Engineer Model Training Model Ensemble Timeline: • Automated data ingestion • Leverage significant domain knowledge to enable (Manual) • Use learnings to decrease domain knowledge and accelerated Analytical Data Model (ADM) RAW AccelerateTimeToValue Data Profiling FlatFile RAW DataFrame (SparkSQL) Partition 1 … Partitio nN Data Modeling and Analytics Package • Spark MLlib (Naïve Bayes + Random Forest) • Scikit-learn • NumPy, SciPy • Pandas NLP Package • Gensim • NLTK Resilient Distributed Dataset Partition 1 … Partitio nN Data model
  • 18. © Freddie Mac 18  Training data: » 27467 JSON/XML Files – 338 attributes – 6,000,000 numerical values – 9,000 text values » 22 Target Labels » Training v.s. Validation data split: 80% v.s. 20%  Testing data: » 1150 JSON/XML Files  Overall Prediction Accuracy: » 69% CDF Autogenerated Data Model Training

Editor's Notes

  1. Good Afternoon everyone. How is everyone doing. Welcome to Freddie Mac and KPMG case study session . on Advanced ML Data Integration with Common Data Framework (Model Robot). Glad to be here to see great ideas and innovations sessions. This is 2nd year in a row for FreddieMac and KPMG to present at HW summit. My name is Balaji Wooputur working in Freddie Mac as Risk Analytics Director. I’m heading Risk Analytics team for SF Risk division. Freddie Mac has been partner for past couple of years with KPMG on HW solution. I’m here today with Kevin Martelli from KPMG’s. Myself and Kevin will co-present today’s session. Anyone in audience from last year CDF session? I’m going to cover following in today’s session. First, who we are, recap of 2017 project objective, 2017 “Patent pending” CDF, 2018 CDF (Model Robot) and CDF Model life cycle and Model framework, which is extended and reused components from our CDF. Let me start with who we are. Kevin is a managing director and heading KPMG’s Big Data Software Engineering Team in KPMG COE for D&A Light. “”-Kevin Greetings and Welcome message””
  2. How many of you had mortgage experience? Freddie Mac was created in year 1970 to expand the secondary market for mortgages in the US. Freddie Mac makes homeownership and rental housing financing more accessible and affordable. Operating in the secondary mortgage market, we keep mortgage capital flowing by purchasing mortgage loans from lenders so they in turn can provide more loans to qualified borrowers. Freddie Mac initiative of "reimagine the mortgage experience.“− ways we’re putting into action the feedback, insights, and opinions to get loan closing faster and save money Our mission to provide liquidity, stability, and affordability to the U.S. housing market in all economic conditions extends to all communities from coast to coast. Mortgage loan manufacturing consists of loan origination, loan closing and servicing loan after purchase. Freddie Mac pool loans then securitize and sell as MBS (Mortgage back securities) to global investors. Now, I’ll handover to Kevin.
  3. Kevin Biggest Challenges understanding and processing the datasets variety of vendors and not very standardized Time consuming to understand the datasets Many of the people in the audience have the same problems? 60% of time in cleaning, organizating and collecting data. (least enjoyable) Resolve challenge KPMG and Freddie have been working on a program over the last couple of years. First we focused on the foundation which automated processes but also allowed us to obtain data sets that could then be used for training the models to more fully automate the process Framework built on 4 core principles
  4. Kevin This is a busy slide and will not spend the time to review all aspects. The idea is to show a conceptual flow of the complexity of producing and consumig an insight. There is the standard data flow of identification of data sources, Ingestion…... And then there are all the supporting processes – quality, security lineage, etc. In a perfect world all these processes work together perfectly but we all know that is not the case. - Help compensate for deficinies found in other areas
  5. The CDF is down into three main components. We discussed these in detail during the DW summit last year. I want to provide a quick overview as it is important to understand the foundation before we discuss the intelligence processing that was added.   The initial framework had 3 main components that align to the model above - Data Discovery, Transformation or Business Rules, & Analytical Model.   In Data Discovery, the program would automatically ingest semi-structure data from the vendors (mainly JSON and XML) and produce insights into the data. It would provide, sample data values, min, max and mean of values, nulls, outliers, where it fell in the object definition, etc. The Data Discover output would allow Domain experts to better understand the data in order to make determinations on how to link the data to the target data model. Transformation rules are rules that business users can apply to the data. (i.e. Transformations (derive new attributes, standard data, data transformations, etc.) Once the data is discovered and transformational rules are applied data is fed into the analytical data model, which then automatically updates or creates new tables in Hive.   Although there are parts, which are automated this is still a human intensive process; hence, the need to add more intelligence to the framework.  
  6. This slide represents shows the overall flow of the CDF. We wanted to highlight the middle section where there was a lot of manual effort and time required for Domain experts and SME in order to produce a useable data model.   SME’s and Domain experts leverages the discovery output and would perform mapping based on their knowledge as well as the data discover output. Who generated the file and a lot of communicaton. For 1 or 2 data sources this is ok but as sources increase there is a lot of Time spend and the manual effort is Risk and Error prone As a result, the team wanted to further automate these process; hence the model robot. The idea of model robot was to automate these human intense processes (thru pass learnings), while still keeping the Human in the loop but to a lesser extent more for validation vs creating. In order to accelerate the time from ingestion to realized business value.   22 attributes
  7. As we started to add the intelligence to the framework we followed a specific defined model development framework: The lifecycle is broken up into 6 main stages. Data Processing put into a format that we can better understand the data Feature Selection – (Research Paper) An important part in the overall lifecycle of framework but what features do I want to leverage. The data science team were able to leverage some standard practices; such as dimension reduction variable transformation, etc. Model selection is important because selecting the wrong model can lead itself to waste time as you try to leverage and refine the model for accuracy. Having a larger team to leverage helps to identify models that have been successful in the past on similar data sets and problems. Model Test and Validation? Once everything is completed you need to deploy, which is not always easy on a Hadoop ecosystem. We will get more into that on the next slide.   At the bottom we have a small workflow of the model components. We built as components to enable reusability
  8. Model Management Native Support in Hadoop…. Once we have built the model we needed a manageable way to deploy and leverage within the Hadoop ecosystem. There is not a straight forward way to accomplish this task. In other analytical packages such as SAS they have applications to help manage. If you are using native Hadoop how to manage modes that you deploy? How do you track the version? How to do do A/B Testing? How do you execute the mode? How to you stream and run in batch?
  9. Balaji Data Insight (Identify noise, separate noise data, leverage business context of the data and dynamic modeling) Semantic Layer – Vendor meta data not standardized. Vendor datasets are semi-structured data (Key value pairs – JSON and Dynamic XML) Data Veracity – SPOT (Single Point of Truth) with data governance emerged. Data Management principles (Metadata standardization, Enterprise naming standards, data types etc.) Data Model – Semi-structed dataset to confirmed data model to meet organization Data Model standards has to be applied to MPP reporting platform
  10. We are living the “world of information” Decompose the information.. Identifying noise in the data.. Segregating actual meaningful data and noise in the data.. Bottom left - “Reference Doan AH” Bottom right – Walkthrough sample data with sizing etc.. Evaluating meaningful data with domain features of the data Predicting model with existing data model Bringing in Human in loop to verify/validate prediction model
  11. Balaji Training outputs provided to SME’s and Domain experts. Transforming raw data to features that better to present , Identifying factors that attributes useful for modeling Train data  Outputs predicted model (Human in the loop) for continuous Let me dive deep-into Model robot design and flow -Json, XML (Dynamic containers) , key value pairs -Data in semi structured and schema less formats (Run Data discovery) output metadata and profiled data (ranges, types etc.)
  12. Balaji – Feature extraction Numeric features min, max, mean, median… Text features TF-IDF, POS tagging, NER tagging… Attribute Names Relationship-based Xpath Depth Number of neighbors Etc
  13. Balaji – Prediction Model This is the moment you all are waiting for. CDF AI (Model Robot) here..
  14. TO summarize CDF outcomes Collabaration, trust, automation, analytics and data ready…. CDF is one stop for any type/format data ingestion with Data Discovery, Data Model and Data Engineering. We are proud today to say our risk analysis is equipped with “Intraday and Day 1 data insights” CDF framework core components reused and extend (Data Model) and reducing cost for Data Integration/Engineering Maturity model we are at 4 .. Fine tune ourselves to complete at 4 and go to 5
  15. Business value delivered Using the generic data engineering framework approach for our next product offering in 2016, we reduced “Data Munging” time by 50% using automation, enabling analysts to generate reports within a month of release. Our subsequent product launch in 2017 resulted in reducing the data engineering timeline by 25% allowing for reports to be generated and reviewed next business day. Share our 2017 success story of business outcome on addressing loan risk and actionable provide feedback to customers on loan origination process. Data insights on loan quality, Automated Collateral Evaluation (ACE)  Get to closing faster – no need for a traditional appraisal  Save money – no appraisal fee  Immediate certainty – automatically eligible for collateral rep and warranty relief
  16. Flat Files are loaded on share drive and manually up loaded to HDFS Nothing like XSD files. ss