From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola East Japan
- 1. コカ・コーライーストジャパン株式会社
From a single droplet to a full bottle,
our journey to Hadoop at Coca-Cola east Japan
October 27, 2016
Information Systems, Enterprise Architect
& Innovation project manager
Damien Contreras
ダミアン コントレラ
- 2. コカ・コーライーストジャパン株式会社
In This session
• About Coca-Cola East Japan
• Hadoop Journey at CCEJ
• Hadoop Projects
• Hadoop for the manufacturing
industry
• Hadoop for CCEJ: What’s Next
- 3. コカ・コーライーストジャパン株式会社 3
• Coca-Cola East Japan was established on Jul. 1, 2013
through the merger of four bottlers.
• On Apr. 1, 2015, it underwent further business integration with
Sendai Coca-Cola Bottling Co. , Ltd.
• Announced MOU with Coca-Cola West on April 26, 2016 to
proceed with discussions/review of business integration
opportunities
• Japan's largest Coca-Cola Bottler, with an extensive
local network, selling the most popular beverage
brands in Japan
Data as of December 2015
About Coca-Cola East Japan
- 5. コカ・コーライーストジャパン株式会社
CCEJ Data Landscape
DATA IN SILOS
(Datamart, ERP, DWH, Staging, Mainframe,…)
P2P INTERFACES
(No ESB, Multiple ETL & Interface Servers)
NO GOVERNANCE
(Multiple Data formats for same business
context, No Meta Data Mgt.)
BATCH ORIENTED
(File, Scheduler, …)
- 8. コカ・コーライーストジャパン株式会社
Hadoop Journey: Production
Yarn
HiveSpark
BW on Hana
Ranger
KNIME Zeppelin
Tez
Python
Notebook
NiFi
Analytics System Processing Integration Data source
Data
Restitution
Flat files
Web
Services
HDFS
MR
Centos
Active
Directory
March 2016
• 8nodes
• Azure D/DS13
• 3TB of Data
• 64 cores
• 448GB Ram
• Team: 2 people
Ambari
- 9. コカ・コーライーストジャパン株式会社
• 13 nodes
• 20TB
• 104 cores
• 728GB RAM
• 1000+ Tables
• 3 Production Systems
Hadoop Eco-system at CCEJ
Analytics System Processing Integratio
n
Data
source
Data
Restituti
on
Aggregated data Visualization
2
Data Hub
Past data Forecast data
1
Analytics
3
Master Data
Centralize
Lineage
Governance
Yar
n
Hiv
e
Spark
BW on
HanaHTML
Report
Ranger
Zeppelin
Tez Presto
AirPal
Python
Notebook
MySQL
NiFi
SAP ECC
Boomi
Sparkling
Water
Tensorflow
Flat files
Web
Services
HDFS
MR
Drill
Centos
Active
Directo
ry
Ambar
i
KNIME
- 10. コカ・コーライーストジャパン株式会社
May Jun July Aug Sep
t
Oct Nov Dec Jan Feb Mar Apr May Jun July Aug Sep
t
Oct
Timeline
Hadoop / NiFi PlatformPlatform POC
VM Analytics POC Forecast ImplementationVM Analytics POC
2015 2016
POC VM Placement
Flow implementation
BW Report integration
1 SAP integration & MDM3
2 Write-Off report
- 12. コカ・コーライーストジャパン株式会社
HIGH Nbr. OF MACHINES
550,000VM, On/Offline
Nbr. SKUs per VM
25 SKUs, Hot & Cold
Vending Replenishment: The Business Case
EXTERNAL FACTORS
(Weather, City data, Geo-Location, Events )
VENDING ROUTES
(Visit List per truck, Logistics dependence)
ColdHot
How to:
Reduce nbr. of visits
Optimize Truck stock
Avoid out of stocks
- 13. コカ・コーライーストジャパン株式会社
Vending replenishment forecast: The Project
The Challenge:
• Deployment in 3 months
• 1 ½ hour to generate the forecast
• +20% of accuracy versus previous version
• 120 steps in the program
Picking list
Visit Plan
Online VM
Offline VM
Every day
Yes NoNoArbitration
Forecast
generation
Hadoop Has Delivered:
• Feed 5GB+ of new data everyday
• Process high volume of data (in-memory)
300GB+
• Integrate from different data sources
• Generate more complicated forecast than
legacy systems
14 Million items
- 15. コカ・コーライーストジャパン株式会社
Staging: The Case of “write-off” report
Drill Web ServerAzure
X7systems
Master Data
Generate
SQL query
JSON
HTML Interface
Verify & Check
Combine
Report
Challenges:
• Data set harmonization
(Sales, Billing, Inventory)
• Data volume from source
systems
• Complex Computation logic
• Not clear functional
requirements
Objectives:
• Aggregate a large number of
dataset 40+ flows 4GB of data
everyday
• Single view of data, anywhere, to
Finance, SC & Commercial
• Dynamic transaction vs. static in
excel
• Reduce manual work to zero
Comparison
=
Aggregation
=+
Enrichment
Analytics
Transformation (conversion)
- 16. コカ・コーライーストジャパン株式会社
MDM: Centralization and Dispatch
External Systems
4 Replicate data
Event driven
3 Consistency check
Rule engine Replication EngineMDM Repository
2 MDM registration
Lineage
1 MDM Creation
Challenges:
• Rule engine definition and
implementation
• MDM on Hadoop & ESB
integration
• MDM & SAP Synchronization
Objectives:
• Single MDM repository
• Centralized bridge tables &
Mapping table
• Standardization of MDM across
data landscape
• Targeted distribution / replication
of MDM to external systems
Realization:
• MySQL and Hadoop synchronization
300+ tables
• Replication engine with ESB
• MDM-Tool: Pilot with Customer
Master
• Full go-live: April 2017
- 17. コカ・コーライーストジャパン株式会社
Use case – SAP Integration / sales interface report
Objectives:
• Leverage the most granular data already
in Hadoop
• Leverage the processing power of
Hadoop
x9flows
x4flows
x7flows
x9flowsMD & Bridge
Vending
Sales Data
Legacy format data
CCEJ format data
Bridge table
& Master Combine
Calculate
x9output tables
Company 1
Company 2
Company 3
Azure
Challenges:
• Many data format requiring
complex data transformation
• Wide variety of data sources &
technologies to transfer data
• Data mapping between systems
Realization:
• Data structure in Hadoop
• Logic for one type of sales
channel implemented
• Full go-live: April 2017
- 18. コカ・コーライーストジャパン株式会社
Hadoop: What’s Next
Increase data velocity & Create a true Data Lake
Improve data collection, quality, profiling, meta-data &
propose a catalog of curated data to end users
Toward a Data Driven Decision Process
Develop Support & Operational Excellence
- 19. コカ・コーライーストジャパン株式会社
I thank CCEJ management who had the courage to believe in an Agile
approach
Thank to my team member and comrade:
Vinay Mahadev for all the long hours we’ve put together
to make this project a reality
- 22. コカ・コーライーストジャパン株式会社
BW on Hana
Integration Landscape overview
Hadoop Prod
Nifi
Prod
NiFi
Prod
Oracle
Boomi
Hive
JDBC
Drill
IDOCS
JDBC
Flat files
MySQL
SAP ECC
Other systems
Other
systems
FTP
JDBC
HTTP HTML
interface
Power users
Acquisition Transformation Restitution
dt=20161024
dt=20161025
t_my_table_txt_p
My_file_20161024.csv
My_file_20161025.csv
Myflow-data
t_my_table_txt_p
(External text tables)
t_my_table_txt_p
t_my_bridge_table_txt_p
+Myflow-data
(Database)
t_my_report_orc_p
(ORC tables)
- 23. コカ・コーライーストジャパン株式会社
Guidelines around NiFi flows
Prod
Dev
Prod
Dev
Azure
Triggers
System source NiFi
Listener
Extraction
webCall
JDBC
Groups
Encryption
/ Flow
Master Data
Transaction
Data
Processing
Group
- 24. コカ・コーライーストジャパン株式会社
Guidelines around NiFi flows
Retry
Processor
Write to error log
Success
OnError
Read from Error log
Re-Process
Update Error log
Send Data
Every 5 mins
Error
Handling / Flow
Master Data
Transaction
Data
Editor's Notes
- > Coca-cola East Japan responsible to produce and distribute all the product you love from Coffee to carbonated drinks like Fanta or Sprite Vending Machines
> 5 different bottlers and soon 6
>
- Data in silos:
3 layers of technologies: mainframe and its eco system / highly customized SAP instances per bottler / single instance of SAP for CCEJ
Redundancy of capabilities at each level
Duplicated data, no single source of truth
Replication going both ways
Point to point :
2 endpoints one or several interfaces
Multiple ETL tools / ESB tools, servers that stage data
No governance:
Data structure based on system requirements and vendors convenience: e.g: 2 different vendors working on the same project 2 IDs for the same object
Requirements solution developed by the vendors Knowledge of the meaning of the data kept on the vendor side
Master data managed in each systems no completeness of the master data / no single source of truth
Batch oriented:
Data transfer through flat files / fixed width files orchestrated by schedulers (once a day)
Little to no event driven flows
Multiple intermediate systems that receive the data / load then send to another system
- Started humbly:
Focused on analytics around vending machines & data exploration
HDP on Azure with Centos VMs (First in Japan) HDInsight without Spark difficult to install
Data manually uploaded to the cluster through FTP
-
Focused on analytics around vending machines data exploration
Teradata : datamart
NiFi to integrate between our on prem environment and the cloud environment leveraging the site-to-site connection and using certificates
- First release and production use of Hadoop
Spark program running every night to generate forecast for all vending machines
Integration with NiFi on multiple systems to retrive transactional data and master data
First attempt to integrate with BW using JDBC (not Vora as BW was not in the same data center and we had no requirements to push back data from BW to Hadoop)
NiFi :
ExecuteSQL modified to leverage setfetchSize to stream data (10,000 records)
Re-encoding to UTF8 in the processor
Fixed-width parser
Governance around data:
Partition data based on delta extraction
Naming convention to easily identify data source systems
Atlas and AD for access rights
NiFi : guidelines around processing groups, e.g: master data / transactional data
NiFi: Trigger oriented with webService, error handling: bubbling errors to the top, retry logic to ensure extraction
- Scheduling & orchestration can be implemented through NiFi
Most of our data are linked to a date therefore we partition tables
keeping the tabular structure of the tables Hive was the de-facto solution
Many data transformation can be implemented directly in HQL
Presto to integrate heterogeneous system / good performance on querying data
Drill to easily format in JSON and rapidly integrate with Web interfaces
NiFi (250 flows) & Boomi (700 flows) integration to have full access to SAP ECC functional modules & IDOCS
Program profiling: Fire
Defined the use cases:
Analytics
Data hub and aggregate & processing data
Central repository for Master Data
20TB in HDFS
Pulling 20GB per day 250 in prod daily extractions in NiFi (ExecuteSQL) other in Dev for CokeOne
DM : 0.249885409
HHT : 3.571921535
CMOS : 0.657761362
SC : 4.140049335
DME : 10.447961232
MM : 0.260761793
104 cores
9 datanodes
486GB of RAM
HIVE tables:
Prod: 402
BI:148
Default: 123
DemandForecast: 42
dm: 7
mdm: 26
sc: 40
vm: 16
Dev: 604
bi: 36
fusion:18
mdm:306
rtr:7
vm:237
- NiFi in production in October
- CCEJ Intro
Hadoop Not only a datalake, integration platform: business enabler:
1 MDM
2 write off
3 ERP integration
Landscape
-
1:
High number of Vending machines
Online Vending machines / Offline vending machines
2:
Lines up of 50 products
in average 25 products in the vending machines
Hot and cold product seasonality
New product every 3 months
3:External factors
Vending inside or outside (in office / stations / sport center)
Easily accessible: on top of a montain / on top of a building / in a station underground
Wide variety of customers: regulars, events: baseball
4: vending routes
Limited number of truck & filler go replenish the 20 VMs that really needs it
what product to put on the truck
- Processing everyday 14M items (VM x products):
5GB: today’s sales information, stock level, settlement information, …
300GB as we look into the past
- Producing that report took 6 to 8 hours and had errors
- Instead of duplicating data (silos) we can reuse the same data to generates those report
IDOCS, JDBC, flat files, fixed width files
Complex transformation as we are combining legacy data with new number and definition of master data to aggregate data during the rollout period of the new systems
- Whenever possible we implement extractions triggered by a web service call from the source system
We group extractions by flow together under logical “Process Group”
We try to have logical subdivisions as well to help readability and maintainability
(e.g: extractor for Master data / transactional data or monthly extractors / daily extractors)
Site-to-site communication is encrypted using a single root CA certificate shared across all keystore
-
We always implement fail over that retry to perform a processing
We always implement error management
In each processing group we implement an Output Port called “onError” that are linked with the parent “OnError” Output port this ensure that notification are propagated until reaching the root canvas.
The top “OnError” is a Remote Output port available on NiFi Azure side and implement a common handler to send errors
Each processor is linked with that “Output port”, at the beginning of the flow branch has their own set of parameters defined:
Service
Process
Priority
Those parameters are used by the Error handling process to send notification to Administrators
- Key Custom Features:
Leverage cursors on tables to stream batch of data (10000 rows at a time)
Comma separated delimiter CSV file output , by default filename <yyyy-MM-dd_hh:mm:ss.SSS>.csv with LF as line separator.
Default UTF-8 file encoding
Option to save the extracted file to folder location ( Integrated PutFile Processor functionality )
A Boolean value for user to choose to keep the source file or remove from folder location.
UPDATE SQL functionality
Boolean option to choose output CSV file as Windows CSV file format with line separator CRLF or Unix CSV file format with line separator as LF.
Also a fixed width processor to CSV format
-
Dev environment accessible to anyone (after requesting an account