SlideShare a Scribd company logo
コカ・コーライーストジャパン株式会社
From a single droplet to a full bottle,
our journey to Hadoop at Coca-Cola east Japan
October 27, 2016
Information Systems, Enterprise Architect
& Innovation project manager
Damien Contreras
ダミアン コントレラ
コカ・コーライーストジャパン株式会社
In This session
• About Coca-Cola East Japan
• Hadoop Journey at CCEJ
• Hadoop Projects
• Hadoop for the manufacturing
industry
• Hadoop for CCEJ: What’s Next
コカ・コーライーストジャパン株式会社 3
• Coca-Cola East Japan was established on Jul. 1, 2013
through the merger of four bottlers.
• On Apr. 1, 2015, it underwent further business integration with
Sendai Coca-Cola Bottling Co. , Ltd.
• Announced MOU with Coca-Cola West on April 26, 2016 to
proceed with discussions/review of business integration
opportunities
• Japan's largest Coca-Cola Bottler, with an extensive
local network, selling the most popular beverage
brands in Japan
Data as of December 2015
About Coca-Cola East Japan
コカ・コーライーストジャパン株式会社
コカ・コーライーストジャパン株式会社
CCEJ Data Landscape
DATA IN SILOS
(Datamart, ERP, DWH, Staging, Mainframe,…)
P2P INTERFACES
(No ESB, Multiple ETL & Interface Servers)
NO GOVERNANCE
(Multiple Data formats for same business
context, No Meta Data Mgt.)
BATCH ORIENTED
(File, Scheduler, …)
コカ・コーライーストジャパン株式会社
Hadoop Journey: Genesis
Yarn
HiveKNIME
WEKA Tez
Analytics System Processing Integration Data source
Data
Restitution
HDFS
MR
Centos
Flat files
July 2015
• Pilot phase
• 5 nodes
• Azure A1  A4
• 100GB
• 70GB of RAM
• Team: 1 person
Ambari
KNIME
コカ・コーライーストジャパン株式会社
Hadoop Journey: Stability
Yarn
Hive
Ranger
KNIME
Tez
Python
Notebook
NiFi
Analytics System Processing Integration Data source
Data
Restitution
Flat files
HDFS
MR
Centos
Active
Directory
November 2015
• Pilot phase
• 6 nodes
• Azure A4  D & DS13
• 1TB of data
• 336GB of RAM
• Team: 2 persons
Zeppelin
Ambari
コカ・コーライーストジャパン株式会社
Hadoop Journey: Production
Yarn
HiveSpark
BW on Hana
Ranger
KNIME Zeppelin
Tez
Python
Notebook
NiFi
Analytics System Processing Integration Data source
Data
Restitution
Flat files
Web
Services
HDFS
MR
Centos
Active
Directory
March 2016
• 8nodes
• Azure D/DS13
• 3TB of Data
• 64 cores
• 448GB Ram
• Team: 2 people
Ambari
コカ・コーライーストジャパン株式会社
• 13 nodes
• 20TB
• 104 cores
• 728GB RAM
• 1000+ Tables
• 3 Production Systems
Hadoop Eco-system at CCEJ
Analytics System Processing Integratio
n
Data
source
Data
Restituti
on
Aggregated data Visualization
2
Data Hub
Past data Forecast data
1
Analytics
3
Master Data
Centralize
Lineage
Governance
Yar
n
Hiv
e
Spark
BW on
HanaHTML
Report
Ranger
Zeppelin
Tez Presto
AirPal
Python
Notebook
MySQL
NiFi
SAP ECC
Boomi
Sparkling
Water
Tensorflow
Flat files
Web
Services
HDFS
MR
Drill
Centos
Active
Directo
ry
Ambar
i
KNIME
コカ・コーライーストジャパン株式会社
May Jun July Aug Sep
t
Oct Nov Dec Jan Feb Mar Apr May Jun July Aug Sep
t
Oct
Timeline
Hadoop / NiFi PlatformPlatform POC
VM Analytics POC Forecast ImplementationVM Analytics POC
2015 2016
POC VM Placement
Flow implementation
BW Report integration
1 SAP integration & MDM3
2 Write-Off report
コカ・コーライーストジャパン株式会社
20TB
コカ・コーライーストジャパン株式会社
HIGH Nbr. OF MACHINES
550,000VM, On/Offline
Nbr. SKUs per VM
25 SKUs, Hot & Cold
Vending Replenishment: The Business Case
EXTERNAL FACTORS
(Weather, City data, Geo-Location, Events )
VENDING ROUTES
(Visit List per truck, Logistics dependence)
ColdHot
How to:
Reduce nbr. of visits
Optimize Truck stock
Avoid out of stocks
コカ・コーライーストジャパン株式会社
Vending replenishment forecast: The Project
The Challenge:
• Deployment in 3 months
• 1 ½ hour to generate the forecast
• +20% of accuracy versus previous version
• 120 steps in the program
Picking list
Visit Plan
Online VM
Offline VM
Every day
Yes NoNoArbitration
Forecast
generation
Hadoop Has Delivered:
• Feed 5GB+ of new data everyday
• Process high volume of data (in-memory)
300GB+
• Integrate from different data sources
• Generate more complicated forecast than
legacy systems
14 Million items
コカ・コーライーストジャパン株式会社
コカ・コーライーストジャパン株式会社
Staging: The Case of “write-off” report
Drill Web ServerAzure
X7systems
Master Data
Generate
SQL query
JSON
HTML Interface
Verify & Check
Combine
Report
Challenges:
• Data set harmonization
(Sales, Billing, Inventory)
• Data volume from source
systems
• Complex Computation logic
• Not clear functional
requirements
Objectives:
• Aggregate a large number of
dataset 40+ flows 4GB of data
everyday
• Single view of data, anywhere, to
Finance, SC & Commercial
• Dynamic transaction vs. static in
excel
• Reduce manual work to zero
Comparison
=
Aggregation
=+
Enrichment

Analytics

Transformation (conversion)

コカ・コーライーストジャパン株式会社
MDM: Centralization and Dispatch
External Systems
4 Replicate data
Event driven
3 Consistency check
Rule engine Replication EngineMDM Repository
2 MDM registration
Lineage
1 MDM Creation
Challenges:
• Rule engine definition and
implementation
• MDM on Hadoop & ESB
integration
• MDM & SAP Synchronization
Objectives:
• Single MDM repository
• Centralized bridge tables &
Mapping table
• Standardization of MDM across
data landscape
• Targeted distribution / replication
of MDM to external systems
Realization:
• MySQL and Hadoop synchronization
300+ tables
• Replication engine with ESB
• MDM-Tool: Pilot with Customer
Master
• Full go-live: April 2017
コカ・コーライーストジャパン株式会社
Use case – SAP Integration / sales interface report
Objectives:
• Leverage the most granular data already
in Hadoop
• Leverage the processing power of
Hadoop
x9flows
x4flows
x7flows
x9flowsMD & Bridge
Vending
Sales Data
Legacy format data
CCEJ format data
Bridge table
& Master Combine
Calculate
x9output tables
Company 1
Company 2
Company 3
Azure
Challenges:
• Many data format requiring
complex data transformation
• Wide variety of data sources &
technologies to transfer data
• Data mapping between systems
Realization:
• Data structure in Hadoop
• Logic for one type of sales
channel implemented
• Full go-live: April 2017
コカ・コーライーストジャパン株式会社
Hadoop: What’s Next
Increase data velocity & Create a true Data Lake
Improve data collection, quality, profiling, meta-data &
propose a catalog of curated data to end users
Toward a Data Driven Decision Process
Develop Support & Operational Excellence
コカ・コーライーストジャパン株式会社
I thank CCEJ management who had the courage to believe in an Agile
approach
Thank to my team member and comrade:
Vinay Mahadev for all the long hours we’ve put together
to make this project a reality
コカ・コーライーストジャパン株式会社
Your turn, let s share ideas & a coke !
Damien Contreras
Email: damien.contreras@ccej.co.jp
LinkedIn: Damien Contreras
Twitter: @dvolute
コカ・コーライーストジャパン株式会社
The inside of Hadoop
コカ・コーライーストジャパン株式会社
BW on Hana
Integration Landscape overview
Hadoop Prod
Nifi
Prod
NiFi
Prod
Oracle
Boomi
Hive
JDBC
Drill
IDOCS
JDBC
Flat files
MySQL
SAP ECC
Other systems
Other
systems
FTP
JDBC
HTTP HTML
interface
Power users
Acquisition Transformation Restitution
dt=20161024
dt=20161025
t_my_table_txt_p
My_file_20161024.csv
My_file_20161025.csv
Myflow-data
t_my_table_txt_p
(External text tables)
t_my_table_txt_p
t_my_bridge_table_txt_p
+Myflow-data
(Database)
t_my_report_orc_p
(ORC tables)
コカ・コーライーストジャパン株式会社
Guidelines around NiFi flows
Prod
Dev
Prod
Dev
Azure
Triggers
System source NiFi
Listener
Extraction
webCall
JDBC
Groups
Encryption
/ Flow
Master Data
Transaction
Data
Processing
Group
コカ・コーライーストジャパン株式会社
Guidelines around NiFi flows
Retry
Processor
Write to error log
Success
OnError
Read from Error log
Re-Process
Update Error log
Send Data
Every 5 mins
Error
Handling / Flow
Master Data
Transaction
Data
コカ・コーライーストジャパン株式会社
NiFi enhancement: example
コカ・コーライーストジャパン株式会社
Technical Architecture
Hadoop Production environment
….
Node 3
Node 4
Node 5 Node 11
AD
NiFi
Node 0
Node 1
Node 2 Node 6
Hadoop Dev environment
Node 3Node 0
Node 1 Node 2
Prod environment
Dev environment
RDBMS
FTP Server
SAP ECC
Azure
NiFi
NiFi
NiFi
…

More Related Content

From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola East Japan

Editor's Notes

  1. > Coca-cola East Japan responsible to produce and distribute all the product you love from Coffee to carbonated drinks like Fanta or Sprite  Vending Machines > 5 different bottlers and soon 6 >
  2. Data in silos: 3 layers of technologies: mainframe and its eco system / highly customized SAP instances per bottler / single instance of SAP for CCEJ Redundancy of capabilities at each level Duplicated data, no single source of truth Replication going both ways Point to point : 2 endpoints one or several interfaces Multiple ETL tools / ESB tools, servers that stage data No governance: Data structure based on system requirements and vendors convenience: e.g: 2 different vendors working on the same project 2 IDs for the same object Requirements  solution developed by the vendors  Knowledge of the meaning of the data kept on the vendor side Master data managed in each systems  no completeness of the master data / no single source of truth Batch oriented: Data transfer through flat files / fixed width files orchestrated by schedulers (once a day) Little to no event driven flows Multiple intermediate systems that receive the data / load then send to another system
  3. Started humbly: Focused on analytics around vending machines & data exploration HDP on Azure with Centos VMs (First in Japan)  HDInsight without Spark difficult to install Data manually uploaded to the cluster through FTP
  4. Focused on analytics around vending machines data exploration Teradata : datamart NiFi to integrate between our on prem environment and the cloud environment leveraging the site-to-site connection and using certificates
  5. First release and production use of Hadoop Spark program running every night to generate forecast for all vending machines Integration with NiFi on multiple systems to retrive transactional data and master data First attempt to integrate with BW using JDBC (not Vora as BW was not in the same data center and we had no requirements to push back data from BW to Hadoop) NiFi : ExecuteSQL modified to leverage setfetchSize to stream data (10,000 records) Re-encoding to UTF8 in the processor Fixed-width parser Governance around data: Partition data based on delta extraction Naming convention to easily identify data source systems Atlas and AD for access rights NiFi : guidelines around processing groups, e.g: master data / transactional data NiFi: Trigger oriented with webService, error handling: bubbling errors to the top, retry logic to ensure extraction
  6. Scheduling & orchestration can be implemented through NiFi Most of our data are linked to a date therefore we partition tables keeping the tabular structure of the tables Hive was the de-facto solution Many data transformation can be implemented directly in HQL Presto to integrate heterogeneous system / good performance on querying data Drill to easily format in JSON and rapidly integrate with Web interfaces NiFi (250 flows) & Boomi (700 flows) integration to have full access to SAP ECC functional modules & IDOCS Program profiling: Fire Defined the use cases: Analytics Data hub and aggregate & processing data Central repository for Master Data 20TB in HDFS Pulling 20GB per day 250 in prod daily extractions in NiFi (ExecuteSQL) other in Dev for CokeOne DM : 0.249885409 HHT : 3.571921535 CMOS : 0.657761362 SC : 4.140049335 DME : 10.447961232 MM : 0.260761793 104 cores 9 datanodes 486GB of RAM HIVE tables: Prod: 402 BI:148 Default: 123 DemandForecast: 42 dm: 7 mdm: 26 sc: 40 vm: 16 Dev: 604 bi: 36 fusion:18 mdm:306 rtr:7 vm:237
  7. NiFi in production in October
  8. CCEJ Intro Hadoop Not only a datalake, integration platform: business enabler: 1 MDM 2 write off 3 ERP integration Landscape
  9. 1: High number of Vending machines Online Vending machines / Offline vending machines 2: Lines up of 50 products in average 25 products in the vending machines Hot and cold product seasonality New product every 3 months 3:External factors Vending inside or outside (in office / stations / sport center) Easily accessible: on top of a montain / on top of a building / in a station underground Wide variety of customers: regulars, events: baseball 4: vending routes Limited number of truck & filler  go replenish the 20 VMs that really needs it what product to put on the truck
  10. Processing everyday 14M items (VM x products): 5GB: today’s sales information, stock level, settlement information, … 300GB as we look into the past
  11. Producing that report took 6 to 8 hours and had errors
  12. Instead of duplicating data (silos) we can reuse the same data to generates those report IDOCS, JDBC, flat files, fixed width files Complex transformation as we are combining legacy data with new number and definition of master data to aggregate data during the rollout period of the new systems
  13. Whenever possible we implement extractions triggered by a web service call from the source system We group extractions by flow together under logical “Process Group” We try to have logical subdivisions as well to help readability and maintainability (e.g: extractor for Master data / transactional data or monthly extractors / daily extractors) Site-to-site communication is encrypted using a single root CA certificate shared across all keystore
  14. We always implement fail over that retry to perform a processing We always implement error management In each processing group we implement an Output Port called “onError” that are linked with the parent “OnError” Output port this ensure that notification are propagated until reaching the root canvas. The top “OnError” is a Remote Output port available on NiFi Azure side and implement a common handler to send errors Each processor is linked with that “Output port”, at the beginning of the flow branch has their own set of parameters defined: Service Process Priority Those parameters are used by the Error handling process to send notification to Administrators
  15. Key Custom Features: Leverage cursors on tables to stream batch of data (10000 rows at a time) Comma separated delimiter CSV file output , by default filename <yyyy-MM-dd_hh:mm:ss.SSS>.csv with LF as line separator. Default UTF-8 file encoding Option to save the extracted file to folder location ( Integrated PutFile Processor functionality ) A Boolean value for user to choose to keep the source file or remove from folder location. UPDATE SQL functionality Boolean option to choose output CSV file as Windows CSV file format with line separator CRLF or Unix CSV file format with line separator as LF. Also a fixed width processor to CSV format
  16. Dev environment accessible to anyone (after requesting an account