Data Centric HPC for Numerical Weather Forecasting
- 1. DATA CENTRIC HPC FOR NUMERICAL WEATHER FORECASTING
James Faeldon
Delfin Jay Sabido III
Karen España
IBM Philippines, STG Labs
- 2. Extreme Weather Events
•The Philippines is home to devastating typhoons.
•19 typhoons a year and intense monsoon rains that can cause widespread flooding.
•Research collaboration by the Philippine Government, University of the Philippines and IBM (2013).
P
The strongest typhoons group near the Philippines
Image courtesy of NOAA
Typhoon Tracks Eastern Hemisphere
Before
After
Super Typhoon Haiyan (Nov 2013)
Image courtesy of DigitalGlobe
- 3. Coupled Models for Pre-Disaster Planning
Numerical weather model
forecasts typhoon track and intensity
Machine learning model predicts
affected population and damages
Optimization model recommends relief supplies pre-positioning and allocation
Typhoons can be forecasted a few days in advance.
But we need more reports, better visualization and data exploration tools to reduce analysis cycles and facilitate timely decisions.
Operations Center
- 4. Operational Forecasting Schedule Runs
Data-Intensive
Compute-Intensive
Data-Intensive processes increasingly becoming the
bottleneck in operational forecasting workflow.
- 6. Operational Forecasting Data Challenges
Quality Control
Sampling
Verification
Machine Learning
Ensemble Forecasts
Update relief operations plan based on new forecast
+ 7 historical days
663 Gb per forecast
Model Output Statistics
6-hour
processing
and
analysis
window
ETL
Source
Qty
Unit Size
Total Size
AWS
733
7Kb/day
5Mb/day
Satellite
1
480Mb/day
480Mb/day
Radar
7
9Gb/day
63Gb/day
Real-time Sensor Data
Res
Cells
Grid Cells
Total Size
12km
5.2 M
307 x 481 x 35
81Gb/forecast
4km
8.8 M
619 x 406 x 35
138Gb/forecast
Forecast Data
- 7. Project Goals
•Manage and process data arriving in time-sensitive remote sensors and weather forecasts.
•Reduce data analysis cycles to facilitate timely decisions.
- 8. Numerical Weather Model
Post-Processing
MapReduce,
NoSQL Database
Stream Pre-Processing
Date Warehouse, OLAP Database
Weather Sensors
Observations
Structured Data
Data Assimilation
Forecast Data
1
Remote sensor data in various format.
2
Quality Control, Interpolation, Sampling, Filtering, Classification
3
High Performance Computing
4
Store structured and unstructured data for analysis and post- processing
5
Business intelligence, data mining, visualization, verification
6
Dashboards and Reports
Automated End-to-End Process
Decision Support Tool
Reports
- 9. Hardware Infrastructure
Traditional HPC
(BlueGene/P)
Commodity Servers
(x86)
Elastic
Cloud Computing (Virtual Machines)
In-situ Big Data
MapReduce
Real-time Data Processing OLAP Visualization
Numerical Weather
Models
MPP Jobs
- 10. Weather Model
•WRF ARW v3.5 limited area model
•3.4 hours using 2048 cores BlueGene/P (850Mhz).
10
- 11. Pre-Processing
•Stream Processing, ETL, R, Python
•Multi-stage quality control of remote sensor data.
•Spatio-temporal interpolation and sampling.
•Star-schema data warehouse.
•NoSQL with MapReduce.
NetCDF,
Image,
CSV
Staging Files
Low-latency
Stream
Processing
ETL
Custom Scripts
NoSQL
Data Warehouse
BI Cubes
Observations,
Forecast Raw
Data
Quality Control, Sampling, Filtering
Structured point or topological data (small <1TB), emphasis on data consistency.
Gridded high-resolution data (big >1TB), emphasis on availability and scalability. Input to coupled models down the line.
Data stores for post processing…
- 12. Post Processing
•Business Intelligence Cubes
•Multi-dimensional analysis
•Dashboards and reports
•GIS Integration
•MapReduce Views (NoSQL)
•Model Verification
•Ensemble Forecasts/MOS
•Ad-Hoc Data Mining
Multi-Dimensional Cubes
MapReduce Views
Reports and Dashboards
Reports and visualization generated using BI and data visualization tools
Custom Scripts
Coupled Models
Model Output
Statistics
Reports and Dashboards
Down-stream predictive models uses MapReduce views as data source
- 13. Current Challenges and Future Directions
•Improvements in geostatistics: Gridded data to topological features.
•River basins, flood prone area, political boundaries and other locations of interests
•Generating statistics makes for very data-intensive processing
•Potential for parallelization.
•Efficient stream processing engine of larger tuples with longer sliding windows.
•Complex quality control and verification requires longer time-series statistics spanning multi-day historical observed and forecasted data.
•Strategy: can we retain data processing all in-memory, caching, etc..
•Efficient MapReduce views on array-based data models and other approaches.
•Improvements on data warehousing schema.
•Ongoing improvements for handling spatio-temporal data.
- 14. Summary
•Planning for extreme weather events is a time-critical workflow that involves complex analysis of large data-sets from various sources.
•Recent advances in Big Data and HPC enables architecture of real-world disaster planning application.
•Current integration schemes uses intermediary staging files and ETL-like scripts.
•Better algorithms and techniques are needed to improve performance and integration.