TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Learning

Programaçãoparalela
emmachinelearning
Igor Freitas
Intel

NoticesandDisclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on
system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.
Performance results are based on testing as of Aug. 20, 2017 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be
absolutely secure.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and
provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel
representative to obtain the latest forecast, schedule, specifications and roadmaps.
Any forecasts of goods and services needed for Intel’s operations are provided for discussion purposes only. Intel will have no liability to make any purchase in connection with
forecasts published in this document.
ARDUINO 101 and the ARDUINO infinity logo are trademarks or registered trademarks of Arduino, LLC.
Altera, Arria, the Arria logo, Intel, the Intel logo, Intel Atom, Intel Core, Intel Nervana, Intel Saffron, Iris, Movidius, OpenVINO, Stratix and Xeon are trademarks of Intel Corporation or
its subsidiaries in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Copyright 2019 Intel Corporation.
2

Programação
Paralela em IA/ML
Vs
Programação
Paralela
“tradicional” (HPC)
Programação
Paralela em
ML Frameworks
Exemplo
de código
3

Programação
Paralela em IA/ML
Vs
Programação
Paralela
Programação
Paralela em
ML Frameworks
Exemplo
de código
4

https://www.intelnervana.com/framework-optimizations/
Big Data Analytics
HPC != Big Data Analytics != Inteligência Artificial ?
*Other brands and names are the property of their respective owners.
FORTRAN / C++ Applications
MPI
High Performance
Java, Python, Go, etc.*
Applications
Hadoop*
Simple to Use
SLURM
Supports large scale startup
YARN*
More resilient of hardware failures
Lustre*
Remote Storage
HDFS*, SPARK*
Local Storage
Compute & Memory Focused
High Performance Components
Storage Focused
Standard Server Components
Server Storage
SSDs
Switch
Fabric
Infrastructure
Modelo de
Programação
Resource
Manager
Sistema de
arquivos
Hardware
Server Storage
HDDs
Switch
Ethernet
Infrastructure

Trends in HPC + Big Data Analytics
Standards
Business viability
Performance
Code Modernization
(Vector instructions)
Many-core
FPGA, ASICs
Usability
Faster time-to-market
Lower costs (HPC at Cloud ? )
Better products
Easy to mantain HW & SW
Portability
Open
Commom
Environments
Integrated solutions:
Storage + Network +
Processing + Memory
Public investments

Varied Resource Needs
Typical HPC
Workloads
Typical
Big Data
Workloads
7
Big Data & HPC
Ambientes de Produção
Small Data + Small
Compute
e.g. Data analysis
Big Data +
Small Compute
e.g. Search, Streaming,
Data Preconditioning
Small Data +
Big Compute
e.g. Mechanical Design, Multi-physics
Data
Compute
High
Frequency
Trading
Numeric
Weather
Simulation
Oil & Gas
Seismic
Systemcostbalance
Video Survey Traffic
Monitor
Personal
Digital Health
Systemcostbalance
Processor Memory Interconnect Storage

Exemplo
de código
8
Programação
Paralela em
ML Frameworks
Programação
Paralela em IA/ML
Vs
Programação
Paralela

Intel®AITools
PortfolioofsoftwaretoolstoexpediteandenrichAIdevelopment
† Formerly the Intel® Computer Vision SDK
Developer personas show above represent the primary user base for each row, but are not mutually-exclusive
All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
TOOLKITS
Application
Developers
libraries
Data
Scientists
foundation
Library
Developers
DEEPLEARNINGDEPLOYMENT
OpenVINO™† Intel® Movidius™ SDK
Open Visual Inference & Neural Network Optimization toolkit for
inference deployment on CPU/GPU/FPGA/VPU using TensorFlow*,
Caffe* & MXNet*
Optimized inference deployment
for all Intel® Movidius™ VPUs using TensorFlow
& Caffe
DEEPLEARNING
Intel® Deep
Learning Studio‡
Open-source tool to compress
deep learning development
cycle
DEEPLEARNINGFRAMEWORKS
Now optimized for CPU Optimizations in progress
TensorFlow MXNet Caffe BigDL* (Spark) Caffe2 PyTorch CNTK PaddlePaddle
MACHINELEARNINGLIBRARIES
Python R Distributed
• Scikit-
learn
• Pandas
• NumPy
• Cart
• Random
Forest
• e1071
• MlLib (on
Spark)
• Mahout
* * * *
ANALYTICS,MACHINE&DEEPLEARNINGPRIMITIVES
Python* DAAL MKL-DNN clDNN
Intel distribution
optimized for
machine learning
Intel® Data Analytics
Acceleration Library
(incl machine learning)
Open-source deep neural
network functions for
CPU / integrated graphics
DEEPLEARNINGGRAPHCOMPILER
Intel® nGraph™ Compiler (Alpha)
Open-sourced compiler for deep learning model
computations optimized for multiple devices from
multiple frameworks
9

Intel®DistributionofOpenVINO™toolkit
writeonce,deployeverywhere
software.intel.com/openvino-toolkit
StrongAdoption+RapidlyExpandingCapability
Agnostic,Complementarytomajorframeworks Cross-platformflexibility
Supports >100 Public
Models, incl. 30+
Pretrained Models
D E E P L E A R N I N G C O M P U T E R V I S I O N
OpenCV* OpenCL™
CV
Algorithms
Model
Optimizer
Inference
Engine
CV Library
(Kernel & Graphic APIs)
Over 20 Customer Products Launched based
on Intel® Distribution of OpenVINO™ toolkit
Breadth of vision product portfolio
12,000+ Developers
HighPerformance,high Efficiency
Optimized media
encode/decode functions
10Optimization Notice
An open source version is available at 01.org/openvinotoolkit

11
What’s Inside Intel® Distribution of OpenVINO™ toolkit
OpenVX and the OpenVX logo are trademarks of the Khronos Group Inc.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Intel® Architecture-Based
Platforms Support
OS Support: CentOS* 7.4 (64 bit), Ubuntu* 16.04.3 LTS (64 bit), Microsoft Windows* 10 (64 bit), Yocto Project* version Poky Jethro v2.0.3 (64 bit)
Intel® Deep Learning Deployment Toolkit Traditional Computer Vision
Model Optimizer
Convert & Optimize
Inference Engine
Optimized InferenceIR OpenCV* OpenVX*
Optimized Libraries & Code Samples
IR = Intermediate Representation file
For Intel® CPU & GPU/Intel® Processor Graphics
Increase Media/Video/Graphics Performance
Intel® Media SDK
Open Source version
OpenCL™
Drivers & Runtimes
For GPU/Intel® Processor Graphics
Optimize Intel® FPGA (Linux* only)
FPGA RunTime Environment
(from Intel® FPGA SDK for OpenCL™)
Bitstreams
Samples
An open source version is available at 01.org/openvinotoolkit (some deep learning functions support Intel CPU/GPU only).
Tools & Libraries
Intel® Vision Accelerator
Design Products &
AI in Production/
Developer Kits
30+ Pre-trained
Models
Computer Vision
Algorithms
Samples

12
Intel®DeepLearningDeploymentToolkit
ForDeepLearningInference
Caffe*
TensorFlow*
MxNet*
.dataIR
IR
IR = Intermediate
Representation format
Load, infer
CPU Plugin
GPU Plugin
FPGA Plugin
NCS Plugin
Model
Optimizer
Convert &
Optimize
Model Optimizer
▪ What it is: A python based tool to import trained models
and convert them to Intermediate representation.
▪ Why important: Optimizes for performance/space with
conservative topology transformations; biggest boost is
from conversion to data types matching hardware.
Inference Engine
▪ What it is: High-level inference API
▪ Why important: Interface is implemented as dynamically
loaded plugins for each hardware type. Delivers best
performance for each type without requiring users to
implement and maintain multiple code pathways.
Trained
Models
Inference
Engine
Common API
(C++ / Python)
Optimized cross-
platform inference
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
GPU = Intel CPU with integrated graphics processing unit/Intel® Processor Graphics
Kaldi*
ONNX*
GNA Plugin
Extendibility
C++
Extendibility
OpenCL™
Extendibility
OpenCL™

Intel® CPUs
(Atom®, Core™, Xeon®)
Intel® CPUs
w/ Integrated Graphics
Intel®VISIONAcceleratorDesignProducts
Intel®VisionProducts
Intel® Movidius™ VPUs
& Intel® FPGAs
Future Accelerators
(Keem Bay, etc.)
Writeonce - deployAcrossIntelArchitecture - Leveragecommonalgorithms
Add to existing Intel® architectures for
accelerated DL inference capabilities
1. Intel® Distribution of OpenVINO™ toolkit: Computer
vision & deep learning inference tool with common API
2. Portfolio of hardware for computer vision & deep
learning inference, device to cloud
3. Ecosystem to cover the breadth of IoT vision systems
13

UnifyingAnalytics+AIonApacheSpark
High-Performance
DeepLearningFramework
forApacheSpark
software.intel.com/bigdl
UnifiedAnalytics+AIPlatform
DistributedTensorFlow,KerasandBigDLon
ApacheSpark
Reference Use Cases, AI Models,
High-level APIs, Feature Engineering, etc.
https://github.com/intel-analytics/analytics-zoo
AIon
14

BIGDL
Bringing Deep Learning to Big Data
github.com/intel-analytics/BigDL
▪ Open Sourced Deep Learning Library for
Apache Spark*
▪ Make Deep learning more Accessible to Big
data users and data scientists.
▪ Feature Parity with popular DL frameworks like
Caffe, Torch, Tensorflow etc.
▪ Easy Customer and Developer Experience
▪ Run Deep learning Applications as Standard
Spark programs;
▪ Run on top of existing Spark/Hadoop clusters
(No Cluster change)
▪ High Performance powered by Intel MKL and
Multi-threaded programming.
▪ Efficient Scale out leveraging Spark
architecture.
Spark Core
SQL SparkR
Stream-
ing
MLlib GraphX
ML Pipeline
DataFrame
BigDL
For developers looking to run deep learning on Hadoop/Spark due to familiarity or analytics use

All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice.
Intel®ngraph™compiler
Open-source compiler enabling flexibility to run models
across a variety of frameworks and hardware
nGraph™ – Deep Learning Compiler
GPU
Future
HW
Future
FW
*
*
* *
* * * *

Exemplo: Layer Fusion
Convolution 1x1
Convolution 3x3
Convolution 1x1
SUM
Input
ReL
U
Output
Memory Read
Memory Write
Continuous
Software
Optimizations
Memory Read
Memory Write
Memory Read
Memory Write
Convolution 1x1
Convolution 3x3
Fused primitive
Convolution 1x1
+
SUM
+
ReLU
Input
Output
Memory Read
Memory Write
Memory Ops
reduced
HPC + IA
Técnicas de otimização

Integer Matrix Multiply Performance
on Intel® Xeon® Platinum 8180 Processor
Configuration Details on Slide: 13
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors
may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel
measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Enhanced matrix multiply performance on Intel® Xeon® Scalable Processor
Lower
precision
integer ops
PUBLIC
Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or
system.

19
Programação Paralela aplicado em IA

Técnicas de HPC aplicadas para IA
Job 0
Job 1
Job 2
Job 3
12
threads
12
threads
12
threads
12
threads
libnumactl kmp_affinity
https://software.intel.com/en-us/articles/boosting-deep-learning-training-inference-performance-on-xeon-and-xeon-phi

Igor Freitas 21
Centros de Excelência em Inteligência Artificial - Intel
Casos de sucesso
“Validador Cognitivo de Infrações de Trânsito”
✓ Performance 22.5x mais rápida em “Xeon Scalable Processors”
“...um processamento de multas que antes levava 45 horas agora poderá ser realizado em menos de 2 horas.”
✓ Desenvolvimento do modelo matemático
“Com isso, tivemos uma acurácia de 90% no sistema, além da automação de todo o projeto”,
disse Gustavo Rocha, chefe de divisão do SERPRO,“
Thiago Oliveira, superintendente de Engenharia de
Infraestrutura do SERPRO

22
TensorFlow for CPU
intra_op_parallelism_threads: Nodes that can use multiple threads to parallelize their execution will schedule the
individual pieces into this pool.
inter_op_parallelism_threads: All ready nodes are scheduled in this pool.
config = tf.ConfigProto()
config.intra_op_parallelism_threads = 44
config.inter_op_parallelism_threads = 44
tf.Session(config=config)
Aplicando técnica “Afinidade de Processos” (NUMA aware) no TensorFlow
Source:
https://www.tensorflow.org/guide/performance/overview#optimizing_for_cpu

23
Exemplo
de código
Programação
Paralela em ML
Vs
Programação
Paralela (HPC)
Programação
Paralela em
ML Frameworks

24
Entendendo o ambiente:
• Dual socket
• AVX-512
• 16 cores / socket
• 32 threads / socket
• Total: 64 threads

Programação Paralela aplicada em IA
Técnicas de HPC aplicadas para IA
Job 0
Job 1
Job 2
Job 3
12
threads
12
threads
12
threads
12
threads
libnumactl kmp_affinity
https://software.intel.com/en-us/articles/boosting-deep-learning-training-inference-performance-on-xeon-and-xeon-phi

26
Codigo de demonstração:
MNIST
Topologia:
Convolution + reLu + maxPool +
Convolution + reLu + maxPool

27
• Preparando o ambiente via Anaconda
“conda create –n tf-pip-2”
“pip install intel-tensorflow”

28
“python MNIST-test.py”

29
“numactl -C 0-7 python MNIST-test.py”

30
“numactl -C 0 python MNIST-test.py”

31
numactl –C 0-15,16-31 python MNIST.py
• Mais cores não significa maior
performance
• 48 threads teve mesma
performance que 64 threads
(102s)
• Melhor tempo com 32 threads
(83s) – 1.22x speedup
4, 271
8, 140
16, 112
32, 83
48, 102 64, 105
0
50
100
150
200
250
300
0 10 20 30 40 50 60 70
Segundos
Threads
NUMACTL
64 cores modo “default”
Tempo para 64 Threads “default”: 102 segundos

32
export KMP_BLOCKTIME=0
numactl –C 0-15,16-31 python MNIST.py
• KMP_BLOCKTIME: tempo em
milisegundos de espera da thread,
após executar sua tarefa, antes de
dormir
• 2.68x speedup
• Melhor tempo com 16 threads
• Melhor Performance x benefício
com 2 Threads
1, 67
2, 43
4, 41 8, 40
16, 39 32, 41
48, 50
64, 46
4, 271
8, 140
16, 112
32, 83
48, 102
64, 105
0
50
100
150
200
250
300
0 10 20 30 40 50 60 70
Segundos
Threads
NUMACTL - KMP_BLOCKTIME=0
KMP_BLOCK_TIME=0 KMP_BLOCK_TIME=Default

33
export KMP_BLOCKTIME=0
export KMP_AFFINITY=granularity=fine,verbose,compact,1,0
numactl –C 0-15 python MNIST.py
• 16 threads : 4.86x speedup !
• Menor custo de infra-estrutura
• Mais jobs de treinamento ao
mesmo tempo
• Modelos maiores
• Sem alteração de código
0
50
100
150
200
250
300
0 10 20 30 40 50 60 70
Segundos
Threads
NUMACTL + KMP_BLOCKTIME=0 + AFFINITY
NUMACTL
NUMACTL + KMP_BLOCKTIME=0

34
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
• Como as Threads são distribuídas entre os Cores e
Sockets
• Impacta bandwidth: “velocidade de memória”
• Compact:
• Threads próximas entre si
• Troca de dados entre elas mais rápida
• Dados cabem na cache,
• Pouca troca de dados entre CPU e DRAM

37
▪ Extends neural network support to include LSTM (long short-term memory) from ONNX*, TensorFlow*& MXNet*
frameworks, & 3D convolutional-based networks in preview mode (CPU-only) for non-vision use cases.
▪ Introduces Neural Network Builder API (preview), providing flexibility to create a graph from simple API calls and
directly deploy via the Inference Engine.
▪ Improves Performance - Delivers significant CPU performance boost on multicore systems through new
parallelization techniques via streams. Optimizes performance on Intel® Xeon®, Core™ & Atom processors through
INT8-based primitives for Intel® Advanced Vector Extensions (Intel® AVX-512), Intel® AVX2 & SSE4.2.
▪ Supports Raspberry Pi* hardware as a host for the Intel® Neural Compute Stick 2 (preview). Offload your deep
learning workloads to this low-cost, low-power USB.
▪ Adds 3 new optimized pretrained models (for a total of 30+): Text detection of indoor/outdoor scenes, and 2
single-image super resolution networks that enhance image resolution by a factor of 3 or 4.
What’s New in Intel® Distribution of OpenVINO™ toolkit
2018 R5
See product site & release notes for more details about 2018 R4.
OpenVX and the OpenVX logo are trademarks of the Khronos Group Inc.

NoticesandDisclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. No computer system can be absolutely
secure. Check with your system manufacturer or retailer or learn more at [intel.com]..
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced
data are accurate.
The cost reduction scenarios described are intended to enable you to get a better understanding of how the purchase of a given Intel based product, combined with a number of
situation-specific variables, might affect future costs and savings. Circumstances will vary and there may be unaccounted-for costs related to the use and deployment of a given
product. Nothing in this document should be interpreted as either a promise of or contract for a given level of costs or cost reduction.
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and MobileMark,
are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should
consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
other products. For more complete information visit intel.com/performance.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult
other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit
intel.com/benchmarks.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not
specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice. Notice Revision #20110804
Intel processors of the same SKU may vary in frequency or power as a result of natural variability in the production process.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS
GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION
INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
© 2018 Intel Corporation. Intel, the Intel logo, Intel Optane and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
38

BigDLConfigurationDetails
Benchmark Segment AI/ML/DL
Benchmark type Training
Benchmark Metric Training Throughput (images/sec)
Framework BigDL master trunk with Spark 2.1.1
Topology Inception V1, VGG, ResNet-50, ResNet-152
# of Nodes 8, 16 (multiple configurations)
Platform Purley
Sockets 2S
Processor
Intel ® Xeon ® Scalable Platinum 8180 Processor (Skylake): 28-core @ 2.5
GHz (base), 3.8 GHz (max turbo), 205W
Intel ® Xeon ® Processor E5-2699v4 (Broadwell): 22-core @ 2.2 GHz (base),
3.6 GHz (max turbo), 145W
Enabled Cores Skylake: 56 per node, Broadwell: 44 per node
Total Memory Skylake: 384 GB, Broadwell: 256 GB
Memory Configuration
Skylake: 12 slots * 32 GB @ 2666 MHz Micron DDR4 RDIMMs
Broadwell: 8 slots * 32 GB @ 2400 MHz Kingston DDR4 RDIMMs
Storage
Skylake: Intel® SSD DC P3520 Series (2TB, 2.5in PCIe 3.0 x4, 3D1, MLC)
Broadwell: 8 * 3 TB Seagate HDDs
Network 1 * 10 GbE network per node
OS
CentOS Linux reléase 7.3.1611 (Core), Linux kernel
4.7.2.el7.x86_64
HT On
Turbo On
Computer Type Dual-socket server
Framework Version https://github.com/intel-analytics/BigDL
Topology Version https://github.com/google/inception
Dataset, version ImageNet, 2012; Cifar-10
Performance command
(Inception v1)
spark-submit --class
com.intel.analytics.bigdl.models.inception.TrainInceptionV1 --
master spark://$master_hostname:7077 --executor-cores=36
--num-executors=16 --total-executor-cores=576 --driver-
memory=60g --executor-
memory=300g $BIGDL_HOME/dist/lib/bigdl-*-SNAPSHOT-
jar-with-dependencies.jar --batchSize 2304 --learningRate
0.0896 -f hdfs:///user/root/sequence/ --
checkpoint $check_point_folder
Data setup
Data was stored on HDFS and cached in memory before
training
Java JDK 1.8.0 update 144
MKL Library version Intel MKL 2017

SparkConfigurationDetails
Configurations:
4.3X for Spark MLlib thru Intel Math Kernel Library (MKL)
▪ Spark-Perf (same for before and after): 9 nodes each with Intel® Xeon® processor E5-2697A v4 @ 2.60GHz * 2 (16 cores, 32 threads); 256 GB ; 10x SSDs; 10Gbps NIC
19x for HDFS Erasure Coding in micro workload (RawErasureCoderBenchmark) and 1.25x in Terasort, plus 50+% storage capacity saving and higher failure tolerance level.
▪ RawErasureCoderBenchmark (same for before and after): single node with Intel® Xeon® processor E5-2699 v4 @ 2.20GHz *2 (22 cores, 44 threads); 256GB; 8x HDDs; 10Gbps NIC
▪ Terasort (same for before and after): 10 nodes each with Intel® Xeon® processor E5-2699 v4 @ 2.20GHz *2 (22 cores, 44 threads); 256GB; 8x HDDs; 10Gbps NIC
5.6x for HBase off heaping read in micro workload (PE) and 1.3x in real Alibaba production workload
▪ PE (same for before and after): Intel® Xeon® Processor X5670 @ 2.93Hz *2 (6 cores, 12 threads); RAM: 150 GB; 1Gbps NIC
▪ Alibaba (same for before and after): 400 nodes cluster with Intel® Xeon® processors
1.22x Spark Shuffle File Encryption performance for TeraSort and 1.28x for BigBench
▪ Terasort (same for before and after): Single node with Intel® Xeon® Processor E5-2699 v3 @ 2.30GHz *2 (18 cores, 36 threads); 128GB; 4x SSD; 10Gbps NIC
▪ BigBench (same for before and after): 6 nodes each with Intel® Xeon® Processor E5-2699 v3 @ 2.30GHz *2 (18 cores, 36 threads); 256GB; 1x SSD; 8x SATA HDD 3TB, 10Gbps NIC
1.35X Spark Shuffle RPC encryption performance for TeraSort and 1.18x for BigBench
▪ Terasort (same for before and after): 3 nodes each with Intel® Xeon® Processor E5-2699 v3 @ 2.30GHz *2 (18 cores, 36 threads); 128GB; 4x SSD; 10Gbps NIC
▪ BigBench (same for before and after): 5 nodes. 1x head node: Intel® Xeon® Processor E5-2699 v3 @ 2.30GHz *2 (18 cores, 36 threads); 384GB; 1x SSD; 8x SATA HDD 3TB, 10Gbps NIC. 4x
worker nodes: each with Intel® Xeon® processor E5-2699 v4 @ 2.20GHz *2 (22 cores, 44 threads); 384GB; 1x SSD; 8x SATA HDD 3TB, 10Gbps NIC.
10X scalability for Word2Vec E5-2630v2 * 2, 128 GB Memory, 12x HDDs; 1000Mb NIC (14 nodes)
70X scalability for LDA (Latent Dirichlet Allocation)
▪ Intel Xeon E5-2630v2 * 2, 288GB Memory, SAS Raid5, 10Gb NIC
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These
optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any
optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more
information regarding the specific instruction sets covered by this notice.
Notice revision #20110804

SparkSQLConfigurations
41
AEP DRAM
Hardware DRAM 192GB (12x 16GB DDR4) 768GB (24x 32GB DDR4)
Apache Pass 1TB (ES2: 8 x 128GB) N/A
AEP Mode App Direct (Memkind) N/A
SSD N/A N/A
CPU Worker: Intel® Xeon® Platinum 8170 @ 2.10GHz (Thread(s) per core: 2, Core(s) per socket: 26, Socket(s): 2
CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 L1d cache: 32K, L1i cache: 32K, L2 cache: 1024K, L3 cache:
36608K)
OS 4.16.6-202.fc27.x86_64 (BKC: WW26, BIOS: SE5C620.86B.01.00.0918.062020181644)
Software OAP 1TB AEP based OAP cache 620GB DRAM based OAP cache
Hadoop 8 * HDD disk (ST1000NX0313, 1-replica uncompressed & plain encoded data on Hadoop)
Spark 1 * Driver (5GB) + 2 * Executor (62 cores, 74GB), spark.sql.oap.rowgroup.size=1MB
JDK Oracle JDK 1.8.0_161
Workloa
d
Data Scale 2.6TB (9 queries related data is of 729.4GB in capacity)
TPC-DS
Queries
9 I/O intensive queries (Q19,Q42,Q43,Q52,Q55, Q63,Q68,Q73,Q98)
Multi-Tenants 9 threads (Fair scheduled)

ApacheCassandraConfigurations
42
NVMe Apache Pass
Server Hardware System Details Intel® Server Board Purely Platform (2 socket)
CPU Dual Intel® Xeon® Platinum 8180 Processors, 28 core/socket, 2 sockets, 2 threads per core
Hyper-Threading Enabled
DRAM DDR4 dual rank 192GB total = 12 DIMMs 16GB@2667Mhz DDR4 dual rank 384GB total = 12 DIMMs 32GB@2667Mh
Apache Pass N/A AEP ES.2 1.5TB total = 12 DIMMs * 128GB Capacity each: Single Rank, 128GB, 15W
Apache Pass Mode N/A App-Direct
NVMe 4 x Intel P3500 1.6TB NVMe devices N/A
Network 10Gbit on board Intel NIC
Software OS Fedora 27
Kernel Kernel: 4.16.6-202.fc27.x86_64
Cassandra Version 3.11.2 release
Cassandra 4.0 trunk, with App Direct patch version 2.1, software found at
https://github.com/shyla226/cassandra/tree/13981
with PCJ library: https://github.com/pmem/pcj
JDK Oracle Hotspot JDK (JDK1.8 u131)
Spectra/Meltdown Compliant Patched for variants 1/2/3
Cassandra
Parameters
Number of Cassandra
Instances
1 14
Cluster Nodes One per Cluster
Garbage Collector CMS Parallel
JVM Options (difference from
default)
-Xms64G
-Xmx64G
-Xms20G
-Xmx20G
-Xmn8G
-XX:+UseAdaptiveSizePolicy
-XX:ParallelGCThreads=5
Schema cqlstress-insanity-example.yaml
DataBase Size per Instance 1.25 Billion entries 100 K entries
Client(s) Hardware Number of Client machines 1 2
System Intel® Server Board model S2600WFT (2 socket)
CPU Dual Intel® Xeon® Platinum 8176M CPU @ 2.1Ghz, 28 core/socket, 2 sockets, 2 threads per core
DRAM DDR4 384GB total = 12 DIMMs 32GB@2666Mhz
Network 10Gbit on board Intel NIC
Software OS Fedora 27
Kernel Kernel: 4.16.6-202.fc27.x86_64
JDK Oracle Hotspot JDK (JDK1.8 u131)
Workload Benchmark Cassandra-Stress
Cassandra-Stress Instances 1 14
Command line to write
database
cassandra-stress user profile/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml
ops(insert=1) n=1250000000 cl=ONE no-warmup -pop seq=1..1250000000 -mode native
cql3 -node <ip_addr> -rate threads=10
cassandra-stress user profile/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml
ops(insert=1) n=100000 cl=ONE no-warmup -pop seq=1..100000 -mode native cql3 -node
<ip_addr> -rate threads=10
Command line to read
database
cassandra-stress user profile=/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml
ops(simple1=1) duration=10m cl=ONE no-warmup -pop dist=UNIFORM(1.. 1250000000)
-mode native cql3 –node <ip_addr> -rate threads=300
cassandra-stress user profile=/root/cassandra_4.0/tools/cqlstress-insanity-example.yaml
ops(simple1=1) duration=3m cl=ONE no-warmup -pop dist=UNIFORM(1..100000) -mode
native cql3 –node <ip_addr> -rate threads=320

TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Learning

TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Learning

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Learning

Similar to TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Learning (20)

More from tdc-globalcode

More from tdc-globalcode (20)

Recently uploaded

Recently uploaded (20)

TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Learning