Big Data - EBC on the road Brazil Edition [Portuguese]
- 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Thiago Paulino
Enterprise Solutions Architect,
Big Data AWS
Executive Briefing Conference
20 de Setembro de 2018
- 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
O que iremos abordar?
• Big Data and why organization care
• Principais desafios, O que? Como? Por que?
• Demo
• Big Data e Machine Learning
• Arquiteturas
- 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
VisualizaçãoVariação
Big Data é definido por váriosV’s
Volume Velocidade Variedade Veracidade Valor
- 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
O que os analistas dizem?
https://www.promptcloud.com
https://john-popelaars.blogspot.com
https://ww.signiant.com
https://www.linkedin.com/pulse/world-today-data-rich-information-poor-guru-p-mohapatra-pmp/
- 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Organizações que geral valor de negócio a partir dos
seus dados tendem a superar seus concorrentes. Uma
pesquisa da Aberdeen aponta que as empresas que
implementaram um Data Lake e conseguiram extrair
valor desses dados, obtiveram um crescimento
orgânico de faturamento em 9% maior que seus
concorrentes.
24%
15%
Lider Seguidores
Crescimeno Orgânico
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
O que realmente importa: Obter valor de dados
- 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Mudança dos dados Adoção de analytics
Capture e armazene
novos dados na escala
de PB-EB
Faça novos tipos de
análises com baixo custo
• Machine learning
• Processamento Big data
• Análise em tempo real
• Full-text search
Novos tipos de
análises
- 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Data Lakes Uma extensão do tradicional
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Dados relacionais e não relacionais
• Escalas emTBs–EBs
• Utilize diferente ferramentas para análise
• Armazenamento de baixo custo e análise
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning
- 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Data Lakes na nuvem AWS
Analytics
• Alta durabilidade e disponibilidade na escala de ExaBytes
• Mais Seguro, Compliance e capacidade de auditoria dos
acessos e execuções
• Controle fino a nível de objeto armazenado
• Melhor desempenho ao acessar os dados
• Difersas maneiras de enviar e armazenar seus dados
• 2x mais integrações com parceiros
• Analyze with broadest set of analytics & ML services
Machine
learning
Real-time dataOn-premises
Data Lake
on AWS
movementdata movement
- 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Managed ML Service
Deep Learning AMIs
Video and Image Recognition
Conversational Interfaces
Deep-Learning Video Camera
Natural Language Processing
LanguageTranslation
Speech Recognition
Text-to-Speech
Interactive Analysis
Hadoop & Spark
Data Warehousing
Full-text search
Real-time analytics
Dashboards & Visualizations
Dedicated Network connection
Secure appliances
Ruggedized Shipping Container
Database migration
Connect Devices to AWS
Real-time Data Streams
Real-time Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data Lakes, Analytics, e soluções de IoT da AWS
Maior quantidade de serviços analíticos
- 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Componentes Data Lake
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight
Central Storage
Secure, cost-effective
Storage in Amazon S3
Metadata Acesso de usuários
Segurança/Gorvernança
Movimentação de dados Analytics e Machine Learning
- 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Componentes Data Lake
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight A
Central Storage
Secure, cost-effective
Storage in Amazon S3
Glue ETL
- 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Desafios comum do Big Data
- 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Qual ferramenta eu deveria usar?
Uma ferramenta não atende
todos os desafios…
Organize seu repertório de
ferramentas e use cada uma
delas para sua devida finalidade
- 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Finalidade específica.
Ferramenta certa pro
trabalho certo.
Qual ferramenta eu deveria usar?
- 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Qual ferramenta eu deveria usar?
- 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Qual Dado eu tenho?
Gartner:
“Até 2018, 80% dos data lakes não terão um recursos eficazes de
gerenciamento efetivo de metadados, tornando-os ineficientes"
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
- 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Job AuthoringData Catalog Job Execution
CompatívelApache Hive Metastore
Integrado com os serviços AWS
CrawlingAutomático
Discover
Gera automaticamente
Código ETL
Python and Apache Spark
edit, debug and compartilhe
Develop
Execução sem servidores
(Servless)
Agendamento flexível
Monitoramento e alerta
Deploy
Qual Dado eu tenho? – AWS Glue
- 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
On-premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
Qual Dado eu tenho?
- 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Outras maneiras de popular seu catálogo de
dados
Chama de api AWS Glue CreateTable
Crie tabelas manualmente Execute Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Importe do Apache hive o metastore
- 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
MAIS Importante: Selecionando um Agile Framework? Comece com uma ferramenta que servirá para o propósito?
Experiência, teste e avaliação, adotar a ferramenta.
Vamos olhar para um exemplo:
Como Eu posso começar?
Evolução do Netflix Data Pipeline
- 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Agregue e carregue eventos para o Hadoop
/ Hive para processamento em lote
EXPERIMENTE coisas novas
Batch Batch+ Real-time
- 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Chukwa front-end Kafka
Kafka front-endKafka
ADOTE sua solução
- 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
“O Amazon Kinesis Streams processa vários terabytes de dados
de log por dia, mas os eventos aparecem em nossa análise em
segundos”, Bennett. “Podemos descobrir e responder a
problemas em tempo real, garantindo alta disponibilidade e uma
ótima experiência ao cliente.” ”
FOCO no valor do negócio
- 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Escolha um ambiente o qual lhe permita testar diversas ferramentas
Concentre-se em ferramentas que você pode fazer o máximo possível com foco
em análise…
AGILIDADE para o negócio
- 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Agilidade no analytics
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
- 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Agilidade - Hadoop/Spark Analytics
• Processamento distribuído
• Diversos tipos de análise
• Batch/Script (Hive/Pig)
• Interactive (Spark, Presto)
• Real-time (Spark)
• Machine Learning (Spark)
• NoSQL (HBase)
• Para diversos casos de uso
• Log and clickstream analysis
• Machine learning
• Real-time analytics
• Large-scale analytics
• Genomics
• ETL
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
- 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Agilidade - Hadoop/Spark Analytics na AWS
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
Amazon S3
Amazon EMR
Gerenciado: Hadoop/Spark
Object Storage
- 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Amazon S3 – Fonte da verdade, multiplos
clusters
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
HDF
S
EC2 Instance Memory
Intermediates stored
on local disk or
HDFS
Local
HDF
S
EC2 Instance Memory
Intermediates stored
on local disk or
HDFS
Local
Transient ETL Job
Source ofTruth
HDFS
HDFS
HDFS
Local Intermediate HDFS/Storage
Local Intermediate HDFS/Storage
- 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Catalo de dados comum entre os clusters
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
Transient ETL Job
Source ofTruth
HDFS
Describes Data in S3
MySQL DB
instance
Customershaveoptions
Glue Data
Catalog
- 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Amazon Athena é um serviço de querie
interativo que torna fácil analisar dados
armazenados diretamente no Amazon S3
usando SQL like.
- 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Demo
Catálogo e análise de dados.
- 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Machine Learning e Big Data
- 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Big Data acima de Machine Learning
Better
Decisions
Object Storage
Databases
Data warehouse
Streaming analytics
BI
Hadoop
Spark/Presto
Elasticsearch
Better
Products Machine Learning
Deep Learning/ AI
More
Users
More
Data
Click stream
User activity
Generated content
Purchases
Clicks
Likes
Sensor data
- 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Agilidade no Machine Learning
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
- 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Machine Learning precisa de novas ferramentas
Machine Learning/Deep Learning
Business
Reporting
Data Scientists
Data Engineer
IDE
Data
Catalog
Central
Storage
Sagemaker
- 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
- 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
FINRA – Dado é nossa principal missão
Reconstruir um Mercado com trilhões de
eventos
• Dados de corretoras e trocas
• Ações, Opções, Renda Fixa
• Construir um gráfico de eventos de ordem de
mercado
Analisar o dado focado em fraude
financeira
• Negociação com informações privilegiadas,
manipulação de produtos, e muito mais
• Procurar uma agulha no palheiro
4
- 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
FINRA – Escalando o Data Lake
Database1
Storage
Query/Compute
Catalog
Database2
Storage
Query/Compute
Catalog
Databasen
Storage
Query/Compute
Catalog
Storage
Query/
Compute
Catalog
EMR Spark LambdaEMR Presto EMR HBase
herd Hive
metastore
FINRA in Data Center FINRA in AWS
Scales Silo
Amazon
S3
- 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Permita o uso de Machine Learning no Data Lake
Data
Scientist
Logical ‘Database’
EMR Cluster
Fonte única
de dados
Spark Cluster
DS-in-a-box
AuthN
Data
Scientist
Data
Scientist
Catalog
- 40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
UDSP – Invetário – Muito mais além de R
• R 3.2.5, Python (2.7.12 and 3.4.3)
• Packages
• R: 300+ Python: 100+
• Tools for Building Packages
• gcc, gfortran, make, java, maven, ant…
• IDEs
• Jupyter, RStudio Server
• Deep Learning
• CUDA, CuDNN (if GPU present)
• Theano, Caffe, Torch
• TensorFlow
16
- 41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Melhores práticas
- 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Principais pontos
• Remover aclopamento, garantindo alto desempenho
• Armazenamento, Analytics, Gestão de Metadata, etc..
• Futuro da prova de análise (Como lidar com novos ambientes)
• Escolher a melhor ferramenta para cada tarefa
• Elasticidade e múltiplos clusters para fins dedicados
• Substituir o planejamento de capacidade por um modelo de consumo
• Não se esqueça de gerenciar o metado
- 43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use o serviço de armazenamento correto
Dado estruturado→ schema fixo, JSON, chave-valor
Padrões de acesso → Armazene os dados no formato em que serão acessados.
Características do dado→ quente, morno, frio
Custo → Melhor custo
- 44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
Obrigado!