SlideShare a Scribd company logo
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Bob Griffiths, AWS Solutions Architect Manager
John Hitchingham, FINRA Engineering
August 14, 2017
FINRA’s Managed Data Lake – Next
Gen Analytics in the Cloud
Overview of Big Data Services
What is big data?
When your data sets become so large and complex
you have to start innovating around how to
collect, store, process, analyze, and share them.
Collect
AWS
Import/Export
AWS Direct
Connect
Amazon
Kinesis
Amazon
EMR
Amazon
EC2
Process & Analyze
Amazon
Glacier
Amazon
S3
Store
Amazon
Machine
Learning
Amazon
Redshift
Amazon
DynamoDB
Amazon
Kinesis
Analytics
Amazon
QuickSight
AWS Database
Migration
Service
AWS Data
Pipeline
Amazon RDS,
Amazon Aurora
Big Data services on AWS
Amazon
Elasticsearch
Service
Amazon
Athena
AWS
Glue
AWS
Snowball

Recommended for you

Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon RedshiftData Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

An overview of how Amazon Redshift uses columnar technology, massively parallel processing, and other techniques to deliver fast query performance on petabyte-size datasets.

amazon web servicescloudbig data
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...

Learn about architecture best practices for combining AWS storage and database technologies. We outline AWS storage options (Amazon EBS, Amazon EC2 Instance Storage, Amazon S3 and Amazon Glacier) along with AWS database options including Amazon ElastiCache (in-memory data store), Amazon RDS (SQL database), Amazon DynamoDB (NoSQL database), Amazon CloudSearch (search), Amazon EMR (hadoop) and Amazon Redshift (data warehouse). Then we discuss how to architect your database tier by using the right database and storage technologies to achieve the required functionality, performance, availability, and durability—at the right cost.

aws cloudebss3
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks

Learning Objectives: - Discover dark data that you are currently not analyzing. - Analyze dark data without moving it into your data warehouse. - Visualize the results of your dark data analytics.

data warehousingdata lakedark data
Scale as your data and business grows
The volume, variety, and velocity at which data is being generated are
leaving organizations with new questions to answer, such as:
Data Lake
Central Storage
Secure, cost-effective
storage in
Amazon S3
Data Ingestion
Get your data into S3 quickly and securely
Kinesis Firehose, Direct Connect, Snowball,
Database Migration Service
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
Processing & Analytics
Use of predictive and prescriptive analytics
to gain better understanding
DynamoDB
Elasticsearch Service
API Gateway
Directory Service
Cognito
Athena, QuickSight, EMR, Amazon Redshift
IAM, CloudWatch, CloudTrail, KMS
Protect & Secure
Use entitlements to ensure data
is secure and users’ identities are
verified
Store and analyze all your data—structured and
unstructured—from all of your sources, in one centralized
location at low cost.
Quickly ingest data without needing to force it into a
predefined schema, enabling ad-hoc analysis by applying
schemas on read, not write.
Separating your storage and compute allows you to scale
each component as required and attach multiple data
processing and analytics services to the same data set.
Scale
Use only the services you need
Scale only the services you need
Pay only for what you use
Discounts through Reserved Instances
Types including Spot, and upfront commitments
Cost

Recommended for you

(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data

NoSQL is an important part of many big data strategies. Attend this session to learn how Amazon DynamoDB helps you create fast ingest and response data sets. We demonstrate how to use DynamoDB for batch-based query processing and ETL operations (using a SQL-like language) through integration with Amazon EMR and Hive. Then, we show you how to reduce costs and achieve scalability by connecting data to Amazon ElasticCache for handling massive read volumes. We’ll also discuss how to add indexes on DynamoDB data for free-text searching by integrating with Elasticsearch using AWS Lambda and DynamoDB streams. Finally, you’ll find out how you can take your high-velocity, high-volume data (such as IoT data) in DynamoDB and connect it to a data warehouse (Amazon Redshift) to enable BI analysis.

financial servicesnathaniel slater - amazon web servicesreinvent2015
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS

Introduction to key architectural concepts to build a data lake using Amazon S3 as the storage layer and making this data available for processing with a broad set of analytic options including Amazon EMR and open source frameworks such as Apache Hadoop, Spark, Presto, and more.

awscloud computingsolution-architecture-and-best-practices
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...

The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.

amazon web servicesreinvent2014advanced
Visibility/control of all APIs and retrievals
Encryption of all data at each step
Store an exabyte of data or more in Amazon S3
Analyze GB to PB using standard tools
Control egress and ingress points using VPCs
Security
and scale
Big data does not mean just batch
• Can be streamed in
• Processed in real time
• Can be used to respond quickly to requests
and actionable events, generate business
value
You can mix and match
• On-premises and cloud
• Custom development and managed services
Agility
&
actionable
insights
FINRA’s Managed Data Lake
In order to solve its market regulation challenges, over the past three years,
FINRA’s Technology team has pioneered a managed cloud service to
operate big data workloads and perform analytics at large scale.
The results of FINRA’s innovations have been significant.
To achieve these gains and operate its big data ecosystem, FINRA
Technology has built a set of cutting edge tools, processes, and know-how.
FINRA’s experience
A 30% operating cost reduction, in both labor and infrastructure
A 5x increase in operational resiliency
The business is able to perform analytics at an unprecedented scale and depth

Recommended for you

(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data

Analyzing large data sets requires significant compute and storage capacity that can vary in size based on the amount of input data and the analysis required. This characteristic of big data workloads is ideally suited to the pay-as-you-go cloud model, where applications can easily scale up and down based on demand. Learn how Amazon S3 can help scale your big data platform. Hear from Redfin and Twitter about how they build their big data platforms on AWS and how they use S3 as an integral piece of their big data platforms.

amazon web servicesbig data & analyticstwitter advanced (300 level)
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive

This document provides an overview and agenda for a presentation on Amazon DynamoDB. It discusses key features of DynamoDB including its data model, scaling capabilities, and data modeling best practices. It also provides examples of how to model and query common data scenarios like event logging, product catalogs, messaging apps, and multiplayer gaming.

amazon-web-servicessolution-architecture-and-best-practicesaws
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake

This document provides an agenda and overview for a workshop on building a data lake on AWS. The agenda includes reviewing data lakes, modernizing data warehouses with Amazon Redshift, data processing with Amazon EMR, and event-driven processing with AWS Lambda. It discusses how data lakes extend traditional data warehousing approaches and how services like Redshift, EMR, and Lambda can be used for analytics in a data lake on AWS.

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Legacy pain points – infrastructure and ops
Did not scale well as volumes and
workloads increased
Duplication of effort in data management
(data lifecycle, retention, versioning, etc.)
Data sync issues – manual effort to keep
data in sync
Costly system maintenance and
upgrades
Legacy pain points – analytics and data science
Business
Analysts
Data
Scientists Data
Analysts
Data
Engineers
Ops
What data do we have?
What format is it in?
Where to I get it?
Get this data for them…
Not on disk – pull from tape
Wait for tapes
from offsite
Prepare & Format
Oops, I need more data … Repeat!
I need data in different format …
Repeat!
etc…, etc…
Summary of cloud drivers
• Fast-growing data volumes YoY
• High cost of pre-building for peak
• Escalating costs of in-house technology infrastructure
• Long time-to-market for finding insights in data
• Appliance platforms were facing obsolescence and end-of life as
a result of new big data technologies
Keep spending more on legacy infrastructure or
redirect dollars to core business of regulation?

Recommended for you

ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...

Amazon RDS allows you to launch an optimally configured, secure and highly available database with just a few clicks. It provides cost-efficient and resizable capacity, automates time-consuming database administration tasks, and provides you with six familiar database engines to choose from: Amazon Aurora, Oracle, Microsoft SQL Server, PostgreSQL, MySQL and MariaDB. In this session, we will take a close look at the capabilities of Amazon RDS and explain how it works. We’ll also discuss the AWS Database Migration Service and AWS Schema Conversion Tool, which help you migrate databases and data warehouses with minimal downtime from on-premises and cloud environments to Amazon RDS and other Amazon services. Gain your freedom from expensive, proprietary databases while providing your applications with the fast performance, scalability, high availability, and compatibility they need.

#awsnysummit2017#nysummit2017#aws
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue

AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.

#awsnysummit2017#nysummit2017#aws
Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3

In this session, storage experts will walk you through the object storage offering, Amazon S3, a bulk data repository that can deliver 99.999999999% durability and scale past trillions of objects worldwide. Learn about the different ways you can accelerate data transfer to S3 and get a close look at some of the new tools available for you to secure and manage your data more efficiently. Announced at re:Invent 2016, see how you can use Amazon Athena with S3 to run serverless analytics on your data and as a bonus, walk away with some code snippets to use with S3. Hear AWS customers talk about the solutions they have built with S3 to turn their data into a strategic asset, instead of just a cost center. And bring your toughest questions to our experts on hand and walk away that much smarter on how to use object storage from AWS.

aws-s3awsamazon-web-services
FINRA cloud program business objectives
• Discover data easily
• Access (all the) data easily
• Increase the power of analytic tools
• Make data processing resilient
• Make data processing cost effective
Could this be achieved in the cloud?
Cloud architectural principles
Manage Data
Consistently
• Define, store and share our
data as an enterprise asset
• All data should be enabled
for analytics
• Protect data in a holistic
manner (data at rest and
data in transit)
Integrate our
Portfolio
• Shared solutions for
common business
processes across the
organization
• All "business" data in cloud
will be tracked by a
centralized Data
Management System so that
FINRA can manage the data
lifecycle in a productive and
cost effective manner
• All FINRA-developed
applications will have
service interfaces
Operational
Resiliency
• Multi-AZ components and
fail-over
• Auto-scaling and load
balancing to achieve high
availability
• No logon to servers or
services for routine
operations
• Applications should include
automated operations jobs
to handle known failure
scenarios, recovery, data
issues, and notifications.
From data puddles to Data Lake
Database1
Storage
Query/Compute
Catalog
Database2
Storage
Query/Compute
Catalog
Databasen
Storage
Query/Compute
Catalog
Storage
Query/
Compute
Catalog
EMR Spark LambdaEMR Presto EMR HBase
herd Hive
metastore
FINRA in Data Center FINRA in AWS
Scales Silo
Amazon
S3
Data processing stream on Data Lake
Catalog &
Storage
ETL
Normalize, Enrich, Reformat
Human
Analytics
Validation
Ingest
Broker Dealers
Exchanges
3rd Party Providers
Data
Files
Analyst
Data Scientist
Regulatory User
• Centralized Catalog
• 100s of EMR clusters
• As many Lambda
functions as needed
Patterns
Automated Surveillance

Recommended for you

(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS

This document discusses building a data lake on AWS. It describes using Amazon S3 for storage, Amazon Kinesis for streaming data, and AWS Lambda to populate metadata indexes in DynamoDB and search indexes. It covers using IAM for access control, AWS STS for temporary credentials, and API Gateway and Elastic Beanstalk for interfaces. The data lake provides a foundation for storing and analyzing structured, semi-structured, and unstructured data at scale from various sources in a cost-effective and secure manner.

aws cloudaws-reinventbig data & analytics
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena

This document provides an overview and introduction to Amazon Athena, an interactive query service that makes it easy to analyze data stored in Amazon S3 using standard SQL. Key features discussed include Athena being serverless, easy to use via the console or JDBC driver, highly available, supporting querying data directly from S3 without loading or ETL, using ANSI SQL, and being cost effective. Examples of how customers can use Athena are also provided.

nathaniel-slatercloudcloud computing
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis

Amazon Kinesis is a managed service for real-time processing of streaming big data at any scale. It allows users to create streams to ingest and process large amounts of data in real-time. Kinesis provides high durability, performance, and elasticity through features like automatic shard management and the ability to seamlessly scale streams. It also offers integration with other AWS services like S3, Redshift, and DynamoDB for storage and analytics. The document discusses various aspects of Kinesis including how to ingest and consume data, best practices, and advantages over self-managed solutions.

aws_technical_day_telavivcloud computingamazon web services
Power of parallelization
ETL Job1
Input Result
ETL Job2
Input Result
ETL Jobn
Input Result
Workloads run in parallel for workload isolation to meet SLAs
Processing scales to meet demand
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
11/1
11/3
11/5
11/7
11/9
11/11
11/13
11/15
11/17
11/19
11/21
11/23
11/25
11/27
11/29
Daily Order Volume (Billions)
0
2000
4000
6000
8000
10000
12000
2016-10-17T02
2016-10-17T08
2016-10-17T14
2016-10-17T20
2016-10-18T02
2016-10-18T08
2016-10-18T14
2016-10-18T20
2016-10-19T02
2016-10-19T08
2016-10-19T14
2016-10-19T20
2016-10-20T02
2016-10-20T08
2016-10-20T14
2016-10-20T20
2016-10-21T02
2016-10-21T08
2016-10-21T14
2016-10-21T20
2016-10-22T02
2016-10-24T03
2016-10-24T20
ComputeNodes
Hour of Day
AWS EMR compute on EC2
EMR
Catalog for centralized data management
http://finraos.github.io/herd
Unified catalog
• Schemas
• Versions
• Encryption type
• Storage policies
Lineage and Usage
• Track publishers and consumers
• Easily identify jobs and derived data sets
Shared Metastore
• Common definition of tables and
partitions
• Use with Spark, Presto, Hive, etc.
• Faster instantiation of clusters
Catalog and the Data Lake ecosystem
Hive
Metastore
Data Catalog
Data Catalog UI
Analyst
Data Scientist
Explore Use
Object Storage
(S3)
Custom
Handler
Request object Info
Processing
Get object info
(optl. DDL)
Knows
Object/File
Object/File
Object/File
Query (w/ DDL)
Store Results
Custom
Handler
Register Output
Validation
ETL
Machine SurveillanceLambda EMR
Interactive Analytics
EMR Redshift
(Spectrum)
Get DDL

Recommended for you

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

During this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, will discuss the design and architecture of Airbnb's streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb's data warehouse, using a system called SpinalTap. We will also discuss how we leverage Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables.

aws re:invent 2016awsdatabases
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS

The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.

aws-summit-london-2016
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스

The document discusses challenges with traditional data warehousing and analytics including high upfront costs, difficulty managing infrastructure, and inability to scale easily. It introduces Amazon Web Services (AWS) and Amazon Redshift as a solution, allowing for easy setup of data warehousing and analytics in the cloud at low costs without large upfront investments. AWS services like Amazon Redshift provide flexible, scalable infrastructure that is easier to manage than traditional on-premise systems and enables organizations to more effectively analyze large amounts of data.

Analytics – one-stop shop for data
Data
Analyst
Data
Scientist
JDBC
Client
JDBC
Client
Table 1
Table 2
AuthN
AuthZ
Metastore
Table N
Logical “Database”
Achieve interactive query speed with Data Lake
Query Table size
(rows)
Output
size (rows)
ORC TXT/BZ2
select count(*) from TABLE_1
where trade_date = cast(‘2016-08-09’ as date)
2469171608 1 4s 1m56s
select col1, count(*) from TABLE_1 where col2 = cast('2016-
08-09' as date) group by col1 order by col1
2469171608 12 3s 1m51s
select col1, count(*) from TABLE_1 where col2 = cast('2016-
08-09' as date) group by col1 order by col1
2469171608 8364 5s 2m5s
select * from TABLE_1 where col2 = cast('2016-08-10' as
date) and col3='I' and col4='CR' and col5 between 100000.0
and 103000.0
2469171608 760 10s 2m3s
Test Config:
Presto 0.167.0.6t (Teradata) On EMR
Data on S3 (external tables)
Cluster size: 60 worker node x r4.4xlarge
Key points:
Use ORC (Or Parquet) for performant query
Grow the data store with no work
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Main Production Data Store (Bucket on S3)
Size
(PB)
• Data footprint grows
seamlessly
• All data accessible for
interactive (or batch query)
from moment it is stored
Or scale out with multiple clusters…
User A JDBC Client
Table 1
Table 2
AuthN
Metastore
Table N
Logical “Database”
JDBC ClientUser B
JDBC App
Cluster A
Cluster B
Cluster C
Still One Copy
Of Data!

Recommended for you

AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days

Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects. We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.

hadoopaws cloudbig data
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Data lakes allow organizations to store all types of data in a centralized repository at scale. AWS Lake Formation makes it easy to build secure data lakes by automatically registering and cleaning data, enforcing access permissions, and enabling analytics. Data stored in data lakes can be analyzed using services like Amazon Athena, Redshift, and EMR depending on the type of analysis and latency required.

awscloudexperienceargentinainnovationtrack
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017

Join us for this general session where AWS big data experts present an in-depth look at the current state of big data. Learn about the latest big data trends and industry use cases. Hear how other organizations are using the AWS big data platform to innovate and remain competitive. Take a look at some of the most recent AWS big data developments. Learn More: https://aws.amazon.com/government-education/

awscloud technologyamazon web services
Data needs for data science and ML
• Allow discovery & exploration
• Bring disparate sources of data together
• Allows users to focus on problem not the infrastructure
• Safeguard information with high degree of security and
least privileges access
A single way to access all of the data
Logical Data
Repository
1
Data
Scientist
Logical Data
Repository
Accelerate discovery through
self-service
Logical Data
Repository
Logical Data
Repository
Data
Scientist
Data
Engineer1
2
N Data
Engineer
Data
Engineer
Before Data Lake Data Lake
Data science on the Data Lake
Data
Scientist
JDBC
Client
Logical ‘Database’
EMR Cluster
Still one copy
of data!
Spark Cluster
DS-in-a-box
AuthN
Data
Scientist
Notebook
Interface
Data
Scientist
Catalog
Notebook or
Shell
Universal Data Science Platform (UDSP)
• Environment (EC2) for each
Data Scientist
• Simple provisioning interface
• Right instance (memory or
GPU) for job
• Access to all the data in
Data Lake
• Shut off when not using for
savings
• Secure (LDAP AuthN/Z +
Encryption)
Data
Scientist

Recommended for you

AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform

This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering: - How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs. - Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics. - The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift. - The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database. Created by: Rahul Pathak, Sr. Manager of Software Development

cloudawsbig data
State of the Union: Database & Analytics
State of the Union: Database & AnalyticsState of the Union: Database & Analytics
State of the Union: Database & Analytics

This document discusses Amazon Web Services' database and analytics services. It begins by noting that 85% of businesses want to be data-driven but only 37% have been successful. It then presents the "data flywheel" concept of breaking from legacy databases, modernizing data infrastructure and data warehouses, turning data into insights, and building data-driven applications to gain momentum with data. The document provides overviews and benefits of AWS services like Amazon Aurora, Athena, Redshift, RDS, DMS, and Elasticsearch. It also introduces new capabilities for these services like machine learning with Aurora, RDS on Outposts, UltraWarm storage for Elasticsearch, and materialized views in Redshift.

reinvent-recap-hongkong
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics

Understanding Big Data with AWS Analytics, originally presented at AWS Vancouver Initiate on September 27, 2018.

awsaws canadavancouver initiate
UDSP – Inventory – not just R
• R 3.2.5, Python (2.7.12 and 3.4.3)
• Packages
• R: 300+ Python: 100+
• Tools for Building Packages
• gcc, gfortran, make, java, maven, ant…
• IDEs
• Jupyter, RStudio Server
• Deep Learning
• CUDA, CuDNN (if GPU present)
• Theano, Caffe, Torch
• TensorFlow
16
Some business benefits with Data Lake
 Market volume changes no longer disruptive technology events

Regulatory analysts can now interactively analyze 1000x more market events
(billons of rows vs millions before)

Easily reprocess data if there are upstream data errors – used to take weeks to
find capacity now can be done in day/days.
 Querying order route detail went from 10s of minutes to seconds
 Quicker turnaround to provide data for oversight
 Machine Learning model development is easier
Want to hear more?
Feel free to contact me:
john.hitchingham@finra.org
Thank you!

Recommended for you

Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of Light

Amazon Web Services proporciona una amplia gama de servicios que le ayudarán a crear e implementar aplicaciones de análisis de big data de forma rápida y sencilla. AWS ofrece un acceso rápido a recursos de TI económicos y flexibles, algo que permitirá escalar prácticamente cualquier aplicación de big data con rapidez, incluidos almacenamiento de datos, análisis de clics, detección de elementos fraudulentos, motores de recomendación, proceso ETL impulsado por eventos, informática sin servidor y procesamiento del Internet de las cosas. Con AWS no necesita hacer grandes inversiones iniciales de tiempo o dinero para crear y mantener la infraestructura. En su lugar, puede aprovisionar exactamente el tipo y el tamaño adecuado de los recursos que necesita para impulsar sus aplicaciones de análisis de big data. Puede obtener acceso a tantos recursos como necesite, prácticamente al instante, y pagar únicamente por los utilice.

awsamazon web servicesservicios financieros en la nube
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes

Using AWS to design and build your data architecture has never been easier to gain insights and uncover new opportunities to scale and grow your business. Join this workshop to learn how you can gain insights at scale with the right big data applications.

cloudawss3
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift

(1) Amazon Redshift is a fully managed data warehousing service in the cloud that makes it simple and cost-effective to analyze large amounts of data across petabytes of structured and semi-structured data. (2) It provides fast query performance by using massively parallel processing and columnar storage techniques. (3) Customers like NTT Docomo, Nasdaq, and Amazon have been able to analyze petabytes of data faster and at a lower cost using Amazon Redshift compared to their previous on-premises solutions.

aws pop-up loft san franciscoamazon web servicesaws loft architecture week database

More Related Content

What's hot

Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Amazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Amazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon RedshiftData Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
Amazon Web Services
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
Amazon Web Services
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Amazon Web Services
 
(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data
Amazon Web Services
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Amazon Web Services
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
Amazon Web Services
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
Amazon Web Services
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
Amazon Web Services
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
Lam Le
 
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
Amazon Web Services
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
Amazon Web Services
 
Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3
Amazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
Amazon Web Services
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
Amazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 

What's hot (20)

Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon RedshiftData Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:I...
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 

Similar to FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
Amazon Web Services Korea
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
Amazon Web Services
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Amazon Web Services LATAM
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Amazon Web Services
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
Amazon Web Services
 
State of the Union: Database & Analytics
State of the Union: Database & AnalyticsState of the Union: Database & Analytics
State of the Union: Database & Analytics
Amazon Web Services
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Amazon Web Services
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of Light
Amazon Web Services LATAM
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
Amazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Amazon Web Services
 
Cloud Data Integration Best Practices
Cloud Data Integration Best PracticesCloud Data Integration Best Practices
Cloud Data Integration Best Practices
Darren Cunningham
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
Amazon Web Services
 
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif AbbasiAWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
AWS Riyadh User Group
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
Amazon Web Services
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
Amazon Web Services
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
Amazon Web Services
 
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptxTrack 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
Amazon Web Services
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
Amazon Web Services
 

Similar to FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud (20)

클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
클라우드에서의 데이터 웨어하우징 & 비즈니스 인텔리전스
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
State of the Union: Database & Analytics
State of the Union: Database & AnalyticsState of the Union: Database & Analytics
State of the Union: Database & Analytics
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
Big Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of LightBig Data & Analytics - Innovating at the Speed of Light
Big Data & Analytics - Innovating at the Speed of Light
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Cloud Data Integration Best Practices
Cloud Data Integration Best PracticesCloud Data Integration Best Practices
Cloud Data Integration Best Practices
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif AbbasiAWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
 
AWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions ShowcaseAWS Webcast - Informatica - Big Data Solutions Showcase
AWS Webcast - Informatica - Big Data Solutions Showcase
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptxTrack 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
Track 6 Session 1_進入 AI 領域的第一步驟_資料平台的建置.pptx
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
Larry Smarr
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 

Recently uploaded (20)

DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Bob Griffiths, AWS Solutions Architect Manager John Hitchingham, FINRA Engineering August 14, 2017 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
  • 2. Overview of Big Data Services
  • 3. What is big data? When your data sets become so large and complex you have to start innovating around how to collect, store, process, analyze, and share them.
  • 4. Collect AWS Import/Export AWS Direct Connect Amazon Kinesis Amazon EMR Amazon EC2 Process & Analyze Amazon Glacier Amazon S3 Store Amazon Machine Learning Amazon Redshift Amazon DynamoDB Amazon Kinesis Analytics Amazon QuickSight AWS Database Migration Service AWS Data Pipeline Amazon RDS, Amazon Aurora Big Data services on AWS Amazon Elasticsearch Service Amazon Athena AWS Glue AWS Snowball
  • 5. Scale as your data and business grows The volume, variety, and velocity at which data is being generated are leaving organizations with new questions to answer, such as:
  • 6. Data Lake Central Storage Secure, cost-effective storage in Amazon S3 Data Ingestion Get your data into S3 quickly and securely Kinesis Firehose, Direct Connect, Snowball, Database Migration Service Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding DynamoDB Elasticsearch Service API Gateway Directory Service Cognito Athena, QuickSight, EMR, Amazon Redshift IAM, CloudWatch, CloudTrail, KMS Protect & Secure Use entitlements to ensure data is secure and users’ identities are verified
  • 7. Store and analyze all your data—structured and unstructured—from all of your sources, in one centralized location at low cost. Quickly ingest data without needing to force it into a predefined schema, enabling ad-hoc analysis by applying schemas on read, not write. Separating your storage and compute allows you to scale each component as required and attach multiple data processing and analytics services to the same data set. Scale
  • 8. Use only the services you need Scale only the services you need Pay only for what you use Discounts through Reserved Instances Types including Spot, and upfront commitments Cost
  • 9. Visibility/control of all APIs and retrievals Encryption of all data at each step Store an exabyte of data or more in Amazon S3 Analyze GB to PB using standard tools Control egress and ingress points using VPCs Security and scale
  • 10. Big data does not mean just batch • Can be streamed in • Processed in real time • Can be used to respond quickly to requests and actionable events, generate business value You can mix and match • On-premises and cloud • Custom development and managed services Agility & actionable insights
  • 12. In order to solve its market regulation challenges, over the past three years, FINRA’s Technology team has pioneered a managed cloud service to operate big data workloads and perform analytics at large scale. The results of FINRA’s innovations have been significant. To achieve these gains and operate its big data ecosystem, FINRA Technology has built a set of cutting edge tools, processes, and know-how. FINRA’s experience A 30% operating cost reduction, in both labor and infrastructure A 5x increase in operational resiliency The business is able to perform analytics at an unprecedented scale and depth
  • 14. Legacy pain points – infrastructure and ops Did not scale well as volumes and workloads increased Duplication of effort in data management (data lifecycle, retention, versioning, etc.) Data sync issues – manual effort to keep data in sync Costly system maintenance and upgrades
  • 15. Legacy pain points – analytics and data science Business Analysts Data Scientists Data Analysts Data Engineers Ops What data do we have? What format is it in? Where to I get it? Get this data for them… Not on disk – pull from tape Wait for tapes from offsite Prepare & Format Oops, I need more data … Repeat! I need data in different format … Repeat! etc…, etc…
  • 16. Summary of cloud drivers • Fast-growing data volumes YoY • High cost of pre-building for peak • Escalating costs of in-house technology infrastructure • Long time-to-market for finding insights in data • Appliance platforms were facing obsolescence and end-of life as a result of new big data technologies Keep spending more on legacy infrastructure or redirect dollars to core business of regulation?
  • 17. FINRA cloud program business objectives • Discover data easily • Access (all the) data easily • Increase the power of analytic tools • Make data processing resilient • Make data processing cost effective Could this be achieved in the cloud?
  • 18. Cloud architectural principles Manage Data Consistently • Define, store and share our data as an enterprise asset • All data should be enabled for analytics • Protect data in a holistic manner (data at rest and data in transit) Integrate our Portfolio • Shared solutions for common business processes across the organization • All "business" data in cloud will be tracked by a centralized Data Management System so that FINRA can manage the data lifecycle in a productive and cost effective manner • All FINRA-developed applications will have service interfaces Operational Resiliency • Multi-AZ components and fail-over • Auto-scaling and load balancing to achieve high availability • No logon to servers or services for routine operations • Applications should include automated operations jobs to handle known failure scenarios, recovery, data issues, and notifications.
  • 19. From data puddles to Data Lake Database1 Storage Query/Compute Catalog Database2 Storage Query/Compute Catalog Databasen Storage Query/Compute Catalog Storage Query/ Compute Catalog EMR Spark LambdaEMR Presto EMR HBase herd Hive metastore FINRA in Data Center FINRA in AWS Scales Silo Amazon S3
  • 20. Data processing stream on Data Lake Catalog & Storage ETL Normalize, Enrich, Reformat Human Analytics Validation Ingest Broker Dealers Exchanges 3rd Party Providers Data Files Analyst Data Scientist Regulatory User • Centralized Catalog • 100s of EMR clusters • As many Lambda functions as needed Patterns Automated Surveillance
  • 21. Power of parallelization ETL Job1 Input Result ETL Job2 Input Result ETL Jobn Input Result Workloads run in parallel for workload isolation to meet SLAs
  • 22. Processing scales to meet demand 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 11/1 11/3 11/5 11/7 11/9 11/11 11/13 11/15 11/17 11/19 11/21 11/23 11/25 11/27 11/29 Daily Order Volume (Billions) 0 2000 4000 6000 8000 10000 12000 2016-10-17T02 2016-10-17T08 2016-10-17T14 2016-10-17T20 2016-10-18T02 2016-10-18T08 2016-10-18T14 2016-10-18T20 2016-10-19T02 2016-10-19T08 2016-10-19T14 2016-10-19T20 2016-10-20T02 2016-10-20T08 2016-10-20T14 2016-10-20T20 2016-10-21T02 2016-10-21T08 2016-10-21T14 2016-10-21T20 2016-10-22T02 2016-10-24T03 2016-10-24T20 ComputeNodes Hour of Day AWS EMR compute on EC2 EMR
  • 23. Catalog for centralized data management http://finraos.github.io/herd Unified catalog • Schemas • Versions • Encryption type • Storage policies Lineage and Usage • Track publishers and consumers • Easily identify jobs and derived data sets Shared Metastore • Common definition of tables and partitions • Use with Spark, Presto, Hive, etc. • Faster instantiation of clusters
  • 24. Catalog and the Data Lake ecosystem Hive Metastore Data Catalog Data Catalog UI Analyst Data Scientist Explore Use Object Storage (S3) Custom Handler Request object Info Processing Get object info (optl. DDL) Knows Object/File Object/File Object/File Query (w/ DDL) Store Results Custom Handler Register Output Validation ETL Machine SurveillanceLambda EMR Interactive Analytics EMR Redshift (Spectrum) Get DDL
  • 25. Analytics – one-stop shop for data Data Analyst Data Scientist JDBC Client JDBC Client Table 1 Table 2 AuthN AuthZ Metastore Table N Logical “Database”
  • 26. Achieve interactive query speed with Data Lake Query Table size (rows) Output size (rows) ORC TXT/BZ2 select count(*) from TABLE_1 where trade_date = cast(‘2016-08-09’ as date) 2469171608 1 4s 1m56s select col1, count(*) from TABLE_1 where col2 = cast('2016- 08-09' as date) group by col1 order by col1 2469171608 12 3s 1m51s select col1, count(*) from TABLE_1 where col2 = cast('2016- 08-09' as date) group by col1 order by col1 2469171608 8364 5s 2m5s select * from TABLE_1 where col2 = cast('2016-08-10' as date) and col3='I' and col4='CR' and col5 between 100000.0 and 103000.0 2469171608 760 10s 2m3s Test Config: Presto 0.167.0.6t (Teradata) On EMR Data on S3 (external tables) Cluster size: 60 worker node x r4.4xlarge Key points: Use ORC (Or Parquet) for performant query
  • 27. Grow the data store with no work 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Main Production Data Store (Bucket on S3) Size (PB) • Data footprint grows seamlessly • All data accessible for interactive (or batch query) from moment it is stored
  • 28. Or scale out with multiple clusters… User A JDBC Client Table 1 Table 2 AuthN Metastore Table N Logical “Database” JDBC ClientUser B JDBC App Cluster A Cluster B Cluster C Still One Copy Of Data!
  • 29. Data needs for data science and ML • Allow discovery & exploration • Bring disparate sources of data together • Allows users to focus on problem not the infrastructure • Safeguard information with high degree of security and least privileges access
  • 30. A single way to access all of the data Logical Data Repository 1 Data Scientist Logical Data Repository Accelerate discovery through self-service Logical Data Repository Logical Data Repository Data Scientist Data Engineer1 2 N Data Engineer Data Engineer Before Data Lake Data Lake
  • 31. Data science on the Data Lake Data Scientist JDBC Client Logical ‘Database’ EMR Cluster Still one copy of data! Spark Cluster DS-in-a-box AuthN Data Scientist Notebook Interface Data Scientist Catalog Notebook or Shell
  • 32. Universal Data Science Platform (UDSP) • Environment (EC2) for each Data Scientist • Simple provisioning interface • Right instance (memory or GPU) for job • Access to all the data in Data Lake • Shut off when not using for savings • Secure (LDAP AuthN/Z + Encryption) Data Scientist
  • 33. UDSP – Inventory – not just R • R 3.2.5, Python (2.7.12 and 3.4.3) • Packages • R: 300+ Python: 100+ • Tools for Building Packages • gcc, gfortran, make, java, maven, ant… • IDEs • Jupyter, RStudio Server • Deep Learning • CUDA, CuDNN (if GPU present) • Theano, Caffe, Torch • TensorFlow 16
  • 34. Some business benefits with Data Lake  Market volume changes no longer disruptive technology events  Regulatory analysts can now interactively analyze 1000x more market events (billons of rows vs millions before)  Easily reprocess data if there are upstream data errors – used to take weeks to find capacity now can be done in day/days.  Querying order route detail went from 10s of minutes to seconds  Quicker turnaround to provide data for oversight  Machine Learning model development is easier
  • 35. Want to hear more? Feel free to contact me: john.hitchingham@finra.org