SlideShare a Scribd company logo
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian Robinson, Specialist SA, Data and Analytics, EMEA
20 September 2017
Introduction to AWS Glue
AWS Glue automates
the undifferentiated heavy lifting of ETL
Automatically discover and categorize your data making it immediately searchable
and queryable across data sources
Generate code to clean, enrich, and reliably move data between various data
sources; you can also use their favorite tools to build ETL jobs
Run your jobs on a serverless, fully managed, scale-out environment. No compute
resources to provision or manage.
Discover
Develop
Deploy
AWS Glue: Components
Data Catalog
 Hive Metastore compatible with enhanced functionality
 Crawlers automatically extracts metadata and creates tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
 Run jobs on a serverless Spark platform
 Provides flexible scheduling
 Handles dependency resolution, monitoring and alerting
Job Authoring
 Auto-generates ETL code
 Build on open frameworks – Python and Spark
 Developer-centric – editing, debugging, sharing
Common Use Cases for AWS Glue

Recommended for you

Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks

Learning Objectives: - Discover dark data that you are currently not analyzing. - Analyze dark data without moving it into your data warehouse. - Visualize the results of your dark data analytics.

data warehousingdata lakedark data
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나

발표영상 다시보기: https://youtu.be/eQjkwhyOOmI 대규모 데이터 레이크 구성 및 관리는 복잡하고 시간이 많이 걸리는 작업입니다. AWS Lake Formation은 수일만에 안전한 데이터 레이크를 구성할 수 있는 완전 관리 서비스입니다. 본 세션에서는 데이터 수집, 분류, 정리, 변환 및 보안을 위해 AWS Lake Formation을 통해 Amazon S3, EMR, Redshift 및 Athena와 같은 분석 도구를 쉽게 구성하는 방법을 알아봅니다. (2019년 11월 서울 리전 출시)

channyunboxinglake formation
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass

Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job. Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service. See a recording of the webinar based on this presentation on YouTube here: Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/ See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/

cloud computingcloudemeawebinars
Understand your data assets
Instantly query your data lake on Amazon S3
ETL data into your data warehouse
Build event-driven ETL pipelines

Recommended for you

Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS

Today’s organisations require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. Data Lake is a new and increasingly popular way to store all of your data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. In this webinar, you will discover how AWS gives you fast access to flexible and low-cost IT resources, so you can rapidly scale and build your data lake that can power any kind of analytics such as data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity and variety of data. Learning Objectives: • Discover how you can rapidly scale and build your data lake with AWS. • Explore the key pillars behind a successful data lake implementation. • Learn how to use the Amazon Simple Storage Service (S3) as the basis for your data lake. • Learn about the new AWS services recently launched, Amazon Athena and Amazon Redshift Spectrum, that help customers directly query that data lake.

database-webinars-2017
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS

The document discusses building a data lake on AWS. It describes various AWS services that can be used to ingest, store, transform, analyze and visualize data in the data lake. These services include Amazon S3 for storage, AWS Glue for ETL/data cataloging, AWS Lake Formation for governance, Amazon Athena/EMR for analytics and Amazon QuickSight for visualization. The document also covers data movement options from on-premises to the data lake and real-time streaming of data using services like Kinesis. Machine learning workloads can leverage Amazon SageMaker for training and deployment.

amazon web servicesawsaws cloud
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue

Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.

aws re:invent 2017amazonanalytics & big data
Demo
Introduction to AWS Glue
AWS Glue Data Catalog
Glue data catalog
Manage table metadata through a Hive metastore API or Hive SQL.
Supported by tools like Hive, Presto, Spark etc.
We added a few extensions:
 Search over metadata for data discovery
 Connection info – JDBC URLs, credentials
 Classification for identifying and parsing files
 Versioning of table metadata as schemas evolve and other metadata are updated
Populate using Hive DDL, bulk import, or automatically through Crawlers.

Recommended for you

Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service

Learn about the new AWS Database Migration Service, which helps you migrate databases with minimal downtime from on-premises and Amazon EC2 environments to Amazon RDS, Amazon Redshift, Amazon Aurora and EC2 databases. We discuss homogeneous (e.g. Oracle-to-Oracle, PostgreSQL-to-PostgreSQL, etc.) and heterogeneous (e.g. Oracle to Aurora, SQL Server to MariaDB) database migrations. We also talk about the new AWS Schema Conversion Tool that saves you development time when migrating your Oracle and SQL Server database schemas, including PL/SQL and T-SQL procedural code, to their MySQL, MariaDB and Aurora equivalents.

nick hertl2016 aws summit chicagoamazon web services
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS

This document discusses building a data lake on AWS. It describes using Amazon S3 for storage, Amazon Kinesis for streaming data, and AWS Lambda to populate metadata indexes in DynamoDB and search indexes. It covers using IAM for access control, AWS STS for temporary credentials, and API Gateway and Elastic Beanstalk for interfaces. The data lake provides a foundation for storing and analyzing structured, semi-structured, and unstructured data at scale from various sources in a cost-effective and secure manner.

aws cloudaws-reinventbig data & analytics
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora

This document provides an introduction to Amazon Aurora, AWS's managed relational database service. It discusses how Aurora was built to provide the speed and availability of commercial databases at the simplicity and cost-effectiveness of open source databases. The document outlines key Aurora features like automatic scaling, continuous backups, replication across Availability Zones, and integration with other AWS services. Customer case studies show how Aurora provides better performance at lower costs than alternative database options. The document also covers migration options and how Aurora offers a simpler, more cost-effective database solution than on-premises or self-managed options.

cloud computingamazon aurorasteve-abraham
Data Catalog: Crawlers
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok expressions
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync
AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable
Data Catalog: Table details
Table schema
Table properties
Data statistics
Nested fields
Data Catalog: Version control
List of table versionsCompare schema versions

Recommended for you

Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...

This document discusses big data analytics architectural patterns and best practices. It covers collecting and storing data from various sources, processing and analyzing data using tools like Amazon Redshift, Amazon Athena and Amazon EMR, and selecting the appropriate tools based on factors like data structure, access patterns, and data temperature. It also discusses stream/real-time analytics tools and machine learning approaches.

amazonawsreinvent2018analytics
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies

Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;) After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters. This is part 1 of an 8 part Data Science for Dummies series: Databricks for dummies Titanic survival prediction with Databricks + Python + Spark ML Titanic with Azure Machine Learning Studio Titanic with Databricks + Azure Machine Learning Service Titanic with Databricks + MLS + AutoML Titanic with Databricks + MLFlow Titanic with DataRobot Deployment, DevOps/MLops and Operationalization

databrickssparkapache-spark
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline

Many organizations have adopted or are in the process of adopting DevOps methodologies in their quest to accelerate the delivery of software capabilities, features, and functionalities to support their organizational objectives. By applying the same practices, DataOps aims to provide the same level of agility in delivering data and information to the organization. AWS Lake Formation, in coordination with other AWS Services, enables DevOps methodologies to be realized through the Data Supply Chain Pipeline.

dc-summit-2019
Data Catalog: Detecting partitions
file 1 file N… file 1 file N…
date=10 date=15…
month=Nov
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type
Data Catalog: Automatic partition detection
Automatically register available partitions
Table
partitions
Job Authoring in AWS Glue
 Python code generated by AWS Glue
 Connect a notebook or IDE to AWS Glue
 Existing code brought into AWS Glue
Job Authoring Choices

Recommended for you

AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive

This document provides an overview of AWS Lake Formation and related services for building a secure data lake. It discusses how Lake Formation provides a centralized management layer for data ingestion, cleaning, security and access. It also describes how Lake Formation integrates with services like AWS Glue, Amazon S3 and ML transforms to simplify and automate many data lake tasks. Finally, it provides an example workflow for using Lake Formation to deduplicate data from various sources and grant secure access for analysis.

awsaws lake formationdata lake
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017

As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

amazonanalytics & big dataawsreinvent2017
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020

Jim Boriotti presents an overview and demo of Azure Synapse Analytics, an integrated data platform for business intelligence, artificial intelligence, and continuous intelligence. Azure Synapse Analytics includes Synapse SQL for querying with T-SQL, Synapse Spark for notebooks in Python, Scala, and .NET, and Synapse Pipelines for data workflows. The demo shows how Azure Synapse Analytics provides a unified environment for all data tasks through the Synapse Studio interface.

azure cloud analytics synapse data warehousing
1. Customize the mappings
2. Glue generates transformation graph and Python code
3. Connect your notebook to development endpoints to customize your code
Job authoring: Automatic code generation
 Human-readable, editable, and portable PySpark code
 Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data
 Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries
 Collaborative: share code snippets via GitHub, reuse code across jobs
Job authoring: ETL code
Job Authoring: Glue Dynamic Frames
Dynamic frame schema
A C D [ ]
X Y
B1 B2
Like Spark’s Data Frames, but better for:
• Cleaning and (re)-structuring semi-structured
data sets, e.g. JSON, Avro, Apache logs ...
No upfront schema needed:
• Infers schema on-the-fly, enabling transformations
in a single pass
Easy to handle the unexpected:
• Tracks new fields, and inconsistent changing data
types with choices, e.g. integer or string
• Automatically mark and separate error records
Job Authoring: Glue transforms
ResolveChoice() B B B
project
B
cast
B
separate into cols
B B
Apply Mapping() A
X Y
A X Y
Adaptive and flexible
C

Recommended for you

AWS Service Catalog
AWS Service CatalogAWS Service Catalog
AWS Service Catalog

Running an IT department in a large organization is challenging. You need to provide users with access to the latest technology, while maintaining corporate standards and providing oversight to avoid runaway spending. In this session, you’ll hear how Lockheed Martin has used AWS Service Catalog to ensure compliance across the organization. You will also learn how 2nd Watch, an APN Premier Consulting Partner, leverages AWS Service Catalog to manage resources for customers and are now able to deploy quickly and standardize their workload management. We’ll also demo advanced functionality and how you can get started.

abhishek lalgarland m. garcia2015nysummit
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)

So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.

data lakehousedata meshdata fabric
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud

Today organizations find themselves in a data rich world with a growing need for increased agility and accessibility of all this data for analysis and deriving keen insights to drive strategic decisions. Creating a data lake helps you to manage all the disparate sources of data you are collecting (in its original format) and extract value. In this session, learn how to architect and implement a data lake in the AWS Cloud. Learn about best practices as we walk through architectural blueprints.

awsamazon web servicescloud computing
Job authoring: Relationalize() transform
Semi-structured schema Relational schema
FKA B B C.X C.Y
PK ValueOffset
A C D [ ]
X Y
B B
• Transforms and adds new columns, types, and tables on-the-fly
• Tracks keys and foreign keys across runs
• SQL on the relational schema is orders of magnitude faster than JSON processing
Job authoring: Glue transformations
 Prebuilt transformation: Click and
add to your job with simple
configuration
 Spigot writes sample data from
DynamicFrame to S3 in JSON format
 Expanding… more transformations
to come
Job authoring: Write your own scripts
Import custom libraries required by your code
Convert to a Spark Data Frame
for complex SQL-based ETL
Convert back to Glue Dynamic Frame
for semi-structured processing and
AWS Glue connectors
Job authoring: Developer endpoints
 Environment to iteratively develop and test ETL code.
 Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
 When you are satisfied with the results you can create an ETL job that runs your code.
Glue Spark environment
Remote
interpreter
Interpreter
server

Recommended for you

AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)

The document introduces Amazon Athena and AWS Glue. It summarizes that Amazon Athena allows users to interactively query data stored in Amazon S3 using standard SQL. It also summarizes that AWS Glue is a fully managed ETL service that automates data extraction, transformation and loading processes. Glue discovers how data is organized, crawls data sources to infer schemas, automatically generates ETL code and manages execution of data workflows.

awscloud2017
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS

This document discusses logging scenarios using DynamoDB and Elastic MapReduce. It covers collecting log data in real-time using tools like Fluentd and storing it in DynamoDB. It then describes using EMR to perform ETL processes on the data, extracting from DynamoDB, transforming the data across EC2 instances, and loading to S3 or DynamoDB. Finally, it discusses analyzing the data using Redshift for queries or CloudSearch for search capabilities.

awscloud computing
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...

Learning Objectives: - Understand how to build a serverless big data solution quickly and easily - Learn how to discover and prepare all your data for analytics - Learn how to query and visualize analytics on all your data to create actionable insights

serverlessbig dataanalytics
Job Authoring: Leveraging the community
No need to start from scratch.
Use Glue samples stored in Github to share, reuse,
contribute: https://github.com/awslabs/aws-glue-samples
• Migration scripts to import existing Hive Metastore data
into AWS Glue Data Catalog
• Examples of how to use Dynamic Frames and
Relationalize() transform
• Examples of how to use arbitrary PySpark code with
Glue’s Python ETL library
Download Glue’s Python ETL library to start developing code
in your IDE: https://github.com/awslabs/aws-glue-libs
Orchestration and Job
Scheduling
Job execution: Scheduling and monitoring
Compose jobs globally with event-
based dependencies
 Easy to reuse and leverage work across
organization boundaries
Multiple triggering mechanisms
 Schedule-based: e.g., time of day
 Event-based: e.g., job completion
 On-demand: e.g., AWS Lambda
 More coming soon: Data Catalog based
events, S3 notifications and Amazon
CloudWatch events
Logs and alerts are available in
Amazon CloudWatch
Marketing: Ad-spend by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based
Job execution: Job bookmarks
For example, you get new files everyday
in your S3 bucket. By default, AWS Glue
keeps track of which files have been
successfully processed by the job to
prevent data duplication.
Option Behavior
Enable Pick up from where you left off
Disable
Ignore and process the entire dataset
every time
Pause
Temporarily disable advancing the
bookmark
Marketing: Ad-spend by customer segment
Data objects
Glue keeps track of data that has already
been processed by a previous run of an
ETL job. This persisted state information is
called a bookmark.

Recommended for you

Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...

From theory to implementation - follow the steps of implementing an end-to-end analytics solution illustrated with some best practices and examples in Azure Data Lake. During this full training day we will share the architecture patterns, tooling, learnings and tips and tricks for building such services on Azure Data Lake. We take you through some anti-patterns and best practices on data loading and organization, give you hands-on time and the ability to develop some of your own U-SQL scripts to process your data and discuss the pros and cons of files versus tables. This were the slides presented at the SQLBits 2018 Training Day on Feb 21, 2018.

azure data lakeu-sqlbig data
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3

The document provides information about querying and analyzing data in Amazon S3 using various AWS services. It discusses: 1. Using Amazon EMR to process raw web logs delivered to S3 by Kinesis Firehose using Apache Spark. 2. Loading the processed data into Amazon Redshift for interactive querying using SQL. 3. Performing ad-hoc analysis on the data in S3 using serverless Athena without having to set up any infrastructure.

cloud computingaws pop-up loft san franciscoaws loft architecture week
Easy Analytics with AWS - AWS Summit Bahrain 2017
Easy Analytics with AWS - AWS Summit Bahrain 2017Easy Analytics with AWS - AWS Summit Bahrain 2017
Easy Analytics with AWS - AWS Summit Bahrain 2017

AWS has a large and growing portfolio of big data management and analytics services, designed to integrate into solution architectures to meet the needs of your business. In this session, we look at analytics through the eyes of a business intelligence analyst, a data scientist, and an application developer, to explore how to quickly leverage Amazon Redshift, Amazon QuickSight, RStudio, and Amazon Machine Learning to create powerful, yet straightforward, business solutions.

awsamazon web servicesaws cloud
Job execution: Serverless
 Auto-configure VPC and role-based access
 Customers can specify the capacity that
gets allocated to each job
 Automatically scale resources (on post-GA
roadmap)
 You pay only for the resources you
consume while consuming them
There is no need to provision, configure, or
manage servers
Customer VPC Customer VPC
Compute instances
Introduction to AWS Glue

More Related Content

What's hot

Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
Amazon Web Services
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
Amazon Web Services
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Amazon Web Services
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
Amazon Web Services Korea
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
Amazon Web Services
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
Amazon Web Services
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
Amazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
Amazon Web Services
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
Amazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
Amazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Amazon Web Services
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
Rodney Joyce
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
Amazon Web Services
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
Cobus Bernard
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
Amazon Web Services
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 
AWS Service Catalog
AWS Service CatalogAWS Service Catalog
AWS Service Catalog
Amazon Web Services
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 

What's hot (20)

Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 
AWS Service Catalog
AWS Service CatalogAWS Service Catalog
AWS Service Catalog
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 

Similar to Introduction to AWS Glue

Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
Amazon Web Services
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
Paolo latella
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Amazon Web Services
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3
Amazon Web Services
 
Easy Analytics with AWS - AWS Summit Bahrain 2017
Easy Analytics with AWS - AWS Summit Bahrain 2017Easy Analytics with AWS - AWS Summit Bahrain 2017
Easy Analytics with AWS - AWS Summit Bahrain 2017
Amazon Web Services
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
Amazon Web Services
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
Amazon Web Services
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18
Neal Davis
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The Netherlands
Marek Kuczynski
 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational Databases
Amazon Web Services
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
Crishantha Nanayakkara
 
Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Kinesis Firehose - Pop-up Loft TLV 2017Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Web Services
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Amazon Web Services
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
Amazon Web Services
 
BDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesBDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practices
Amazon Web Services
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Amazon Web Services
 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational Databases
Amazon Web Services
 

Similar to Introduction to AWS Glue (20)

Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3
 
Easy Analytics with AWS - AWS Summit Bahrain 2017
Easy Analytics with AWS - AWS Summit Bahrain 2017Easy Analytics with AWS - AWS Summit Bahrain 2017
Easy Analytics with AWS - AWS Summit Bahrain 2017
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18AWS Certified Solutions Architect Professional Course S15-S18
AWS Certified Solutions Architect Professional Course S15-S18
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The Netherlands
 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational Databases
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Kinesis Firehose - Pop-up Loft TLV 2017Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Kinesis Firehose - Pop-up Loft TLV 2017
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
 
BDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesBDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practices
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
Adding Search to Relational Databases
Adding Search to Relational DatabasesAdding Search to Relational Databases
Adding Search to Relational Databases
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Introduction to AWS Glue

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ian Robinson, Specialist SA, Data and Analytics, EMEA 20 September 2017 Introduction to AWS Glue
  • 2. AWS Glue automates the undifferentiated heavy lifting of ETL Automatically discover and categorize your data making it immediately searchable and queryable across data sources Generate code to clean, enrich, and reliably move data between various data sources; you can also use their favorite tools to build ETL jobs Run your jobs on a serverless, fully managed, scale-out environment. No compute resources to provision or manage. Discover Develop Deploy
  • 3. AWS Glue: Components Data Catalog  Hive Metastore compatible with enhanced functionality  Crawlers automatically extracts metadata and creates tables  Integrated with Amazon Athena, Amazon Redshift Spectrum Job Execution  Run jobs on a serverless Spark platform  Provides flexible scheduling  Handles dependency resolution, monitoring and alerting Job Authoring  Auto-generates ETL code  Build on open frameworks – Python and Spark  Developer-centric – editing, debugging, sharing
  • 4. Common Use Cases for AWS Glue
  • 6. Instantly query your data lake on Amazon S3
  • 7. ETL data into your data warehouse
  • 11. AWS Glue Data Catalog
  • 12. Glue data catalog Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools like Hive, Presto, Spark etc. We added a few extensions:  Search over metadata for data discovery  Connection info – JDBC URLs, credentials  Classification for identifying and parsing files  Versioning of table metadata as schemas evolve and other metadata are updated Populate using Hive DDL, bulk import, or automatically through Crawlers.
  • 13. Data Catalog: Crawlers Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expressions Run ad hoc or on a schedule; serverless – only pay when crawler runs Crawlers automatically build your Data Catalog and keep it in sync
  • 14. AWS Glue Data Catalog Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized list that is searchable
  • 15. Data Catalog: Table details Table schema Table properties Data statistics Nested fields
  • 16. Data Catalog: Version control List of table versionsCompare schema versions
  • 17. Data Catalog: Detecting partitions file 1 file N… file 1 file N… date=10 date=15… month=Nov S3 bucket hierarchy Table definition Estimate schema similarity among files at each level to handle semi-structured logs, schema evolution… sim=.99 sim=.95 sim=.93 month date col 1 col 2 str str int float Column Type
  • 18. Data Catalog: Automatic partition detection Automatically register available partitions Table partitions
  • 19. Job Authoring in AWS Glue
  • 20.  Python code generated by AWS Glue  Connect a notebook or IDE to AWS Glue  Existing code brought into AWS Glue Job Authoring Choices
  • 21. 1. Customize the mappings 2. Glue generates transformation graph and Python code 3. Connect your notebook to development endpoints to customize your code Job authoring: Automatic code generation
  • 22.  Human-readable, editable, and portable PySpark code  Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data  Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries  Collaborative: share code snippets via GitHub, reuse code across jobs Job authoring: ETL code
  • 23. Job Authoring: Glue Dynamic Frames Dynamic frame schema A C D [ ] X Y B1 B2 Like Spark’s Data Frames, but better for: • Cleaning and (re)-structuring semi-structured data sets, e.g. JSON, Avro, Apache logs ... No upfront schema needed: • Infers schema on-the-fly, enabling transformations in a single pass Easy to handle the unexpected: • Tracks new fields, and inconsistent changing data types with choices, e.g. integer or string • Automatically mark and separate error records
  • 24. Job Authoring: Glue transforms ResolveChoice() B B B project B cast B separate into cols B B Apply Mapping() A X Y A X Y Adaptive and flexible C
  • 25. Job authoring: Relationalize() transform Semi-structured schema Relational schema FKA B B C.X C.Y PK ValueOffset A C D [ ] X Y B B • Transforms and adds new columns, types, and tables on-the-fly • Tracks keys and foreign keys across runs • SQL on the relational schema is orders of magnitude faster than JSON processing
  • 26. Job authoring: Glue transformations  Prebuilt transformation: Click and add to your job with simple configuration  Spigot writes sample data from DynamicFrame to S3 in JSON format  Expanding… more transformations to come
  • 27. Job authoring: Write your own scripts Import custom libraries required by your code Convert to a Spark Data Frame for complex SQL-based ETL Convert back to Glue Dynamic Frame for semi-structured processing and AWS Glue connectors
  • 28. Job authoring: Developer endpoints  Environment to iteratively develop and test ETL code.  Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.  When you are satisfied with the results you can create an ETL job that runs your code. Glue Spark environment Remote interpreter Interpreter server
  • 29. Job Authoring: Leveraging the community No need to start from scratch. Use Glue samples stored in Github to share, reuse, contribute: https://github.com/awslabs/aws-glue-samples • Migration scripts to import existing Hive Metastore data into AWS Glue Data Catalog • Examples of how to use Dynamic Frames and Relationalize() transform • Examples of how to use arbitrary PySpark code with Glue’s Python ETL library Download Glue’s Python ETL library to start developing code in your IDE: https://github.com/awslabs/aws-glue-libs
  • 31. Job execution: Scheduling and monitoring Compose jobs globally with event- based dependencies  Easy to reuse and leverage work across organization boundaries Multiple triggering mechanisms  Schedule-based: e.g., time of day  Event-based: e.g., job completion  On-demand: e.g., AWS Lambda  More coming soon: Data Catalog based events, S3 notifications and Amazon CloudWatch events Logs and alerts are available in Amazon CloudWatch Marketing: Ad-spend by customer segment Event Based Lambda Trigger Sales: Revenue by customer segment Schedule Data based Central: ROI by customer segment Weekly sales Data based
  • 32. Job execution: Job bookmarks For example, you get new files everyday in your S3 bucket. By default, AWS Glue keeps track of which files have been successfully processed by the job to prevent data duplication. Option Behavior Enable Pick up from where you left off Disable Ignore and process the entire dataset every time Pause Temporarily disable advancing the bookmark Marketing: Ad-spend by customer segment Data objects Glue keeps track of data that has already been processed by a previous run of an ETL job. This persisted state information is called a bookmark.
  • 33. Job execution: Serverless  Auto-configure VPC and role-based access  Customers can specify the capacity that gets allocated to each job  Automatically scale resources (on post-GA roadmap)  You pay only for the resources you consume while consuming them There is no need to provision, configure, or manage servers Customer VPC Customer VPC Compute instances