In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Learning Objectives: - Discover dark data that you are currently not analyzing. - Analyze dark data without moving it into your data warehouse. - Visualize the results of your dark data analytics.
발표영상 다시보기: https://youtu.be/eQjkwhyOOmI 대규모 데이터 레이크 구성 및 관리는 복잡하고 시간이 많이 걸리는 작업입니다. AWS Lake Formation은 수일만에 안전한 데이터 레이크를 구성할 수 있는 완전 관리 서비스입니다. 본 세션에서는 데이터 수집, 분류, 정리, 변환 및 보안을 위해 AWS Lake Formation을 통해 Amazon S3, EMR, Redshift 및 Athena와 같은 분석 도구를 쉽게 구성하는 방법을 알아봅니다. (2019년 11월 서울 리전 출시)
Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job. Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service. See a recording of the webinar based on this presentation on YouTube here: Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/ See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
Today’s organisations require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. Data Lake is a new and increasingly popular way to store all of your data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. In this webinar, you will discover how AWS gives you fast access to flexible and low-cost IT resources, so you can rapidly scale and build your data lake that can power any kind of analytics such as data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity and variety of data. Learning Objectives: • Discover how you can rapidly scale and build your data lake with AWS. • Explore the key pillars behind a successful data lake implementation. • Learn how to use the Amazon Simple Storage Service (S3) as the basis for your data lake. • Learn about the new AWS services recently launched, Amazon Athena and Amazon Redshift Spectrum, that help customers directly query that data lake.
The document discusses building a data lake on AWS. It describes various AWS services that can be used to ingest, store, transform, analyze and visualize data in the data lake. These services include Amazon S3 for storage, AWS Glue for ETL/data cataloging, AWS Lake Formation for governance, Amazon Athena/EMR for analytics and Amazon QuickSight for visualization. The document also covers data movement options from on-premises to the data lake and real-time streaming of data using services like Kinesis. Machine learning workloads can leverage Amazon SageMaker for training and deployment.
Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
Learn about the new AWS Database Migration Service, which helps you migrate databases with minimal downtime from on-premises and Amazon EC2 environments to Amazon RDS, Amazon Redshift, Amazon Aurora and EC2 databases. We discuss homogeneous (e.g. Oracle-to-Oracle, PostgreSQL-to-PostgreSQL, etc.) and heterogeneous (e.g. Oracle to Aurora, SQL Server to MariaDB) database migrations. We also talk about the new AWS Schema Conversion Tool that saves you development time when migrating your Oracle and SQL Server database schemas, including PL/SQL and T-SQL procedural code, to their MySQL, MariaDB and Aurora equivalents.
This document discusses building a data lake on AWS. It describes using Amazon S3 for storage, Amazon Kinesis for streaming data, and AWS Lambda to populate metadata indexes in DynamoDB and search indexes. It covers using IAM for access control, AWS STS for temporary credentials, and API Gateway and Elastic Beanstalk for interfaces. The data lake provides a foundation for storing and analyzing structured, semi-structured, and unstructured data at scale from various sources in a cost-effective and secure manner.
This document provides an introduction to Amazon Aurora, AWS's managed relational database service. It discusses how Aurora was built to provide the speed and availability of commercial databases at the simplicity and cost-effectiveness of open source databases. The document outlines key Aurora features like automatic scaling, continuous backups, replication across Availability Zones, and integration with other AWS services. Customer case studies show how Aurora provides better performance at lower costs than alternative database options. The document also covers migration options and how Aurora offers a simpler, more cost-effective database solution than on-premises or self-managed options.
This document discusses big data analytics architectural patterns and best practices. It covers collecting and storing data from various sources, processing and analyzing data using tools like Amazon Redshift, Amazon Athena and Amazon EMR, and selecting the appropriate tools based on factors like data structure, access patterns, and data temperature. It also discusses stream/real-time analytics tools and machine learning approaches.
Tech talk on what Azure Databricks is, why you should learn it and how to get started. We'll use PySpark and talk about some real live examples from the trenches, including the pitfalls of leaving your clusters running accidentally and receiving a huge bill ;) After this you will hopefully switch to Spark-as-a-service and get rid of your HDInsight/Hadoop clusters. This is part 1 of an 8 part Data Science for Dummies series: Databricks for dummies Titanic survival prediction with Databricks + Python + Spark ML Titanic with Azure Machine Learning Studio Titanic with Databricks + Azure Machine Learning Service Titanic with Databricks + MLS + AutoML Titanic with Databricks + MLFlow Titanic with DataRobot Deployment, DevOps/MLops and Operationalization
Many organizations have adopted or are in the process of adopting DevOps methodologies in their quest to accelerate the delivery of software capabilities, features, and functionalities to support their organizational objectives. By applying the same practices, DataOps aims to provide the same level of agility in delivering data and information to the organization. AWS Lake Formation, in coordination with other AWS Services, enables DevOps methodologies to be realized through the Data Supply Chain Pipeline.
This document provides an overview of AWS Lake Formation and related services for building a secure data lake. It discusses how Lake Formation provides a centralized management layer for data ingestion, cleaning, security and access. It also describes how Lake Formation integrates with services like AWS Glue, Amazon S3 and ML transforms to simplify and automate many data lake tasks. Finally, it provides an example workflow for using Lake Formation to deduplicate data from various sources and grant secure access for analysis.
As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
Jim Boriotti presents an overview and demo of Azure Synapse Analytics, an integrated data platform for business intelligence, artificial intelligence, and continuous intelligence. Azure Synapse Analytics includes Synapse SQL for querying with T-SQL, Synapse Spark for notebooks in Python, Scala, and .NET, and Synapse Pipelines for data workflows. The demo shows how Azure Synapse Analytics provides a unified environment for all data tasks through the Synapse Studio interface.
Running an IT department in a large organization is challenging. You need to provide users with access to the latest technology, while maintaining corporate standards and providing oversight to avoid runaway spending. In this session, you’ll hear how Lockheed Martin has used AWS Service Catalog to ensure compliance across the organization. You will also learn how 2nd Watch, an APN Premier Consulting Partner, leverages AWS Service Catalog to manage resources for customers and are now able to deploy quickly and standardize their workload management. We’ll also demo advanced functionality and how you can get started.
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Today organizations find themselves in a data rich world with a growing need for increased agility and accessibility of all this data for analysis and deriving keen insights to drive strategic decisions. Creating a data lake helps you to manage all the disparate sources of data you are collecting (in its original format) and extract value. In this session, learn how to architect and implement a data lake in the AWS Cloud. Learn about best practices as we walk through architectural blueprints.
The document introduces Amazon Athena and AWS Glue. It summarizes that Amazon Athena allows users to interactively query data stored in Amazon S3 using standard SQL. It also summarizes that AWS Glue is a fully managed ETL service that automates data extraction, transformation and loading processes. Glue discovers how data is organized, crawls data sources to infer schemas, automatically generates ETL code and manages execution of data workflows.
This document discusses logging scenarios using DynamoDB and Elastic MapReduce. It covers collecting log data in real-time using tools like Fluentd and storing it in DynamoDB. It then describes using EMR to perform ETL processes on the data, extracting from DynamoDB, transforming the data across EC2 instances, and loading to S3 or DynamoDB. Finally, it discusses analyzing the data using Redshift for queries or CloudSearch for search capabilities.
Learning Objectives: - Understand how to build a serverless big data solution quickly and easily - Learn how to discover and prepare all your data for analytics - Learn how to query and visualize analytics on all your data to create actionable insights
From theory to implementation - follow the steps of implementing an end-to-end analytics solution illustrated with some best practices and examples in Azure Data Lake. During this full training day we will share the architecture patterns, tooling, learnings and tips and tricks for building such services on Azure Data Lake. We take you through some anti-patterns and best practices on data loading and organization, give you hands-on time and the ability to develop some of your own U-SQL scripts to process your data and discuss the pros and cons of files versus tables. This were the slides presented at the SQLBits 2018 Training Day on Feb 21, 2018.
The document provides information about querying and analyzing data in Amazon S3 using various AWS services. It discusses: 1. Using Amazon EMR to process raw web logs delivered to S3 by Kinesis Firehose using Apache Spark. 2. Loading the processed data into Amazon Redshift for interactive querying using SQL. 3. Performing ad-hoc analysis on the data in S3 using serverless Athena without having to set up any infrastructure.
AWS has a large and growing portfolio of big data management and analytics services, designed to integrate into solution architectures to meet the needs of your business. In this session, we look at analytics through the eyes of a business intelligence analyst, a data scientist, and an application developer, to explore how to quickly leverage Amazon Redshift, Amazon QuickSight, RStudio, and Amazon Machine Learning to create powerful, yet straightforward, business solutions.