Presentation by Austin Sun who has cleared this certification himself and who is helping others to do the same.
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing. We want to share some of our learnings and hard earned lessons and as we reached this scale specifically with Structured Streaming. Know thy Lag While consuming off a Kafka topic which sees sporadic loads, its very important to monitor the Consumer lag. Also makes you respect what a beast backpressure is. Reading Data In Fan Out Pattern using minPartitions to Use Kafka Efficiently Overload protection using maxOffsetsPerTrigger More Apache Spark Settings used to optimize Throughput MicroBatching Best Practices Map() +ForEach() vs MapPartitons + forEachPartition Adobe Spark Speculation and its Effects Calculating Streaming Statistics Windowing Importance of the State Store RocksDB FTW Broadcast joins Custom Aggegators OffHeap Counters using Redis Pipelining
Hadoop has become synonymous to Big Data. Oracle has release the latest standard to Java EE stack: Batch Processing JSR 352. Batch processing has been around for decades and there are many Java framework already available such Spring Batch. This talks provides a perspective about Hadoop and JSR352. Knowing when to use or the other or both together.
The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.
This document discusses Devon Energy's efforts to modernize its data landscape by implementing a data hub architecture. The data hub consolidates various data sources and tools on cloud services like Snowflake, Databricks and Azure. This has improved agility, reduced costs, and allowed various teams to access and analyze data. Devon Energy is working to improve continuous integration/deployment, testing, and monitoring across its data engineering and analytics workflows on the data hub platform.
Business leads, executives, analysts, and data scientists rely on up-to-date information to make business decision, adjust to the market, meet needs of their customers or run effective supply chain operations. Come hear how Asurion used Delta, Structured Streaming, AutoLoader and SQL Analytics to improve production data latency from day-minus-one to near real time Asurion’s technical team will share battle tested tips and tricks you only get with certain scale. Asurion data lake executes 4000+ streaming jobs and hosts over 4000 tables in production Data Lake on AWS.
Gojek, Indonesia’s first billion-dollar startup, has seen an explosive growth in both users and data over the past three years.
Accelerating Data Science Initiatives with Databricks’ Rapid SQL Analytics and Privacera’s Centralized Data Access Governance. Databricks’ SQL Analytics helps data teams consolidate and simplify their data architectures. With SQL Analytics, data teams can perform BI and SQL workloads on the same multi-cloud lakehouse architecture enabling data scientists to perform advanced analytics on unstructured and large-scale data. This session will explore how Privacera’s advanced security, privacy, and governance capabilities seamlessly integrate with Databricks’ unified SQL Analytics approach to provide single pane visibility of data analytics from a centralized location. Attendees will learn how to: Rapidly access data to run high-fidelity analytics Implement a fully secure solution that ensures productivity, while controlling data access at fine-grained levels (row, column, and file) Easily enable consistent access policies across all systems and applications Support true data transparency across enterprises Comply with stringent industry and privacy regulations like GDPR, LGPD, HIPAA, CCPA, PCI DSS, RTBF, and more with rich auditing and reporting
Driverless AI can run on various cloud platforms and on-premises servers. It supports Linux environments with CUDA GPUs. The document provides step-by-step instructions for setting up Driverless AI on an IBM Power P9 system, including installing prerequisites, running experiments through a web interface, and automating training with Python. It also addresses common customer questions about installation, deployment, and productionizing Driverless AI models and pipelines.
This document discusses ETL practices and opportunities for improving data integration processes. It presents ELT and RIT approaches to extract, load, and transform data in Hadoop/MPP systems for better performance and scalability. While data modeling is still important, the document questions how to balance normalization with ease of querying for analytics. Integration is noted as key to bringing value from distributed data sources, and challenges of unique identifiers and cross-referencing data are discussed. The document also emphasizes best practices like profiling, prototyping, deploying to sandboxes before production, and ensuring tools for performance monitoring, problem detection and education are in place.
This document provides an introduction to Google BigQuery, a cloud-based data warehouse that allows users to interactively query and analyze massive datasets. It begins with background on big data and technologies like Hadoop, Hive, and Spark. It then explains the differences between row-based and column-based data stores, with BigQuery using a columnar approach. The rest of the document demonstrates BigQuery through an example query on public datasets and provides pricing and resource information.
An introduction to using R in Power BI via the various touch points such as: R script data sources, R transformations, custom R visuals, and the community gallery of R visualizations
This document provides an overview of using Polybase for data virtualization in SQL Server. It discusses installing and configuring Polybase, connecting external data sources like Azure Blob Storage and SQL Server, using Polybase DMVs for monitoring and troubleshooting, and techniques for optimizing performance like predicate pushdown and creating statistics on external tables. The presentation aims to explain how Polybase can be leveraged to virtually access and query external data using T-SQL without needing to know the physical data locations or move the data.
This document provides an introduction to Azure SQL Data Warehouse. It discusses the architecture of ASDW including how it is built on Azure SQL Database and Analytics Platform System (APS). It covers various topics like database design, querying, data loading, tooling, and maintenance for ASDW. The goals are to understand the basic infrastructure, learn design/querying/migration methods, and investigate available tooling for automation and monitoring of ASDW.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.
This document discusses using MongoDB as an agile NoSQL database for big data applications. It describes MongoDB's schema-less design, horizontal scaling, and replication capabilities which make it a good fit for frequently changing agile projects. The document includes examples of using MongoDB for an e-commerce catalog with dynamic product data and reviews from multiple sources.
Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse. Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table. These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business. Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data. And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table. Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.
This document summarizes Mark Kromer's presentation on using Azure Data Factory and Azure Databricks for ETL. It discusses using ADF for nightly data loads, slowly changing dimensions, and loading star schemas into data warehouses. It also covers using ADF for data science scenarios with data lakes. The presentation describes ADF mapping data flows for code-free data transformations at scale in the cloud without needing expertise in Spark, Scala, Python or Java. It highlights how mapping data flows allow users to focus on business logic and data transformations through an expression language and provides debugging and monitoring of data flows.
This document discusses techniques for optimizing Power BI performance. It recommends tracing queries using DAX Studio to identify slow queries and refresh times. Tracing tools like SQL Profiler and log files can provide insights into issues occurring in the data sources, Power BI layer, and across the network. Focusing on optimization by addressing wait times through a scientific process can help resolve long-term performance problems.
What is Data Warehousing? , Who needs Data Warehousing? , Why Data Warehouse is required? , Types of Systems , OLTP OLAP Maintenance of Data Warehouse Data Warehousing Life Cycle
The document describes BlueData's Big Data Lab Accelerator solution which provides software and professional services to deploy a multi-tenant Hadoop and Spark lab environment in two weeks for evaluation of Big Data tools. BlueData's EPIC software simplifies Big Data infrastructure deployment using containers and virtualization. The Accelerator solution includes deployment of the EPIC platform, setup of Hadoop and Spark clusters, data pipeline workshops and implementation of sample use cases to get started with Big Data.
This document summarizes a presentation on using SQL Server Integration Services (SSIS) with HDInsight. It introduces Tillmann Eitelberg and Oliver Engels, who are experts on SSIS and HDInsight. The agenda covers traditional ETL processes, challenges of big data, useful Apache Hadoop components for ETL, clarifying statements about Hadoop and ETL, using Hadoop in the ETL process, how SSIS is more than just an ETL tool, tools for working with HDInsight, getting started with Azure HDInsight, and using SSIS to load and transform data on HDInsight clusters.