This document provides an overview of building a modern cloud analytics solution using Microsoft Azure. It discusses the role of analytics, a history of cloud computing, and a data warehouse modernization project. Key challenges covered include lack of notifications, logging, self-service BI, and integrating streaming data. The document proposes solutions to these challenges using Azure services like Data Factory, Kafka, Databricks, and SQL Data Warehouse. It also discusses alternative implementations using tools like Matillion ETL and Snowflake.
Past, present and future of data mesh at Intuit. This deck describes a vision and strategy for improving data worker productivity through a Data Mesh approach to organizing data and holding data producers accountable. Delivered at the inaugural Data Mesh Leaning meetup on 5/13/2021.
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist. From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives. As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms. In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly. In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data). Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021. You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include: - Defining top-down and bottom-up approaches to data management - Explaining what a data lake is and how Hadoop can function as the data lake - Describing how a modern data warehouse combines features of a traditional data warehouse and data lake - Discussing how federated querying allows data to be accessed across multiple sources - Highlighting benefits of implementing big data solutions in the cloud - Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
The document discusses streaming data pipelines and includes information about: - The FLaNK stack which is comprised of Apache NiFi, Apache Kafka, Apache Flink, and Java. - SQL Stream Builder which allows developers, analysts, and data scientists to write streaming applications using standard SQL without needing to write Java or Scala code. - Apache Kafka as a distributed, partitioned, and replicated publish-subscribe messaging system. - Apache Flink which is a framework for distributed stream and batch data processing.
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years. Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines. In today���s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
This document discusses architecting a data lake. It begins by introducing the speaker and topic. It then defines a data lake as a repository that stores enterprise data in its raw format including structured, semi-structured, and unstructured data. The document outlines some key aspects to consider when architecting a data lake such as design, security, data movement, processing, and discovery. It provides an example design and discusses solutions from vendors like AWS, Azure, and GCP. Finally, it includes an example implementation using Azure services for an IoT project that predicts parts failures in trucks.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
The document discusses AWS Glue Data Catalog and Amazon Athena. It provides an overview of AWS Glue Data Catalog as a unified metadata repository across data sources. It then describes Amazon Athena as an interactive query service that allows users to analyze data stored in Amazon S3 using standard SQL. Various use cases are presented that demonstrate how customers can use AWS Glue Data Catalog and Amazon Athena together to build data lakes on AWS.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems. Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/) Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020. Abstract: The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability. At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh. The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership. This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Data mesh is a decentralized approach to managing and accessing analytical data at scale. It distributes responsibility for data pipelines and quality to domain experts. The key principles are domain-centric ownership, treating data as a product, and using a common self-service infrastructure platform. Snowflake is well-suited for implementing a data mesh with its capabilities for sharing data and functions securely across accounts and clouds, with built-in governance and a data marketplace for discovery. A data mesh implemented on Snowflake's data cloud can support truly global and multi-cloud data sharing and management according to data mesh principles.
- Azure Databricks provides a curated platform for data science and machine learning workloads using notebooks, data services, and machine learning tools. - Only a small fraction of real-world machine learning systems is composed of the actual machine learning code, as vast surrounding infrastructure is required for data collection, feature extraction, model training, and deployment. - Azure Databricks can be used across many industries for applications like customer analytics, financial modeling, healthcare analytics, industrial IoT, and cybersecurity threat detection through machine learning on structured and unstructured data.
Apache Spark is a fast and general engine for large-scale data processing. It was created by UC Berkeley and is now the dominant framework in big data. Spark can run programs over 100x faster than Hadoop in memory, or more than 10x faster on disk. It supports Scala, Java, Python, and R. Databricks provides a Spark platform on Azure that is optimized for performance and integrates tightly with other Azure services. Key benefits of Databricks on Azure include security, ease of use, data access, high performance, and the ability to solve complex analytics problems.
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh? In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry. The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems. This session is targeted for architects, decision-makers, data-engineers, and system designers.
Organizations that have vast amounts of data in legacy applications often experience difficulties delivering that data to business unit end-users. Register to learn how Eliza Corporation and Scholastic overcame this challenge by leveraging a Data Lake solution from NorthBay on AWS to optimize data analytics and provide greater visibility. AWS and NorthBay will give you an in-depth overview of how you can use a Data Lake in conjunction with your existing on-premises or cloud-based Data Warehouse. NorthBay helps organizations scale their ETL and data warehousing workloads using Amazon EMR and Amazon Redshift. Join us to learn: • Best practices for using a Data Lake in conjunction with your existing data warehouse • The key aspects of introducing agile and scrum methodologies into an enterprise • The most impactful cost-savings levers that are addressed via a cloud data warehouse migration Who should attend: Heads of Analytics, Heads of BI, Analytics Managers, BI Teams, Senior Analysts
The document discusses the Common Data Model (CDM) and how to use it. It describes CDM as an open-sourced definition of standard business entities that provides a common data model that can be shared across applications. It outlines how CDM allows building applications faster by composing analytics, user experiences, and automation using integrated Microsoft services. It also discusses moving data into CDM using the Data Integrator and building applications with CDM using PowerApps, the CDS SDK, Microsoft Flow, and Power BI.
Speaker: Frederik Vandeputte Download SQL Server 2012: http://www.microsoft.com/sqlserver/en/us/get-sql-server/try-it.aspx
This document discusses feature stores and their role in modern machine learning infrastructure. It begins with an introduction and agenda. It then covers challenges with modern data platforms and emerging architectural shifts towards things like data meshes and feature stores. The remainder discusses what a feature store is, reference architectures, and recommendations for adopting feature stores including leveraging existing AWS services for storage, catalog, query, and more.
Modern data warehouses need to be modernized to handle big data, integrate multiple data silos, reduce costs, and reduce time to market. A modern data warehouse blueprint includes a data lake to land and ingest structured, unstructured, external, social, machine, and streaming data alongside a traditional data warehouse. Key challenges for modernization include making data discoverable and usable for business users, rethinking ETL to allow for data blending, and enabling self-service BI over Hadoop. Common tactics for modernization include using a data lake as a landing zone, offloading infrequently accessed data to Hadoop, and exploring data in Hadoop to discover new insights.
So you got a handle on what Big Data is and how you can use it to find business value in your data. Now you need an understanding of the Microsoft products that can be used to create a Big Data solution. Microsoft has many pieces of the puzzle and in this presentation I will show how they fit together. How does Microsoft enhance and add value to Big Data? From collecting data, transforming it, storing it, to visualizing it, I will show you Microsoft’s solutions for every step of the way
Serverless SQL provides a serverless analytics platform that allows users to analyze data stored in object storage without having to manage infrastructure. Key features include seamless elasticity, pay-per-query consumption, and the ability to analyze data directly in object storage without having to move it. The platform includes serverless storage, data ingest, data transformation, analytics, and automation capabilities. It aims to create a sharing economy for analytics by allowing various users like developers, data engineers, and analysts flexible access to data and analytics.
Self-service BI empowers users to reach analytic outputs through data visualizations and reporting tools. Solution Architect and Cloud Solution Specialist, James McAuliffe, will be taking you through a journey of Azure's Modern Data Estate.
This document provides an overview of several Microsoft Azure cloud data and analytics services: - Azure Data Factory is a data integration service that can move and transform data between cloud and on-premises data stores as part of scheduled or event-driven workflows. - Azure SQL Data Warehouse is a cloud data warehouse that provides elastic scaling for large BI and analytics workloads. It can scale compute resources on demand. - Azure Machine Learning enables building, training, and deploying machine learning models and creating APIs for predictive analytics. - Power BI provides interactive reports, visualizations, and dashboards that can combine multiple datasets and be embedded in applications.
Get your data to life with Power BI visualization and insights! With the changing landscape of Power BI features it is essential to get hold of configuration and deployment practices within your data platform that will ensure you are on-par with compliance & security practices. In this session we will overview from the basics leading into advanced tricks on this landscape: How to deploy Power BI? How to implement configuration parameters and package BI features as a part of Office 365 roll out in your organisation? What are newest features and enhancements on this Power BI landscape? How to manage on-premise vs on-cloud connectivity? How can you help and support the Power BI community as well? Having said that within the objectives of this session, cloud computing is another aspect of this technology made is possible to get data within few clicks and ticks to the end-user. Let us review how to manage & connect on-premise data to cloud capabilities that can offer full advantage of data catalogue capabilities by keeping data secure as per Information Governance standards. Not just with nuts and bolts, performance is another aspect that every Admin is keeping up, let us look into few settings on how to maximize performance to optimize access to data as required. Gain understanding and insight into number of tools that are available for your Business Intelligence needs. There will be a showcase of events to demonstrate where to begin and how to proceed in BI world. - D BI A Consulting consulting@dbia.uk
This document summarizes a presentation about Microsoft Cloud BI capabilities using Windows Azure. The speaker, Andy Tegethoff, is a Microsoft BI architect who has over 12 years of experience building BI solutions. The presentation covers key topics like cloud computing models, Cloud BI, and how Microsoft's Azure platform can be used to implement BI solutions in the public cloud or in hybrid cloud/private cloud environments. It provides examples of using Azure SQL Database, SQL Reporting, and HDInsight for big data, as well as running full SQL Server BI implementations on Azure virtual machines. Power BI, a new self-service BI tool from Microsoft, is also summarized. The document concludes by introducing Perficient, the company hosting the presentation, as a
By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
The document discusses IBM's cloud data services and analytics offerings. It introduces IBM Cloudant for NoSQL database services, IBM dashDB for a cloud data warehouse with built-in analytics, and how they can be used together. Use cases are provided showing how a payment processor leveraged Cloudant's geospatial capabilities, an investment firm used Cloudant and dashDB to enable real-time access to analytics, and a food distributor analyzed sales data from different business units stored in dashDB.
IBM's Big Data platform provides tools for managing and analyzing large volumes of structured, unstructured, and streaming data. It includes Hadoop for storage and processing, InfoSphere Streams for real-time streaming analytics, InfoSphere BigInsights for analytics on data at rest, and PureData System for Analytics (formerly Netezza) for high performance data warehousing. The platform enables businesses to gain insights from all available data to capitalize on information resources and make data-driven decisions.
Data lakes are providing immense value to organizations embracing data science. In this webinar, William will discuss the value of having broad, detailed, and seemingly obscure data available in cloud storage for purposes of expanding Data Science in the organization.
Today, data lakes are widely used and have become extremely affordable as data volumes have grown. However, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine that runs on top of a data lake. Join us for this webinar where Ahana co-founder and Chief Product Officer Dipti Borkar will discuss how to unlock the value of your data lake with the emerging Open Data Lake analytics architecture. Dipti will cover: -Open Data Lake analytics - what it is and what use cases it supports -Why companies are moving to an open data lake analytics approach -Why the open source data lake query engine Presto is critical to this approach