This document provides an overview and summary of the author's background and expertise. It states that the author has over 30 years of experience in IT working on many BI and data warehouse projects. It also lists that the author has experience as a developer, DBA, architect, and consultant. It provides certifications held and publications authored as well as noting previous recognition as an SQL Server MVP.
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.
From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.
As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.
In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.
In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).
Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.
You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.
This document is a training presentation on Databricks fundamentals and the data lakehouse concept by Dalibor Wijas from November 2022. It introduces Wijas and his experience. It then discusses what Databricks is, why it is needed, what a data lakehouse is, how Databricks enables the data lakehouse concept using Apache Spark and Delta Lake. It also covers how Databricks supports data engineering, data warehousing, and offers tools for data ingestion, transformation, pipelines and more.
Doug Bateman, a principal data engineering instructor at Databricks, presented on how to build a Lakehouse architecture. He began by introducing himself and his background. He then discussed the goals of describing key Lakehouse features, explaining how Delta Lake enables it, and developing a sample Lakehouse using Databricks. The key aspects of a Lakehouse are that it supports diverse data types and workloads while enabling using BI tools directly on source data. Delta Lake provides reliability, consistency, and performance through its ACID transactions, automatic file consolidation, and integration with Spark. Bateman concluded with a demo of creating a Lakehouse.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko
This document provides an overview of Azure Databricks, a Apache Spark-based analytics platform optimized for Microsoft Azure cloud services. It discusses key components of Azure Databricks including clusters, workspaces, notebooks, visualizations, jobs, alerts, and the Databricks File System. It also outlines how data engineers can leverage Azure Databricks for scenarios like running ETL pipelines, streaming analytics, and connecting business intelligence tools to query data.
This document discusses designing a modern data warehouse in Azure. It provides an overview of traditional vs. self-service data warehouses and their limitations. It also outlines challenges with current data warehouses around timeliness, flexibility, quality and findability. The document then discusses why organizations need a modern data warehouse based on criteria like customer experience, quality assurance and operational efficiency. It covers various approaches to ingesting, storing, preparing, modeling and serving data on Azure. Finally, it discusses architectures like the lambda architecture and common data models.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The Microsoft Analytics Platform System (APS) is a turnkey appliance that provides a modern data warehouse with the ability to handle both relational and non-relational data. It uses a massively parallel processing (MPP) architecture with multiple CPUs running queries in parallel. The APS includes an integrated Hadoop distribution called HDInsight that allows users to query Hadoop data using T-SQL with PolyBase. This provides a single query interface and allows users to leverage existing SQL skills. The APS appliance is pre-configured with software and hardware optimized to deliver high performance at scale for data warehousing workloads.
Data platform modernization with Databricks.pptxCalvinSim10
The document discusses modernizing a healthcare organization's data platform from version 1.0 to 2.0 using Azure Databricks. Version 1.0 used Azure HDInsight (HDI) which was challenging to scale and maintain. It presented performance issues and lacked integrations. Version 2.0 with Databricks will provide improved scalability, cost optimization, governance, and ease of use through features like Delta Lake, Unity Catalog, and collaborative notebooks. This will help address challenges faced by consumers, data engineers, and the client.
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
This document discusses data mesh, a distributed data management approach for microservices. It outlines the challenges of implementing microservice architecture including data decoupling, sharing data across domains, and data consistency. It then introduces data mesh as a solution, describing how to build the necessary infrastructure using technologies like Kubernetes and YAML to quickly deploy data pipelines and provision data across services and applications in a distributed manner. The document provides examples of how data mesh can be used to improve legacy system integration, batch processing efficiency, multi-source data aggregation, and cross-cloud/environment integration.
This document provides an overview of using Azure Data Factory (ADF) for ETL workflows. It discusses the components of modern data engineering, how to design ETL processes in Azure, an overview of ADF and its components. It also previews a demo on creating an ADF pipeline to copy data into Azure Synapse Analytics. The agenda includes discussions of data ingestion techniques in ADF, components of ADF like linked services, datasets, pipelines and triggers. It concludes with references, a Q&A section and a request for feedback.
Organizations are grappling to manually classify and create an inventory for distributed and heterogeneous data assets to deliver value. However, the new Azure service for enterprises – Azure Synapse Analytics is poised to help organizations and fill the gap between data warehouses and data lakes.
Organizations are struggling to make sense of their data within antiquated data platforms. Snowflake, the data warehouse built for the cloud, can help.
Azure data analytics platform - A reference architecture Rajesh Kumar
This document provides an overview of Azure data analytics architecture using the Lambda architecture pattern. It covers Azure data and services, including ingestion, storage, processing, analysis and interaction services. It provides a brief overview of the Lambda architecture including the batch layer for pre-computed views, speed layer for real-time views, and serving layer. It also discusses Azure data distribution, SQL Data Warehouse architecture and design best practices, and data modeling guidance.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Azure Databricks - An Introduction (by Kris Bock)Daniel Toomey
Azure Databricks is a fast, easy to use, and collaborative Apache Spark-based analytics platform optimized for Azure. It allows for interactive collaboration through a unified workspace, enables sharing of insights through integration with Power BI, and provides native integration with other Azure services. It also offers enterprise-grade security through integration with Azure Active Directory and compliance features.
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
Past, present and future of data mesh at Intuit. This deck describes a vision and strategy for improving data worker productivity through a Data Mesh approach to organizing data and holding data producers accountable. Delivered at the inaugural Data Mesh Leaning meetup on 5/13/2021.
Microsoft Data Platform - What's includedJames Serra
This document provides an overview of a speaker and their upcoming presentation on Microsoft's data platform. The speaker is a 30-year IT veteran who has worked in various roles including BI architect, developer, and consultant. Their presentation will cover collecting and managing data, transforming and analyzing data, and visualizing and making decisions from data. It will also discuss Microsoft's various product offerings for data warehousing and big data solutions.
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
Prague data management meetup 2018-03-27Martin Bém
This document discusses different data types and data models. It begins by describing unstructured, semi-structured, and structured data. It then discusses relational and non-relational data models. The document notes that big data can include any of these data types and models. It provides an overview of Microsoft's data management and analytics platform and tools for working with structured, semi-structured, and unstructured data at varying scales. These include offerings like SQL Server, Azure SQL Database, Azure Data Lake Store, Azure Data Lake Analytics, HDInsight and Azure Data Warehouse.
So you got a handle on what Big Data is and how you can use it to find business value in your data. Now you need an understanding of the Microsoft products that can be used to create a Big Data solution. Microsoft has many pieces of the puzzle and in this presentation I will show how they fit together. How does Microsoft enhance and add value to Big Data? From collecting data, transforming it, storing it, to visualizing it, I will show you Microsoft’s solutions for every step of the way
Azure SQL Database Managed Instance is a new flavor of Azure SQL Database that is a game changer. It offers near-complete SQL Server compatibility and network isolation to easily lift and shift databases to Azure (you can literally backup an on-premise database and restore it into a Azure SQL Database Managed Instance). Think of it as an enhancement to Azure SQL Database that is built on the same PaaS infrastructure and maintains all it's features (i.e. active geo-replication, high availability, automatic backups, database advisor, threat detection, intelligent insights, vulnerability assessment, etc) but adds support for databases up to 35TB, VNET, SQL Agent, cross-database querying, replication, etc. So, you can migrate your databases from on-prem to Azure with very little migration effort which is a big improvement from the current Singleton or Elastic Pool flavors which can require substantial changes.
ENT305 Migrating Your Databases to AWS: Deep Dive on Amazon Relational Databa...Amazon Web Services
Amazon RDS allows you to launch an optimally configured, secure and highly available database with just a few clicks. It provides cost-efficient and resizable capacity, automates time-consuming database administration tasks, and provides you with six familiar database engines to choose from: Amazon Aurora, Oracle, Microsoft SQL Server, PostgreSQL, MySQL and MariaDB. In this session, we will take a close look at the capabilities of Amazon RDS and explain how it works. We’ll also discuss the AWS Database Migration Service and AWS Schema Conversion Tool, which help you migrate databases and data warehouses with minimal downtime from on-premises and cloud environments to Amazon RDS and other Amazon services. Gain your freedom from expensive, proprietary databases while providing your applications with the fast performance, scalability, high availability, and compatibility they need.
AWS ofrece una gran variedad de servicios de base de datos que se adaptan a los requisitos de su aplicación. Los servicios de bases de datos están totalmente administrados y se pueden implementar en cuestión de minutos con tan solo unos clics. Los servicios de AWS incluyen Amazon Relational Database Service (Amazon RDS), compatible con 6 motores de bases de datos comunes, Amazon Aurora, base de datos relacional compatible con MySQL con un desempeño 5 veces superior, Amazon DynamoDB, servicio de bases de datos NoSQL rápido y flexible, Amazon Redshift, almacén de datos a escala de petabytes, y Amazon Elasticache, servicio de caché en memoria compatible con Memcached y Redis. AWS también proporciona AWS Database Migration Service, un servicio que permite migrar las bases de datos a la nube de AWS de forma sencilla y rentable.
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
Modern DW Architecture
- The document discusses modern data warehouse architectures using Azure cloud services like Azure Data Lake, Azure Databricks, and Azure Synapse. It covers storage options like ADLS Gen 1 and Gen 2 and data processing tools like Databricks and Synapse. It highlights how to optimize architectures for cost and performance using features like auto-scaling, shutdown, and lifecycle management policies. Finally, it provides a demo of a sample end-to-end data pipeline.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
2014.10.22 Building Azure Solutions with Office 365Marco Parenzan
This document discusses building Azure solutions with Office 365. It provides an overview of Microsoft Azure services including compute, storage, networking and identity services. It also discusses Office 365 APIs for integrating with calendar, mail and contacts. Code samples are shown for accessing these APIs through REST calls and a library that abstracts away the REST requests. The document concludes with a demonstration of an application that integrates Office 365 and Azure services.
Antoine Genereux takes us on a detailed overview of the Database solutions available on the AWS Cloud, addressing the needs and requirements of customers at all levels. He also discusses Business Intelligence and Analytics solutions.
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Trivadis
In dieser Session stellen wir ein Projekt vor, in welchem wir ein umfassendes BI-System mit Hilfe von Azure Blob Storage, Azure SQL, Azure Logic Apps und Azure Analysis Services für und in der Azure Cloud aufgebaut haben. Wir berichten über die Herausforderungen, wie wir diese gelöst haben und welche Learnings und Best Practices wir mitgenommen haben.
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Level: Intermediate
Speakers:
Tony Nguyen - Senior Consultant, ProServe, AWS
Hannah Marlowe - Consultant - Federal, AWS
What is in a modern BI architecture? In this presentation, we explore PaaS, Azure Active Directory and Storage options including SQL Database and SQL Datawarehouse.
The cloud is all the rage. Does it live up to its hype? What are the benefits of the cloud? Join me as I discuss the reasons so many companies are moving to the cloud and demo how to get up and running with a VM (IaaS) and a database (PaaS) in Azure. See why the ability to scale easily, the quickness that you can create a VM, and the built-in redundancy are just some of the reasons that moving to the cloud a “no brainer”. And if you have an on-prem datacenter, learn how to get out of the air-conditioning business!
Microsoft Fabric is the next version of Azure Data Factory, Azure Data Explorer, Azure Synapse Analytics, and Power BI. It brings all of these capabilities together into a single unified analytics platform that goes from the data lake to the business user in a SaaS-like environment. Therefore, the vision of Fabric is to be a one-stop shop for all the analytical needs for every enterprise and one platform for everyone from a citizen developer to a data engineer. Fabric will cover the complete spectrum of services including data movement, data lake, data engineering, data integration and data science, observational analytics, and business intelligence. With Fabric, there is no need to stitch together different services from multiple vendors. Instead, the customer enjoys end-to-end, highly integrated, single offering that is easy to understand, onboard, create and operate.
This is a hugely important new product from Microsoft and I will simplify your understanding of it via a presentation and demo.
Agenda:
What is Microsoft Fabric?
Workspaces and capacities
OneLake
Lakehouse
Data Warehouse
ADF
Power BI / DirectLake
Resources
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Power BI Overview, Deployment and GovernanceJames Serra
This document provides an overview of external sharing in Power BI using Azure Active Directory Business-to-Business (Azure B2B) collaboration. Azure B2B allows Power BI content to be securely distributed to guest users outside the organization while maintaining control over internal data. There are three main approaches for sharing - assigning Pro licenses manually, using guest's own licenses, or sharing to guests via Power BI Premium capacity. Azure B2B handles invitations, authentication, and governance policies to control external sharing. All guest actions are audited. Conditional access policies can also be enforced for guests.
Power BI has become a product with a ton of exciting features. This presentation will give an overview of some of them, including Power BI Desktop, Power BI service, what’s new, integration with other services, Power BI premium, and administration.
The breath and depth of Azure products that fall under the AI and ML umbrella can be difficult to follow. In this presentation I’ll first define exactly what AI, ML, and deep learning is, and then go over the various Microsoft AI and ML products and their use cases.
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...James Serra
Discover, manage, deploy, monitor – rinse and repeat. In this session we show how Azure Machine Learning can be used to create the right AI model for your challenge and then easily customize it using your development tools while relying on Azure ML to optimize them to run in hardware accelerated environments for the cloud and the edge using FPGAs and Neural Network accelerators. We then show you how to deploy the model to highly scalable web services and nimble edge applications that Azure can manage and monitor for you. Finally, we illustrate how you can leverage the model telemetry to retrain and improve your content.
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
New features in Power BI give it enterprise tools, but that does not mean it automatically creates an enterprise solution. In this talk we will cover these new features (composite models, aggregations tables, dataflow) as well as Azure Data Lake Store Gen2, and describe the use cases and products of an individual, departmental, and enterprise big data solution. We will also talk about why a data warehouse and cubes still should be part of an enterprise solution, and how a data lake should be organized.
In three years I went from a complete unknown to a popular blogger, speaker at PASS Summit, a SQL Server MVP, and then joined Microsoft. Along the way I saw my yearly income triple. Is it because I know some secret? Is it because I am a genius? No! It is just about laying out your career path, setting goals, and doing the work.
I'll cover tips I learned over my career on everything from interviewing to building your personal brand. I'll discuss perm positions, consulting, contracting, working for Microsoft or partners, hot fields, in-demand skills, social media, networking, presenting, blogging, salary negotiating, dealing with recruiters, certifications, speaking at major conferences, resume tips, and keys to a high-paying career.
Your first step to enhancing your career will be to attend this session! Let me be your career coach!
Is the traditional data warehouse dead?James Serra
With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.
Learning to present and becoming good at itJames Serra
Have you been thinking about presenting at a user group? Are you being asked to present at your work? Is learning to present one of the keys to advancing your career? Or do you just think it would be fun to present but you are too nervous to try it? Well take the first step to becoming a presenter by attending this session and I will guide you through the process of learning to present and becoming good at it. It’s easier than you think! I am an introvert and was deathly afraid to speak in public. Now I love to present and it’s actually my main function in my job at Microsoft. I’ll share with you journey that lead me to speak at major conferences and the skills I learned along the way to become a good presenter and to get rid of the fear. You can do it!
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
The document summarizes new features in SQL Server 2016 SP1, organized into three categories: performance enhancements, security improvements, and hybrid data capabilities. It highlights key features such as in-memory technologies for faster queries, always encrypted for data security, and PolyBase for querying relational and non-relational data. New editions like Express and Standard provide more built-in capabilities. The document also reviews SQL Server 2016 SP1 features by edition, showing advanced features are now more accessible across more editions.
DocumentDB is a powerful NoSQL solution. It provides elastic scale, high performance, global distribution, a flexible data model, and is fully managed. If you are looking for a scaled OLTP solution that is too much for SQL Server to handle (i.e. millions of transactions per second) and/or will be using JSON documents, DocumentDB is the answer.
First introduced with the Analytics Platform System (APS), PolyBase simplifies management and querying of both relational and non-relational data using T-SQL. It is now available in both Azure SQL Data Warehouse and SQL Server 2016. The major features of PolyBase include the ability to do ad-hoc queries on Hadoop data and the ability to import data from Hadoop and Azure blob storage to SQL Server for persistent storage. A major part of the presentation will be a demo on querying and creating data on HDFS (using Azure Blobs). Come see why PolyBase is the “glue” to creating federated data warehouse solutions where you can query data as it sits instead of having to move it all to one data platform.
Machine learning allows us to build predictive analytics solutions of tomorrow - these solutions allow us to better diagnose and treat patients, correctly recommend interesting books or movies, and even make the self-driving car a reality. Microsoft Azure Machine Learning (Azure ML) is a fully-managed Platform-as-a-Service (PaaS) for building these predictive analytics solutions. It is very easy to build solutions with it, helping to overcome the challenges most businesses have in deploying and using machine learning. In this presentation, we will take a look at how to create ML models with Azure ML Studio and deploy those models to production in minutes.
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
Did you know Microsoft provides a Hadoop Platform-as-a-Service (PaaS)? It’s called Azure HDInsight and it deploys and provisions managed Apache Hadoop clusters in the cloud, providing a software framework designed to process, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution that includes many Hadoop components such as HBase, Spark, Storm, Pig, Hive, and Mahout. Join me in this presentation as I talk about what Hadoop is, why deploy to the cloud, and Microsoft’s solution.
HA/DR options with SQL Server in Azure and hybridJames Serra
What are all the high availability (HA) and disaster recovery (DR) options for SQL Server in a Azure VM (IaaS)? Which of these options can be used in a hybrid combination (Azure VM and on-prem)? I will cover features such as AlwaysOn AG, Failover cluster, Azure SQL Data Sync, Log Shipping, SQL Server data files in Azure, Mirroring, Azure Site Recovery, and Azure Backup.
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Bert Blevins
Today’s digitally connected world presents a wide range of security challenges for enterprises. Insider security threats are particularly noteworthy because they have the potential to cause significant harm. Unlike external threats, insider risks originate from within the company, making them more subtle and challenging to identify. This blog aims to provide a comprehensive understanding of insider security threats, including their types, examples, effects, and mitigation techniques.
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc
Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk.
What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year?
Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year.
This webinar will review:
- Key changes to privacy regulations in 2024
- Key themes in privacy and data governance in 2024
- How to maximize your privacy program in the second half of 2024
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
Kief Morris rethinks the infrastructure code delivery lifecycle, advocating for a shift towards composable infrastructure systems. We should shift to designing around deployable components rather than code modules, use more useful levels of abstraction, and drive design and deployment from applications rather than bottom-up, monolithic architecture and delivery.
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
Measuring the Impact of Network Latency at Twitter
Azure data platform overview
2. About Me
Microsoft, Big Data Evangelist
In IT for 30 years, worked on many BI and DW projects
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
Been perm employee, contractor, consultant, business owner
Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
Blog at JamesSerra.com
Former SQL Server MVP
Author of book “Reporting with Microsoft SQL Server 2012”
3. I tried to understand the Microsoft data platform on my own…
And felt like I was body slammed by Randy
Savage:
Let’s prevent that from happening…
4. Data platform continuum
Hybrid Cloud
On premises
Shared
Lower cost
Dedicated
Higher cost
Higher administration Lower administration
Off premises
9. Microsoft Big Data Portfolio
SQL Server Stretch
Business intelligence
Machine learning analytics
Insights
SQL Server 2017
SQL Server 2016 Fast Track
Azure SQL DW
Databricks
Cosmos DB
HDInsight
Hadoop
Analytics Platform System
Sequential Scale Out + AcrossScale Up
Key
Relational Non-relational
On-premisesCloud
Microsoft has solutions covering
and connecting all four
quadrants – that’s why SQL
Server is one of the most utilized
databases in the world
Azure SQL Database
SQL Server in Azure VM
10. VM hosted on Microsoft Azure Infrastructure (“IaaS”)
• From Microsoft images (gallery) or your own images (custom)
SQL 2008R2 / 2012 / 2014 / 2016 / 2017 Web / Standard / Enterprise
Images refreshed with latest version, SP, CU
• Windows Server 2008 R2 / 2012 R2 / 2016, Linux RHEL / Ubuntu
• Fast provisioning (~10 minutes).
• Accessible via RDP and Powershell
• Full compatibility with SQL Server “Box” software
Pay per use
• Per minute (only when running)
• Cost depends on size and licensing
• EA customers can use existing SQL licenses (BYOL)
• Network: only outgoing (not incoming)
• Storage: only used (not allocated)
Elasticity
• 1 core / 2 GB mem / 1 TB 128 cores / 3.5 TB mem / 256 TB
11. Azure SQL Database
A relational database-as-a-service (“PaaS”), fully managed by Microsoft.
For cloud-designed apps when near-zero administration and enterprise-grade capabilities are key.
Perfect for organizations looking to dramatically increase the DB:IT ratio.
12. Azure SQL Database Managed Instance
Managed Instance
Instance scoped programming model with
high compatibility to on-premises databases
Single
Standalone managed database best for
predictable and stable workloads
Elastic pool
Shared resource model best for greater
efficiency through multi-tenancy
Best for modernization at
scale with low cost and effort
13. Supports compatibility modes (SQL Server 2005+), Instance sizes up to 8TB
Security
• TDE
• SQL Audit
• Row level security
• Always Encrypted
14. Scalable High performanceReliable and available
Adapts on-demand to your workload's needs, auto-scaling up to 100TB per database.
100
TB
16. More choices and full integration into Azure’s ecosystem and services
Managed community
MySQL, PostgreSQL,
and MariaDB
Scale in seconds with
built-in high availability
Secure and compliantLanguages and
frameworks of your choice
Industry-leading
global reach
AZURE DATABASE SERVICES FOR
MYSQL, POSTGRESQL, AND MARIADB
Easy Lift and Shift Enterprise Ready
My
17. SMP vs MPP
• Uses many separate CPUs running in parallel to execute a single program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
• Segments communicate using high-speed network between nodes
MPP - Massively
Parallel Processing
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
• SQL Server implementations traditionally have been SMP
• Mostly, the solution is housed on a shared storage
SMP - Symmetric
Multiprocessing
19. Azure SQL Data Warehouse
A relational data warehouse-as-a-service, fully managed by Microsoft.
Industries first elastic cloud data warehouse with enterprise-grade capabilities.
Support your smallest to your largest data storage needs while handling queries up to 100x faster.
20. AZURE RELATIONAL DATABASE PLATFORM
PowerBI,AppServices,DataFactory,
Analytics,ML,Cognitive,Bot…
Global Azure with 54 Regions
Azure Compute
SQL Data
Warehouse
Azure Storage
SQL Database MariaDBPostgreSQL
Flexible: On-demand scaling, Resource governance
Trusted: HA/DR, Backup/Restore, Security, Audit, Isolation
Intelligent: Advisors, Tuning, Monitoring
Database
Services
Platform
MySQL
21. Azure Database Migration Service (DMS)
A seamless, end-to-end solution for moving on-premises SQL Server, Oracle, and other relational
databases to the cloud.
Azure Database Migration Guide
https://datamigration.microsoft.com/
22. Relational and non-relational defined
Relational databases (RDBMS, SQL Databases)
• Example: Microsoft SQL Server, Oracle Database, IBM DB2
• Mostly used in large enterprise scenarios
• Analytical RDBMS (OLAP, MPP) solutions are SQL DW, Redshift, Teradata, Netezza
Non-relational databases (NoSQL databases)
• Example: Azure Cosmos DB, MongoDB, Cassandra
• Four categories: Key-value stores, Wide-column stores, Document stores and Graph stores
Hadoop: Made up of Hadoop Distributed File System (HDFS), YARN and MapReduce (Ideal for data lake)
OLTP vs OLAP/DW
SMP vs MPP
23. A globally distributed, massively scalable, multi-model database service
Column-family
Document
Graph
Turnkey global distribution
Elastic scale out
of storage & throughput
Guaranteed low latency at the 99th percentile
Comprehensive SLAs
Five well-defined consistency models
Table API
Key-value
Azure Cosmos DB
MongoDB API
Cassandra API
24. Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP MODEL & SERVE
(& store)
Data orchestration
and monitoring
Big data store Transform & Clean Data warehouse
AI
BI + Reporting
Azure Data Factory
SSIS
Azure Data Lake
Storage Gen2
Blob Storage
Azure Data Lake
Storage Gen1
SQL Server 2019 Big
Data Cluster
Azure Databricks
Azure HDInsight
PolyBase & Stored
Procedures
Power BI Dataflow
Azure Data Lake Analytics
Azure SQL Data Warehouse
Azure Analysis Services
SQL Database (Single, MI,
HyperScale)
SQL Server in a VM
Cosmos DB
Power BI Aggregations
25. Blob Storage Data Lake Store
Azure Data Lake Storage Gen2
Large partner ecosystem
Global scale – All 50 regions
Durability options
Tiered - Hot/Cool/Archive
Cost Efficient
Built for Hadoop
Hierarchical namespace
ACLs, AAD and RBAC
Performance tuned for big data
Very high scale capacity and throughput
Large partner ecosystem
Global scale – All 50 regions
Durability options
Tiered - Hot/Cool/Archive
Cost Efficient
Built for Hadoop
Hierarchical namespace
ACLs, AAD and RBAC
Performance tuned for big data
Very high scale capacity and throughput
26. LRS
Multiple replicas across
a datacenter
Protect against disk,
node, rack failures
Write is ack’d when all
replicas are committed
Superior to dual-parity
RAID
11 9s of durability
SLA: 99.9%
GRS
Multiple replicas across each
of 2 regions
Protects against major
regional disasters
Asynchronous to secondary
16 9s of durability
SLA: 99.9%
RA-GRS
GRS + Read access to secondary
Separate secondary endpoint
RPO delay to secondary can be
queried
SLA: 99.99% (read), 99.9% (write)
Zone 1
ZRS
Replicas across 3 Zones
Protect against disk, node, rack and
zone failures
Synchronous writes to all 3 zones
12 9s of durability
Available in 8 regions
SLA: 99.9%
Zone 2 Zone 3
27. Caching Layer (Avere tech)
Extra Hot Tier - Premium (SSD + NVME)
Hot Tier (HDD)
Cool Tier (HDD)
Cooler Tier
Archive Tier
Deep Storage Tier (Glass, DNA, etc.)
Analytics Engines
(Hadoop, Spark, SCOPE …)
High Performance
Compute
AI / ML
Current Future
Edge
Azure File
Sync
Azure Backup
Data
Box
Data
Box
Edge
Azure
Stack
REST HDFS NFS SMB …
Automatic
Lifecycle
Management
Avere
FXT
28. File Sync
• Windows Srv <-> Azure
• Local caching
• With offline (Databox) can
'sync' remainder
Fuse
• Mount blobs as local FS
• Commit on write
• Linux
Site Replication
• On premise & cloud
• Windows, Linux
• Physical, virtual
• Hyper-V, VMWare
Network Acceleration
• Aspera
• Signiant
AZCopy
• Throughput +30%
• S3 to Azure Blobs
• Sync to cloud
• Hi Latency 10-100%
NetApp
• CloudSync
• SnapMirror
• SnapVault
Data Factory
• On premise & cloud sources
• Structured & unstructured
• Over 60 connectors
• UI design data flow
Partners
• Peer Global File Service
• Talon FAST
• Zerto
• …
Offline
• Data Box
• Data Box Heavy
• Data Box Disk
• Disk Import / Export
Fast Data Transfer
microsoft.com/en-us/garage/profiles/fast-data-transfer/
29. Data Box Heavy PREVIEW
• Capacity: 1 PB
• Weight 500+ lbs
• Secure, ruggedized
appliance
• Preview September 2018
• Same service as Data Box,
but targeted to petabyte-
sized datasets.
Data Box Gateway PREVIEW
• Virtual device provisioned in
your hypervisor
• Supports storage gateway,
SMB, NFS, Azure blob, files
• Preview: September 2018
• Virtual network transfer
appliance (VM), runs on your
choice of hardware.
Data Box Edge PREVIEW
• Local Cache Capacity: ~12 TB
• Includes Data Box Gateway
and Azure IoT Edge.
• Preview: September 2018
• Data Box Edge manages
uploads to Azure and can
pre-process data prior to
upload.
Data Box
• Capacity: 100 TB
• Weight: ~50 lbs
• Secure, ruggedized
appliance
• GA September 2018
• Data Box enables bulk
migration to Azure when
network isn’t an option.
Data Box DiskPREVIEW
• Capacity: 8TB ea.; 40TB/order
• Secure, ruggedized USB
drives orderable in packs of
5 (up to 40TB).
• Currently in Preview
• Perfect for projects that
require a smaller form factor,
e.g., autonomous vehicles.
Order Fill UploadSend Return Cloud to Edge Edge to Cloud Pre-processing ML Inferencing
Network Data Transfer Edge Compute
Offline Data Transfer Online Data Transfer
30. Exactly what is a data lake?
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• Inexpensively store unlimited data
• Collect all data “just in case”
• Store data with no modeling – “Schema on read”
• Complements EDW
• Frees up expensive EDW resources
• Quick user access to data
• ETL Hadoop tools
• Easily scalable
• Place to move older data (archive)
• Place to backup data to
32. Objectives
Plan the structure based on optimal data retrieval
Avoid a chaotic, unorganized data swamp
Data Retention Policy
Temporary data
Permanent data
Applicable period (ex: project lifetime)
etc…
Business Impact / Criticality
High (HBI)
Medium (MBI)
Low (LBI)
etc…
Confidential Classification
Public information
Internal use only
Supplier/partner confidential
Personally identifiable information (PII)
Sensitive – financial
Sensitive – intellectual property
etc…
Probability of Data Access
Recent/current data
Historical data
etc…
Owner / Steward / SME
Subject Area
Security Boundaries
Department
Business unit
etc…
Time Partitioning
Year/Month/Day/Hour/Minute
Downstream App/Purpose
Common ways to organize the data:
33. Data Warehouse
Serving, Security & Compliance
• Business people
• Low latency
• Complex joins
• Interactive ad-hoc query
• High number of users
• Additional security
• Large support for tools
• Dashboards
• Easily create reports (Self-service BI)
• Know questions
34. What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
35. Azure
HDInsight
Hadoop and Spark
as a Service on Azure
Fully-managed Hadoop and Spark
for the cloud
100% Open Source Hortonworks
data platform
Clusters up and running in minutes
Managed, monitored and supported
by Microsoft with the industry’s best SLA
Familiar BI tools for analysis, or open source
notebooks for interactive data science
63% lower TCO than deploy your own
Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
36. Hortonworks Data Platform (HDP) 3.0
Simply put, Hortonworks ties all the open source products together (20)
(under the covers of HDInsight 4.0 – public preview)
39. Compute pool
SQL Compute
Node
SQL Compute
Node
SQL Compute
Node
…
Compute pool
SQL Compute
Node
IoT data
Directly
read from
HDFS
Persistent storage
…
Storage pool
SQL
Server
Spark
HDFS Data Node
SQL
Server
Spark
HDFS Data Node
SQL
Server
Spark
HDFS Data Node
Kubernetes pod
Analytics
Custom
apps BI
SQL Server
master instance
Node Node Node Node Node Node Node
SQL
Data mart
SQL Data
Node
SQL Data
Node
Compute pool
SQL Compute
Node
Storage Storage
46. Q & A ?
James Serra, Big Data Evangelist
Email me at: jamesserra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted via the “Presentations” link on the top menu)
49. INGEST STORE PREP & TRAIN MODEL & SERVE
C L O U D D A T A W A R E H O U S E
Azure Data Lake Store Gen2
Logs (unstructured)
Azure Data Factory
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs.
Media (unstructured)
Files (unstructured)
PolyBase
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
50. INGEST STORE PREP & TRAIN MODEL & SERVE
M O D E R N D A T A W A R E H O U S E
Azure Data Lake Store Gen2
Logs (unstructured)
Azure Data Factory
Azure Databricks
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs.
Media (unstructured)
Files (unstructured)
PolyBase
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
51. A D V A N C E D A N A L Y T I C S O N B I G D A T A
INGEST STORE PREP & TRAIN MODEL & SERVE
Cosmos DB
Business/custom apps
(structured)
Files (unstructured)
Media (unstructured)
Logs (unstructured)
Azure Data Lake Store Gen2Azure Data Factory Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
PolyBase
SparkR
Azure Databricks
Microsoft Azure also supports other Big Data services like Azure HDInsight, Azure Machine Learning to allow customers to tailor the above architecture to meet
their unique needs.
Real-time apps
52. INGEST STORE PREP & TRAIN MODEL & SERVE
R E A L T I M E A N A L Y T I C S
Sensors and IoT
(unstructured)
Apache Kafka for
HDInsight
Cosmos DB
Files (unstructured)
Media (unstructured)
Logs (unstructured)
Azure Data Lake Store Gen2Azure Data Factory
Azure Databricks
Real-time apps
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
Microsoft Azure also supports other Big Data services like Azure IoT Hub, Azure Event Hubs, Azure Machine Learning to allow customers to
tailor the above architecture to meet their unique needs.
PolyBase
53. INGEST STORE MODEL & SERVE
D A T A M A R T C O N S O L I D A T I O N
Azure Data Lake Store Gen2 Azure SQL
Data Warehouse
Azure Data Factory Azure Analysis
Services
Power BI
RDBMS data marts
Hadoop
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the architecture to meet their unique needs.
PolyBase
54. INGEST STORE PREP & TRAIN MODEL & SERVE
H U B & S P O K E A R C H I T E C T U R E F O R B I
Azure SQL
Data Warehouse
PolyBase
Business/custom apps
(structured)
Power BI
Microsoft Azure supports other services like Azure HDInsight to allow customers a truly customized solution.
Multiple Azure Analysis
Services instances
SQL
Multiple Azure SQL
Database instances
Data Marts
Data Cubes
Azure Databricks
Logs (unstructured)
Media (unstructured)
Files (unstructured)
Azure Data Lake Store Gen2Azure Data Factory
55. INGEST STORE PREP & TRAIN MODEL & SERVE
A U T O S C A L I N G D A T A W A R E H O U S E
Microsoft Azure supports other services like Azure HDInsight to allow customers a truly customized solution.
Azure Analysis
Services
Azure Functions
(Auto-scaling)
Business/custom apps
(structured)
Logs (unstructured)
Media (unstructured)
Files (unstructured)
Azure SQL
Data Warehouse
PolyBase
Power BIAzure Data Lake Store Gen2Azure Data Factory
Azure Databricks
56. D A T A W A R E H O U S E M I G R A T I O N
INGEST STORE PREP & TRAIN MODEL & SERVE
Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the architecture to meet their unique needs.
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Business/custom apps
Azure Data Lake Store Gen2
Logs (unstructured)
Azure Data Factory Azure Databricks
Media (unstructured)
Files (unstructured)
Azure Analysis
Services
Power BI
PolyBase
Editor's Notes
Data Platform Overview
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
http://www.ispot.tv/ad/7f64/directv-hang-gliding
Mention WWF, then WWE
What is Randy Savage’s nickname (Macho Man)? Who won the final match of the first WrestleMania Hulk Hogan and Mr. T def. Roddy Piper and Paul Orndorff. Who won in the final match for WrestleMania III: Hulk Hogan (c) def André the Giant for the WWF World Heavyweight Championship. What was the name of the show starring "Rowdy" Roddy Piper: Piper’s Pit. What was the nickname for announcer Gene Okerlund: Mean Gene. Who did Hulk Hogan defeat in 1984 to become World Champion? The Iron Sheik Who did Andre the Giant eliminate to win the battle royal at WrestleMania 2? Bret "The Hitman" Hart What was the nickname of Jimmy Snuka? "Superfly"
One of the first things to understand in any discussion of Azure versus on-premises SQL Server databases is that you can use it all. Microsoft’s Data Platform leverages SQL Server technology and makes it available across physical on-premises machines, private cloud environments, third party hosted private cloud environments, and public cloud. This enables you to meet unique and diverse business needs through a combination of on-premises and cloud-hosted deployments, while using the same set of server products, development tools, and expertise across these environments.
As seen in the diagram, each offering can be characterized by the level of administration you have over the infrastructure (on the X axis), and by the degree of cost efficiency achieved by database level consolidation and automation (on the Y axis).
When designing an application, four basic options are available for hosting the SQL Server part of the application:
SQL Server on nonvirtualized physical machines
SQL Server in on-premises virtualized machines (private cloud)
SQL Server in Azure Virtual Machine (public cloud)
Azure SQL Database (public cloud)
Before the advent of the cloud. FInd someone to provision physical hardware, servers on the hardware, and applications on those physical servers. So many! Lots of steps, parts, counter productive.
All good things begin with a story.
So before I talk to you about serverless, I wanted to discuss some history of how apps have been built through the years and how that has led us to the computing options we have today.
Buy your own hardware and software
Run it in your own data-centers
Deal with a multitude of questions
# of servers
Physical security
Hardware failure
Software patching
Backup
Essentially a lot of questions, not all of which might be directly related to running your business. For example, if you are running an insurance company, you want to spend your time growing and innovating in the insurance domain, not worrying about how to handle servers.
Along came the cloud, way faster than provisioning hardware. So much innovation and rapid appilcation building. BUt Still have questions about servers and scaling.
Along came the cloud, and provided some relief from some of those questions.
IaaS in the cloud, allowed you to forget about managing all that physical hardware. Instead of hosting it yourself, you could now utilize servers hosted by someone else. While the worries around managing physical infrastructure are gone, there are a number of questions that still need to thought about:
How many virtual machines do I need?
How many would I need in future when my usage grows?
How to handle updating the servers for the latest security patches
How to think about deploying my code and the various dependencies, such as application runtimes, to the server
Who monitors my apps and so on
This means that you still haven’t reached that ideal state where you could focus purely on your business and its problems.
Trying to remove infrastructure. Don't care about patching OS or manging instance of your VM. All you care about is your web app – still have chores / non-business questions.
Well, the cloud continue to help with that problem and with the advent of PaaS, there was even more simplification. PaaS allows you to just worry about your app, and let the cloud vendor take care of not only the physical hardware, but also the ongoing management of the operating environment like Operating Systems, Virtual Machines or Containers, as well as the various application runtimes, whether it is Java Runtime, .NET, Node.js, and so on.
You are not completely free from thinking about servers however, since you still have to worry about how to scale your app and accordingly specify the number and size of the various compute resources your apps would need at various times.
The server exists, it's not yours. NO questions about the server.
The latest in this series of evolution is Serverless – which further abstracts the underlying infrastructure from you, and allows you to focus only on a single question – how are you going to build your app to achieve the best results for your business, or whatever it is that you are trying to do with your apps.
Whether it is the code to delight your customers through a great experience, or the logic to communicate securely with a business partner, that is where most organization want to spend their creative energies on.
And serverless allows you to do just that, focus on your code and forget about the infrastructure.
This is why we believe it is the platform for next generation of apps.
Now before we jump into explaining the benefits of serverless, let’s call out what it’s not. It’s not the total lack of servers, that would a bit difficult. Instead, it’s a system where you don’t have to think about Servers at all, so for you, they are invisible.
Our portfolio of products provides customers with the power to deploy the solution that suits their business needs.
Your choice of platform, whether on-premises, hybrid or private or public cloud, doesn’t limit you now or in the future. Migrating or expanding becomes an easy process and doesn’t require excessive downtime or introduce potential threats to your business success.
With Microsoft, you can seamlessly scale up to larger processing and storage capabilities, or scale out by adding additional servers in parallel arrangement.
T: SQL Server is a trusted market leader, and it’s the cornerstone of our data warehouse offering.
With digital transformation in mind, let’s focus on how SQL Database provides the low-cost, low-friction option to migrating your SQL Server data at scale to SQL Database – without having to re-architect your apps.
Introducing Azure SQL Database Managed Instance
SQL Database Managed Instance is an expansion of the existing SQL Database service designed to enable database lift-and-shift to a fully-managed PaaS, without re-designing the application. SQL Database Managed Instance provides high compatibility with the on-premises SQL Server programming model and out-of-box support for the large majority of SQL Server features and accompanying tools and services.
It’s important to note that Managed Instance isn’t a new service – it is a third deployment option within Azure SQL Database, sitting alongside single databases and elastic pools. As part of Azure SQL Database, Microsoft’s fully managed cloud database service, it inherits all its built-in PaaS features.
The Hyperscale service tier is a highly scalable performance tier that leverages the Azure architecture to scale out resources for SQL Database substantially beyond the limits available for the General Purpose and Business Critical service tiers.
Hyperscale provides the following additional capabilities:
Reliable and available
Because it’s part of SQL Database, it has multiple levels of redundancy and no single points of failure. It also provides a 99.99% availability SLA.
Scalable
Support for up to 100 TB of database size
Compute and storage scale independently
Nearly instantaneous database backups (based on file snapshots stored in Azure Blob storage) regardless of size with no IO impact on Compute
High performance
Fast database restores (based on file snapshots) in minutes rather than hours or days (not a size of data operation)
Higher overall performance due to higher log throughput and faster transaction commit times regardless of data volumes
Rapid Scale up - you can, in constant time, scale up your compute resources to accommodate heavy workloads when needed, and then scale the resources back down when not needed.
Rapid scale out - you can provision one or more read-only nodes for offloading your read workload and for use as hot-standbys
Azure Database for MySQL provides fully managed enterprise ready community MySQL database as a service for service for app development and deployment. Being community MySQL allows you to easily lift and shift to the cloud and use languages and frameworks for your choice. On top of that you get built-in high availability and capability to scale in seconds, helping you easily adjust to changes in customer demands. Additionally, you benefit from the unparalleled security and compliance, including Azure IP advantage, as well as Azure’s industry leading reach with more datacenters than any other cloud provider. All this with a flexible pricing model so you can choose resources for your workload with no hidden cost.
Languages and Frameworks of your choice
Being based on the community-editions of MySQL, PostgreSQL and MariaDB mean you can use the existing development languages frameworks and tools you already use for your apps. In addition, Azure Database Services are deeply integrated with Azure Web Apps to provide a streamlined provisioning and management experience for common frameworks (like WordPress, Drupal, Joomla) and languages (PHP, Node.js, Ruby) to provide a best-in-class PaaS experience.
Scale in Seconds with built-in HA
Azure Database Services are built upon the same service fabric framework that has been powering SQL Database for years. Unlike an VM-based PaaS offering like AWS RDS, Azure Database Services do not have the overhead a full VM stack has (e.g.; Linux OS + DB). Running on in a secured container implementation (SQLPAL, a very light-weight SQL OS), Azure Database Services can provision a new server in seconds in the event that a primary server hangs or crashes whereas in a traditional VM-based implementation the entire Linux (or Windows) OS stack has to bootstrap before the DB service loads. This means the entire experience of a failover can happen in as little as 30-45 seconds – and most importantly WITHOUT the need for a replica. AWS RDS requires deployment in Multi-AZ in order to achieve 99.95% SLA, which doubles your costs as you have 2 DB servers running at all times. With Azure Database Services, no replicas are needed which means no additional cost, or maintenance, by the customer.
Additionally, this HA infrastructure enables the ability for Azure Database Services to scale performance on the fly. When a customer needs to scale-up for workload spikes, by simply changing a slider in the portal, a new server is provisioned at a higher performance level and the previous server’s DNS name and storage is connected to the new instance. Scaling can take a little time as 20 seconds meaning customers can scale performance, up or down, with little/no downtime to the application.
Secure and Compliant
“Secure by default” is the standard for any Azure service, meaning elements such as SSL encryption between the database and application are turned on by default. Additionally, all data at-rest is encrypted by default in Azure storage using AES 256 bit encryption. And since Azure Database Services are using OSS database engines, the Azure IP Advantage means that customers do not have to worry about litigation using an OSS product in Azure. Microsoft provides indemnification for any OSS first party workload in Azure.
Industry-leading global reach
With more regions across the globe than any other public cloud provider, Azure offers the ability to have the most globally distributed MySQL, PostgreSQL or MariaDB-based application in the world.
SMP is one server where each CPU in the server shares the same memory, disk, and network controllers (scale-up). MPP means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk (scale-out).
SMP is one server where each CPU in the server shares the same memory, disk, and network controllers (scale-up). MPP means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk (scale-out).
Azure Cosmos DB offers the first globally distributed, multi-model database service for building planet scale apps. It’s been powering Microsoft’s internet-scale services for years, and now it’s ready to launch yours.
Only Azure Cosmos DB makes global distribution turn-key.
You can add Azure locations to your database anywhere across the world, at any time, with a single click. Cosmos DB will seamlessly replicate your data and make it highly available.
Cosmos DB allows you to scale throughput and storage elastically, and globally! You only pay for the throughput and storage you need – anywhere in the world, at any time.
Transfer options:
https://www.microsoft.com/en-us/garage/profiles/fast-data-transfer/
Azure data box
Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)
http://www.jamesserra.com/archive/2014/05/hadoop-and-data-warehouses/
http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/
http://adtmag.com/articles/2014/07/28/gartner-warns-on-data-lakes.aspx
http://intellyx.com/2015/01/30/make-sure-your-data-lake-is-both-just-in-case-and-just-in-time/
http://www.blue-granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses
http://www.martinsights.com/?p=1088
http://data-informed.com/hadoop-vs-data-warehouse-comparing-apples-oranges/
http://www.martinsights.com/?p=1082
http://www.martinsights.com/?p=1094
http://www.martinsights.com/?p=1102
https://www.sqlchick.com/entries/2017/12/30/zones-in-a-data-lake
https://www.sqlchick.com/entries/2016/7/31/data-lake-use-cases-and-planning
Question: Do you see many companies building data lakes?
Raw: Raw events are stored for historical reference. Also called staging layer or landing area
Cleansed: Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer
Application: Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed, presentation
Sandbox: Optional layer to be used to “play” in. Also called exploration layer or data science workspace
Azure Databricks features –
Enhance your teams’ productivity
Get started quickly by launching your new Spark environment with one click.
Share your insights in powerful ways through rich integration with PowerBI.
Improve collaboration amongst your analytics team through a unified workspace.
Innovate faster with native integration with rest of Azure platform.
Build on the most compliant and trusted cloud
Simplify security and identity control with built-in integration with Active Directory.
Regulate access with fine-grained user permissions to Azure Databricks’ notebooks, clusters, jobs and data.
Build with confidence on the trusted cloud backed by unmatched support, compliance and SLAs.
Scale without limits
Operate at massive scale without limits globally.
Accelerate data processing with the fastest Spark engine.
Key points: Summarize key benefits for Azure Analysis Services
Azure Analysis Services helps you transform complex data from different data sources into a BI semantic model, so users in your organization can easily gain insights by connecting to the data models using tools like Excel, Power BI, and others to create reports and perform ad hoc data analysis
Talk Track
Transform Complex Data into rich BI semantic models: Azure Analysis Services Analysis Services helps you transform complex data into a single business user friendly data model making it easy for business users to understand and analyze data across different data sources.
Gain instant insights with in-memory cache using your preferred visualization tools : Not only can business users get insights from data easily using their preferred data visualization tool, whether it is Power BI, Excel or other major data visualization tools, but with the in-memory cache capabilities of Azure Analysis Services, users can gain insights over billions of rows of data at the speed of thought
Proven Technology: Azure Analysis Services is based on the proven analytics engine in SQL Server 2016 Analysis Services, that has helped organizations turn complex data into a trusted, single source of truth for years. This means that BI professionals who are familiar with SQL Server Analysis Services, tabular models can get started quickly and do not need to learn new tools or skills.
Analytics engine as-a-service (provision fast, scale faster): The same proven enterprise grade BI platform is now available as a fully managed service in Azure. With the power of the trusted Microsoft Cloud, you do not need to manage infrastructure on-premises and can benefit from the scalability of the cloud. Additionally you can use Azure Resource Manager to create and deploy an Azure Analysis Services instance within seconds, and use backup restore to quickly move your existing models to Azure Analysis Services and take advantage of the scale, flexibility and management benefits of the cloud. Scale up, scale down, or pause the service and pay only for what you use.
Azure Analysis Services is built for Hybrid BI - Organizations store data in the cloud and on-premises. Azure Analysis Services is built for hybrid data. Data can be access in the cloud, on-premises or a combination of both, enabling a hybrid solution. So - you do not have to move on-premises data to the cloud.
To summarize, Azure Analysis Services is simple to use – it is easy to get started, you can use your existing skills to create BI semantic models, and your favorite data visualizations tools to analyze your data.
Increase analytics and apps performance with scale out data marts
3/17/2019
Microsoft Azure supports other services like Azure HDInsight, Azure Data Lake, Azure IoT Hub, Azure Events Hub in various layers of the architecture above to allow customers a truly customized solution.
1) Copy source data into the Azure Data Lake Store (twitter data example)2) Massage/filter the data using Hadoop (or skip using Hadoop and use stored procedures in SQL DW/DB to massage data after step #5)3) Pass data into Azure ML to build models using Hive query (or pass in directly from Azure Data Lake Store)4) Azure ML feeds prediction results into the data warehouse5) Non-relational data in Azure Data Lake Store copied to data warehouse in relational format (optionally use PolyBase with external tables to avoid copying data)6) Power BI pulls data from data warehouse to build dashboards and reports7) Azure Data Catalog captures metadata from Azure Data Lake Store and SQL DW/DB8) Power BI and Excel can pull data from the Azure Data Lake Store via HDInsight9) To support high concurrency if using SQL DW, or for easier end-user data layer, create an SSAS cube