This document discusses how SQL can be used in Lucidworks Fusion for various purposes like aggregating signals to compute relevance scores, ingesting and transforming data from various sources using Spark SQL, enabling self-service analytics through tools like Tableau and PowerBI, and running experiments to compare variants. It provides examples of using SQL for tasks like sessionization with window functions, joining multiple data sources, hiding complex logic in user-defined functions, and powering recommendations. The document recommends SQL in Fusion for tasks like analytics, data ingestion, machine learning, and experimentation.
Modern DW Architecture - The document discusses modern data warehouse architectures using Azure cloud services like Azure Data Lake, Azure Databricks, and Azure Synapse. It covers storage options like ADLS Gen 1 and Gen 2 and data processing tools like Databricks and Synapse. It highlights how to optimize architectures for cost and performance using features like auto-scaling, shutdown, and lifecycle management policies. Finally, it provides a demo of a sample end-to-end data pipeline.
Power BI can be used either through Power BI Desktop or Power BI Embedded. Power BI Desktop is a free desktop application that allows connecting to various data sources and creating visual analytics. Power BI Embedded allows integrating Power BI visualizations into web and mobile applications. Reports in Power BI combine visuals and filters to analyze data, while dashboards combine multiple reports. Filters and slicers allow filtering the data in visuals. Authentication is handled through Azure Active Directory, while access is controlled using various token types.
The document discusses Azure Data Factory V2 data flows. It will provide an introduction to Azure Data Factory, discuss data flows, and have attendees build a simple data flow to demonstrate how they work. The speaker will introduce Azure Data Factory and data flows, explain concepts like pipelines, linked services, and data flows, and guide a hands-on demo where attendees build a data flow to join customer data to postal district data to add matching postal towns.
Azure Data Factory is one of the newer data services in Microsoft Azure and is part of the Cortana Analyics Suite, providing data orchestration and movement capabilities. This session will describe the key components of Azure Data Factory and take a look at how you create data transformation and movement activities using the online tooling. Additionally, the new tooling that shipped with the recently updated Azure SDK 2.8 will be shown in order to provide a quickstart for your cloud ETL projects.
When you model data you are making two decisions: * The location where data will be stored * How the data will be organized for ease of use
This slide deck will show you techniques and technologies necessary to take a large, transaction SQL Server database and migrate it to Azure, Azure SQL Database, and Azure SQL Database Managed Instance
Microsoft Graph provides REST APIs and webhooks to access and connect Microsoft 365 and other organizational data at scale. It enables building custom applications and workflows that extend Microsoft 365 experiences. Data access through Microsoft Graph is designed with data privacy, security, and governance in mind, allowing administrators to control access to organizational data.
A quick introduction to Azure DocumentDB - a JSON document store that's Platform as a Service on Microsoft Azure
This document discusses challenges with distributed data analysis and Treasure Data's approach to addressing them. Some key points: - Distributed data analysis faces challenges around network bandwidth, throughput, data consistency, and reliability. - Treasure Data uses a columnar storage format based on MessagePack to more efficiently save bandwidth and storage space. - They implement time index pushdown to enable reading only relevant data within a time range, reducing network usage. - Automatic optimization of partitioning layout and repartitioning aims to balance partition file size, time ranges, and keys to maximize performance and throughput while minimizing memory pressure.
The document discusses a company's migration from their in-house computation engine to Apache Spark. It describes five key issues encountered during the migration process: 1) difficulty adapting to Spark's low-level RDD API, 2) limitations of DataSource predicates, 3) incomplete Spark SQL functionality, 4) performance issues with round trips between Spark and other systems, and 5) OutOfMemory errors due to large result sizes. Lessons learned include being aware of new Spark features and data formats, and designing architectures and data structures to minimize data movement between systems.
In this session we'll look at the cloud choices available in Azure for SQL Server. Whether it's PaaS, IaaS or Managed Instance we'll look into the features provided, the major differences and the Pros and Cons of each solution and how to choose the best option available.
An introduction to using R in Power BI via the various touch points such as: R script data sources, R transformations, custom R visuals, and the community gallery of R visualizations
Marco Pozzan Power BI consultant & Trainer Scenario di utilizzo del real-time di Power BI. In questa sessione verrà introdotta la teoria sul real-time dashboarding offerto da Power BI. Poi ci si focalizzerà sun un caso pratico di real-time dataset in modalità ibrida per la realizzazione di una dashboard di controllo con la possibilità di effettuare il write back e permettere all’utente di effettuare analisi what-if.
This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.
This document discusses machine learning data lineage using Delta Lake. It introduces Richard Zang and Denny Lee, then outlines the machine learning lifecycle and challenges of model management. It describes how MLflow Model Registry can track model versions, stages, and metadata. It also discusses how Delta Lake allows data to be processed continuously and incrementally in a data lake. Delta Lake uses a transaction log and file format to provide ACID transactions and allow optimistic concurrency control for conflicts.
- The document discusses real-time options in Power BI including push, streaming, and PubNub data. It describes the characteristics of each option including refresh rates, visual capabilities, and advantages/limitations. - A case study is presented on creating a dashboard to monitor warehouse workload in real-time using a hybrid dataset with data pushed from SQL Server and SAP HANA via REST APIs into Power BI. PowerApps is also suggested for creating mobile apps connected to the real-time data. - Additional resources are provided on real-time streaming documentation, tutorials for IoT dashboards and connecting Azure Stream Analytics, and using PubNub streams in Power BI.
1) The document discusses integrating MongoDB, a NoSQL database, with Teradata, a data warehouse platform. 2) It provides 5 key things to know about the integration, including how Teradata can pull directly from sharded MongoDB clusters and push data back. 3) Use cases are presented where the operational data in MongoDB can provide context and analytics capabilities for applications, and the data warehouse can enrich the operational data.
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.