Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes that enable LinkedIn to roll out future product innovations with minimal downstream impact. Shirshanka and Yael explore the motivations and the building blocks for this reimagined data analytics ecosystem, the technical details of LinkedIn’s new client-side tracking infrastructure, its unified reporting platform, and its data virtualization layer on top of Hadoop and share lessons learned from data producers and consumers that are participating in this governance model. Along the way, they offer some anecdotal evidence during the rollout that validated some of their decisions and are also shaping the future roadmap of these efforts.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
This document summarizes a presentation on using Apache Calcite for cost-based query optimization in Apache Phoenix. Key points include:
- Phoenix is adding Calcite's query planning capabilities to improve performance and SQL compliance over its existing query optimizer.
- Calcite models queries as relational algebra expressions and uses rules, statistics, and a cost model to choose the most efficient execution plan.
- Examples show how Calcite rules like filter pushdown and exploiting sortedness can generate better plans than Phoenix's existing optimizer.
- Materialized views and interoperability with other Calcite data sources like Apache Drill are areas for future improvement beyond the initial Phoenix+Calcite integration.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
This document discusses LinkedIn's transition from an offline metrics platform to a near real-time "nearline" architecture using Apache Calcite and Apache Samza. It overviews LinkedIn's metrics platform and needs, and then details how the new nearline architecture works by translating Pig jobs into optimized Samza jobs using Calcite's relational algebra and query planning. An example production use case for analyzing storylines on the LinkedIn platform is also presented. The nearline architecture allows metrics to be computed with latencies of 5-30 minutes rather than 3-6 hours previously.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
High level introduction to Confluent REST Proxy and Schema Registry (leveraging Apache Avro under the hood), two components of the Apache Kafka open source ecosystem. See the concepts, architecture and features.
RedisConf17- Using Redis at scale @ TwitterRedis Labs
The document discusses Nighthawk, Twitter's distributed caching system which uses Redis. It provides caching services at a massive scale of over 10 million queries per second and 10 terabytes of data across 3000 Redis nodes. The key aspects of Nighthawk's architecture that allow it to scale are its use of a client-oblivious proxy layer and cluster manager that can independently scale and rebalance partitions across Redis nodes. It also employs replication between data centers to provide high availability even in the event of node failures. Some challenges discussed are handling "hot keys" that get an unusually high volume of requests and more efficiently warming up replicas when nodes fail.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Iceberg provides capabilities beyond traditional partitioning of data in Spark/Hive. It allows updating or deleting individual rows without rewriting partitions through mutable row operations (MOR). It also supports ACID transactions through versions, faster queries through statistics and sorting, and flexible schema changes. Iceberg manages metadata that traditional formats like Parquet do not, enabling these new capabilities. It is useful for workloads that require updating or filtering data at a granular record level, managing data history through versions, or frequent schema changes.
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Beautiful Monitoring With Grafana and InfluxDBleesjensen
Query your data streams with the time series database InfluxDB and then visualize the results with stunning Grafana dashboards. Quick and easy to set up. Fully scalable to millions of metrics per second.
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Shirshanka Das
LinkedIn built a change data capture pipeline called Databus to extract data changes from databases and publish them to downstream applications in a consistent and timely manner. Databus uses a pull model with logical clocks to simplify distributing changes across a network of relays and consumers. Key aspects of Databus include isolating sources from consumers, managing metadata and schemas, and partitioning streams of data changes across consumer groups.
Aksyon Radyo Iloilo DYOK-AM 720 kHz is an AM radio station in Iloilo City, Philippines owned by Pacific Broadcasting Systems and operated by Manila Broadcasting Company. The station broadcasts daily from 3:30 AM to 12:00 AM from its studio in the Carlos Uy Building in Barangay San Rafael, Mandurriao, and transmits from its transmitter in Barangay Nabitasan, La Paz, signing off only from 3 PM on Good Friday to 3:30 AM on Easter Sunday each year.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
Gobblin is a data integration framework that can handle both batch and streaming data. It provides a logical pipeline specification that is independent of the underlying execution model. Gobblin pipelines can run in both batch and streaming modes using the same system. This allows for cost-efficient batch processing as well as low-latency streaming. The document discusses Gobblin's pipeline specification, deployment options, and roadmap including adding more streaming capabilities and improving security.
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Shirshanka Das
Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists, but at the same time, you need to preserve the privacy of your members, who have entrusted you with their data.
Shirshanka Das and Tushar Shanbhag outline the path LinkedIn has taken to protect member privacy in its scalable distributed data ecosystem built around Kafka and Hadoop.
They also discuss three foundational building blocks for scalable data management that can meet data compliance regulations: a centralized metadata system, a standardized data lifecycle management platform, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future—specifically, to the General Data Protection Regulation, which comes into effect in 2018—and outline LinkedIn’s plans for addressing those requirements.
But technology is just part of the solution. Shirshanka and Tushar also share the culture and process change they’ve seen happen at the company and the lessons they’ve learned about sustainable process and governance.
The SlideShare 101 is a quick start guide if you want to walk through the main features that the platform offers. This will keep getting updated as new features are launched.
The SlideShare 101 replaces the earlier "SlideShare Quick Tour".
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Backstage 2019 - The Atlassian Journey with Amplitude - Itzik FeldmanAmplitude
This document outlines Atlassian's journey with the analytics platform Amplitude over several years:
- Atlassian began in 2012 with 400+ staff but no product analysts or focus on data; initial attempts to use data lacked traction
- Mobile teams adopted Amplitude in 2014 but relied heavily on limited data teams
- In 2016, Amplitude was adopted company-wide to standardize on a unified platform
- Success factors included applying governance, helping users discover data faster, and growing Amplitude efficiently in the company through evangelism and iterative improvements.
The document discusses digital twins, which are dynamic digital representations of physical assets that allow companies to understand, predict, and optimize asset performance. Digital twins use asset data like sensor readings, events, and models to generate insights about an asset's current context, key performance indicators (KPIs), and future predictions. A digital twin platform is needed to manage digital twins at scale across edge, network and cloud environments and expose twin data and insights via APIs. This allows industrial applications to leverage digital twins without needing direct access to the underlying data and models.
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningKai Wähner
Comparison of Data Preparation vs. Data Wrangling Programming Languages, Frameworks and Tools in Machine Learning / Deep Learning Projects.
A key task to create appropriate analytic models in machine learning or deep learning is the integration and preparation of data sets from various sources like files, databases, big data storages, sensors or social networks. This step can take up to 80% of the whole project.
This session compares different alternative techniques to prepare data, including extract-transform-load (ETL) batch processing (like Talend, Pentaho), streaming analytics ingestion (like Apache Storm, Flink, Apex, TIBCO StreamBase, IBM Streams, Software AG Apama), and data wrangling (DataWrangler, Trifacta) within visual analytics. Various options and their trade-offs are shown in live demos using different advanced analytics technologies and open source frameworks such as R, Python, Apache Hadoop, Spark, KNIME or RapidMiner. The session also discusses how this is related to visual analytics tools (like TIBCO Spotfire), and best practices for how the data scientist and business user should work together to build good analytic models.
Key takeaways for the audience:
- Learn various options for preparing data sets to build analytic models
- Understand the pros and cons and the targeted persona for each option
- See different technologies and open source frameworks for data preparation
- Understand the relation to visual analytics and streaming analytics, and how these concepts are actually leveraged to build the analytic model after data preparation
Video Recording / Screencast of this Slide Deck: https://youtu.be/2MR5UynQocs
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesRaphael Branger
"We now do Agile BI too” is often heard in todays BI community. But can you really "create" agile in Business Intelligence projects? This presentation shows that Agile BI doesn't necessarily start with the introduction of an iterative project approach. An organisation is well advised to establish first the necessary foundations in regards to organisation, business and technology in order to become capable of an iterative, incremental project approach in the BI domain.
In this session you learn which building blocks you need to consider. In addition you will see what a meaningful sequence to these building blocks is. Selected aspects like test automation, BI specific design patterns as well as the Disciplined Agile Framework will be explained in more and practical details.
La crescita veloce è uno degli aspetti più rilevanti dell'economia negli ultimi anni. Startup, scaleup e unicorni sono tutte aziende che, anno su anno, crescono in modo vertiginoso a livello di numeri di business e di persone, facendo scaling dei sistemi IT.
Le aziende "pre native digitali" stanno guardando a queste realtà come a potenziali (o reali) competitor e si stanno organizzando per scalare. Ma un conto è avere una struttura di business nata per scalare, un conto è scalare con un business avviato da almeno 20/30 anni. Cultura aziendale, sistemi IT e tecnologie si sono stratificati nel tempo e possono essere un ostacolo a questa corsa verso l'alto.
In questo talk vedremo buone pratiche, tecniche e modelli per scalare realtà enterprise sia a livello tecnico (e tecnologico), sia a livello organizzativo. Lo faremo attraverso esempi concreti di casi reali e proponendo spunti su come superare le difficoltà che si incontrano durante il percorso.
Parleremo di Cloud Native, di migrazione da Monoliti e Microservices, di API as a Product, di Organizzazioni Enterprise in stile Open Source e di Cultura Aziendale.
Modern Thinking: Cómo el Big Data y Cognitive están cambiando la estrategia de Marketing
Por: Ismael Yuste, Strategic Cloud Engineer Google Cloud
Presentación: Introducción a las soluciones Big Data de Google
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
As organizations modernize their data and analytics platforms, the data lake concept has gained momentum as a shared enterprise resource for supporting insights across multiple lines of business. The perception is that data lakes are vast, slow-moving bodies of data, but innovations like Apache Kafka for streaming-first architectures put real-time data flows at the forefront. Combining real-time alerts and fast-moving data with rich historical analysis lets you respond quickly to changing business conditions with powerful data lake analytics to make smarter decisions.
Join this complimentary webinar with industry experts from 451 Research and Arcadia Data who will discuss:
- Business requirements for combining real-time streaming and ad hoc visual analytics.
- Innovations in real-time analytics using tools like Confluent’s KSQL.
- Machine-assisted visualization to guide business analysts to faster insights.
- Elevating user concurrency and analytic performance on data lakes.
- Applications in cybersecurity, regulatory compliance, and predictive maintenance on manufacturing equipment all benefit from streaming visualizations.
Applications need data, but the legacy approach of n-tiered application architecture doesn’t solve for today’s challenges. Developers aren’t empowered to build and iterate their code quickly without lengthy review processes from other teams. New data sources cannot be quickly adopted into application development cycles, and developers are not able to control their own requirements when it comes to data platforms.
Part of the challenge here is the existing relationship between two groups: developers and DBAs. Developers are trying to go faster, automating build/test/release cycles with CI/CD, and thrive on the autonomy provided by microservices architectures. DBAs are stewards of data protection, governance, and security. Both of these groups are critically important to running data platforms, but many organizations deal with high friction between these teams. As a result, applications get to market more slowly, and it takes longer for customers to see value.
What if we changed the orientation between developers and DBAs? What if developers consumed data products from data teams? In this session, Pivotal’s Dormain Drewitz and Solstice’s Mike Koleno will speak about:
- Product mindset and how balanced teams can reduce internal friction
- Creating data as a product to align with cloud-native application architectures, like microservices and serverless
- Getting started bringing lean principles into your data organization
- Balancing data usability with data protection, governance, and security
Presenter : Dormain Drewitz, Pivotal & Mike Koleno, Solstice
Big Data Paris - A Modern Enterprise ArchitectureMongoDB
Depuis les années 1980, le volume de données produit et le risque lié à ces données ont littéralement explosé. 90% des données existantes aujourd’hui ont été créé ces 2 dernières années, dont 80% sont non structurées. Avec plus d’utilisateurs et le besoin de disponibilité permanent, les risques sont beaucoup plus élevés.
Quels sont les paramètres de bases de données qu’un décideur doit prendre en compte pour déployer ses applications innovantes?
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving ITDynatrace
Dynatrace announced new features for their AI assistant Davis including notifications for Amazon Echo and Slack. They also discussed plans to further integrate Davis with Dynatrace search capabilities. The company announced a new Innovator Program providing hardware, workshops and early access to new features for 15 participants paying an annual $25k fee. Finally, they demonstrated a new integration with Microsoft HoloLens and discussed how Davis is built on Dynatrace APIs to provide multimodal interfaces for the future.
Real-time big data analytics based on product recommendations case studydeep.bi
We started as an ad network. The challenge was to recommend the best product (out of millions) to the right person in a given moment (thousands of users within a second). We have delivered 5 billion ad views since 24 months. To put it in the scale context: If we would serve 1 ad per second it will take 160 years to serve 5 billion ads.
So we needed a solution. SQL databases did not work. Popular NoSQL databases did not work. Standard data warehouse approaches (pre-aggregations, creating schemas) - did not work too.
Re-thinking all the problems with huge data streams flowing to us every second we have built a complete solution based on open-source technologies and fresh, smart ideas from our engineering team. It is called deep.bi and now we make it available to other companies.
deep.bi lets high-growth companies solve fast data problems by providing scalable, flexible and real-time data collection, enrichment and analytics.
It was built using:
- Node.js - API
- Kafka - collecting and distributing data
- Spark Streaming - ETL, data enrichments
- Druid - real-time analytics
- Cassandra - user events store
- Hadoop + Parquet + Spark - raw data store + ad-hoc queries
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022StreamNative
There is an increasing need to unleash analytical capabilities directly to the end-users to democratize decision-making. User-Facing Analytics is a new frontier that will shape the products of tomorrow and push the limits of existing technology. It demands a solution that will scale to millions of users to provide fast, real-time insights. In this session, Xiang will talk about his journey to build Apache Pinot to tackle the analytics problem space with the architectural changes and technology inventions made over the past decade. He will also talk about how other big data companies such as LinkedIn, Uber, and Stripe power their user-facing analytical applications.
This talk is about data-driven transformation and its contribution to Digital transformation. The first part shows the necessity to adopt the "software revolution" to adapt constantly to the customer’s environment. I then speak about " Exponential Information Systems" that the the foundation for the data-driven ambitions : Enterprise-wide flows, Customer-time data freshness, Future-proof unified semantics, etc.
The last part talks about Exponential Technologies, such as Artificial intelligence and machine learning, to drive more value from data
High-performance database technology for rock-solid IoT solutionsClusterpoint
Clusterpoint is a privately held database software company founded in 2006 with 32 employees. Their product is a hybrid operational database, analytics, and search platform that provides secure, high-performance distributed data management at scale. It reduces total cost of ownership by 80% over traditional relational databases by providing blazing fast performance, unlimited scalability, and bulletproof transactions with instant text search and security. Clusterpoint also offers their database software as a cloud database as a service to instantly scale databases on demand.
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
Creating a Modern Data Architecture for Digital TransformationMongoDB
By managing Data in Motion, Data at Rest, and Data in Use differently, modern Information Management Solutions are enabling a whole range of architecture and design patterns that allow enterprises to fully harness the value in data flowing through their systems. In this session we explored some of the patterns (e.g. operational data lakes, CQRS, microservices and containerisation) that enable CIOs, CDOs and senior architects to tame the data challenge, and start to use data as a cross-enterprise asset.
Applying linear regression and predictive analyticsMariaDB plc
In this session Alejandro Infanzon, Solutions Engineer, introduces the linear regression and statistical functions that debuted in MariaDB ColumnStore 1.2, and how you can use them to support powerful analytics. He explains how to perform even-more-powerful analytics by writing multi-parameter user-defined functions (UDFs) – also new in MariaDB ColumnStore 1.2.
This presentation was delivered at Microsoft Ignite - The Tour in Singapore on 16th Jan 2019. The original video for this is available on YouTube here: https://www.youtube.com/watch?v=ZRsrwLi-deA
Similar to Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem (20)
How We Added Replication to QuestDB - JonTheBeachjavier ramirez
Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance.
A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen.
Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second.
In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
1. Architecting for change:
LinkedIn's new data ecosystem
Sept 28, 2016
Shirshanka Das, Principal Staff Engineer, LinkedIn
Yael Garten, Director of Data Science, LinkedIn
@shirshanka, @yaelgarten
7. Tracking data records user activity
InvitationClickEvent()
Scale fact:
~ 1000 tracking event types,
~ Double-digit TB per day,
hundreds of metrics & data
products
9. Tracking Data Lifecycle & Teams
TransportProduce Consume
Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data teams:
Analytics, Relevance Engineers,...
user
engagement
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
14. Two options:
1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code
from scratch),
b. Save: consumers avoid migrating.
2. Evolve.
a. Cost: time on data modeling, and on consumer
migration,
b. Save: pays down data modeling tech debt
How much work would it be?
15. How much work would it be?
Two options:
1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code
from scratch),
b. Save: consumers avoid migrating.
2. Evolve.
a. Cost: time on data modeling, and on consumer
migration,
b. Save: pays down data modeling tech debt
2000 daysEstimated cost to update consumers to new tracking with clean, committee-approved data models
Estimated cost for producers to attempt to replicate old tracking
5000 days
#AnalyticsHappiness
16. The Task and Opportunity
Must do: So we will do the data modeling, and rewrite all the metrics to
account for the changes happening upstream… but…
Extra credit points: How do we make sure that the cost is not this high the
next time?
How do we handle evolution in a principled way?
39. We had been working on something that could help...
A Data Access Layer for Linkedin
Abstract away underlying physical details to allow users to
focus solely on the logical concerns
Logical Tables + Views
Logical FileSystem
44. A Few Hard Problems
Versioning
Views and UDFs
Mapping to Hive metastore entities
Development lifecycle
Git as source of truth
Gradle for build
LinkedIn tooling integration for deployment
45. Early experiences with Dali views
How we executed
Lots of work to get the infra ready
Closed beta model
Tons of training and education (hand holding) for all
Governance body
Feedback from analysts is overwhelmingly positive:
+ Much simpler to share and standardize data cleansing code with peers
+ Provides effective insulation to scripts from upstream changes
- Harder to debug where problems are due to additional layer
46. State of the world today
~100 producer views
~200 consumer views
~30% of UMP metrics use Dali data
sources
~80 unique tracking event data sources
ProfileViews
MessagesSent
Searches
InvitationsSent
ArticlesRead
JobApplications
...
47. What’s next for Dali?
Real-time Views on streaming data
Selective materialization
Hive is an implementation detail, not a long term bet
Open source
Data Quality Framework
49. Infrastructure enables, but culture really preserves
get_tracking_codes = foreach get_domain_rolled_up generate
..entry_domain_rollup,
( (tracking_code matches 'eml-ced.*' or tracking_code matches 'eml-b2_content
_ecosystem_digest.*'
or (referer is not null and (referer matches '.*touch.linkedin.com.*trk=eml-ced.*'
or referer matches '.*touch.linkedin.com.*trk=eml-b2_content_ecosystem_digest.*')) ? 'Email
- CED' : (tracking_code matches 'eml-.*' or (referer is not null and referer matches '.*touch.linkedin
.com.*trk=eml-.*') or entry_domain_rollup == 'Email' ? 'Email - Other' :
(tracking_code == 'hp-feed-article-title-hpm' and entry_domain_rollup == 'Linkedin' ? 'Homepage
Pulse Module' : ((tracking_code matches 'hp-feed-.*' and entry_domain_rollup == 'Linkedin') or
(std_user_interface matches '(phone app|tablet app|phone browser|tablet browser)' and tracking_code
== 'v-feed') or (tracking_code == 'Organic Traffic' and entry_domain_rollup == 'Linkedin' and (referer ==
'https://www.linkedin.com/nhome' or referer == 'http://www.linkedin.com/nhome')) ? 'Feed' :
(tracking_code matches 'hb_ntf_MEGAPHONE_.*' and entry_domain_rollup == 'Linkedin' ?
'Desktop Notifications' : (tracking_code == 'm_sim2_native_reader_swipe_right' ? 'Push Notification' :
(tracking_code == 'pulse_dexter_stream_scroll' and entry_domain_rollup == 'Linkedin' ? 'Pulse -
50. For a great data ecosystem that can handle change:
1. Standardize core data entities
2. Create clear maintainable contracts between data producers
& consumers
3. Ensure dialogue between data producers & consumers
51. 1. Standardize core data entities
• Event types and names: Page, Action, Impression
• Framework level client side tracking: views, clicks, flows
• For all else (custom) - guide when to create a new Event or Dali view
Navigation
Page View
Control Interaction
52. 2. Create clear maintainable contracts
1
1. Tracking specification with monitoring: clear, visual, consistent contract
Need tooling to support culture shift
Tracking specification Tool
2
2. Dali dataset specification with data quality rules
53. 3. Ensure dialogue between Producers & Consumers
• Awareness: Train about end-to-end data pipeline, data modeling
• Instill communication & collaborative ownership process between all: a step-by-step
playbook for who & how to develop and own tracking
PM → Analyst → Engineer → All3 → TestEng → Analyst
user engagement
tracking data
metric
scripts
production
code
Member facing
data products
Business facing
decision making
55. Our Learnings
Culture and Process
● Spend time to identify what needs culture & process, and
what needs tools & tech
● Big changes can mean big opportunities
● Very hard to massively change things like data culture or data tech debt; never
a good time to invest in “invisible” behind-the-scenes change
→ Make it non-invisible -- try to clarify or size out the cost of NOT doing it
→ needed strong leaders, and a village
Tech
● Must build tooling to support that culture change otherwise culture will revert
● Work hard to make any new layer as frictionless as possible
● Virtual views on Hadoop data can work at scale! (Dali views)
56. For a great data ecosystem that can handle change:
1. Standardize core data entities
2. Create clear maintainable contracts between data producers & consumers
3. Ensure dialogue between data producers & consumers
Design for change. Expect it. Embrace it.
57. Did we succeed? We just handled another huge change!
#AnalyticsHappiness