So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Action from Insight - Joining the 2 Percent Who are Getting Big Data RightStampedeCon
Today’s world is awash in data, and organizations are rapidly discovering that putting this data to work is the single most important factor in their ability to remain relevant to hyper-connected consumers. In this session, HP will explore the new trends of this appified, thingified, context-rich world and how HP’s Haven platform can give you an edge over your competition.
Vertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AISri Ambati
Data is the only vertical, Machine Learning, bigdata, artificial intelligence
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
This document discusses combining Apache Spark and MongoDB for real-time analytics. It describes how MongoDB provides rich analytics capabilities through queries, aggregations, and indexing. Apache Spark can further extend MongoDB's analytics by offering additional processing capabilities. Together, Spark and MongoDB enable organizations to perform real-time analytics directly on operational data without needing separate analytics infrastructure.
The document provides an overview of IBM's big data and analytics capabilities. It discusses what big data is, the characteristics of big data including volume, velocity, variety and veracity. It then covers IBM's big data platform which includes products like InfoSphere Data Explorer, InfoSphere BigInsights, IBM PureData Systems and InfoSphere Streams. Example use cases of big data are also presented.
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Amazon Web Services Korea
This document discusses the democratization of data science and machine learning using automated machine learning tools. It provides examples of how DataRobot has helped customers in various industries build predictive models faster and with less coding than traditional approaches. Specifically, it summarizes how DataRobot has helped customers in banking, insurance, retail, and other industries with use cases like predictive maintenance, sales forecasting, fraud detection, customer churn prediction, and insurance underwriting.
Purpose of this presentation is to highlight how end to end machine learning looks like in real world enterprise. This is to provide insight to aspiring data scientist who have been through courses or education in ML that mostly focus on ML algorithms and not end to end pipeline.
Architecture and components mentioned in Slide 11 will be discussed in detailed in series of post on LinkedIn over the course of next few month
To get updates on this follow me on LinkedIn or search/follow hashtag #end2endDS. Post will be active in August 2019 and will be posted till September 2019
This document summarizes a presentation about big data analytics solutions from Think Big Analytics and Infochimps. It discusses using their platforms together to power applications with next-generation big data stacks. It highlights case studies, architecture diagrams, and polls to demonstrate how their services can accelerate time to value through a combination of data science, engineering, strategy, and hands-on training and education.
Webinar: Hybrid Cloud Integration - Why It's Different and Why It MattersSnapLogic
In this webinar, hear from 451 Research analyst Carl Lehmann about how IT organizations are challenged like never before with several disruptive changes. As hybrid clouds proliferate and as workloads shift across these disruptive venues, enterprises must now consider a thoughtful and strategic approach to hybrid cloud integration.
This presentation features a discussion of the business and technical trends driving hybrid cloud integration, how hybrid cloud integration is different from traditional approaches to integration, and why it matters.
To learn more, visit: www.snaplogic.com/connect-faster
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
---------
"Data is the new oil" - Many companies and professionals do not know how to use their data or are not aware of the added value they could gain from it.
It is in response to these problems that the project “Brussels: The Beating Heart of Big Data” was born.
This project, financed by the Region of Brussels Capital and organised by Betacowork, offers 3 training cycles of 10 courses on big data, at both beginner and advanced levels. These 3 cycles will be followed by a Hackathon weekend.
No prerequisites are required to start these courses. The aim of these courses is to familiarize participants with the principles of Big Data.
------
For more info about our Big Data courses, check out our website ➡️ https://www.betacowork.com/big-data/
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION Elvis Muyanja
Today, data science is enabling companies, governments, research centres and other organisations to turn their volumes of big data into valuable and actionable insights. It is important to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. According to the McKinsey Global Institute, the U.S. alone could face a shortage of about 190,000 data scientists and 1.5 million managers and analysts who can understand and make decisions using big data by 2018. In coming years, data scientists will be vital to all sectors —from law and medicine to media and nonprofits. Has the African continent planned to train the next generation of data scientists required on the continent?
This document discusses the evolution of cluster computing and resource management. It describes how:
1) Early clusters were single-purpose and used technologies like MapReduce. General purpose cluster OSes like YARN emerged to allow multiple applications on a cluster.
2) YARN improved on Hadoop by decoupling the programming model from resource management, allowing more flexibility and better performance/availability.
3) REEF aims to further improve frameworks by factoring out common functionalities around communication, configuration, and fault tolerance.
Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster
Whether you’ve heard of Google’s MapReduce or not, its impact on Big Data applications, data warehousing, ETL,
business intelligence, and data mining is re-shaping the market for business analytics and data processing.
Attend this session to hear from Curt Monash on the basics of the MapReduce framework, how it is used, and what implementations like SQL-MapReduce enable.
In this session you will learn:
* The basics of MapReduce, key use cases, and what SQL-MapReduce adds
* Which industries and applications are heavily using MapReduce
* Recommendations for integrating MapReduce in your own BI, Data Warehousing environment
Introduction to Deep Learning and AI at Scale for ManagersDataWorks Summit
Deep Learning and the new wave of AI are inevitably coming to your business area. If you are a manager and if you are trying to make sense of all the buzzwords, this session is four you. We will show you what is Deep Learning in a way that you will understand how it works and how can you apply it. We then expand the scope and apply the deep learning and AI techniques in the Big Data context. You will learn about things that don't work out so well, the risks and challenges in both applying and developing with deep learning and AI technologies. We conclude with practical guidance on how to add the exciting deep learning and AI capabilities to your next project.
Outline:
- The path to Deep Learning
- From machine learning to Deep Learning
- But how does it work?
- Deep Learning architectures
- Deep Learning applications
- Deep Learning at scale
- Running AI at scale
- Deep learning at Scale using Spark
- The trouble with AI
- Application challenges
- Development challenges
- How to start your first Deep Learning project
The document discusses frameworks for digital transformation and data processing. It provides an overview of BCG's digital transformation framework, which includes digitizing core domains and processes. It also discusses building blocks for digitizing core functions like data and analytics, machine learning/deep learning, and blockchain. The document then covers modern business intelligence blueprints and multimodal data processing blueprints. Finally, it discusses AI/ML blueprints including data sources, labeling, automation, querying, feature stores, experiment tracking, model monitoring, and distributed processing.
Unified Information Governance, Powered by Knowledge GraphVaticle
This document provides an overview of Infosys' Unified Information Governance solution powered by Knowledge Graph. It describes Infosys' vision to enable digital transformation for clients through an AI-powered core. The solution addresses challenges organizations face with complex system landscapes and data proliferation. It connects, observes, and provides sentient interaction with enterprise assets and data through a Knowledge Graph. This enables various roles to govern, manage, and consume information. Examples are provided of how the solution helps address priorities of specific roles like a CIO, CDO, and data scientist.
Democratizing Intelligence - Sri Ambati, CEO & Co-Founder, H2O.aiSri Ambati
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/ZrlJQqNaSMI.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://www.twitter.com/h2oai.
IBM Big Data Analytics Concepts and Use CasesTony Pearson
The document discusses big data concepts including what big data is, how the amount and types of data have changed over time, and the four V's of big data - volume, variety, velocity and veracity. It provides examples of practical big data use cases from companies like Vestas and Target. The document also outlines IBM's big data analytics platform and how it can help with tasks like simplifying the data warehouse, analyzing streaming data in real time, and exploiting instrumented assets.
The digital transformation is going forward due to Mobile, Cloud and Internet of Things. Disrupting business models leverage Big Data Analytics and Machine Learning.
"Big Data" is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud. "Fast Data" via stream processing is the solution to embed patterns - which were obtained from analyzing historical data - into future transactions in real-time.
This session uses several real world success stories to explain the concepts behind stream processing and its relation to Hadoop and other big data platforms. It discusses how patterns and statistical models of R, Spark MLlib, H2O, and other technologies can be integrated into real-time processing by using several different real world case studies. The session also points out why a Microservices architecture helps solving the agile requirements for these kind of projects.
A brief overview of available open source frameworks and commercial products shows possible options for the implementation of stream processing, such as Apache Storm, Apache Flink, Spark Streaming, IBM InfoSphere Streams, or TIBCO StreamBase.
A live demo shows how to implement stream processing, how to integrate machine learning, and how human operations can be enabled in addition to the automatic processing via a Web UI and push events.
Keywords: Big Data, Fast Data, Machine Learning, Analytics, Analytic Model, Stream Processing, Event Processing, Streaming Analytics, Real Time, Hadoop, Spark, MLlib, Streaming, R, TERR, TIBCO, Spotfire, StreamBase, Live Datamart, H20, Predictive Analytics, Data Discovery, Insights, Patterns
Big data expert and Infochimps CEO, Jim Kaskade presents the Infinite Monkey Theorem at CloudCon Expo. He provides an energetic, inspiring, and practical perspective on why Big Data is disrupting. It’s more than historic data analyzed on Hadoop. It’s also more than real-time streaming data stored and queried using NoSQL. Learn more at www.Infochimps.com
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
Gobblin is a data integration framework that can handle both batch and streaming data. It provides a logical pipeline specification that is independent of the underlying execution model. Gobblin pipelines can run in both batch and streaming modes using the same system. This allows for cost-efficient batch processing as well as low-latency streaming. The document discusses Gobblin's pipeline specification, deployment options, and roadmap including adding more streaming capabilities and improving security.
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Shirshanka Das
Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists, but at the same time, you need to preserve the privacy of your members, who have entrusted you with their data.
Shirshanka Das and Tushar Shanbhag outline the path LinkedIn has taken to protect member privacy in its scalable distributed data ecosystem built around Kafka and Hadoop.
They also discuss three foundational building blocks for scalable data management that can meet data compliance regulations: a centralized metadata system, a standardized data lifecycle management platform, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future—specifically, to the General Data Protection Regulation, which comes into effect in 2018—and outline LinkedIn’s plans for addressing those requirements.
But technology is just part of the solution. Shirshanka and Tushar also share the culture and process change they’ve seen happen at the company and the lessons they’ve learned about sustainable process and governance.
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Shirshanka Das
We discuss LinkedIn's big data ecosystem and its evolution through the years. We introduce three open source projects, Gobblin for ingestion, Cubert for computation and Pinot for fast OLAP serving. We also showcase our in-house data discovery and lineage portal WhereHows.
Aksyon Radyo Iloilo DYOK-AM 720 kHz is an AM radio station in Iloilo City, Philippines owned by Pacific Broadcasting Systems and operated by Manila Broadcasting Company. The station broadcasts daily from 3:30 AM to 12:00 AM from its studio in the Carlos Uy Building in Barangay San Rafael, Mandurriao, and transmits from its transmitter in Barangay Nabitasan, La Paz, signing off only from 3 PM on Good Friday to 3:30 AM on Easter Sunday each year.
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Shirshanka Das
LinkedIn built a change data capture pipeline called Databus to extract data changes from databases and publish them to downstream applications in a consistent and timely manner. Databus uses a pull model with logical clocks to simplify distributing changes across a network of relays and consumers. Key aspects of Databus include isolating sources from consumers, managing metadata and schemas, and partitioning streams of data changes across consumer groups.
The SlideShare 101 is a quick start guide if you want to walk through the main features that the platform offers. This will keep getting updated as new features are launched.
The SlideShare 101 replaces the earlier "SlideShare Quick Tour".
Architecting for change: LinkedIn's new data ecosystemYael Garten
2016 StrataHadoop NYC conference talk.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52182
Abstract:
Last year, LinkedIn embarked on an ambitious mission to completely revamp the mobile experience for its members. This would mean a completely new mobile application, reimagined user experiences, and new interaction concepts. As the team evaluated the impact of this big rewrite on the data analytics ecosystem, they observed a few problems.
Over the past few years, LinkedIn has become extremely good at incrementally changing the site one mini-feature at a time, often in conjunction with hundreds of other incremental changes. LinkedIn’s experimentation platform ensures that it is always monitoring a wide gamut of impacted metrics with every change before rolling fully forward. However, when it comes to rolling out a big change like this, different challenges crop up. You have to rollout the entire application all at once; the new experience means that you have no baseline on new metrics; and existing metrics may see double digit changes just because of the new experience or because the metric’s logic is no longer accurate—the challenge is in figuring out which is which.
Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes that enable LinkedIn to roll out future product innovations with minimal downstream impact. Shirshanka and Yael explore the motivations and the building blocks for this reimagined data analytics ecosystem, the technical details of LinkedIn’s new client-side tracking infrastructure, its unified reporting platform, and its data virtualization layer on top of Hadoop and share lessons learned from data producers and consumers that are participating in this governance model. Along the way, they offer some anecdotal evidence during the rollout that validated some of their decisions and are also shaping the future roadmap of these efforts.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch: https://bit.ly/2DYsUhD
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- How Prologis accelerated their use of Machine Learning with data virtualization
Agile Data Science is a lean methodology that is adopted from Agile Software Development. At the core it centers around people, interactions, and building minimally viable products to ship fast and often to solicit customer feedback. In this presentation, I describe how this work was done in the past with examples. Get started today with our help by visiting http://www.alpinenow.com
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesRaphael Branger
"We now do Agile BI too” is often heard in todays BI community. But can you really "create" agile in Business Intelligence projects? This presentation shows that Agile BI doesn't necessarily start with the introduction of an iterative project approach. An organisation is well advised to establish first the necessary foundations in regards to organisation, business and technology in order to become capable of an iterative, incremental project approach in the BI domain.
In this session you learn which building blocks you need to consider. In addition you will see what a meaningful sequence to these building blocks is. Selected aspects like test automation, BI specific design patterns as well as the Disciplined Agile Framework will be explained in more and practical details.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/32c6TnG
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
Presented at the ACEMS workshop at QUT in February 2015.
Credits: whole project team (names listed in the first slide).
Approved by CSIRO to be shared externally.
In this slidedeck, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.
The global need to securely derive (instant) insights, have motivated data architectures from distributed storage, to data lakes, data warehouses and lake-houses. In this talk we describe Tag.bio, a next generation data mesh platform that embeds vital elements such as domain centricity/ownership, Data as Products, Self-serve architecture, with a federated computational layer. Tag.bio data products combine data sets, smart APIs, statistical and machine learning algorithms into decentralized data products for users to discover insights using FAIR Principles. Researchers can use its point and click (no-code) system to instantly perform analysis and share versioned, reproducible results. The platform combines a dynamic cohort builder with analysis protocols and applications (low-code) to drive complex analysis workflows. Applications within data products are fully customizable via R and Python plugins (pro-code), and the platform supports notebook based developer environments with individual workspaces.
Join us for a talk/demo session on Tag.bio data mesh platform and learn how major pharma industries and university health systems are using this technology to promote value based healthcare, precision healthcare, find cures for disease, and promote collaboration (without explicitly moving data around). The talk also outlines Tag.bio secure data exchange features for real world evidence datasets, privacy centric data products (confidential computing) as well as integration with cloud services
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
As organizations modernize their data and analytics platforms, the data lake concept has gained momentum as a shared enterprise resource for supporting insights across multiple lines of business. The perception is that data lakes are vast, slow-moving bodies of data, but innovations like Apache Kafka for streaming-first architectures put real-time data flows at the forefront. Combining real-time alerts and fast-moving data with rich historical analysis lets you respond quickly to changing business conditions with powerful data lake analytics to make smarter decisions.
Join this complimentary webinar with industry experts from 451 Research and Arcadia Data who will discuss:
- Business requirements for combining real-time streaming and ad hoc visual analytics.
- Innovations in real-time analytics using tools like Confluent’s KSQL.
- Machine-assisted visualization to guide business analysts to faster insights.
- Elevating user concurrency and analytic performance on data lakes.
- Applications in cybersecurity, regulatory compliance, and predictive maintenance on manufacturing equipment all benefit from streaming visualizations.
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
This document outlines the course content for a Big Data Analytics course. The course covers key concepts related to big data including Hadoop, MapReduce, HDFS, YARN, Pig, Hive, NoSQL databases and analytics tools. The 5 units cover introductions to big data and Hadoop, MapReduce and YARN, analyzing data with Pig and Hive, and NoSQL data management. Experiments related to big data are also listed.
This document summarizes Amrit Chhetri's presentation on information security analytics. The presentation covers topics such as big data introduction and history, advantages of big data solutions, trends in big data analytics, big data adoption trends, big data software stacks, platforms for big data analytics, programming platforms for big data, and applications of big data analytics in telecommunications and information security. It also includes a demonstration of running a word count example using MapReduce on a big data platform and configuring a Cloudera big data platform.
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
- The document discusses automating data science pipelines with DevOps tools like Ansible, Packer, and Kubernetes.
- It covers obtaining data, exploring and modeling data, and how to automate infrastructure setup and deployment with tools like Packer to build machine images and Ansible for configuration management.
- The rise of DevOps and its cultural aspects are discussed as well as how tools like Packer, Ansible, Kubernetes can help automate infrastructure and deploy machine learning models at scale in production environments.
Real time insights for better products, customer experience and resilient pla...Balvinder Hira
Businesses are building digital platforms with modern architecture principles like domain driven design, microservice based, and event-driven. These platforms are getting ever so modular, flexible and complex.
While they are built with architecture principles like - loose coupling, individually scaling, plug-and-play components; regulations and security considerations on data - complexity leads to many unknown and grey areas in the entire architecture. Details on how the different components of this complex architecture interact with each other are lost. Generating insights becomes multi-teams, multi-staged activity and hence multi-days activity.
Multiple users and stakeholders of the platform want different and timely insights to take both corrective and preventive actions.Business teams want to know how business is doing in every corner of the country near real time at a zipcode granularity. Tech teams want to correlate flow changes with system health including that of downstream stability as it happens.Knowing these details also helps in providing the feedback to the platform itself, to make it more efficient and also to the underlying business process.
In this talk we intend to share how we made all the business and technical insights of a complicated platform available in realtime with limited incremental effort and constant validation of the ideas and slices with business teams. Since the client was a Banking client, we will also touch base handling of financial data in a secure way and still enabling insights for a large group of stakeholders.
We kept the self-service aspect at the center of our solution - to accommodate increasing components in the source platform, evolving requirements, even to support new platforms altogether. Configurability and Scalability were key here, it was important that all the data that was collected from the source platform was discoverable and presentable. This also led to evolving the solution in lines of domain data products, where the data is generated and consumed by those who understand it the best.
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo
Watch the full session: Denodo DataFest 2016 sessions: https://goo.gl/yVJnti
Data virtualization starts with democratizing data access for business users, but goes well beyond to enable entire analytics life cycle. This session will discuss the critical role of data virtualization in the four key phases of big data analytics: Discovery of raw and enriched data, Analytic Exploration, Real-time Operationalization, and Predictive Intervention.
In this session, you will learn:
• Design of advanced analytics with view towards business goal realization
• The role of data virtualization in enabling analytics through four key phases
• How to exploit product capabilities relevant to each stage
• Creating a system of governed self-service and collaborative analytics
This session is part of the Denodo DataFest 2016 event. You can also watch more Denodo DataFest sessions on demand here: https://goo.gl/VXb6M6
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
Watch full webinar here: https://bit.ly/3offv7G
Presented at AI Live APAC
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spend most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Watch this on-demand session to learn how companies can use data virtualization to:
- Create a logical architecture to make all enterprise data available for advanced analytics exercise
- Accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- Integrate popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc.
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Ian Gomez
At H2O.ai we see a world where all software will incorporate AI, and we’re focused on bringing AI to business through software. H2O.ai is the maker behind H2O, the leading open source machine and deep learning platform for smarter applications and data products. H2O operationalizes data science by developing and deploying algorithms and models for R, Python and the Sparkling Water API for Spark.
In this webinar, you will learn about the scalable H2O core platform and the distributed algorithms it supports. H2O integrates seamlessly with the R and the Python environments. We will show you how to leverage the power of H2O algorithms in R, Python and H2O Flow interface. Come with an open mind and some high level knowledge of machine learning, and you will take away a stream of knowledge for your next ML/DL project.
Amy Wang is a math hacker at H2O, as well as the Sales Engineering Lead. She graduated from Hunter College in NYC with a Masters in Applied Mathematics and Statistics with a heavy concentration on numerical analysis and financial mathematics.
Her interest in applicable math eventually lead her to big data and finding the appropriate mediums for data analysis.
Desmond is a Senior Director of Marketing at H2O.ai. In his 15+ years of career in Enterprise Software, Desmond worked in Distributed Systems, Storage, Virtualization, MPP databases, Streaming Analytics Platform, and most recently Machine Learning. He obtained his Master’s degree in Computer Science from Stanford University and MBA degree from UC Berkeley, Haas School of Business.
Confluent Partner Tech Talk with BearingPointconfluent
This document discusses best practices for debugging client applications in Kafka streams. It begins by asking a question about debugging practices for producers, consumers, and Kafka streams applications. It then describes a Partner Technical Sales Enablement offering that includes live sessions and on-demand learning paths on topics like Confluent fundamentals and use cases. It outlines additional support for partners through technical workshops, coaching, and solution discovery sessions. The document concludes by stating the goal of Partner Tech Talks is to provide insights and inspiration through use case discussions.
Similar to Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn (20)
How We Added Replication to QuestDB - JonTheBeachjavier ramirez
Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance.
A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen.
Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second.
In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn
1. Building a healthy data ecosystem
around Kafka and Hadoop:
Lessons learned at LinkedIn
Mar 16, 2017
Shirshanka Das, Principal Staff Engineer, LinkedIn
Yael Garten, Director of Data Science, LinkedIn
@shirshanka, @yaelgarten
2. The Pursuit of #DataScienceHappiness
A original
@yaelgarten @shirshanka
5. Achieving Data
Democracy
“… helping everybody to access and understand data .…
breaking down silos… providing access to data when and where
it is needed at any given moment.”
Collect, flow, store as much data as you can
Provide efficient access to data in all its stages of evolution
6. The forms of data
Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable
Espresso
Venice
Pinot Galene Graph DB
Document
DB
DynamoDB
Azure
Blob, Data
Lake
Storage
7. The forms of data
At RestIn Motion
Espresso
Venice
Pinot
Galene
Graph DBDocument
DB
DynamoDB
Azure
Blob, Data
Lake
Storage
8. The forms of data
At RestIn Motion
Scale
O(10) clusters
~1.7 Trillion messages
~450 TB
Scale
O(10) clusters
~10K machines
~100 PB
10. Data Integration: key requirements
Source, Sink
Diversity
Batch
+
Streaming
Data
Quality
So, we built
11. SFTP
JDBC
REST
Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal,
NerdWallet and many more…
Apache incubation under way
SFTP
Azure
Blob, Data
Lake
Storage
19. How
does
LinkedIn
build
data-
driven
products?
Data Scientist
PM Designer
Engineer
We should enable users
to filter connection
suggestions by company
How much do
people utilize
existing filter
capabilities?
Let's see how
users send
connection
invitations
today.
21. (powers metrics and data products)
InvitationClickEvent()
Scale fact:
~ 1000 tracking event types,
~ Hundreds of metrics & data
products
Tracking data records user activity
23. Tracking Data Lifecycle & Teams
Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data science teams:
Analytics, ML Engineers,...
user
engagement
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
TransportProduce Consume
Members
Execs
24. How do we calculate a metric: ProfileViews
PageViewEvent
Record 1:
{
"memberId" : 12345,
"time" : 1454745292951,
"appName" : "LinkedIn",
"pageKey" : "profile_page",
"trackingInfo" :
“Viewee=1214,lnl=f,nd=1,o=1214,
^SP=pId-'pro_stars',rslvd=t,vs=v,
vid=1214,ps=EDU|EXP|SKIL| ..."
}
Metric:
ProfileViews = sum(PageViewEvent
where pagekey = profile_page
)
PageViewEvent
Record 101:
{
"memberId" : 12345,
"time" : 1454745292951,
"appName" : "LinkedIn",
"pageKey" : "new_profile_page",
"trackingInfo" :
"viewee_id=1214,lnl=f,nd=1,o=1214,
^SP=pId-'pro_stars',rslvd=t,vs=v,
vid=1214,ps=EDU|EXP|SKIL| ..."
}
or new_profile_page
Ok but
forgot to notifyundesirable
25. Metrics ecosystem at LinkedIn: 3 yrs ago
Operational Challenges for infra teams
Diminished Trust due to multiple sources of truth
26. What was causing unhappiness?
1. No contracts: Downstream scripts broke when upstream changed
2. "Naming things is hard": different semantics & conventions in various data Events
(per team)
--> need to email to figure out what is correct and complete logic to use
--> inefficient and potentially wrong
3. Discrepant metric logic:
Duplicate tech allowed for duplicate logic allowed for discrepant metric logic
So how did we solve this?
27. Data Modeling Tip
Say no to Fragile Formats or Schema-Free
Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing
your messages to your persistent stores: Kafka, Hadoop, DBs etc.
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Chose Avro as our format
28. Sometimes you need a committee
Leads from product and infra teams
Review each new data model
Ensure that it follows our conventions,
patterns and best practices across entire
data lifecycle
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Data Model Review Committee
(DMRC)
Tooling to codify conventions
“Always be reviewing”
Who and What Evolution
29. Unified Metrics Platform
A single source of truth for all
business metrics at LinkedIn
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
- metrics processing platform as a
service
- a metrics computation template
- a set of tools and process to
facilitate metrics life-cycle
Central Team,
Relevant
Stakeholders
Sandbox
Metric
Definition
Code
Repo
Build
& Deploy
System JobsCore Metrics
Job
Metric
Owner
1. iterate
2. create
4. check in
3. review
5,000 metrics daily
30. Unified Metrics Platform: Pipeline
Metrics Logic
Raw
Data
Pinot
UMP Harness
Incremental
Aggregate
Backfill
Auto-join
Raptor
dashboards
HDFS
Aggregated
Data
Experiment
Analysis
Machine
Learning
Anomaly
Detection
HDFS
Ad-hoc
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Tracking
+ Database
+ Other data
31. Tracking Platform: standardizing production
Schema compatibility
Time
Audit
KafkaClient-side
Tracking
Tracking
Frontend
Services
Tools
33. What was still causing unhappiness?
1. Old bad data sticks around (e.g. old mobile app versions)
2. No clear contract for data production - Producers unaware of consumers concerns
3. Never a good time to pay down this tech debt
We started from the bottom.
34. Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data science teams:
Analytics, ML Engineers,...
user
engagement
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
Members
Execs
3. Never a good time to pay down this "data" debt
#victimsOfTheData —> #DataScienceHappiness
via proactively forging our own data destiny.
Features are waiting to ship to members... some of this stuff is invisible
But what is the cost
of not doing it?
35. The Big Problem Opportunity in 2015
Launch a completely rewritten LinkedIn mobile app
37.
1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code from
scratch),
b. Save: consumers avoid migrating.
2. Evolve.
a. Cost: time on clean data modeling, and on consumer
migration to new tracking events,
b. Save: pays down data modeling tech debt
There were two options:
38.
1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code from
scratch),
b. Save: consumers avoid migrating.
2. Evolve.
a. Cost: time on clean data modeling, and on consumer
migration to new tracking events,
b. Save: pays down data modeling tech debt
How much work would it be?
#DataScienceHappiness
40. 2. Clear contract did not exist for data production
Producers were unaware of consumers needs, and were "Throwing data over the wall".
Albeit avro, Schema adherence != Semantics equivalence
user engagement
tracking data
metric
scripts
production
code
Member facing
data products
Business facing
decision making
#victimsOfTheData —> #DataScienceHappiness, via proactive joint requirements definition
Own the artifact that
feeds the data ecosystem
(and data scientists!)
Data producers
(PM, app developers)
Data consumers
(DS)
41. 2a. Ensure dialogue between Producers & Consumers
• Awareness: Train about end-to-end data pipeline, data modeling
• Instill communication & collaborative ownership process between all: a step-by-step
playbook for who & how to develop and own tracking
42. 2b. Standardized core data entities
• Event types and names: Page, Action, Impression
• Framework level client side tracking: views, clicks, flows
• For all else (custom) - guide when to create a new Event
Navigation
Page View
Control Interaction
43. 2c. Created clear maintainable data production contracts
Tracking specification with monitoring and alerting for adherence:
clear, visual, consistent contract
Need tooling to support culture and process shift - "Always be tooling"
Tracking specification Tool
49. We built Dali to solve this
A Data Access Layer for Linkedin
Abstract away underlying physical details to allow users to
focus solely on the logical concerns
Logical Tables + Views
Logical FileSystem
54. A Few Hard Problems
Versioning
Views and UDFs
Mapping to Hive metastore entities
Development lifecycle
Git as source of truth
Gradle for build
LinkedIn tooling integration for deployment
55. State of the world today
~300 views
Pretty much all new UMP metrics use Dali
data sources
ProfileViews
MessagesSent
Searches
InvitationsSent
ArticlesRead
JobApplications
...
At Rest
Data
Processing
Frameworks
56. Now brewing: Dali on Kafka
Can we take the same
views and run them
seamlessly on Kafka as
well?
Stream Data
Standard streaming API-s
- Samza System Consumer
- Kafka Consumer
57. What’s next for Dali?
Selective materialization
Open source
Hive is an implementation detail, not a long term bet
58. Dali: When are we done dreaming?
At RestIn Motion
Data
Processing
Frameworks
Dali
62. Basic data
infrastructure
for data democracy
Platforms, Process
to standardize
produce + consume
Evangelize
investing
in
#DataScience
Happiness
Tech + process
to sustain
healthy data
ecosystem
Our Journey towards #DataScienceHappiness
Dali,
Dialogue
2015->
Tracking, UMP
DMRC
2013 ->
Kafka, Hadoop,
Gobblin, Pinot
2010 -> 2015 ->
63. The Pursuit of #DataScienceHappiness
A original
@yaelgarten @shirshanka
Thank You!
to be continued…