At LinkedIn, the Kafka infrastructure is run as a service: the Streaming team develops and deploys Kafka, but is not the producer or consumer of the data that flows through it. With multiple datacenters, and numerous applications sharing these clusters, we have developed an architecture with multiple pipelines and multiple tiers. Most days, this works out well, but it has led to many interesting problems. Over the years we have worked to develop a number of solutions, most of them open source, to make it possible for us to reliably handle over a trillion messages a day.
Kafka is a distributed publish-subscribe messaging system that allows both streaming and storage of data feeds. It is designed to be fast, scalable, durable, and fault-tolerant. Kafka maintains feeds of messages called topics that can be published to by producers and subscribed to by consumers. A Kafka cluster typically runs on multiple servers called brokers that store topics which may be partitioned and replicated for fault tolerance. Producers publish messages to topics which are distributed to consumers through consumer groups that balance load.
Apache Kafka's rise in popularity as a streaming platform has demanded a revisit of its traditional at-least-once message delivery semantics.
In this talk, we present the recent additions to Kafka to achieve exactly-once semantics (EoS) including support for idempotence and transactions in the Kafka clients. The main focus will be the specific semantics that Kafka distributed transactions enable and the underlying mechanics which allow them to scale efficiently.
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
It covers a brief introduction to Apache Kafka Connect, giving insights about its benefits,use cases, motivation behind building Kafka Connect.And also a short discussion on its architecture.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Kafka is an open-source message broker that provides high-throughput and low-latency data processing. It uses a distributed commit log to store messages in categories called topics. Processes that publish messages are producers, while processes that subscribe to topics are consumers. Consumers can belong to consumer groups for parallel processing. Kafka guarantees order and no lost messages. It uses Zookeeper for metadata and coordination.
Kafka Tutorial - introduction to the Kafka streaming platform
The document discusses Kafka, an open-source distributed event streaming platform. It provides an introduction to Kafka and describes how it is used by many large companies to process streaming data in real-time. Key aspects of Kafka explained include topics, partitions, producers, consumers, consumer groups, and how Kafka is able to achieve high performance through its architecture and design.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
This document discusses using Apache Kafka as a data hub to capture changes from various data sources using change data capture (CDC). It outlines several common CDC patterns like using modification dates, database triggers, or log files to identify changes. It then discusses using Kafka Connect to integrate various data sources like MongoDB, PostgreSQL and replicate changes. The document provides examples of open source CDC connectors and concludes with suggestions for getting involved in the Apache Kafka community.
Hello ApacheKafka
An Introduction to Apache Kafka with Timothy Spann and Carolyn Duby Cloudera Principal engineers.
We also demo Flink SQL, SMM, SSB, Schema Registry, Apache Kafka, Apache NiFi and Public Cloud - AWS.
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.
Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.
This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.
We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.
Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
Presented at Kafka Summit 2016
Operating out of multiple datacenters is a large part of most disaster recovery plans, but it brings extra complications to our data pipelines. Instead of having a straight path from front to back, it now has forks and dead ends and odd little use cases that don’t match up with a perfect view of the world. This talk will focus on how to best utilize Apache Kafka in this world, including basic architectures for multi-datacenter and multi-tier clusters. We will also touch on how to assure messages make it from producer to consumer, and how to monitor the entire ecosystem.
The document discusses Cruise Control, a tool for managing Apache Kafka clusters. It was created by LinkedIn to handle their large Kafka deployment consisting of over 2,000 brokers and 4 trillion messages per day. Cruise Control monitors broker loads, detects anomalies, and generates proposals to optimize resource usage and replica distributions. It supports operations like adding or removing brokers and performing automatic rebalances without downtime or data loss. The architecture includes components for load monitoring, analysis, execution and an API for administration.
First presentation for Savi's sponsorship of the Washington DC Spark Interactive. Discusses tips and lessons learned using Spark Streaming (24x7) to ingest and analyze Industrial Internet of Things (IIoT) data as part of a Lambda Architecture
The document discusses modern software development tools and practices, including:
- Using Git for version control and GitHub for collaboration between developers.
- Tools like Jenkins, Trello, and Slack to enable continuous integration, project management, and team communication.
- Following architectural approaches like microservices and implementing infrastructure as code using tools from the HashiCorp stack like Vagrant, Consul, and Terraform.
- Achieving continuous delivery by integrating development and operations to reliably release software through an automated deployment process.
Symantec deployed an SDN using OpenStack with the following key aspects:
1. They created different "Classes of Service" including a development environment and a production environment to onboard teams and manage workloads.
2. They provided self-service user onboarding through Horizon with automatic network creation to hide complexities.
3. They offered load balancing as a service using HA Proxy with various optimizations to achieve high performance.
4. They attached baremetal servers to the overlay network by launching them in network namespaces.
5. They aimed for over 99.95% control plane availability using a distributed controller and Cassandra setup with automation and monitoring.
The evolving complexity of the data center is placing increased demand on the network and security teams to come up with inventive methods for enforcing security policies in these ever-changing environments. The goal of this session is to provide participants with an understanding of features and design recommendations for integrating security into the data center environment. This session will focus on recommendations for securing next-generation data center architectures. Areas of focus include security services integration, leveraging device virtualization, and considerations and recommendations for server virtualization. The target audience are security and data center administrators.
This document discusses Oracle Ravello Cloud and provides an overview, live demonstration, and summary. Oracle Ravello Cloud allows users to migrate VMware workloads to public clouds without modification by lifting and shifting the virtual machines. The live demonstration shows importing an existing Primavera environment into Oracle Ravello Cloud and publishing the virtual machines to the cloud with one click. The summary notes that Oracle Ravello Cloud solves issues like compatibility, lock-in, and labor costs by allowing lift and shift of workloads to the cloud in an agnostic manner.
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)
Many organizations anticipate significant growth in WAN bandwidth and Public Cloud usage. Leveraging the Internet to provide extra WAN bandwidth and to offload Public Cloud traffic is compelling, however network reliability, application performance and security are the primary roadblocks. Cisco IWAN transport solution is the most full featured architecture to support the Software Defined Wide Area Network (SD-WAN) requirements that are emerging in standards bodies like the Open Networking User Group (ONUG) to address these issues. Many enterprises are looking for the benefits these technologies deliver, but without the costs associated with owning and operating those technologies. Here is where VMS for IWAN meets market need. Cisco VMS is a full featured management platform for both virtual and physical devices. This session will cover a full description of the VMS platform and how it can be used to deliver exceptional customer experience when supporting a managed offering of IWAN. The roles of Customer and Resource Facing Services will be covered, along with integration between the IWAN service and SP operations. This session will also cover the topic of how Virtual Network Functions (VNFs) can be placed optimally in the network from the CPE to SP datacenter, along with a demo of the end user and operator experience.
Adobe Ask the AEM Community Expert Session Oct 2016
Two large enterprise AEM implementations were presented and compared. Anshul Chhabra from Symantec presented their implementation handling 3.3 billion requests per month. Anil Kalbag from Cisco presented their implementation handling 375 million monthly page views. Both implementations utilized multiple data centers for high availability and disaster recovery. Key architecture decisions around virtual/physical infrastructure, storage, caching, and multi-tenancy were discussed and compared between the two organizations.
An Integrated Approach to Manage IT Network Traffic - An Overview
An integrated approach to network traffic management provides benefits over traditional point solutions. A single-point solution can monitor all types of network traffic, support multiple protocols, provide insightful dashboards and reports, and analyze network behavior. This saves troubleshooting time, improves service levels, and maximizes return on investment by enabling better resource allocation and optimization.
The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to.
At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone.
Description:
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slides
Want to move
your Oracle Commerce
infrastructure to the cloud?
If you are running your commerce business on Oracle Commerce (formerly ATG) stack, and want to take it to the cloud, check out this webinar slide deck. Two industry innovators are joined by the leading retail analyst to discuss why now is the time for the large retailers to transition to the cloud and how they can help.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Kafka is a distributed publish-subscribe messaging system that allows both streaming and storage of data feeds. It is designed to be fast, scalable, durable, and fault-tolerant. Kafka maintains feeds of messages called topics that can be published to by producers and subscribed to by consumers. A Kafka cluster typically runs on multiple servers called brokers that store topics which may be partitioned and replicated for fault tolerance. Producers publish messages to topics which are distributed to consumers through consumer groups that balance load.
Apache Kafka's rise in popularity as a streaming platform has demanded a revisit of its traditional at-least-once message delivery semantics.
In this talk, we present the recent additions to Kafka to achieve exactly-once semantics (EoS) including support for idempotence and transactions in the Kafka clients. The main focus will be the specific semantics that Kafka distributed transactions enable and the underlying mechanics which allow them to scale efficiently.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
It covers a brief introduction to Apache Kafka Connect, giving insights about its benefits,use cases, motivation behind building Kafka Connect.And also a short discussion on its architecture.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Kafka is an open-source message broker that provides high-throughput and low-latency data processing. It uses a distributed commit log to store messages in categories called topics. Processes that publish messages are producers, while processes that subscribe to topics are consumers. Consumers can belong to consumer groups for parallel processing. Kafka guarantees order and no lost messages. It uses Zookeeper for metadata and coordination.
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
The document discusses Kafka, an open-source distributed event streaming platform. It provides an introduction to Kafka and describes how it is used by many large companies to process streaming data in real-time. Key aspects of Kafka explained include topics, partitions, producers, consumers, consumer groups, and how Kafka is able to achieve high performance through its architecture and design.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming apps. It provides a unified, scalable, and durable platform for handling real-time data feeds. Kafka works by accepting streams of records from one or more producers and organizing them into topics. It allows both storing and forwarding of these streams to consumers. Producers write data to topics which are replicated across clusters for fault tolerance. Consumers can then read the data from the topics in the order it was produced. Major companies like LinkedIn, Yahoo, Twitter, and Netflix use Kafka for applications like metrics, logging, stream processing and more.
This document discusses using Apache Kafka as a data hub to capture changes from various data sources using change data capture (CDC). It outlines several common CDC patterns like using modification dates, database triggers, or log files to identify changes. It then discusses using Kafka Connect to integrate various data sources like MongoDB, PostgreSQL and replicate changes. The document provides examples of open source CDC connectors and concludes with suggestions for getting involved in the Apache Kafka community.
Hello, kafka! (an introduction to apache kafka)Timothy Spann
Hello ApacheKafka
An Introduction to Apache Kafka with Timothy Spann and Carolyn Duby Cloudera Principal engineers.
We also demo Flink SQL, SMM, SSB, Schema Registry, Apache Kafka, Apache NiFi and Public Cloud - AWS.
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.
Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.
This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.
We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.
Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
Presented at Kafka Summit 2016
Operating out of multiple datacenters is a large part of most disaster recovery plans, but it brings extra complications to our data pipelines. Instead of having a straight path from front to back, it now has forks and dead ends and odd little use cases that don’t match up with a perfect view of the world. This talk will focus on how to best utilize Apache Kafka in this world, including basic architectures for multi-datacenter and multi-tier clusters. We will also touch on how to assure messages make it from producer to consumer, and how to monitor the entire ecosystem.
The document discusses Cruise Control, a tool for managing Apache Kafka clusters. It was created by LinkedIn to handle their large Kafka deployment consisting of over 2,000 brokers and 4 trillion messages per day. Cruise Control monitors broker loads, detects anomalies, and generates proposals to optimize resource usage and replica distributions. It supports operations like adding or removing brokers and performing automatic rebalances without downtime or data loss. The architecture includes components for load monitoring, analysis, execution and an API for administration.
First presentation for Savi's sponsorship of the Washington DC Spark Interactive. Discusses tips and lessons learned using Spark Streaming (24x7) to ingest and analyze Industrial Internet of Things (IIoT) data as part of a Lambda Architecture
The document discusses modern software development tools and practices, including:
- Using Git for version control and GitHub for collaboration between developers.
- Tools like Jenkins, Trello, and Slack to enable continuous integration, project management, and team communication.
- Following architectural approaches like microservices and implementing infrastructure as code using tools from the HashiCorp stack like Vagrant, Consul, and Terraform.
- Achieving continuous delivery by integrating development and operations to reliably release software through an automated deployment process.
Symantec deployed an SDN using OpenStack with the following key aspects:
1. They created different "Classes of Service" including a development environment and a production environment to onboard teams and manage workloads.
2. They provided self-service user onboarding through Horizon with automatic network creation to hide complexities.
3. They offered load balancing as a service using HA Proxy with various optimizations to achieve high performance.
4. They attached baremetal servers to the overlay network by launching them in network namespaces.
5. They aimed for over 99.95% control plane availability using a distributed controller and Cassandra setup with automation and monitoring.
Security and Virtualization in the Data CenterCisco Canada
The evolving complexity of the data center is placing increased demand on the network and security teams to come up with inventive methods for enforcing security policies in these ever-changing environments. The goal of this session is to provide participants with an understanding of features and design recommendations for integrating security into the data center environment. This session will focus on recommendations for securing next-generation data center architectures. Areas of focus include security services integration, leveraging device virtualization, and considerations and recommendations for server virtualization. The target audience are security and data center administrators.
This document discusses Oracle Ravello Cloud and provides an overview, live demonstration, and summary. Oracle Ravello Cloud allows users to migrate VMware workloads to public clouds without modification by lifting and shifting the virtual machines. The live demonstration shows importing an existing Primavera environment into Oracle Ravello Cloud and publishing the virtual machines to the cloud with one click. The summary notes that Oracle Ravello Cloud solves issues like compatibility, lock-in, and labor costs by allowing lift and shift of workloads to the cloud in an agnostic manner.
SP Virtual Managed Services (VMS) for Intelligent WAN (IWAN)Cisco Canada
Many organizations anticipate significant growth in WAN bandwidth and Public Cloud usage. Leveraging the Internet to provide extra WAN bandwidth and to offload Public Cloud traffic is compelling, however network reliability, application performance and security are the primary roadblocks. Cisco IWAN transport solution is the most full featured architecture to support the Software Defined Wide Area Network (SD-WAN) requirements that are emerging in standards bodies like the Open Networking User Group (ONUG) to address these issues. Many enterprises are looking for the benefits these technologies deliver, but without the costs associated with owning and operating those technologies. Here is where VMS for IWAN meets market need. Cisco VMS is a full featured management platform for both virtual and physical devices. This session will cover a full description of the VMS platform and how it can be used to deliver exceptional customer experience when supporting a managed offering of IWAN. The roles of Customer and Resource Facing Services will be covered, along with integration between the IWAN service and SP operations. This session will also cover the topic of how Virtual Network Functions (VNFs) can be placed optimally in the network from the CPE to SP datacenter, along with a demo of the end user and operator experience.
Two large enterprise AEM implementations were presented and compared. Anshul Chhabra from Symantec presented their implementation handling 3.3 billion requests per month. Anil Kalbag from Cisco presented their implementation handling 375 million monthly page views. Both implementations utilized multiple data centers for high availability and disaster recovery. Key architecture decisions around virtual/physical infrastructure, storage, caching, and multi-tenancy were discussed and compared between the two organizations.
An Integrated Approach to Manage IT Network Traffic - An OverviewManageEngine
An integrated approach to network traffic management provides benefits over traditional point solutions. A single-point solution can monitor all types of network traffic, support multiple protocols, provide insightful dashboards and reports, and analyze network behavior. This saves troubleshooting time, improves service levels, and maximizes return on investment by enabling better resource allocation and optimization.
I'm No Hero: Full Stack Reliability at LinkedInTodd Palino
The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to.
At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone.
Description:
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
Oracle Commerce as a Secure, Scalable Hybrid Cloud Service, webinar slidesGrid Dynamics
Want to move
your Oracle Commerce
infrastructure to the cloud?
If you are running your commerce business on Oracle Commerce (formerly ATG) stack, and want to take it to the cloud, check out this webinar slide deck. Two industry innovators are joined by the leading retail analyst to discuss why now is the time for the large retailers to transition to the cloud and how they can help.
OSMC 2022 | Current State of icinga by Bernd ErkNETWAYS
This document provides an overview and update on the current state of Icinga, an open source monitoring solution. It discusses Icinga's goal of continuously improving its unified open source and enterprise monitoring capabilities. Key points include that Icinga is made for enterprises and offers features like scalability, high availability, and enterprise-grade support. The document highlights recent Icinga releases and upcoming work, community contributions, and how Icinga can be used to monitor infrastructure, offer automation, support cloud monitoring, and provide metrics, logs, and notifications.
Updates to Apache CloudStack and LINBIT SDSShapeBlue
In this session, speakers Giles Sirett and Philipp Reisner shared insights into CloudStack and LINBIT. Giles detailed Apache CloudStack’s scalability, multi-tenancy, and compatibility with various hypervisors. He also discusses CloudStack’s integrated, easy-to-use nature, rapid time-to-value, and its active community. Following this, Giles delves into different use cases, such as IaaS/Cloud Provisioning, Disaster recovery, Sovereign Clouds, and the list goes on. CloudStack’s features, including its support for Kubernetes clusters, its scalable architecture, high availability and other features were also discussed.
Following this, Philipp highlighted the 4 key ways in which LINBIT can help an organisation: ‘Protecting data, Always Keeping Your Services On, Shaping Your Destiny and Exceeding with Best Performance”. Philipp also delved into the different reasons why LINBIT SDS is so fast, and what the next steps are for DRBD, LINSTOR and the LINSTOR Driver for CloudStack.
-----------------------------------------
On October 10th 2023, ShapeBlue, Ampere Computing and LINBIT held a joint virtual event – Building Next-Generation IaaS. The event explored how the synergy between ARM, Apache CloudStack and LINBIT’s storage solutions can achieve a formidable price-to-performance ratio. There were a total of 3 sessions held by speakers from all 3 organisations.
Serverless: Market Overview and Investment OpportunitiesUnderscore VC
The document discusses serverless computing and key investment opportunities in the space. Serverless computing refers to cloud-based event-driven computing where the cloud provider manages the infrastructure. The document outlines the benefits of serverless computing including ease of scaling, reduced costs, and increased productivity. It also discusses some challenges around vendor lock-in, lack of control, and multitenancy. The document identifies serverless monitoring, security, and infrastructure as key investment areas. It provides overviews of serverless monitoring companies IOPipe and Dashbird which provide tools to monitor and debug serverless applications.
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
In today's data-driven world, the Internet of Things (IoT) is revolutionizing industries and unlocking new possibilities. Join Data Reply, Confluent, and Imply as we unveil a comprehensive solution for IoT that harnesses the power of real-time insights.
Organizations are intrigued and excited by the ability to reduce costs, gain new insights and expand their data playground with Hadoop. However, when it comes time to design and execute their strategy, they face two fundamental challenges: “Where do I start?” followed by, “Now that I’ve started, how do I keep up?”
The ecosystem of Hadoop tools is constantly expanding to keep up as demands (real-time, self-service, etc.) and data growth (more sources, larger volumes) increase. Innovation is good, but added complexity, uncertainty and risk is not.
If you’re committed to realizing the benefits of Hadoop, but are taken aback by the complexities and pace of change in the Big Data landscape, watch this webcast to learn about:
Finding the right use case – Successful companies realize the fastest time to value and create a foundation for big data analytics by starting with familiar use cases such as offloading enterprise data warehouses and mainframes to Hadoop.
Exploring the landscape of Big Data tools -- Learn about common tools used in Hadoop implementations as illustrated by real-world use cases.
Shielding your organization from the complexities of Hadoop while staying current as Big Data technologies evolve – Solutions like Syncsort DMX-h allow users to visually design data transformations once and deploy them anywhere—across Hadoop MapReduce, Apache Spark, or whatever framework becomes popular next.
Some history, background and information about the building of software at LinkedIn. This presentation was delivered at the Gradle Summit 2013 so it has a Gradle focus, but covers many other types of tooling and integration.
MySQL High Availability Solutions - Feb 2015 webinarAndrew Morgan
How important is your data? Can you afford to lose it? What about just some of it? What would be the impact if you couldn’t access it for a minute, an hour, a day or a week?
Different applications can have very different requirements for High Availability. Some need 100% data reliability with 24x7x365 read & write access while many others are better served by a simpler approach with more modest HA ambitions.
MySQL has an array of High Availability solutions ranging from simple backups, through replication and shared storage clustering – all the way up to 99.999% available shared nothing, geographically replicated clusters. These solutions also have different ‘bonus’ features such as full InnoDB compatibility, in-memory real-time performance, linear scalability and SQL & NoSQL APIs.
The purpose of this presentation is to help you decide where your application sits in terms of HA requirements and discover which of the MySQL solutions best fit the bill. It will also cover what you need outside of the database to ensure High Availability – state of the art monitoring being a prime example.
Similar to Multi tier, multi-tenant, multi-problem kafka (20)
Leading Without Managing: Becoming an SRE Technical LeaderTodd Palino
Increasingly, technical organizations are developing career paths to build and recognize leaders outside of the traditional management roles. But what should an SRE who wants to be a leader be focusing on? Through the eyes of an engineer who reinvented his career in one of the largest SRE organizations, we will examine what technical leadership looks like, and how an individual can help guide the strategic path of a team, department, or company without taking on the role of a people manager. You'll pick up tactical work that you can start immediately to set yourself up for success, and some pointers to be able to identify the opportunities when they show up.
From Operations to Site Reliability in Five Easy StepsTodd Palino
Across industries, modern operations teams have noted the emergence of a new role: the Site Reliability Engineer (SRE): an IT craftsperson who fuses software engineering and operations best practices to enable highly reliable software systems. Once the domain of technology giants, this discipline is both applicable and important for any organization looking to differentiate itself in a world increasingly defined by software.
In this session, Todd Palino from LinkedIn explores how SRE evolves from Operations by taking the ‘lid-off’ SRE at LinkedIn. He’ll describe how by crafting automation, problem solving, and building a partnership with software engineering teams, companies can build a high-trust and inclusive team culture that is needed to drive continuous improvement — and importantly, have lots of fun doing it!
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino
All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success.
We will look at the process for Code Yellow, the term we use for this process of “righting the ship”, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.
Monitoring services is easy, right? Set up a notification that goes out when a certain number increases past a certain threshold to let you know that there’s a problem. But if that’s the case, why are so many teams drowning in alerts and dreading their time on call? The reason is that we tend to monitor the wrong things: reactive alerts, metrics that we don’t completely understand how they impact our service, and capacity alerts. We look at our own view of the service and fail to consider that our customers have a different view.
Come learn to let go of what does not help, and explore how to monitor for what truly matters: what the customer sees. This starts with defining our agreements with our customers, continues through building applications intelligently and instrumenting all the things, and finishes with picking the right signals out of that instrumentation to generate alerts that are actionable, not ones that introduce confusion and noise. We will also touch on capacity planning, and how it should never wake you up. You’ll find it’s possible to assure that you meet your service level objectives while still maximizing your sleep level objectives.
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows.
We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain:
Under-replicated Partitions: The mother of all metrics
Request Latencies: Why your users complain
Thread pool utilization: How could 80% be a problem?
We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Todd Palino
Across industries, modern operations teams have noted the emergence of a new role: the Site Reliability Engineer (SRE); a new IT craftsperson who fuses software engineering and operations best practices to enable highly reliable software systems. Once the domain of web-scale businesses, this discipline is both applicable and important for any organization looking to differentiate itself in a world increasingly defined by software.
In this session, Todd Palino from LinkedIn explores SRE from organizational, team and individual perspectives. He’ll describe how by crafting automation and problem solving, SRE can permeate across a technical organization – not only ensuring a massively high-performant and always available site, but used to inform optimum decision making - in everything from system procurement to application design, builds and deployment.
Todd will talk in depth about what constitutes the best in SRE in a DevOps world, using examples to examine the techniques needed to accelerate value and grow teams. Taking the ‘lid-off’ SRE at LinkedIn, join Todd as he describes how it started and continues to evolve, what goals are important, and how it’s instrumental in building a high-trust and inclusive team culture needed to drive continuous improvement -- and importantly, have lots of fun doing it!
This document discusses some of the challenges of running Kafka at scale based on LinkedIn's experience. It describes how multitenancy can cause problems when topics are automatically created without ownership. It also discusses issues with infrastructure like inefficient mirroring and a lack of auditing. Management was difficult due to the lack of tools for configuring topics across clusters and upgrading brokers. LinkedIn developed open source tools like Cruise Control and Burrow to help address some of these problems.
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements.
Topics include:
- What latencies and throughputs you should expect from Kafka
- How to select hardware and size components
- What you should be monitoring
- Design patterns and antipatterns for client applications
- How to go about diagnosing performance bottlenecks
- Which configurations to examine and which ones to avoid
This document discusses tuning Kafka for performance. It covers optimizing Zookeeper configurations like using SSDs; using RAID or JBOD for Kafka broker disks with testing showing XFS performs best; scaling Kafka clusters by considering disk capacity, network capacity, and partition counts; configuring topics for retention settings and partition balancing; and tuning Mirror Maker for network locality and producer/consumer settings.
Kafka is a publish/subscribe messaging system that, while young, forms a vital core for data flow inside many organizations, including LinkedIn. We will discuss Kafka from an Operations point of view, including the use cases for Kafka and the tools LinkedIn has been developing to improve the management of deployed clusters. We'll also talk about some of the challenges of managing a multi-tenant data service and how to avoid getting woken up at 3 AM.
NOTE: I highly recommend viewing the original PPT. It has copious speaker notes for each slide, and the animations will actually work properly.
Encontro anual da comunidade Splunk, onde discutimos todas as novidades apresentadas na conferência anual da Spunk, a .conf24 realizada em junho deste ano em Las Vegas.
Neste vídeo, trago os pontos chave do encontro, como:
- AI Assistant para uso junto com a SPL
- SPL2 para uso em Data Pipelines
- Ingest Processor
- Enterprise Security 8.0 (Maior atualização deste seu release)
- Federated Analytics
- Integração com Cisco XDR e Cisto Talos
- E muito mais.
Deixo ainda, alguns links com relatórios e conteúdo interessantes que podem ajudar no esclarecimento dos produtos e funções.
https://www.splunk.com/en_us/campaigns/the-hidden-costs-of-downtime.html
https://www.splunk.com/en_us/pdfs/gated/ebooks/building-a-leading-observability-practice.pdf
https://www.splunk.com/en_us/pdfs/gated/ebooks/building-a-modern-security-program.pdf
Nosso grupo oficial da Splunk:
https://usergroups.splunk.com/sao-paulo-splunk-user-group/
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...YanKing2
Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are often heavy in computational complexity, and quadratically with the length of the input code sequence. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input program should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input program belongs, the outcome may differ when the model is trained on a different dataset. We propose SlimCode, a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization. We reported that 1) the reduction ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm–prompt engineering and interactive in-context learning and this study can save reduce the cost of invoking GPT-4 by 24%per API query. Importantly, SlimCode simplifies the input code with its greedy strategy and can obtain at most 133 times faster than the state-of-the-art technique with a significant improvement. This paper calls for a new direction on code-based, model-agnostic code simplification solutions to further empower LLMs.
A vernier caliper is a precision instrument used to measure dimensions with high accuracy. It can measure internal and external dimensions, as well as depths.
Here is a detailed description of its parts and how to use it.
Unblocking The Main Thread - Solving ANRs and Frozen FramesSinan KOZAK
In the realm of Android development, the main thread is our stage, but too often, it becomes a battleground where performance issues arise, leading to ANRS, frozen frames, and sluggish Uls. As we strive for excellence in user experience, understanding and optimizing the main thread becomes essential to prevent these common perforrmance bottlenecks. We have strategies and best practices for keeping the main thread uncluttered. We'll examine the root causes of performance issues and techniques for monitoring and improving main thread health as wel as app performance. In this talk, participants will walk away with practical knowledge on enhancing app performance by mastering the main thread. We'll share proven approaches to eliminate real-life ANRS and frozen frames to build apps that deliver butter smooth experience.
20CDE09- INFORMATION DESIGN
UNIT I INCEPTION OF INFORMATION DESIGN
Introduction and Definition
History of Information Design
Need of Information Design
Types of Information Design
Identifying audience
Defining the audience and their needs
Inclusivity and Visual impairment
Case study.
A brand new catalog for the 2024 edition of IWISS. We have enriched our product range and have more innovations in electrician tools, plumbing tools, wire rope tools and banding tools. Let's explore together!
OCS Training Institute is pleased to co-operate with
a Global provider of Rig Inspection/Audits,
Commission-ing, Compliance & Acceptance as well as
& Engineering for Offshore Drilling Rigs, to deliver
Drilling Rig Inspec-tion Workshops (RIW) which
teaches the inspection & maintenance procedures
required to ensure equipment integrity. Candidates
learn to implement the relevant standards &
understand industry requirements so that they can
verify the condition of a rig’s equipment & improve
safety, thus reducing the number of accidents and
protecting the asset.
Conservation of Taksar through Economic RegenerationPriyankaKarn3
This was our 9th Sem Design Studio Project, introduced as Conservation of Taksar Bazar, Bhojpur, an ancient city famous for Taksar- Making Coins. Taksar Bazaar has a civilization of Newars shifted from Patan, with huge socio-economic and cultural significance having a settlement of about 300 years. But in the present scenario, Taksar Bazar has lost its charm and importance, due to various reasons like, migration, unemployment, shift of economic activities to Bhojpur and many more. The scenario was so pityful that when we went to make inventories, take survey and study the site, the people and the context, we barely found any youth of our age! Many houses were vacant, the earthquake devasted and ruined heritages.
Conservation of those heritages, ancient marvels,a nd history was in dire need, so we proposed the Conservation of Taksar through economic regeneration because the lack of economy was the main reason for the people to leave the settlement and the reason for the overall declination.
Social media management system project report.pdfKamal Acharya
The project "Social Media Platform in Object-Oriented Modeling" aims to design
and model a robust and scalable social media platform using object-oriented
modeling principles. In the age of digital communication, social media platforms
have become indispensable for connecting people, sharing content, and fostering
online communities. However, their complex nature requires meticulous planning
and organization.This project addresses the challenge of creating a feature-rich and
user-friendly social media platform by applying key object-oriented modeling
concepts. It entails the identification and definition of essential objects such as
"User," "Post," "Comment," and "Notification," each encapsulating specific
attributes and behaviors. Relationships between these objects, such as friendships,
content interactions, and notifications, are meticulously established.The project
emphasizes encapsulation to maintain data integrity, inheritance for shared behaviors
among objects, and polymorphism for flexible content handling. Use case diagrams
depict user interactions, while sequence diagrams showcase the flow of interactions
during critical scenarios. Class diagrams provide an overarching view of the system's
architecture, including classes, attributes, and methods .By undertaking this project,
we aim to create a modular, maintainable, and user-centric social media platform that
adheres to best practices in object-oriented modeling. Such a platform will offer users
a seamless and secure online social experience while facilitating future enhancements
and adaptability to changing user needs.
So who am I, and why am I qualified to stand up here?
I am a member of the Data Infrastructure Streaming SRE team at LinkedIn. We’re responsible for Kafka and Zookeeper operations, as well as Samza and a couple iterations of our change capture systems.
SRE stands for Site Reliability Engineering. Many of you, like myself before I started in this role, may not be familiar with the title. SRE combines several roles that fit together into one Operations position
Foremost, we are administrators. We manage all of the systems in our area
We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together
And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them.
At the end of the day, our job is to keep the site running, always.
What are the things we are going to cover in this talk? I’m going to assume some basic knowledge of what Kafka is and how it works, so I won’t be covering the basics. I’ll start by describing the Kafka pipelines we have set up at LinkedIn in our multi-tenant environment. This will transition into the tier architecture that many of those pipelines use. But I’ll spend most of our time on the interesting problems that we’ve run into in running Kafka at such a large scale. We’ll wrap up talking about a couple of the things that we’re working on now, and hopefully have some time for Q&A
I won’t be going into too much detail on how Kafka works. If you do not have a basic understanding of Kafka itself, I suggest checking out some of the resources listed in the Reference slides at the end of this deck.
Here’s what a single Kafka cluster looks like at LinkedIn. I’ll get into some details on the TrackerProducer/TrackerConsumer components later, but they are internal libraries that wrap the open source Kafka producer and consumer components and integrate with our schema registry and our monitoring systems.
Every cluster has multiple Kafka brokers, storing their metadata in a Zookeeper ensemble. We have producers sending messages in, and consumers reading messages out. At the present time, our consumers talk to Zookeeper as well and everything works well. In LinkedIn’s environment, all of these components live in the same datacenter, in the same network.
What happens when you have two sites to deal with?
Now we iterate on the architecture. We add the concept of an aggregate Kafka cluster, which contains all of the messages from each of the primary datacenter local clusters. We also have a copy of this cluster in the secondary datacenter, C, for consumers there to access. We still have cross-datacenter traffic – that can’t be avoided if we need to move data around. But we have isolated it to one application, mirror maker, which we can monitor and assure works properly. This is a better situation than needing to have each consumer worry about it for themselves.
We’ve definitely added complexity here, but it serves a purpose. By having the infrastructure be a little more complex, we simplify the usage of Kafka for our customers. Producers know that if they send messages to their local cluster, it will show up in the appropriate places without additional work on their part. Consumers can select which view of the data they need, and have assurances that they will see everything that is produced. The intricacies of how the data gets moved around are left to people like me, who run the Kafka infrastructure itself.
We’ve chosen to keep all of our clients local to the clusters and use a tiered architecture due to several major concerns.
The primary concern is around the networking itself. Kafka enables multiple consumers to read the same topic, which means if we are reading remotely, we are copying messages over expensive inter-datacenter connections multiple times. We also have to handle problems like network partitioning in every client. Granted, you can have a partition even within a single datacenter, but it happens much more frequently when you are dealing with large distances. There’s also the concern of latency in connections – distance increases latency. Latency can cause interesting problems in client applications, and I like life to be boring.
There are also security concerns around talking across datacenters. If we keep all of our clients local, we do not have to worry about ACL problems between the clients and the brokers (and Zookeeper as well). We can also deal with the problem of encrypting data in transit much more easily. This is one problem we have not worried about as much, but it is becoming a big concern now.
The last concern is over resource usage. Everything at LinkedIn talks to Kafka, and a problem that takes out a production cluster is a major event. It could mean we have to shift traffic out of the datacenter until we resolve it, or it could result in inconsistent behavior in applications. Any application could overwhelm a cluster, but there are some, such as applications that run in Hadoop, that are more prone to this. By keeping those clients talking to a cluster that is separate from the front end, we mitigate resource contention.
More components means that we have more places to poke and prod to get the most efficiency out of our system. With multiple tiers most of this revolves around making sure the sizes of everything are correct.
This is as good a time as any for a little self-promotion. Many of the questions around how to set up and lay out Kafka clusters, including specific performance concerns and tuning, are covered in this fine book that I am co-authoring. You’ll also find a trove of information about client development, stream processing, and a variety of use cases for Kafka.
We currently have 4 chapters complete, and it’s available from O’Reilly under their early access program. We expect to have the book completed late this year, or early next, with chapters being released as soon as we can write them.
Many of us use Kafka for monitoring applications. At LinkedIn, every application and server writes metrics and logs to Kafka. We have central applications that read out these metrics and provide pretty graphs, thresholding, and other tools to make sure that everything is running properly within LinkedIn. Kafka itself is no exception, which leads to this…
As soon as I say “monitoring Kafka with Kafka”, we know this is not a good thing
For the broker, what are the critical metrics that I’m keeping an eye on every day?
Bytes in, bytes out, and messages in are all critical metrics for us from a growth point of view. While we don’t alert on these, we do keep an eye on them because they help us to understand how the usage of the cluster is growing over time, and they let us plan for the next expansion. You may ask why I don’t have messages out on this list. It’s because there is no messages out metric. Kafka consumers read batches of messages, not single messages, and it’s not easy for Kafka to count messages on the outbound side. There’s a metric on the number of fetches, but it’s less interesting to me.
For partitions, we start with the number of partitions per broker, and the number of leader partitions per broker. As we know, there is a single broker responsible for leadership for a given partition. In a healthy cluster, I want to make sure that each broker has approximately the same number of partitions, and that each broker is leading about 50% of those because we have a replication factor of 2 for most things. We can also see this reflected in the bytes rates, because if the partitions are imbalanced, the bytes rates will be as well. This gives us uneven load and that can cause a lot of problems.
More importantly though, we monitor the number of under replicated partitions that each broker is reporting. I’m going to get into this in much more detail in a few slides, but this indicates the number of partitions that the broker is leader for where at least one of the replicas has fallen behind. This is the single most important metric to monitor and alert on. It indicates a number of problems and a single alert here will provide coverage of most Kafka issues.
Lastly, there are metrics on the thread pool usage, both network and request pools, as well as rate and time metrics on the different types of requests. These are all examples of metrics that are good to have, but they’re difficult to alert on. If you are able to establish a good baseline on some of the request time metrics, I do recommend doing it, however, as rising request times can indicate a problem that is building up, and you may be able to see it before it becomes under replicated partitions.
Buried in the middle there is the “max dirty percent” metric. This is a measurement of how many log segments are able to be compacted that are not currently compacted. Right now, this is the only way to monitor the health of log compaction within Kafka, which is critical for the consumer offsets topic at the very least. If the thread doing log compaction dies (which it can do frequently), the only way you will know is by this metric increasing and staying high. Normal behavior is for the metric to spike up and immediately drop back down again.
There are a number of things that can be improved upon, both in the brokers and in the mirror maker, to make it easier to set up and manage multiple datacenters.
Another big problem is that we are using RAID and providing a single mount point to the Kafka brokers for a log dir. This is because there are some issues with the way JBOD is handled in the broker. Specifically, the brokers assign partitions to log dirs by round robin, not taking into account current size. In addition, there are no administrative functions to move partitions from one directory to another. And if a single disk fails, the entire broker fails. If JBOD was more robust, we could have replication factors of 3 or 4 without an increase in hardware cost, which would allow us to have “no data loss” configurations.
The big improvement to mirror maker is the creation of an identity mirror maker, which would keep message batches together in the exact same partition from source to target cluster. This would completely eliminate the compression overhead from the mirror maker, making it much faster and more efficient. Of course, this requires maintaining the partitions counts in the clusters properly, and allowing the mirror maker to increase partition counts in a target cluster if needed.
That leads into the idea of multi-cluster management. While there are a couple people making some headway on this in the open source world, we still lack a solid interface for managing Kafka clusters as part of an overall infrastructure. This would include maintaining topic configurations across multiple clusters and easily configuring and visualizing the mirror maker links between them.
Another piece needed is better client monitoring overall. Burrow provides us with a good view of what the consumers are doing, but there’s nothing available yet for producer client monitoring. We, of course, have our internal audit system for this. And other companies have their own versions as well. It would be nice to have an open source solution that anyone can use for assuring that the producers are working properly.
We could also use better end-to-end monitoring of our Kafka clusters, so we can know that they are available. We have a lot of metrics that can track information about the individual components, but without a client view of the cluster, we don’t know if the cluster is actually available. We also have a hard time making sure that the entire pipeline is working properly. There’s not a lot available for this right now, but watch this space…
So how can you get more involved in the Kafka community?
The most obvious answer is to go apache.kafka.org. From there you can
Join the mailing lists, either on the development or the user side
You’ll find people on the #apache-kafka channel on Freenode IRC if you have questions
We also coordinate meetups for both Kafka and Samza in the Bay Area, with streaming if you are not local
You can also dive into the source repository, and work on and contribute your own tools back.
Kafka may be young, but it’s a critical piece of data infrastructure for many of us.