Comcast is one of the largest cable and telecommunications providers in the country built on decades of mergers, acquisitions, and subscriber growth. The success of our company depends on keeping our customers happy and how quickly we can pivot with changing trends and new technologies. Data abounds within our internal data centers and edge networks as well as both the private and public cloud across multiple vendors. Within such an environment and given such challenges, how do we get AI, machine learning, and data science platforms built so our company can respond to the market, predict our customers’ needs and create new revenue generating products that delight our customers? If you don’t happen to be our friends and colleagues at Google, Facebook, and Amazon, what are technologies, strategies, and toolkits you can employ to bring together disparate data sets and quickly get them into the hands of your data scientists and then into your own production systems for use by your customers and business partners? We’ll explore our journey and evolution and look at specific technologies and decisions that have gotten us to where we are today and demo how our platform works. Speaker Ray Harrison, Comcast, Enterprise Architect Prashant Khanolkar, Comcast, Principal Architect Big Data
Devops, Devops , Devops, from Dev and Ops to Devops presented at Froscon in Sankt-Augusting (Germany) on August 21st 2011
Deploying machine learning models from training to production requires companies to deal with the complexity of moving workloads through different pipelines and re-writing code from scratch.
This document discusses Open Lineage and the Marquez project for collecting metadata and data lineage information from data pipelines. It describes how Open Lineage defines a standard model and protocol for instrumentation to collect metadata on jobs, datasets, and runs in a consistent way. This metadata can then provide context on the data source, schema, owners, usage, and changes. The document outlines how Marquez implements the Open Lineage standard by defining entities, relationships, and facets to store this metadata and enable use cases like data governance, discovery, and debugging. It also positions Marquez as a centralized but modular framework to integrate various data platforms and extensions like Datakin's lineage analysis tools.
Join us for a live webinar on December 13th to learn why you can’t have effective DevOps without Value Stream Management. While DevOps provides capabilities that improve a business value stream through the implementation of culture, toolchains, orchestration and automation, DevOps alone without Value Stream Management is not sufficient to realize business benefits. Don’t spend the time and money on DevOps alone and NOT get to reap the rewards for the business! Attend this webinar to hear Marc Hornbeek of Trace3, and Jeff Keyes of Plutora discuss how you can leverage all of the data from your DevOps tools chains to provide real-time analytics, and codify policies that must be orchestrated to realize benefits of a business value stream.
Migrate any server workload to any target destination with the OpenText Migrate cloud migration platform. Learn about common migration challenges and how to choose the right cloud migration tool.
This document provides an overview of a presentation comparing Apache Flink and Apache Spark. The presentation aims to address marketing claims, confusing statements, and outdated information regarding Flink vs Spark. It outlines key criteria to evaluate the two platforms, such as streaming capabilities, state management, and scalability. The document then directly compares some criteria, such as their support for iterative processing and streaming engines. The presenter hopes this evaluation framework will help others assess Flink and Spark for stream processing use cases.
Modern enterprises increasingly rely on software to keep the lights on and lay the foundations for long-term sustainable growth. Among many things, IT leaders are tasked with accelerating the time to value of their software delivery value streams. But when asked, “Do you know what is slowing your software delivery teams down?”, why do IT leaders typically not know the answer? Methodologies such as Agile and DevOps have been adopted to accelerate the time between build to deploy, yet the benefits are often only felt at a localized level (more sprints completed, higher number of deployments etc.) without a tangible link to business outcomes. Enter Value Stream Architecture. During this webinar, Senior Value Stream Architect, Dan Feminella, presents: - The business case for Value Stream Architecture - Why your organization needs it in order to scale Agile and DevOps - How to architect for end-to-end flow of business value from customer request to delivery and back through the customer feedback loop
Understand the concept of DevOps by employing DevOps Strategy Roadmap Lifecycle PowerPoint Presentation Slides Complete Deck. Describe how DevOps is different from traditional IT with these content-ready PPT themes. The slides also help to discuss DevOps use cases in the business, roadmap, and its lifecycle. Explain the roles, responsibilities, and skills of DevOps engineers by utilizing this visually appealing slide deck. Demonstrate DevOp roadmap for implementation in the organization with the help of a thoroughly researched PPT slideshow. Describe the characteristics of cloud computing, its benefits, and risks with the aid of this PPT layout. Utilize this easy-to-use DevOps transformation strategy PowerPoint slide deck to showcase the difference between cloud and traditional data centers. This ready-to-use PowerPoint layout also discusses the roadmap to integrate cloud computing in business. Highlight the usages of cloud computing and deployment models with the help of visual attention-grabbing DevOps implementation roadmap PowerPoint slides. https://bit.ly/3eFxYYr
Recently, Dr. Qingsong Zhang spoke at a Meetup about how Walmart is using DevOps. Within this slide deck, you'll learn about our DataOps, DevOps and OneOps, an application lifecycle management (ALM), and open source DevOps platform for cloud which was developed by Walmart Labs. Feel free to follow us on Twitter: @one_ops! Contribute to One_Ops: www.oneops.com
Speaker: Yupeng Fu, Staff Engineer, Uber High availability and reliability are important requirements to Uber services, and the services shall tolerate datacenter failures in a region and fail over to another region. In this talk, we will present the active-active Apache Kafka® at Uber and how it facilitates disaster discovery across regions for Uber services. In particular, we will highlight the key components including topic replication, topic aggregation, offsets sync and then walk through several use cases of their disaster recovery strategy using active-active Kafka. Lastly, we will present several interesting challenges and the future work planned. Yupeng Fu is a staff engineer in Uber Data Org leading the streaming data platform. Previously, he worked at Alluxio and Palantir, building distributed data analysis and storage platforms. Yupeng holds a B.S. and an M.S. from Tsinghua University and did his Ph.D. research on databases at UCSD.
Introduction to Enterprise Architecture in a simpler, modernized, & realistic model (CSVLOD). Target Audience: 1- Tech Leaders New to Enterprise Architecture. 2- Enterprise Architects. 3- CIO, CTO, CDO, EPMO, ITPMO.
Event streaming applications unlock new benefits by combining various data feeds. However, getting actionable insights in a timely fashion has remained a challenge, as the data has been siloed in disparate systems. ksqlDB solves this by providing an interactive SQL interface that can seamlessly combine and transform data from various sources. In this webinar, we will show how streaming queries of high throughput NoSQL systems can derive insights from various push/pull queries via ksqlDB's User-Defined Functions, Aggregate Functions and Table Functions.
Aljoscha Krettek is the PMC chair of Apache Flink and Apache Beam, and co-founder of data Artisans. Apache Flink is an open-source platform for distributed stream and batch data processing. It allows for stateful computations over data streams in real-time and historically. Flink supports batch and stream processing using APIs like DataSet and DataStream. Data Artisans originated Flink and provides an application platform powered by Flink and Kubernetes for building stateful stream processing applications.
This document provides an overview of developing Java streaming applications with Apache Storm. It discusses what Storm is, its conceptual model including tuples, streams, spouts, bolts and topologies. It demonstrates developing a word count topology with code examples. It also covers Storm's runtime architecture, additional features like reliability, and integrating Storm with technologies like Kafka and HBase.
Telegraf is the open source server agent which is used to collect metrics from your stacks, sensors and systems. It is InfluxDB’s native data collector that supports over 250+ inputs and outputs. Learn how to send data from a variety of systems, apps, databases and services in the appropriate format to InfluxDB. Discover tips and tricks on how to write your own plugins. Join this webinar as Jessica Ingrassellino and Samantha Wang dive into: Types of Telegraf plugins (i.e. input, output, aggregator and processor) Specific plugins including Execd input plugins and the Starlark processor plugin How to create your own Telegraf plugin
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput. This talk provides a comprehensive overview of Kafka architecture and internal functions, including: -Topics, partitions and segments -The commit log and streams -Brokers and broker replication -Producer basics -Consumers, consumer groups and offsets This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
This document provides an overview and introduction to Apache Flink, a stream-based big data processing engine. It discusses the evolution of big data frameworks to platforms and the shortcomings of Spark's RDD abstraction for streaming workloads. The document then introduces Flink, covering its history, key differences from Spark like its use of streaming as the core abstraction, and examples of using Flink for batch and stream processing.
Comcast's Streaming Data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have previously reported (DataWorks Summit 2017) on how we extended Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous kafka messaging. Recently we were presented the challenge of providing integrated data discovery and lineage across our public cloud datasources and on-prem datasources, both Hadoop-based and traditional data warehouses and RDBMSs. Can Apache Atlas meet this challenge? A resounding yes! This talk will present our federated architecture, with Atlas providing SQL-like, free-text, and graph search across select metadata from all on-prem and public cloud data sources in our purview. Lightweight, custom connectors/bridges identify metadata/lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining. While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs. In-house-built solutions typically underestimate the cost of development and maintenance and often suffer from architecture-by-accretion. Atlas' commitment to extensibility, built-in provision of typed, free-text, and graph search, and REST and asynchronous APIs, position it uniquely in the build-vs-buy sweet spot.
This document introduces several big data technologies that are less well known than traditional solutions like Hadoop and Spark. It discusses Apache Flink for stream processing, Apache Samza for processing real-time data from Kafka, Google Cloud Dataflow which provides a managed service for batch and stream data processing, and StreamSets Data Collector for collecting and processing data in real-time. It also covers machine learning technologies like TensorFlow for building dataflow graphs, and cognitive computing services from Microsoft. The document aims to think beyond traditional stacks and learn from companies building pipelines at scale.
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends. * Introduction to Data Engineering * Role of Big Data in Data Engineering * Key Skills related to Data Engineering * Role of Big Data in Data Engineering * Overview of Data Engineering Certifications * Free Content and ITVersity Paid Resources Don't worry if you miss the video - you can click on the below link to go through the video after the schedule. https://youtu.be/dj565kgP1Ss * Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/ Relevant Playlists: * Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi * Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl * Join our Meetup group - https://www.meetup.com/itversityin/ * Enroll for our labs - https://labs.itversity.com/plans * Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1 * Access Content via our GitHub - https://github.com/dgadiraju/itversity-books * Lab and Content Support using Slack
In the new era of digitalization, there is an ever-growing need for design and production processes capable of increasing systems quality, reducing risks and the chance of errors, while, at the same me, reducing overall production costs. Nowadays, more and more systems design scenarios comprise a high number of domains. However, the underlying tool landscape is still dominated by closed ecosystems, resulting in the design data remaining in separate silos. To effectively deal with novel, massively diverse yet interconnected engineering scenarios, while also considering industrial sustainability and the well-being of the future digital society, we have to propose new ways to look at the digital thread, supporting every phase of a digital engineering lifecycle, while turning the siloed multi-domain engineering data into a holistic, accessible and globally analyzable digital thread.
The document discusses challenges in maintaining consistency, completeness, and correctness (3C) across disconnected engineering data silos. It proposes using links and transformations to connect models between systems engineering and electrical design tools. Validation rules can then check that connections and components are properly mapped between the silos. The IncQuery Validator was used to import a model from E3.GENESYS into its knowledge graph and generate a validation report checking for 3C issues. Tracking link management and validation results over time provides visibility into the progress of the "digital thread" across the engineering lifecycle.
Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.
This document provides an overview of big data and the Spark framework. It discusses the big data ecosystem, including file systems, data ingestion tools, batch and real-time data processing frameworks, visualization tools, and support technologies. It outlines common big data job roles and their associated skills. The document then focuses on Spark, describing its core functionality, modules like DataFrames and MLlib, and execution modes. It provides guidance on learning Spark, emphasizing programming skills and Spark APIs. A demo of Spark fundamentals on a big data lab is also proposed.