Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform

•Download as PPTX, PDF•

0 likes•486 views

In this presentation, I'll speak of the benefits of running Scylla on our Big Data environment which stores over 500TB of data as well as using Scylla as the indexing engine to replace MongoDB and Cassandra for our log data analysis platform.

Recommended for you

Scylla Summit 2017: Migrating to Scylla From Cassandra and Others With No Dow...

The session will cover the best practices to migrate existing data from Apache Cassandra to Scylla and how to do it while being online all of the time.

•by ScyllaDB

nosqlscyllasummitscylla

Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...

In my talk, I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.

•by ScyllaDB

nosqlscyllasummitscylla

Scylla Summit 2017: SMF: The Fastest RPC in the West

On a quest to build the fastest durable log broker in the west, we had to rethink all of the components needed to deliver on this promise. First, we began by building the fastest RPC system in the west, SMF. SMF is a new RPC mechanism, IDL-compiler, and libraries that make using Seastar easy. In this talk, I will cover SMF in detail and show a live demo on how you can get started using it to build your next application so you can live in the future.

•by ScyllaDB

nosqlscyllasummitscylladb

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
How to analyze logs instantly

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Log analysis system
• Source
• Access Log
• Syslog .etc
• Volume
• ~500TB per day
• text/pb/gzip
• Goal
• Build Index per day
• Quick Search
• Value
• APT
• Security Analysis

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Index building architecture

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Three levels index
2nd-level-index is about 8%-10% of source log size, ~50TB
1st-level-index and 3rd-level-index is about 0.4% of source log size, ~2TB

Recommended for you

Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

When working with streaming data, stateful operations are a common use case. If you would like to perform data de-duplication, calculate aggregations over event-time windows, track user activity over sessions, you are performing a stateful operation. Apache Spark provides users with a high level, simple to use DataFrame/Dataset API to work with both batch and streaming data. The funny thing about batch workloads is that people tend to run these batch workloads over and over again. Structured Streaming allows users to run these same workloads, with the exact same business logic in a streaming fashion, helping users answer questions at lower latencies. In this talk, we will focus on stateful operations with Structured Streaming and we will demonstrate through live demos, how NoSQL stores can be plugged in as a fault tolerant state store to store intermediate state, as well as used as a streaming sink, where the output data can be stored indefinitely for downstream applications.

•by ScyllaDB

nosqlscyllasummitscylla

Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs

I will be giving a talk about performance characterization and tuning of Scylla on Samsung NVMe SSDs. We will characterize the performance of Scylla on Samsung high-performance NVMe SSDs and show how Z-SSD ─ the Samsung ultra-low-latency NVMe drive ─ can significantly shrink the performance gap between in-memory and in-storage with Scylla. We will further evaluate the throughput-vs-latency profile of Scylla with NVMe devices and present end-to-end latencies (from the client's viewpoint) as well as the latencies of the software/hardware stack. We will show that a Z-SSD-backed Scylla cluster can provide competitive performance to an in-memory deployment while sharply reducing costs.

•by ScyllaDB

scyllasummitnosqlscylla

Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...

The document appears to be a presentation on optimizing inter-data center communication. It discusses key topics like what inter-data center communication involves, the costs associated with it, best practices for setting snitches, keyspaces, client drivers and consistency levels for queries to optimize performance between data centers. It recommends using network topology replication strategies over simple strategies for multi-region deployments, setting load balancing and consistency levels appropriately in clients, and enabling internode compression to reduce costs of communication between data centers. The presentation encourages reviewing client locations, data access patterns, who is reading/writing data, and having conversations between operations and development teams to determine the best use cases.

•by ScyllaDB

nosqlscyllasummitscylla

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Search architecture

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Log index storage
✓ KV Storage
✓ High-Speed Read
✓ High-Speed Write
✓ Available Every Time
✓ Easy Maintenance
Storage System Expectation

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Why scylla?

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Separate R/W for indexing
Real-time
Analytics
replicationoffline online
Index building service
Bulk Write

Recommended for you

Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...

This document outlines a presentation on using the GoCQL driver to execute queries against Cassandra and Scylla databases. It discusses connecting to a Cassandra cluster, executing queries, iterating over results, and using asynchronous queries. It also mentions some additional Cassandra libraries built on top of GoCQL, including gocqlx for data binding and queries, and gocassa for queries and migrations. The presentation aims to explain how GoCQL works behind the scenes and how to get started with basic querying functionality.

•by ScyllaDB

scyllanosqlscyllasummit

Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL

Apache Kafka is a high-throughput distributed streaming platform that is being adopted by hundreds of companies to manage their real-time data. KSQL is an open source streaming SQL engine that implements continuous, interactive queries against Apache Kafka™. KSQL makes it easy to read, write and process streaming data in real-time, at scale, using SQL-like semantics. In my talk, I will discuss streaming ETL from Kafka into stores like Apache Cassandra using KSQL.

•by ScyllaDB

nosqlscyllasummitscylla

Millions of Regions in HBase: Size Matters

This document discusses strategies for scaling HBase to support millions of regions. It describes Yahoo's experience managing clusters with over 100,000 regions. Large regions can cause problems with tasks distribution, I/O contention during compaction, and scan timeouts. The document recommends keeping regions small and explores enhancements made in HBase to support very large region counts like splitting the meta region across servers and using hierarchical region directories to reduce load on the namenode. Performance tests show these changes improved the time to assign millions of regions.

•by DataWorks Summit

apache hadoopstreamsetsyahoo

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Perf of Scylla on HDD

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Expect secondary index

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
The plan in the future
Try to replace Redis with Scylla which is used as a cache service, given
that Scylla could supply higher performance and was more convenient in
terms of maintenance and scalability.
PLAN-1 : Replace Redis with Scylla
DC1
DC2
DC3

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
The plan in the future
Combine Scylla with OLAP data warehouse because it performs awesome query
throughput.
PLAN-2 : Integrate Scylla into OLAP data warehouse
OLAP
Spark
SparkSpark
SQL Query
CQL Query

Recommended for you

Scylla Compaction Strategies

Presentation on Scylla's and Cassandra's compaction, why it is needed and how it works, and the different compaction strategies: their strengths and weaknesses, and the different types of "amplification" and how to use them to reason about the different compaction strategies. And finally, what Scylla does better than Cassandra in this area. These slides were presented at a meetup in Tel-Aviv, a joint meetup of the following two groups: https://www.meetup.com/Israel-Cassandra-Users/events/259322355/ https://www.meetup.com/Big-things-are-happening-here/events/259495379/

•by Nadav Har'El

cassandrascyllanosql

MyRocks Deep Dive

Detailed technical material about MyRocks -- RocksDB storage engine for MySQL -- https://github.com/facebook/mysql-5.6

•by Yoshinori Matsunobu

myrocksmysqlrocksdb

Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes

At Yahoo! HBase has been running as a hosted multi-tenant service since 2013. In a single HBase cluster we have around 30 tenants running various types of workloads (ie batch, near real-time, ad-hoc, etc). Typically such a deployment would cause tenant workloads to negatively affect each other because of resource contention (disk, cpu, network, cache thrashing, etc). Using RegionServer Groups we are able to designate a dedicated subset of RegionServers in a cluster to host only tables of a given tenant (HBASE-6721). Most HBase deployments use HDFS as their distributed filesystem, which in turn does not guarantee that a region’s data is locally available to the hosting regionserver. This poses a problem when providing isolation since the hdfs data blocks may have to be read remotely from a different tenant’s host thus contending for disk or network resources. Favored nodes addresses this problem by providing hints to HDFS on which datanodes data should be stored and only assigns regions to these favored regionservers (HBASE-15531). We will walk through these features explaining our motivation, how they work as well as our experiences running these multi-tenant clusters. These features will be available in Apache HBase 2.0.

•by DataWorks Summit

hadoop summitdws17hadoop

PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
THANKS

What's hot

Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field

ScyllaDB

In this talk, we will share useful tools and techniques that we are using in the field to understand Scylla clusters. Users will learn how to use those same tools to better understand their deployment. Some of the questions that will be answered are: - how to find out which queries are the slowest and why - how we go about understanding the impact of the data model in a node's performance - how to check which resources are the bottlenecks in the cluster

Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...

ScyllaDB

JanusGraph, a highly scalable graph database solution, supports historically Cassandra and HBase as database backends. We decided to put Scylla in the mix, certainly searching for the best performing backend. We ran test scenarios that cover high volume reads and writes. In this talk, we will show you the performance results of Scylla vs others and also share our lessons learned during the performance evaluation.

If You Care About Performance, Use User Defined Types

ScyllaDB

Shlomi Livne, VP of R&D at ScyllaDB, presented on the performance benefits of using user-defined types (UDTs) in ScyllaDB. He explained that with traditional columns, each column has overhead and flexibility comes at a price. However, with frozen UDTs, the columns are treated as a single unit, sharing metadata and improving performance. Livne showed results of a test where UDTs with many fields outperformed traditional columns with the same number of fields. However, he noted that Scylla's row cache and Java driver performance need improvement for UDTs.

Scylla Summit 2017: Migrating to Scylla From Cassandra and Others With No Dow...

ScyllaDB

Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...

ScyllaDB

Scylla Summit 2017: SMF: The Fastest RPC in the West

ScyllaDB

Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

ScyllaDB

Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs

ScyllaDB

Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...

ScyllaDB

Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...

ScyllaDB

Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL

ScyllaDB

Millions of Regions in HBase: Size Matters

DataWorks Summit

Scylla Compaction Strategies

Nadav Har'El

MyRocks Deep Dive

Yoshinori Matsunobu

Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes

DataWorks Summit

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

The Hive

This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.

DAT402 - Deep Dive on Amazon Aurora PostgreSQL

Grant McAlister

What You Need to Know - Domain Name System (DNS)

Wes Morgan

Scaling HBase for Big Data

Salesforce Engineering

The Hive Think Tank: Rocking the Database World with RocksDB

The Hive

RocksDB is a new storage engine for MySQL that provides better storage efficiency than InnoDB. It achieves lower space amplification and write amplification than InnoDB through its use of compression and log-structured merge trees. While MyRocks (RocksDB integrated with MySQL) currently has some limitations like a lack of support for online DDL and spatial indexes, work is ongoing to address these limitations and integrate additional RocksDB features to fully support MySQL workloads. Testing at Facebook showed MyRocks uses less disk space and performs comparably to InnoDB for their queries.

What's hot (20)

Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field

Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...

If You Care About Performance, Use User Defined Types

Scylla Summit 2017: Migrating to Scylla From Cassandra and Others With No Dow...

Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...

Scylla Summit 2017: SMF: The Fastest RPC in the West

Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs

Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...

Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...

Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL

Millions of Regions in HBase: Size Matters

Scylla Compaction Strategies

MyRocks Deep Dive

Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes

Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook

DAT402 - Deep Dive on Amazon Aurora PostgreSQL

What You Need to Know - Domain Name System (DNS)

Scaling HBase for Big Data

The Hive Think Tank: Rocking the Database World with RocksDB

Viewers also liked

Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View

ScyllaDB

Are you a MySQL DBA or DevOps individual being asked to run Cassandra or Scylla? Feeling overwhelmed? In this talk, I will present Cassandra/Scylla operations in terms that directly relate to MySQL. I will show you comparisons between the Information Schema and the Cassandra/Scylla System keyspace(s). I will also talk about metrics available in MySQL versus Cassandra/Scylla and how to retrieve them. Finally, I will talk about how MySQL replication compares with Cassandra replication. Hopefully, when I am done you will be able to relate to Cassandra operations in a practical and useful way.

How to achieve no compromise performance and availability

ScyllaDB

Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing

ScyllaDB

This presentation discusses the "cold node problem" that occurs when a node restarts in a Cassandra cluster. When a node restarts, it loses its cached data and becomes a bottleneck. The presentation proposes a "heat weighted load balancing" solution where the cluster tracks each node's cache hit ratio and redistributes requests based on this ratio after a restart. Testing shows this solution significantly improves throughput after a node restart by distributing requests more evenly across nodes based on their "heat" or cache contents.

Scylla Summit 2016: Keynote - Big Data Goes Native

ScyllaDB

This document discusses Scylla, a new database that aims to improve upon existing databases. It notes several key differences in Scylla's architecture that allow it to be faster and more scalable than other databases, including its use of techniques like log-structured merge trees, lock-free design, and asynchronous programming. The document also outlines Scylla's value proposition as the fastest database with the best high availability and ease of management compared to other options.

Scylla Summit 2017: The Upcoming HPC Evolution

ScyllaDB

mParticle's Journey to Scylla from Cassandra

ScyllaDB

mParticle processes 50 billion monthly messages and needed a data store that provides full availability and performance. They previously used Cassandra but faced issues with high latency, complicated tuning, and backlogs of up to 20 hours. They tested Scylla and found it provided significantly lower latency and compaction backlogs with minimal tuning needed. Scylla also offered knowledgeable support. mParticle migrated their data from Cassandra to Scylla, which immediately kept up with their data loads with little to no backlog.

Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances

ScyllaDB

Scylla Summit 2017: Scylla's Open Source Monitoring Solution

ScyllaDB

Scylla's monitoring capability has come a long way in the last year. We now have native support for Prometheus. Through scylla-grafana-monitoring, we have started providing default dashboards summarizing the most important aspects of Scylla for users. In this talk, I will cover what is currently available in our metrics, other non-standard metrics that are interesting but not available in our main dashboard, as well as our future plans for enhancement.

Scylla Summit 2017: How We Got to 1 Millisecond Latency in 99% Under Repair, ...

ScyllaDB

Glauber Costa, a Principal Architect at ScyllaDB, discusses techniques for achieving low latency database operations. He identifies three main sources of latency: speed mismatch between disk and CPU, lack of respect for task quotas, and imperfect isolation. Glauber describes how ScyllaDB addresses these issues through techniques like the I/O scheduler, CPU scheduler, task quotas, block detector, and controllers that regulate operations like memtable flushes. The goal is to make high percentile latencies low and bounded by treating them as bugs rather than nice-to-haves. ScyllaDB users can already benefit from these latency improvements in many situations, with more fixes coming in future releases.

Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL

ScyllaDB

How to Monitor and Size Workloads on AWS i3 instances

ScyllaDB

There is a new class of machines in town! Amazon recently unveiled i3, a new class of machines targeted at I/O-intensive workloads. Scylla will officially support i3, and previews are already available. Join our webinar to learn how to build a state-of-the-art database solution. Presenters Glauber Costa and Eyal Gutkind will cover how to: - Determine which workloads can benefit from i3 instances - Ensure Scylla fully leverages the great resources in the i3 family - Effectively navigate the Scylla monitoring system and identify bottlenecks You'll also see a live demonstration with a dashboard featuring an i3 cluster with different data models and workloads.

Scylla Summit 2017: From Elasticsearch to Scylla at Zenly

ScyllaDB

Viewers also liked (12)

Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View

How to achieve no compromise performance and availability

Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing

Scylla Summit 2016: Keynote - Big Data Goes Native

Scylla Summit 2017: The Upcoming HPC Evolution

mParticle's Journey to Scylla from Cassandra

Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances

Scylla Summit 2017: Scylla's Open Source Monitoring Solution

Scylla Summit 2017: How We Got to 1 Millisecond Latency in 99% Under Repair, ...

Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL

How to Monitor and Size Workloads on AWS i3 instances

Scylla Summit 2017: From Elasticsearch to Scylla at Zenly

Similar to Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform

Accelerating Analytics with EMR on your S3 Data Lake

Alluxio, Inc.

- Alluxio provides a data caching layer for analytics frameworks like Spark running on AWS EMR, addressing challenges of using S3 directly like inconsistent performance and expensive metadata operations. - It mounts S3 as a unified filesystem and caches frequently used data in memory across workers for faster queries while continuously syncing data to S3. - Alluxio's multi-tier storage enables data to be accessed locally from remote locations like S3 using intelligent policies to promote and demote data between memory, SSDs and disks.

Druid: Under the Covers (Virtual Meetup)

Imply

Apache Druid ingests and enables instant query on many billions of events in real-time. But how? In this talk, each of the components of an Apache Druid cluster is described – along with the data and query optimisations at its core – that unlock fresh, fast data for all. Bio: Peter Marshall (https://linkedin.com/in/amillionbytes/) leads outreach and engineering across Europe for Imply (http://imply.io/), a company founded by the original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA (hons) degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.

Understanding Metadata: Why it's essential to your big data solution and how ...

Zaloni

This document discusses the importance of metadata for big data solutions and data lakes. It begins with introductions of the two speakers, Ben Sharma and Vikram Sreekanti. It then discusses how metadata allows you to track data in the data lake, improve change management and data visibility. The document presents considerations for metadata such as integration with enterprise solutions and automated registration. It provides examples of using metadata for data lineage, quality, and cataloging. Finally, it discusses using metadata across storage tiers for data lifecycle management and providing elastic compute resources.

Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...

Lace Lofranco

Data orchestration is the lifeblood of any successful data analytics solution. Take a deep dive into Azure Data Factory's data movement and transformation activities, particularly its integration with Azure's Big Data PaaS offerings such as HDInsight, SQL Data warehouse, Data Lake, and AzureML. Participants will learn how to design, build and manage big data orchestration pipelines using Azure Data Factory and how it stacks up against similar Big Data orchestration tools such as Apache Oozie. Video of presentation: https://channel9.msdn.com/Events/Ignite/Australia-2017/DA332

SAP BI with BO from LCC Infotech,Hyderabad

lccinfotech

This document provides an overview of course contents for SAP BI & BO training. It covers topics such as introduction to data warehousing, SAP BW architecture and data modeling, data loading and extraction in SAP BW, SAP BO tools including Universe Designer, Information Design Tool, Crystal Reports, Web Intelligence, and dashboards. The training will provide skills in areas such as data warehousing concepts, SAP BW data management, building reports and dashboards using SAP BO tools connected to SAP BW systems.

From limited Hadoop compute capacity to increased data scientist efficiency

Alluxio, Inc.

Alluxio Tech Talk Oct 17, 2019 Speaker: Alex Ma, Alluxio Want to leverage your existing investments in Hadoop with your data on-premise and still benefit from the elasticity of the cloud? Like other Hadoop users, you most likely experience very large and busy Hadoop clusters, particularly when it comes to compute capacity. Bursting HDFS data to the cloud can bring challenges – network latency impacts performance, copying data via DistCP means maintaining duplicate data, and you may have to make application changes to accomodate the use of S3. “Zero-copy” hybrid bursting with Alluxio keeps your data on-prem and syncs data to compute in the cloud so you can expand compute capacity, particularly for ephemeral Spark jobs.

Data Orchestration Platform for the Cloud

Alluxio, Inc.

This document discusses using a hybrid cloud approach with data orchestration to enable analytics workloads on data stored both on-premises and in the cloud. It outlines reasons for a hybrid approach including reducing time to production and leveraging cloud flexibility. It then describes alternatives like lift-and-shift or compute-driven approaches and their issues. Finally, it introduces a data orchestration platform that can cache and tier data intelligently while enabling analytics frameworks to access both on-premises and cloud-based data with low latency.

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

Amazon Web Services

Learn how to deploy a managed Presto environment to interactively query log data on AWS Organizations often need to quickly analyze large amounts of data, such as logs, generated from a wide variety of sources and formats. However, traditional approaches require a lot of time and effort designing complex data transformation and loading processes; and configuring data warehouses. Using AWS, you can start querying your datasets within minutes In this webinar you will learn how you can deploy a managed Presto environment in minutes to interactively query log data using plain ANSI SQL. Presto is a popular open source SQL engine for running interactive analytic queries against data sources of all sizes. We will talk about common use cases and best practices for running Presto on Amazon EMR. Learning Objectives: • Learn how to deploy a managed Presto environment running on Amazon EMR • Understand best practices for running Presto on Amazon EMR, including use of Amazon EC2 Spot instances • Learn how other customers are using Presto to analyze large data sets

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2

Amazon Web Services

This document discusses building data warehouses and data lakes in the cloud using AWS services. It provides an overview of AWS databases, analytics, and machine learning services that can be used to store and analyze data at scale. These services allow customers to migrate existing data warehouses to the cloud, build new data warehouses and data lakes more cost effectively, and gain insights from their data more easily.

How to migrate from Alfresco Search Services to Alfresco SearchEnterprise

Angel Borroy López

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Amazon Web Services LATAM

Data lakes allow organizations to store all types of data in a centralized repository at scale. AWS Lake Formation makes it easy to build secure data lakes by automatically registering and cleaning data, enforcing access permissions, and enabling analytics. Data stored in data lakes can be analyzed using services like Amazon Athena, Redshift, and EMR depending on the type of analysis and latency required.

Azure Data Lake Intro (SQLBits 2016)

Michael Rys

O365 Meetup Graz -Tome Tomovski - Beyond the limits of SharePoint

Thomas Gölles

The document discusses enterprise architecture for SharePoint with extreme measures. It describes designing a highly available and disaster recovery SharePoint farm across multiple servers. The presentation covers platform and application design, authentication, load balancing, and improvements like zero downtime patching. It emphasizes the importance of documentation for farm configuration, security, provisioning, and known issues for support. The presentation also discusses integrating SharePoint with Office 365 and Azure.

Migrating Your Databases to AWS – Tools and Services (Level 100)

Amazon Web Services

In this webinar, you will learn how the AWS Database Migration Service (DMS) and AWS Schema Conversion Tool (SCT) can help migrate your databases to AWS for homogeneous and heterogeneous migrations. We will also discuss new sources and targets, together with new features that make DMS and SCT a powerful combination for both your database migration and data replication requirements. Speaker: Blair Layton, APAC Business Development, Database, AWS APAC

Azure Hd insigth news

nnakasone

This document provides an overview of Azure HDInsight and options for building data lakes in the cloud. It discusses HDInsight's advantages like preserving existing Hadoop investments. It also covers Azure's data landscape including storage, streaming, ETL, and orchestration options. Key technologies are compared like Hive, Spark, and Storm. Best practices are shared around monitoring, security, data transfer, and disaster recovery.

Windows Azure: Lessons From The Field

Rob Gillen

Alluxio Data Orchestration Platform for the Cloud

Shubham Tagra

Alluxio originated as an open source project at UC Berkeley to orchestrate data for cloud applications by providing a unified namespace and intelligent data caching across multiple data sources. It provides consistent high performance for analytics and AI workloads running on object stores by caching frequently accessed data in memory and tiering data to flash/disk based on policies. Alluxio can also enable hybrid cloud environments by allowing on-premises workloads to burst to public clouds without data movement through "zero-copy" access to remote data.

2017 OpenWorld Keynote for Data Integration

Jeffrey T. Pollock

Media Applications on AWS

Danilo Poccia

The document discusses how media companies can leverage AWS services at each stage of content production and delivery including build, transform, store, deliver, search, archive, understand, notify, improve, and load. It provides examples of how companies like Coursera, Spotify, and PBS have used AWS services like S3, DynamoDB, CloudFront, Glacier, and EMR to scale infrastructure and better serve customers.

Building Data Lakes and Analytics on AWS

Amazon Web Services

In this session, we show you how to understand what data you have, how to drive insights, and how to make predictions using purpose-built AWS services. Learn about the common pitfalls of building data lakes and discover how to successfully drive analytics and insights from your data. Also learn how services such as Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon ML services work together to build a successful data lake for various roles, including data scientists and business users.

Similar to Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform (20)

Accelerating Analytics with EMR on your S3 Data Lake

Druid: Under the Covers (Virtual Meetup)

Understanding Metadata: Why it's essential to your big data solution and how ...

Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...

SAP BI with BO from LCC Infotech,Hyderabad

From limited Hadoop compute capacity to increased data scientist efficiency

Data Orchestration Platform for the Cloud

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2

How to migrate from Alfresco Search Services to Alfresco SearchEnterprise

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

Azure Data Lake Intro (SQLBits 2016)

O365 Meetup Graz -Tome Tomovski - Beyond the limits of SharePoint

Migrating Your Databases to AWS – Tools and Services (Level 100)

Azure Hd insigth news

Windows Azure: Lessons From The Field

Alluxio Data Orchestration Platform for the Cloud

2017 OpenWorld Keynote for Data Integration

Media Applications on AWS

Building Data Lakes and Analytics on AWS

More from ScyllaDB

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...

ScyllaDB

In this presentation, we explore how standard profiling and monitoring methods may fall short in identifying bottlenecks in low-latency data ingestion workflows. Instead, we showcase the power of simple yet clever methods that can uncover hidden performance limitations. Attendees will discover unconventional techniques, including clever logging, targeted instrumentation, and specialized metrics, to pinpoint bottlenecks accurately. Real-world use cases will be presented to demonstrate the effectiveness of these methods. By the end of the session, attendees will be equipped with alternative approaches to identify bottlenecks and optimize their low-latency data ingestion workflows for high throughput.

Mitigating the Impact of State Management in Cloud Stream Processing Systems

ScyllaDB

Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states. In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing. Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.

Measuring the Impact of Network Latency at Twitter

ScyllaDB

Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...

ScyllaDB

BlazingMQ is a new open source* distributed message queuing system developed at and published by Bloomberg. It provides highly-performant queues to applications for asynchronous, efficient, and reliable communication. This system has been used at scale at Bloomberg for eight years, where it moves terabytes of data and billions of messages across tens of thousands of queues in production every day. BlazingMQ provides highly-available, fault-tolerant queues courtesy of replication based on the Raft consensus algorithm. In addition, it provides a rich set of enterprise message routing strategies, enabling users to implement a variety of scenarios for message processing. Written in C++ from the ground up, BlazingMQ has been architected with low latency as one of its core requirements. This has resulted in some unique design and implementation choices at all levels of the system, such as its lock-free threading model, custom memory allocators, compact wire protocol, multi-hop network topology, and more. This talk will provide an overview of BlazingMQ. We will then delve into the system’s core design principles, architecture, and implementation details in order to explore the crucial role they play in its performance and reliability. *BlazingMQ will be released as open source between now and P99 (exact timing is still TBD)

Noise Canceling RUM by Tim Vereecke, Akamai

ScyllaDB

Noisy Real User Monitoring (RUM) data can ruin your P99! We introduce a fresh concept called ""Human Visible Navigations"" (HVN) to tackle this risk; we focus on the experiences you actually care about when talking about the speed of our sites: - Human: We exclude noise coming from bots and synthetic measurements. - Visible: We remove any partial or fully hidden experiences. These tend to be very slow but users don’t see this slowness. - Navigations: We ignore lightning fast back-forward navigations which usually have few optimisation opportunities. Adopting Human Visible Navigations provides you with these key benefits: - Fewer changes staying below the radar - Fewer data fluctuations - Fewer blindspots when finding bottlenecks - Better correlation with business metrics This is supported by plenty of real world examples coming from the world's largest scale modeling site (6M Monthly visits) in combination with aggregated data from the brand new rumarchive.com (open source) After attending this session; your P99 and other percentiles will become less noisy and easier to tune!

Running a Go App in Kubernetes: CPU Impacts

ScyllaDB

Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...

ScyllaDB

In this session, Tanel introduces a new open source eBPF tool for efficiently sampling both on-CPU events and off-CPU events for every thread (task) in the OS. Linux standard performance tools (like perf) allow you to easily profile on-CPU threads doing work, but if we want to include the off-CPU timing and reasons for the full picture, things get complicated. Combining eBPF task state arrays with periodic sampling for profiling allows us to get both a system-level overview of where threads spend their time, even when blocked and sleeping, and allow us to drill down into individual thread level, to understand why.

Performance Budgets for the Real World by Tammy Everts

ScyllaDB

Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include: • Understanding performance budgets vs. performance goals • Aligning budgets with user experience • Pros and cons of Core Web Vitals • How to stay on top of your budgets to fight regressions

Using Libtracecmd to Analyze Your Latency and Performance Troubles

ScyllaDB

Trying to figure out why your application is responding late can be difficult, especially if it is because of interference from the operating system. This talk will briefly go over how to write a C program that can analyze what in the Linux system is interfering with your application. It will use trace-cmd to enable kernel trace events as well as tracing lock functions, and it will then go over a quick tutorial on how to use libtracecmd to read the created trace.dat file to uncover what is the cause of interference to you application.

Reducing P99 Latencies with Generational ZGC

ScyllaDB

With the low-latency garbage collector ZGC, GC pause times are no longer a big problem in Java. With sub-millisecond pause times there are instead other things in the GC and JVM that can cause application threads to experience unexpected latencies. This talk will dig into a specific use where the GC pauses are no longer the cause of unexpected latencies and look at how adding generations to ZGC help lower the p99 application latencies.

5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X

ScyllaDB

Linters are a type of database! They are a collection of lint rules — queries that look for rule violations to report — plus a way to execute those queries over a source code dataset. This is a case study about using database ideas to build a linter that looks for breaking changes in Rust library APIs. Maintainability and performance are key: new Rust releases tend to have mutually-incompatible ways of representing API information, and we cannot afford to reimplement and optimize dozens of rules for each Rust version separately. Fortunately, databases don't require rewriting queries when the underlying storage format or query plan changes! This allows us to ship massive optimizations and support multiple Rust versions without making any changes to the queries that describe lint rules. Ship now, optimize later"" can be a sustainable development practice after all — join us to see how!

How Netflix Builds High Performance Applications at Global Scale

ScyllaDB

Conquering Load Balancing: Experiences from ScyllaDB Drivers

ScyllaDB

Load balancing seems simple on the surface, with algorithms like round-robin, but the real world loves throwing curveballs. Join me in this session as we delve into the intricacies of load balancing within ScyllaDB Drivers. Discover firsthand experiences from our journey in driver development, where we employed the Power of Two Choices algorithm, optimized the implementation of load balancing in Rust Driver, mitigated cloud costs through zone-aware load balancing and combated the issue of overloading a particular core of ScyllaDB. Be prepared to delve into the practical and theoretical aspects of load balancing, gaining valuable insights along the way.

Interaction Latency: Square's User-Centric Mobile Performance Metric

ScyllaDB

Mobile performance metrics often take inspiration from the backend world and measure resource usage (CPU usage, memory usage, etc) and workload durations (how long a piece of code takes to run). However, mobile apps are used by humans and the app performance directly impacts their experience, so we should primarily track user-centric mobile performance metrics. Following the lead of tech giants, the mobile industry at large is now adopting the tracking of app launch time and smoothness (jank during motion). At Square, our customers spend most of their time in the app long after it's launched, and they don't scroll much, so app launch time and smoothness aren't critical metrics. What should we track instead? This talk will introduce you to Interaction Latency, a user-centric mobile performance metric inspired from the Web Vital metric Interaction to Next Paint"" (web.dev/inp). We'll go over why apps need to track this, how to properly implement its tracking (it's tricky!), how to aggregate this metric and what thresholds you should target.

How to Avoid Learning the Linux-Kernel Memory Model

ScyllaDB

The Linux-kernel memory model (LKMM) is a powerful tool for developing highly concurrent Linux-kernel code, but it also has a steep learning curve. Wouldn't it be great to get most of LKMM's benefits without the learning curve? This talk will describe how to do exactly that by using the standard Linux-kernel APIs (locking, reference counting, RCU) along with a simple rules of thumb, thus gaining most of LKMM's power with less learning. And the full LKMM is always there when you need it!

99.99% of Your Traces are Trash by Paige Cruz

ScyllaDB

Distributed tracing is still finding its footing in many organizations today, one challenge to overcome is the data volume - keeping 100% of your traces is expensive and unnecessary. Enter sampling - head vs tail how do you decide? Let’s look at the design of Sifter and get familiar with why tail-based sampling is the way to enact a cost-effective tracing solution while actually increasing the system’s observability.

Square's Lessons Learned from Implementing a Key-Value Store with Raft

ScyllaDB

To put it simply, Raft is used to make a use case (e.g., key-value store, indexing system) more fault tolerant to increase availability using replication (despite server and network failures). Raft has been gaining ground due to its simplicity without sacrificing consistency and performance. Although we'll cover Raft's building blocks, this is not about the Raft algorithm; it is more about the micro-lessons one can learn from building fault-tolerant, strongly consistent distributed systems using Raft. Things like majority agreement rule (quorum), write-ahead log, split votes & randomness to reduce contention, heartbeats, split-brain syndrome, snapshots & logs replay, client requests dedupe & idempotency, consistency guarantees (linearizability), leases & stale reads, batching & streaming, parallelizing persisting & broadcasting, version control, and more! And believe it or not, you might be using some of these techniques without even realizing it! This is inspired by Raft paper (raft.github.io), publications & courses on Raft, and an attempt to implement a key-value store using Raft as a side project.

Making Python 100x Faster with Less Than 100 Lines of Rust

ScyllaDB

A Deep Dive Into Concurrent React by Matheus Albuquerque

ScyllaDB

The Latency Stack: Discovering Surprising Sources of Latency

ScyllaDB

Usually, when an API call is slow, developers blame ourselves and our code. We held a lock too long, or used a blocking operation, or built an inefficient query. But often, the simple picture of latency as “the time a server takes to process a message” hides a great deal of end-to-end complexity. Debugging tail latencies requires unpacking the abstractions that we normally ignore: virtualization, hidden queues, and network behavior. In this talk, I’ll describe how developers can diagnose more sources of delay and failure by building a more realistic and broad understanding of networked services. I’ll give some real-world cases when high end-to-end latency or elevated failure rates occurred due to factors we ordinarily might not even measure. Some examples include TCP SYN retransmission; virtualization on the client; and surprising behavior from AWS load balancers. Unfortunately, many measurement techniques don’t cover anything but the portion most directly under developer control. But developers can do better by comparing multiple measurements, applying Little’s law, investing in eBPF probes, and paying attention to the network layer. Understanding API performance to find and fix issues faster ultimately means understanding the entire stack: the client, your code, and the underlying infrastructure.

More from ScyllaDB (20)

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...

Mitigating the Impact of State Management in Cloud Stream Processing Systems

Measuring the Impact of Network Latency at Twitter

Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...

Noise Canceling RUM by Tim Vereecke, Akamai

Running a Go App in Kubernetes: CPU Impacts

Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...

Performance Budgets for the Real World by Tammy Everts

Using Libtracecmd to Analyze Your Latency and Performance Troubles

Reducing P99 Latencies with Generational ZGC

5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X

How Netflix Builds High Performance Applications at Global Scale

Conquering Load Balancing: Experiences from ScyllaDB Drivers

Interaction Latency: Square's User-Centric Mobile Performance Metric

How to Avoid Learning the Linux-Kernel Memory Model

99.99% of Your Traces are Trash by Paige Cruz

Square's Lessons Learned from Implementing a Key-Value Store with Raft

Making Python 100x Faster with Less Than 100 Lines of Rust

A Deep Dive Into Concurrent React by Matheus Albuquerque

The Latency Stack: Discovering Surprising Sources of Latency

Recently uploaded

Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...

Erasmo Purificato

Best Programming Language for Civil Engineers

Awais Yaseen

The integration of programming into civil engineering is transforming the industry. We can design complex infrastructure projects and analyse large datasets. Imagine revolutionizing the way we build our cities and infrastructure, all by the power of coding. Programming skills are no longer just a bonus—they’re a game changer in this era. Technology is revolutionizing civil engineering by integrating advanced tools and techniques. Programming allows for the automation of repetitive tasks, enhancing the accuracy of designs, simulations, and analyses. With the advent of artificial intelligence and machine learning, engineers can now predict structural behaviors under various conditions, optimize material usage, and improve project planning.

Recent Advancements in the NIST-JARVIS Infrastructure

KAMAL CHOUDHARY

WPRiders Company Presentation Slide Deck

Lidia A.

YOUR RELIABLE WEB DESIGN & DEVELOPMENT TEAM — FOR LASTING SUCCESS WPRiders is a web development company specialized in WordPress and WooCommerce websites and plugins for customers around the world. The company is headquartered in Bucharest, Romania, but our team members are located all over the world. Our customers are primarily from the US and Western Europe, but we have clients from Australia, Canada and other areas as well. Some facts about WPRiders and why we are one of the best firms around: More than 700 five-star reviews! You can check them here. 1500 WordPress projects delivered. We respond 80% faster than other firms! Data provided by Freshdesk. We’ve been in business since 2015. We are located in 7 countries and have 22 team members. With so many projects delivered, our team knows what works and what doesn’t when it comes to WordPress and WooCommerce. Our team members are: - highly experienced developers (employees & contractors with 5 -10+ years of experience), - great designers with an eye for UX/UI with 10+ years of experience - project managers with development background who speak both tech and non-tech - QA specialists - Conversion Rate Optimisation - CRO experts They are all working together to provide you with the best possible service. We are passionate about WordPress, and we love creating custom solutions that help our clients achieve their goals. At WPRiders, we are committed to building long-term relationships with our clients. We believe in accountability, in doing the right thing, as well as in transparency and open communication. You can read more about WPRiders on the About us page.

The Rise of Supernetwork Data Intensive Computing

Larry Smarr

Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops

Mydbops

This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization. Key Takeaways: * Understand why connection pooling is essential for high-traffic applications * Explore various connection poolers available for PostgreSQL, including pgbouncer * Learn the configuration options and functionalities of pgbouncer * Discover best practices for monitoring and troubleshooting connection pooling setups * Gain insights into real-world use cases and considerations for production environments This presentation is ideal for: * Database administrators (DBAs) * Developers working with PostgreSQL * DevOps engineers * Anyone interested in optimizing PostgreSQL performance Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services

Calgary MuleSoft Meetup APM and IDP .pptx

ishalveerrandhawa1

How to Build a Profitable IoT Product.pptx

Adam Dunkels

The Increasing Use of the National Research Platform by the CSU Campuses

Larry Smarr

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In

TrustArc

Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk. What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year? Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year. This webinar will review: - Key changes to privacy regulations in 2024 - Key themes in privacy and data governance in 2024 - How to maximize your privacy program in the second half of 2024

What's New in Copilot for Microsoft365 May 2024.pptx

Stephanie Beckett

20240704 QFM023 Engineering Leadership Reading List June 2024

Matthew Sinclair

7 Most Powerful Solar Storms in the History of Earth.pdf

Enterprise Wired

Comparison Table of DiskWarrior Alternatives.pdf

Andrey Yasko

What’s New in Teams Calling, Meetings and Devices May 2024

Stephanie Beckett

BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf

Neo4j

Presented at Gartner Data & Analytics, London Maty 2024. BT Group has used the Neo4j Graph Database to enable impressive digital transformation programs over the last 6 years. By re-imagining their operational support systems to adopt self-serve and data lead principles they have substantially reduced the number of applications and complexity of their operations. The result has been a substantial reduction in risk and costs while improving time to value, innovation, and process automation. Join this session to hear their story, the lessons they learned along the way and how their future innovation plans include the exploration of uses of EKG + Generative AI.

Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy

RaminGhanbari2

BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL

Liveplex

DealBook of Ukraine: 2024 edition

Yevgen Sysoyev

Details of description part II: Describing images in practice - Tech Forum 2024

BookNet Canada

This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator. Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/ Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.

Recently uploaded (20)

Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...

Best Programming Language for Civil Engineers

Recent Advancements in the NIST-JARVIS Infrastructure

WPRiders Company Presentation Slide Deck

The Rise of Supernetwork Data Intensive Computing

Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops

Calgary MuleSoft Meetup APM and IDP .pptx

How to Build a Profitable IoT Product.pptx

The Increasing Use of the National Research Platform by the CSU Campuses

TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In

What's New in Copilot for Microsoft365 May 2024.pptx

20240704 QFM023 Engineering Leadership Reading List June 2024

7 Most Powerful Solar Storms in the History of Earth.pdf

Comparison Table of DiskWarrior Alternatives.pdf

What’s New in Teams Calling, Meetings and Devices May 2024

BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf

Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy

BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL

DealBook of Ukraine: 2024 edition

Details of description part II: Describing images in practice - Tech Forum 2024

Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform

1. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company How Baidu runs Scylla on PB-level big data platform Baidu Security R&D Department Zhangmei Li & Jeff Liu

2. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Agenda  About me  Introduction of Baidu Security  How to use Scylla in a log analysis system  What is the purpose of Scylla  The development about Scylla in the future 2

3. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company About Baidu security 3 Protect Baidu Protect Partner Protect People Cloud WAFAnti-DDos Device fingerprint DNS Hijacking Detection Web Page Hijacking Detection Simulator Detection Threat Intelligence Service System Vulnerabilities Scan Service Others… Big Data Platform Data-Driven Security More fast, More intelligence

4. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Baidu security big data platform 4 Management Process Storage Collection Data Searching Data Warehouse Spark MapReduce Kafka HiveHDFS FTPMinos Client DataProxy Mola ES Storm Graph Database Service HBase Cluster Monitor Statistical Analysis System Flume Flink OpenTSDB Resource Management && Scheduler Scylla Metadata Management

5. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company How to analyze logs instantly

6. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Log analysis system • Source • Access Log • Syslog .etc • Volume • ~500TB per day • text/pb/gzip • Goal • Build Index per day • Quick Search • Value • APT • Security Analysis

7. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Index building architecture

8. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Three levels index 2nd-level-index is about 8%-10% of source log size, ~50TB 1st-level-index and 3rd-level-index is about 0.4% of source log size, ~2TB

9. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Search architecture

10. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Log index storage ✓ KV Storage ✓ High-Speed Read ✓ High-Speed Write ✓ Available Every Time ✓ Easy Maintenance Storage System Expectation

11. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Why scylla?

12. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Separate R/W for indexing Real-time Analytics replicationoffline online Index building service Bulk Write

13. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Perf of Scylla on HDD

14. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Expect secondary index

15. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company The plan in the future Try to replace Redis with Scylla which is used as a cache service, given that Scylla could supply higher performance and was more convenient in terms of maintenance and scalability. PLAN-1 : Replace Redis with Scylla DC1 DC2 DC3

16. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company The plan in the future Combine Scylla with OLAP data warehouse because it performs awesome query throughput. PLAN-2 : Integrate Scylla into OLAP data warehouse OLAP Spark SparkSpark SQL Query CQL Query

17. PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company THANKS

Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform

Similar to Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform