Testing a complex system like Scylla is a challenge on its own. There are many environments, workloads, and problems. Simple problems become increasingly worse at scale. In this talk, we will explore the testing method that we employ in our QA lab and our plans to make it even better in years to come.
This document discusses Apache Spark and Cassandra. It provides an overview of Cassandra as a shared-nothing, masterless, peer-to-peer database with great scaling. It then discusses how Spark can be used to analyze large amounts of data stored in Cassandra in parallel across a cluster. The Spark Cassandra connector allows Spark to create partitions that align with the token ranges in Cassandra, enabling efficient distributed queries across the cluster.
Cassandra gives operations a lot of control over the system by forcing them to make a lot of decisions they'd rather not around cluster topology changes. Hecuba2 is a tool that helps to automate that. Hecuba2 has a library component and an agent component. The library provides an API for manipulating Cassandra topologies and the agent runs on all Cassandra hosts and converges the existing topology to the generated topology. Hecuba2 is running in production at Spotify and has been remarkably bug free since being rolled out. It supports creating a cluster, expanding a cluster, and replacing nodes. This talk will cover the design of Hecuba2 and how to deploy it. About the Speaker Radovan Zvoncek Backend Engineer, Spotify After graduating a master degree in distributed systems I've joined Spotify as a backend engineer. For the past three years I've been involved in Cassandra operations, as well as the cultivation of the Cassandra ecosystem at Spotify.
DataStax Enterprise clients, such as CQLSH or Hadoop and Spark based applications, can be precisely configured to achieve a desired behaviour. For a basic use case, we just run a dedicated DSE command and do not care about how all of those pieces are setup to work together, leveraging the goodness of DSE. However, understanding where and what we need to modify to achieve the expected change in the configuration is essential for using DSE efficiently. In this presentation we go through the basic and advanced settings for client applications, including security features and limitations or DSE patches introduced into integrated Spark. We show the new tools which significantly simplify the configuration of external DSE installations which are used just for accessing DSE cluster in client mode. Finally, we conclude with hints for configuring Spark driver from scratch in order to use it in a web application, when running the program through DSE scripts is not feasible. About the Speaker Jacek Lewandowski Software engineer, DataStax Jacek Lewandowski is a software engineer with 13 years of experience. Initially a full stack developer, he was working as a consultant and a trainer for different companies. Since 2011 he started using Cassandra as an alternative to SQL in various applications. He is passionate about distributed algorithms, graphs and functional programming in Scala. Part time assistant professor popularizing Cassandra database among students and researchers. Working at DataStax Analytics team for over 2 years.
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions! About the Speaker Russell Spitzer Software Engineer, DataStax Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
This document discusses managing Apache Cassandra at scale. It provides an overview of Cassandra's history and evolution from Dynamo and BigTable. It also discusses Cassandra's data model and how it handles operations like reads, writes and updates in a distributed system without relying on read-modify-writes. The document also covers Cassandra best practices like using collections, lightweight transactions and time series data modeling to optimize for scalability.
With the addition of vnodes (Virtual Nodes), Cassandra users were able to gain a few benefits as a result of streaming when it came to bootstrapping and decommissioning nodes. On the flip side, having to route requests on larger clusters became a lot more intensive of a workload for all nodes that were then forced to act coordinator nodes. By setting up a tier of proxy nodes, we were able to have our cluster of 50 nodes perform with a 300% improvement on average in a mixed workload environment. This is an explanation of what we did, how we did it, and why it works. About the Speaker Eric Lubow CTO, SimpleReach Eric Lubow is CTO of SimpleReach, where he builds highly-scalable distributed systems for processing analytics data. Eric is also a DataStax MVP for Cassandra, and co-author of Practical Cassandra. In his spare time, Eric is a skydiver, motorcycle rider, mixed martial artist, and dog dad.
We have been offering many internet services and smart phone applications for over 20 years in Japan, and Cassandra has been used by our services since 2010. In this presentation, I will explain some issues and solutions about Cassandra, and our next generation infrastructure for Cassandra. About the Speaker Satoshi Konno Technical Manager, Yahoo Japan Corporation Satoshi Konno is a software engineer with 20 years of experience. He has worked in Yahoo Japan as a programmer for 10 years and in their NoSQL team for the past 4 years and he is currently in a computer science doctoral course studying distributed computing.
Have you ever wondered what is in all of those SSTable files and how it helps Cassandra find and manage your data? If you go to the Datastax website they will give you a high level explanation of what is in each file. In this talk we will go much deeper explaining each file and walking through a dump of its contents. We will also explore the differences between Cassandra 2.1 and 3.4. About the Speaker John Schulz Prinicipal Consultant, The Pythian Group John has 40 of years experience working with data. Data in files and in Databases from flat files through ISAM to relational databases and most recently NoSQL. For the last 15 he's worked on a variety of Open source technologies including MySQL, PostgreSQL, Cassandra, Riak, Hadoop and Hbase. He has been working with Cassandra since 2010. For the last eighteen months he has been working for The Pythian Group to help their customers improve their existing databases and select new ones.
HighLoad++ 2017 Зал «Кейптаун», 8 ноября, 16:00 Тезисы: http://www.highload.ru/2017/abstracts/3115.html During this session we will cover the last development in ProxySQL to support regular expressions (RE2 and PCRE) and how we can use this strong technique in correlation with ProxySQL's query rules to anonymize live data quickly and transparently. We will explain the mechanism and how to generate these rules quickly. We show live demo with all challenges we got from the Community and we finish the session by an interactive brainstorm testing queries from the audience.
The document discusses mining the Automatic Workload Repository (AWR) in Oracle databases for capacity planning, visualization, and other real-world uses. It introduces Karl Arao as a speaker and discusses topics he will cover including AWR, diagnosing performance issues using AWR data, visualization of AWR data, capacity planning, and tools for working with AWR data like scripts and linear regression. References and resources on working with AWR are also provided.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose. This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
This document summarizes challenges with large partitions in Cassandra and potential solutions. When a large partition is read, the key cache can cause garbage collection pressure as it stores the partition's index on the Java heap. Currently, the index is stored off-heap only if the partition exceeds a configurable size, otherwise it is kept on-heap. Fully migrating the key cache off-heap is another potential solution but incurs serialization costs.
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's. About the Speaker Eric Stevens Principal Architect, ProtectWise, Inc. Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
This document discusses operations, consistency, and failover for multi-datacenter Apache Cassandra clusters. It describes how to configure replication strategies to distribute data across DCs, maintain consistency levels, and handle reads and writes between DCs. It also covers adding a new DC, removing a DC, running repairs across DCs, and designing for failover between DCs in the event of network partitions or DC outages.
Rapid Home Provisioning is a new feature in Oracle Grid Infrastructure 12c R2 that provides a simplified way to provision and patch Oracle software and databases. It uses a centralized management server and golden images stored on ACFS to deploy pre-packaged and patched Oracle homes to client nodes. Administrators can easily create working copies of golden images, deploy databases from the working copies, and seamlessly patch databases by moving them to a working copy based on a newer patched golden image with a single command.
Join us as we talk about the current state as well as the future of DSE Search. Nick Panahi will discuss high level architecture while Ariel will dive deep into some of the integration. We'll talk about future features, improvements and enhancements as well as some of the challenges of our custom integration and what that means for scale and availability. About the Speakers Nick Panahi Sr. Product Manager, DSE Search, DataStax I am the product manager for DSE search, prior to product management, I was a solution architect for DataStax. Ariel Weisberg Software Engineer, DataStax Ariel is currently a Cassandra contributor and Datastax employee and former lead architect for VoltDB. Ariel aspires to be or considers himself a shared-nothing database expert depending on the time of day and whether Benedict is in the room, and has a passion for things measured in nanoseconds. Ariel has presented at events like Strangeloop, PAX Dev, OpenSQL camp Boston, NYC MySQL Meetup, and Boston New Technology Group meetup.
Tanel Poder has been involved in a number of Exadata migration projects since its introduction, mostly in the area of performance ensurance, troubleshooting and capacity planning. These slides, originally presented at UKOUG in 2010, cover some of the most interesting challenges, surprises and lessons learnt from planning and executing large Oracle database migrations to Exadata v2 platform. This material is not just repeating the marketing material or Oracle's official whitepapers.
Zenly (recently acquired by Snap) makes a social map app. Their team has been running Scylla in production for the past eight months. Get an overview of the reasons they chose Scylla, its deployment on Google Cloud, the performances they achieved, plus learn as they share some of the few hiccups they hit along the way.
When working with streaming data, stateful operations are a common use case. If you would like to perform data de-duplication, calculate aggregations over event-time windows, track user activity over sessions, you are performing a stateful operation. Apache Spark provides users with a high level, simple to use DataFrame/Dataset API to work with both batch and streaming data. The funny thing about batch workloads is that people tend to run these batch workloads over and over again. Structured Streaming allows users to run these same workloads, with the exact same business logic in a streaming fashion, helping users answer questions at lower latencies. In this talk, we will focus on stateful operations with Structured Streaming and we will demonstrate through live demos, how NoSQL stores can be plugged in as a fault tolerant state store to store intermediate state, as well as used as a streaming sink, where the output data can be stored indefinitely for downstream applications.
In my talk, I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.
What happens to a request that reaches Scylla, and why should one care? Understanding how Scylla executes your queries can help you make better architectural decisions and also better understand the performance of your application. Are my rows too big? Should I make that other column a part of my partition key instead? This talk will cover the interaction between nodes, shards and the role of Scylla's internal components like memtables, cache and sstables. I will explain how different types of queries are executed and how to plan your queries for maximum performance.