Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
Apache Doris (incubating) is an MPP-based interactive SQL data warehousing for reporting and analysis. It is open-sourced by Baidu. Doris mainly integrates the technology of Google Mesa and Apache Impala. Unlike other popular SQL-on-Hadoop systems, Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Doris not only provides batch data loading, but also provides near real-time mini-batch data loading. Doris also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris.
ABSTRACT OF THE TALK Kubernetes has hit a home run for stateless workloads, but can it do the same for stateful services such as distributed databases? Before we can answer that question, we need to understand the challenges of running stateful workloads on, well anything. In this talk, we will first look at which stateful workloads, specifically databases, are ideal for running inside Kubernetes. Secondly, we will explore the various concerns around running databases in Kubernetes for production environments, such as: - The production-readiness of Kubernetes for stateful workloads in general - The pros and cons of the various deployment architectures - The failure characteristics of a distributed database inside containers In this session, we will demonstrate what Kubernetes brings to the table for stateful workloads and what database servers must provide to fit the Kubernetes model. This talk will also highlight some of the modern databases that take full advantage of Kubernetes and offer a peek into what’s possible if stateful services can meet Kubernetes halfway. We will go into the details of deployment choices, how the different cloud-vendor managed container offerings differ in what they offer, as well as compare performance and failure characteristics of a Kubernetes-based deployment with an equivalent VM-based deployment. BIO Amey is a VP of Data Engineering at Yugabyte with a deep passion for Data Analytics and Cloud-Native technologies. In his current role, he collaborates with Fortune 500 enterprises to architect their business applications with scalable microservices and geo-distributed, fault-tolerant data backend using YugabyteDB. Prior to joining Yugabyte, he spent 5 years at Pivotal as Platform Data Architect and has helped enterprise customers across multiple industry verticals to extend their analytical capabilities using Pivotal & OSS Big Data platforms. He is originally from Mumbai, India, and has a Master's degree in Computer Science from the University of Pennsylvania(UPenn), Philadelphia. Twitter: @ameybanarse LinkedIn: linkedin.com/in/ameybanarse/
This document discusses Oracle Multitenant 19c and pluggable databases. It begins with an introduction to the speaker and overview of pluggable databases. It then describes the traditional Oracle database architecture and the multitenant architecture in Oracle 19c. It discusses the different components of a container database including the root, seed PDB, and application containers. It also covers how to create pluggable databases from scratch, through cloning locally and remotely, relocating PDBs, and plugging in unplugged PDBs.
This document discusses using Apache Cassandra for business intelligence, reporting and analytics. It covers: - Data modeling and querying Cassandra data using CQL - Accessing Cassandra data through drivers, ODBC/JDBC, and analytics frameworks like Spark and Hadoop - Doing reporting, dashboards, and analytics on Cassandra data using CQL, Solr, Spark, and BI tools - Capabilities of DataStax Enterprise for integrated search, batch analytics, and real-time analytics on Cassandra - Example architectures that isolate workloads and handle hot vs cold data
This document summarizes a series of performance issues seen by the author in their work with Oracle Exadata systems. It describes random session hangs occurring across several minutes, with long transaction locks and I/O waits seen. Analysis of AWR reports and blocking trees revealed that many sessions were blocked waiting on I/O, though initial I/O metrics from the OS did not show issues. Further analysis using ASH activity breakdowns and OS tools like sar and vmstat found high apparent CPU usage in ASH that was not reflected in actual low CPU load on the system. This discrepancy was due to the way ASH attributes non-waiting time to CPU. The root cause remained unclear.
Detailed technical material about MyRocks -- RocksDB storage engine for MySQL -- https://github.com/facebook/mysql-5.6
This talk presents 15 different tips and tricks using tools to better troubleshoot and debug problems with Database , Oracle RAC and Oracle Clusterware , ASM and how to get the right pieces of data with the least of commands which today most people do manually. This session will cover tools from the Oracle Autonomous Health Framework (AHF) like Trace file Analyzer (TFA) to collect , organize and analyze log data , Exachk and orachk to perform mass best practices analysis and automation , Cluster Health Advisor to debug node evictions and calibrate the framework , OSWatcher and its analysis engine , oratop for pinpointing performance issues and many others to make one feel like a rockstar DBA.
Query compilation in Impala involves parsing the SQL, semantic analysis to validate the query, planning to generate an executable query plan, and finally executing the query. The query planner considers different join orders and strategies like broadcast joins and partitioned joins to minimize data transfer during query execution based on table and column statistics. The explain output provides details on how the query will be executed in a distributed fashion across nodes.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
As cloud computing continues to gather speed, organizations with years’ worth of data stored on legacy on-premise technologies are facing issues with scale, speed, and complexity. Your customers and business partners are likely eager to get data from you, especially if you can make the process easy and secure. Challenges with performance are not uncommon and ongoing interventions are required just to “keep the lights on”. Discover how Snowflake empowers you to meet your analytics needs by unlocking the potential of your data. Agenda of Webinar : ~Understand Snowflake and its Architecture ~Quickly load data into Snowflake ~Leverage the latest in Snowflake’s unlimited performance and scale to make the data ready for analytics ~Deliver secure and governed access to all data – no more silos
The document discusses Oracle Data Guard, a disaster recovery solution for Oracle databases. It provides: 1) An overview of Data Guard, explaining that it maintains a physical or logical standby copy of the primary database to enable failover in the event of outages or disasters. 2) Details on the different types of standby databases - physical, logical, and snapshot - and how they are maintained through redo application or SQL application. 3) The various Data Guard configuration options like real-time apply, time delay, and role transitions such as switchover and failover.
Vladimir Rodionov (Hortonworks) Time-series applications (sensor data, application/system logging events, user interactions etc) present a new set of data storage challenges: very high velocity and very high volume of data. This talk will present the recent development in Apache HBase that make it a good fit for time-series applications.
This document discusses backup and recovery strategies for Oracle Exadata systems. It outlines the fundamental principles of backups including having multiple copies of data stored on different media with one copy offsite. It then describes the various backup options for Exadata, including using additional Exadata storage cells for the fastest backups, using a ZFS storage appliance for flexibility, or backing up to tape for economical long-term storage with removable offline copies. Key metrics like backup and restore speeds are provided for each option.
This document discusses using ClickHouse for experimentation and metrics at Spotify. It describes how Spotify built an experimentation platform using ClickHouse to provide teams interactive queries on granular metrics data with low latency. Key aspects include ingesting data from Google Cloud Storage to ClickHouse daily, defining metrics through a centralized catalog, and visualizing metrics and running queries using Superset connected to ClickHouse. The platform aims to reduce load on notebooks and BigQuery by serving common queries directly from ClickHouse.
Adapting and adopting SQL Plan Management (SPM) to achieve execution plan stability for sub-second queries on a high-rate OLTP mission-critical application
Several free tools are available to help with SQL tuning and performance diagnostics, including SQLd360, SQLT, and eDB360. SQLd360 and SQLT are good for diagnosing a single SQL statement, while eDB360 provides a 360-degree view of an entire Oracle database. Snapper and TUNAs360 can diagnose sessions and database activity. Standalone scripts like planx and sqlmon provide specialized diagnostics for individual cases. These free tools vary in size and capabilities, but all aim to help tune and diagnose SQL and database performance issues.
Apache Cassandra is rock-solid and widely deployed for OLTP and real-time applications, but it is typically not thought of as an OLAP database for analytical queries. This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance. We will then introduce a new open-source project that combines the above performance improvements with the ease of use of Apache Cassandra, and compare it to implementations based on Hadoop and Parquet. First, the existing Cassandra Spark connector allows one to easily load data from Cassandra to Spark. We'll cover how to accelerate queries through different caching options in Spark, and the tradeoffs and limitations around performance, memory, and updating data in real time. We then dive into the use of columnar storage layout and efficient coding techniques that dramatically speed up I/O for OLAP use cases. Cassandra features like triggers and custom secondary indexes allow for easy data ingestion into columnar format. Next, we explore how to integrate this new storage with Spark SQL and its pluggable data storage API. Future developments will enable extreme analytical database performance, including smart caching of column projections, a columnar version of Spark's Catalyst execution planner, and how vectorization makes for fast cache- and GPU-friendly calculations - see Spark's Project Tungsten. FiloDB is a new open-source database using the above techniques to combine very fast Spark SQL analytical queries with the ease of use of Cassandra. We will briefly cover interesting use cases, such as: * Easy exactly-once ingestion from Kafka for streaming and IoT applications * Incremental computed columns and geospatial annotations. We'll discuss how FiloDB improves aggregations needed for choropleth maps over standard PostGIS solutions.
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
The document discusses using Apache Spark and Cassandra for online analytical processing (OLAP) of big data. It describes challenges with relational databases and OLAP cubes at large scales and how Spark can provide fast, distributed querying of data stored in Cassandra. The key points made are that Spark and Cassandra combine to provide horizontally scalable storage with Cassandra and fast, in-memory analytics with Spark; and that for optimal performance, data should be cached in Spark SQL tables for column-oriented querying and aggregation.
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
کارگاه کلان داده در سومین کنفرانس ملی محاسبات توزیعی و پردازش داده های بزرگ دانشگاه شهید مدنی آذربایجان تبریز اردیبهشت 96
This document provides an overview of Cassandra, a NoSQL database. It discusses that Cassandra is an open source, distributed database designed to handle large amounts of structured data across nodes. The document outlines Cassandra's architecture, which involves distributing data across peer nodes so that there is no single point of failure. It also discusses Cassandra's data model, including keyspaces, column families, and the use of the Cassandra Query Language to define schemas, insert and query data. In closing, the document notes that Cassandra is well-suited for applications that require scaling to handle large, variable workloads across data centers with high performance and availability.
A talk covering the best-of-breed platform consisting of Spark, Mesos, Akka, Cassandra and Kafka. SMACK is more of a toolbox of technologies to allow the building of resilient ingestion pipelines, offering a high degree of freedom in the selection of analysis and query possibilities and baked in support for flow-control. More and more customers are using this stack, which is rapidly becoming the new industry standard for Big Data solutions. Session can be seen here - in German - https://speakerdeck.com/stefan79/fast-data-smack-down
My presentation at recently concluded Apache Big Data Conference Europe about the Reliable Low Level Kafka Spark Consumer I developed and an use case of real time indexing to Apache Blur using this consumer
This document provides an overview of using Cassandra in web applications. It discusses why developers may consider using a NoSQL solution like Cassandra over traditional SQL databases. It then covers topics like Cassandra's architecture, data modeling, configuration options, APIs, development tools, and examples of companies using Cassandra in production systems. Key points emphasized are that Cassandra offers high performance but requires rewriting code and developing new processes and tools to support its flexible schema and data model.
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
Cassandra at Pollfish. Athens Cassandra Users MeetUp (Datastax)- 11 March 2015. http://www.meetup.com/Athens-Cassandra-Users/events/220922960/
Pollfish is a survey platform which provides access to millions of targeted users. Pollfish allows easy distribution and targeting of surveys through existing mobile apps. (https://www.pollfish.com/). At pollfish we use Cassandra for difference use cases, eg. for application data store to maximize write throughput when appropriate and for our analytics project to find insights in application generated data. As a medium to accomplish our success so far, we use the Datastax's DSE 4.6 environment which integrates Appache Cassadra, Spark and a hadoop compatible file system (CFS). We will discuss how we started, how the journey was and the impressions gained so far along with some tips learned the hard way. This is a result of joint work of an excellent team here at Pollfish.
This presentation explains how to get started with Apache Cassandra to provide a scale out, fault tolerant backend for inventory storage on OpenSimulator.
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top? In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and BI integration help meet requirements for timely processing and quick responses.
This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and integration with BI tools help meet requirements for timely processing and quick
The document provides an introduction to NOSQL databases. It begins with basic concepts of databases and DBMS. It then discusses SQL and relational databases. The main part of the document defines NOSQL and explains why NOSQL databases were developed as an alternative to relational databases for handling large datasets. It provides examples of popular NOSQL databases like MongoDB, Cassandra, HBase, and CouchDB and describes their key features and use cases.
Presented to the Dublin Cassandra User Group by Niall Milton of DigBigData. This presentation is on Cassandra and its use with other technologies such as Storm, Spark, Hadoop, ElasticSearch and Redis. This presentation should act as a solid foundation to explore some of the mentioned technologies in more depth.
Apache Cassandra is a highly scalable, distributed database designed to handle large amounts of data across many servers with no single point of failure. It uses a peer-to-peer distributed system where data is replicated across multiple nodes for availability even if some nodes fail. Cassandra uses a column-oriented data model with dynamic schemas and supports fast writes and linear scalability.
Quick introduction to the moving parts inside Cassandra and essential commands and tasks for System Administrators.
Slides from my talk at MinneAnalytics 2024 - June 7, 2024 https://datatech2024.sched.com/event/1eO0m/time-state-analytics-a-new-paradigm Across many domains, we see a growing need for complex analytics to track precise metrics at Internet scale to detect issues, identify mitigations, and analyze patterns. Think about delays in airlines (Logistics), food delivery tracking (Apps), detect fraudulent transactions (Fintech), flagging computers for intrusion (Cybersecurity), device health (IoT), and many more. For instance, at Conviva, our customers want to analyze the buffering that users on some types of devices suffer, when using a specific CDN. We refer to such problems as Multidimensional Time-State Analytics. Time-State here refers to the stateful context-sensitive analysis over event streams needed to capture metrics of interest, in contrast to simple aggregations. Multidimensional refers to the need to run ad hoc queries to drill down into subpopulations of interest. Furthermore, we need both real-time streaming and offline retrospective analysis capabilities. In this talk, we will share our experiences to explain why state-of-art systems offer poor abstractions to tackle such workloads and why they suffer from poor cost-performance tradeoffs and significant complexity. We will also describe Conviva’s architectural and algorithmic efforts to tackle these challenges. We present early evidence on how raising the level of abstraction can reduce developer effort, bugs, and cloud costs by (up to) an order of magnitude, and offer a unified framework to support both streaming and retrospective analysis. We will also discuss how our ideas can be plugged into existing pipelines and how our new ``visual'' abstraction can democratize analytics across many domains and to non-programmers.
How we at Conviva ported a streaming data pipeline in months from Scala to Rust. What are the important human and technical factors in our port, and what did we learn?
Almost all applications have some kind of state. Some data processing apps and databases have huge amounts of state. How do we navigate a cloud-based world of containers where stateless and functions-as-a-service is all the rage? As a long-time architect, designer, and developer of very stateful apps (databases and data processing apps), I’d like to take you on a journey through the modern cloud world and Kubernetes, offering helpful design patterns, considerations, tips, and where things are going. How is Kubernetes shaking up stateful app design?
Slides for my talk at Monitorama PDX 2019. Histograms have the potential to give us tools to meet SLO/SLAs, quantile measurements, and very rich heatmap displays for debugging. Their promise has not been fulfilled by TSDB backends however. This talk talks about the concept of histograms as first class citizens in storage. What does accuracy mean for histograms? How can we store and compress rich histograms for evaluation and querying at massive scale? How can we fix some of the issues with histograms in Prometheus, such as proper aggregation, bucketing, avoiding clipping, etc.?
My keynote presentation about how we developed FiloDB, a distributed, Prometheus-compatible time series database, productionized it at Apple and scaled it out to handle a huge amount of operational data, based on the stack of Kafka, Cassandra, Scala/Akka.
Here is my talk at Scala by the Bay 2016, Building a High-Performance Database with Scala, Akka, and Spark. Covers integration of Akka and Spark, when to use actors and futures, back pressure, reactive monitoring with Kamon, and more.
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
An overview of building scalable data pipelines and specific technologies including Apache #Kafka, #Spark, Spark Streaming, #Cassandra, and #FiloDB.
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications. When does one use actors vs futures? Can we use Akka with, or in place of, Storm? How did we set up instrumentation and monitoring in production? How does one use VisualVM to debug Akka apps in production? What happens if the mailbox gets full? What is our Akka stack like? I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.
Socrata is a software company that provides an open data platform to enable governments to publish and share data with the public and developers in order to spur innovation; their platform allows users to find, explore, and analyze datasets through tools for visualization, analysis, and application building. The document discusses Socrata's architecture and technologies that power their open data platform and allow it to handle large volumes of data and queries in a scalable way.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
This document discusses Spark Job Server, an open source project that allows Spark jobs to be submitted and run via a REST API. It provides features like job monitoring, context sharing between jobs to reuse cached data, and asynchronous APIs. The document outlines motivations for the project, how to use it including submitting and monitoring jobs, and future plans like high availability and hot failover support.
This was a talk that Kelvin Chu and I just gave at the SF Bay Area Spark Meetup 5/14 at Palantir Technologies. We discussed the Spark Job Server (http://github.com/ooyala/spark-jobserver), its history, example workflows, architecture, and exciting future plans to provide HA spark job contexts. We also discussed the use case of the job server at Ooyala to facilitate fast query jobs using shared RDD and a shared job context, and how we integrate with Apache Cassandra.
This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.
"Real-time Analytics with Cassandra, Spark, and Shark", my talk at the Cassandra Summit 2013 conference in San Francisco
This was our 9th Sem Design Studio Project, introduced as Conservation of Taksar Bazar, Bhojpur, an ancient city famous for Taksar- Making Coins. Taksar Bazaar has a civilization of Newars shifted from Patan, with huge socio-economic and cultural significance having a settlement of about 300 years. But in the present scenario, Taksar Bazar has lost its charm and importance, due to various reasons like, migration, unemployment, shift of economic activities to Bhojpur and many more. The scenario was so pityful that when we went to make inventories, take survey and study the site, the people and the context, we barely found any youth of our age! Many houses were vacant, the earthquake devasted and ruined heritages. Conservation of those heritages, ancient marvels,a nd history was in dire need, so we proposed the Conservation of Taksar through economic regeneration because the lack of economy was the main reason for the people to leave the settlement and the reason for the overall declination.