This document summarizes a presentation about modeling data with Cassandra Query Language (CQL) using examples from a Twitter-like application called Twissandra. It introduces CQL as an alternative to Thrift for querying Cassandra and describes how to model users, followers, tweets, timelines and other social media data structures in Cassandra tables. The presentation emphasizes denormalizing data and using materialized views to optimize queries, and concludes by noting that applications can be built in various languages thanks to Cassandra drivers.
Wikimedia Content API: A Cassandra Use-caseEric Evans
Among the resources offered by Wikimedia is an API providing low-latency access to full-history content, in many formats. Its results are often the product of computationally intensive transforms, and must be pre-generated and stored to meet latency expectations. Unsurprisingly, there are many challenges to providing low-latency access to such a large data-set, in a demanding, globally distributed environment.
This presentation covers the Wikimedia content API and its use of Apache Cassandra as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature will be discussed.
Wikimedia Content API: A Cassandra Use-caseEric Evans
This document summarizes Eric Evans' presentation on using Cassandra as the backend for Wikimedia's content API. It discusses Wikimedia's goals of providing free knowledge, key metrics about Wikipedia and its architecture. It then focuses on how Wikimedia uses Cassandra, including their data model, compression techniques, and ongoing work to optimize compaction strategies and reduce node density to improve performance.
The Wikimedia Foundation is a non-profit and charitable organization driven by a vision of a world where every human can freely share in the sum of all knowledge. Each month Wikimedia sites serve over 18 billion page views to 500 million unique visitors around the world.
Among the many resources offered by Wikimedia is a public-facing API that provides low-latency, programmatic access to full-history content and meta-data, in a variety of formats. Commonly, results from this system are the product of computationally intensive transformations, and must be pre-generated and persisted to meet latency expectations. Unsurprisingly, there are numerous challenges to providing low-latency storage of such a massive data-set, in a demanding, globally distributed environment.
This talk covers Wikimedia Content API, and it's use of Apache Cassandra, a massively-scalable distributed database, as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature are discussed.
Castle is an open-source project that provides an alternative to the lower layers of the storage stack -- RAID and POSIX filesystems -- for big data workloads, and distributed data stores such as Apache Cassandra.
This presentation from Berlin Buzzwords 2012 provides a high-level overview of Castle and how it is used with Cassandra to improve performance and predictability.
Time Series Data with Apache Cassandra (ApacheCon EU 2014)Eric Evans
This document discusses using Apache Cassandra to store and retrieve time series data more efficiently than the traditional RRDTool approach. It describes how Cassandra is well-suited for time series data due to its high write throughput, ability to store data sorted on disk, and partitioning and replication. The document also outlines a data model for storing time series metrics in Cassandra and discusses Newts, an open source time series data store built on Cassandra.
This document provides an overview and history of the Cassandra Query Language (CQL) and discusses changes between versions 1.0 and 2.0. It notes that CQL was introduced in Cassandra 0.8.0 to provide a more stable and user-friendly interface than the native Cassandra API. Major changes in CQL 2.0 included data type changes and additional functionality like named keys, counters, and timestamps. The document outlines the roadmap for future CQL features and lists several third-party driver projects supporting CQL connectivity.
Virtual Nodes: Rethinking Topology in CassandraEric Evans
The document discusses Cassandra's topology and how it is moving from a single token per node model to a virtual node model where each node is assigned multiple tokens. This improves load balancing and data distribution in the cluster. Specifically, it addresses problems with the single token approach like poor load distribution when nodes fail and inefficient data movement when adding or replacing nodes. The virtual node model with random token assignment provides better scaling properties as the number of nodes and data size increases.
This document describes how to build a proof of concept Software as a Service (SaaS) using Docker containers. It discusses creating a Docker image with Memcached installed, and using that image to spawn new Memcached containers for each user upon registration on a website. When a user signs up, a new container running Memcached is created for that user using the Docker API. The user is then provided with the IP address and public port of their Memcached container so they can access it.
This document discusses CQL, the Cassandra Query Language. CQL is designed to be similar to SQL but with some differences to account for Cassandra's data model. The presentation provides an overview of CQL's syntax and capabilities, discusses why CQL was created to provide a more stable interface than Cassandra's native protocol, and analyzes CQL's performance compared to the native protocol. Future roadmap items for CQL are also presented, including prepared statements and custom transports. Available CQL drivers for languages like Java, Python, Ruby, and Node.js are also briefly mentioned.
Whether it's statistics, weather forecasting, astronomy, finance, or network management, time series data plays a critical role in analytics and forecasting. Unfortunately, while many tools exist for time series storage and analysis, few are able to scale past memory limits, or provide rich query and analytics capabilities outside what is necessary to produce simple plots; For those challenged by large volumes of data, there is much room for improvement.
Apache Cassandra is a fully distributed second-generation database. Cassandra stores data in key-sorted order making it ideal for time series, and its high throughput and linear scalability make it well suited to very large data sets.
This talk will cover some of the requirements and challenges of large scale time series storage and analysis. Cassandra data and query modeling for this use-case will be discussed, and Newts, an open source Cassandra-based time series store under development at The OpenNMS Group will be introduced.
Rethinking Topology In Cassandra (ApacheCon NA)Eric Evans
The document discusses topology and partitioning in Cassandra distributed hash tables (DHTs). It describes issues with poor load distribution and data distribution in traditional DHT designs. It proposes using virtual nodes, where each physical node is assigned multiple tokens, to better distribute partitions and improve performance. Configuration options for Cassandra are presented that implement virtual nodes using a random token assignment strategy.
This document summarizes Spark, an open-source cluster computing framework that is 10-100x faster than Hadoop for interactive queries and stream processing. It discusses how Spark works and its Resilient Distributed Datasets (RDD) API. It then explains how Spark can be used with Cassandra for fast analytics, including reading and writing Cassandra data as RDDs and mapping rows to objects. Finally, it briefly covers the Shark SQL query engine on Spark.
Apache Cassandra Data Modeling with Travis PriceDataStax Academy
This document provides an overview of data modeling in Cassandra, including:
- The components of the Cassandra schema including columns, column families, keyspaces, and clusters.
- Best practices for designing Cassandra data models including working backwards from application requirements and de-normalizing data.
- Features like compound primary keys, secondary indexes, collections, and partitioning that help optimize data models for Cassandra.
- Examples of different types of column families and modeling patterns for user profiles, activity logs, and other common use cases.
This document discusses using Apache Cassandra to store and manage time series data in OpenNMS. It describes some limitations of the existing RRDTool-based data storage, such as high I/O requirements for updating and aggregating data. Cassandra is presented as an alternative that is optimized for write throughput, flexible data modeling, high availability, and ability to perform aggregations at read time rather than write time. The Newts project is introduced as a standalone time series data store built on Cassandra that aims to provide fast storage and retrieval of raw samples along with flexible aggregation capabilities.
Cassandra By Example: Data Modelling with CQL3Eric Evans
CQL is the query language for Apache Cassandra that provides an SQL-like interface. The document discusses the evolution from the older Thrift RPC interface to CQL and provides examples of modeling tweet data in Cassandra using tables like users, tweets, following, followers, userline, and timeline. It also covers techniques like denormalization, materialized views, and batch loading of related data to optimize for common queries.
The document summarizes Cassandra developments over the past 5 years, including keynote details from Jonathan Ellis on Cassandra 1.2 and 2.0. Some highlights include improvements to scalability, performance and reliability in Cassandra 1.2, and the introduction of new features in Cassandra 2.0 like lightweight transactions (CAS), improved compaction, and experimental triggers. The keynote outlines changes and removals between the two versions to ease the transition for developers and operators.
C*ollege Credit: Data Modeling for Apache CassandraDataStax
Cassandra stores data differently than traditional RDBMS’s. It is these differences that allow for improvements in performance, availability and scalability. Aaron Morton, DataStax MVP for Apache Cassandra will present the basics of the data model and outline the differences clearly. This webinar is 101 level and is suitable for people who are coming from a relational background and just starting to get into Apache Cassandra.
The document summarizes new features and improvements in Cassandra 2.0, including enhanced performance, scalability, and ease of use. Key updates include improved cursors for paging through large result sets, batching of prepared statements, simplified parameterized queries, additional CQL3 functionality, and lightweight transactions. Future plans outlined are secondary indexes on collections, more efficient repairs, custom data types, and aggregate functions in CQL. The document provides examples and explanations of new capabilities such as tracing, tombstone handling, and rapid read protection.
DevOpsDaysRiga 2018: Michiel Rook - Database schema migrations with zero down...DevOpsDays Riga
Does your application or service use a database? When that application changes because of new business requirements, you may need to make changes to the database schema. These database migrations could lead to downtime and can be an obstacle to implementing continuous delivery/deployment.
How can we deal with database migrations when we don’t want our end-users to experience downtime, and want to keep releasing?
In this talk we’ll discuss non-destructive changes, rollbacks, large data sets, useful tools and a few strategies to migrate our data safely, with minimum disruption to production.
This document summarizes full text search capabilities in PostgreSQL. It begins with an introduction and overview of common full text search solutions. It then discusses reasons to use full text search in PostgreSQL, including consistency and no need for additional software. The document covers basics of full text search in PostgreSQL like to_tsvector, to_tsquery, and indexes. It also covers fuzzy full text search using pg_trgm and functions like similarity. Other topics mentioned include ts_headline, ts_rank, and the RUM extension.
Полнотекстовый поиск в PostgreSQL / Александр Алексеев (Postgres Professional)Ontico
РИТ++ 2017, Backend Conf
Зал Сан-Паулу, 6 июня, 11:00
Тезисы:
http://backendconf.ru/2017/abstracts/2780.html
Не всем известно, что в PostgreSQL есть полнотекстовый поиск. Притом, в отличие от некоторых других РСУБД, в PostgreSQL этот поиск совершенно полноценный и способный посоревноваться в скорости и качестве со специализированными решениями. Что не менее интересно, используя полнотекстовый поиск PostgreSQL, вы можете избавиться от дублирования данных, экономя тем самым место на диске, трафик и обеспечивая согласованность данных.
Из этого доклада вы узнаете, как использовать для полнотекстового поиска PostgreSQL GIN, GiST, а также новый RUM-индекс, в чем заключаются преимущества и недостатки названных индексов, как с их помощью сделать поиск по документами или, например, саджестилки, и не только.
Importing Data into Neo4j quickly and easily - StackOverflowNeo4j
In this GraphConnect presentation Mark and Michael show several ways to import large amounts of highly connected data from different formats into Neo4j. Both Cypher's LOAD CSV as well as the bulk importer is demonstrated along with many tips.
We use the well know StackOverflow Q&A site data which is interestingly very graphy.
GraphConnect Europe 2016 - Importing Data - Mark Needham, Michael HungerNeo4j
The document discusses importing data from Stack Exchange into Neo4j. It describes extracting data from the Stack Exchange API and data dumps into JSON format, then converting the JSON to CSV files for questions, answers, users and tags. It then covers using Cypher and procedures like LOAD CSV and CALL apoc.load.json to import the data into an initial graph model in Neo4j, providing tips for performance and handling large datasets. It also introduces using the Neo4j Import tool for bulk loading large initial datasets directly into the Neo4j store files.
This document provides an introduction and overview of databases and the basic operations used to manage data in a database using Microsoft Access 2007. It defines what a database is, how data is organized in tables with rows and columns, and when it is appropriate to use a database. It also outlines and provides examples of the basic CRUD (create, read, update, delete) operations used in structured query language (SQL) to manipulate data, including inserting, selecting, updating, and deleting records from database tables.
The document discusses stored procedures and cursors in SQL. It defines stored procedures as reusable blocks of SQL code that can be called multiple times with different input parameters to perform repetitive tasks. Cursors allow row-by-row processing of result sets. The document provides examples of creating stored procedures and using cursors to iterate through rows retrieved from a database table. It also lists advantages of stored procedures like modular programming, faster execution, reduced network traffic, and better security.
This document provides instructions and examples for using the MySQL database system. It discusses MySQL concepts like database, tables, rows, and columns. It also demonstrates common SQL commands like CREATE, SELECT, INSERT, UPDATE, DROP. Examples show how to create databases and tables, insert data, query data, and more. Installation and configuration steps are also covered.
This document provides instructions and examples for using the MySQL database system. It discusses MySQL concepts like database, tables, rows, and columns. It also demonstrates common SQL commands like CREATE, SELECT, INSERT, UPDATE, DROP. Examples show how to create databases and tables, insert and query data, use functions, conditions and wildcards. Script files demonstrate populating tables with sample data.
When to NoSQL and When to SQL
NoSQL databases are suited for applications that require rapid development, large data growth, and scale out capabilities. They provide flexible data models like documents and key-value stores. SQL remains effective for query-heavy workloads with complex queries over structured data. A hybrid approach using multiple database types can leverage their respective strengths. The right choice depends on factors like data access patterns, consistency needs, and the skills of those using the system.
This document provides an overview of SQL analytic queries and tips and tricks, mostly related to PostgreSQL. It begins with an introduction on the topics to be covered, including SQL basics, advanced topics, and a conclusion. It then shares some lesser known facts about SQL, including that it is standardized, turing complete, and the only successful 4th generation programming language. The document reviews the revision history of SQL standards from 1986 to the present. It provides examples of common table expressions, temporary tables, unnesting and aggregation, subqueries, and lateral joins in SQL.
Graph Connect: Importing data quickly and easilyMark Needham
This document discusses importing data from Stack Exchange into Neo4j. It describes extracting data from the Stack Exchange API and data dump into JSON and CSV files. It then covers using Cypher and the LOAD CSV command to import the data into Neo4j, creating nodes for questions, answers, users and tags and relationships between them. It also provides tips for optimizing the import process such as indexing keys, using periodic commit, and cleaning the data. For very large datasets, it recommends using the Neo4j import tool to directly write to the database files.
The document provides an introduction to SQL (Structured Query Language). It discusses the history and evolution of SQL standards. SQL is introduced as the most widely used and accepted language for managing data in relational database management systems. The key benefits of SQL and its role in creating, querying, updating and managing relational databases are described. Common SQL commands like CREATE, ALTER, DROP, INSERT, SELECT, UPDATE, DELETE are explained. Additional topics covered include functions, joins, subqueries and other advanced SQL features.
This document provides an overview and instructions for installing and using the MySQL database system. It describes MySQL's client-server architecture, how to connect to the MySQL server using the command line client, and provides examples of common SQL commands for creating databases and tables, inserting, selecting, updating, and deleting rows of data. It also introduces some basic SQL functions and provides SQL scripts as examples to create tables and insert data.
A presentation about MySQL for beginners. It includes the following topics:
- Introduction
- Installation
- Executing SQL statements
- SQL Language Syntax
- The most important SQL commands
- MySQL Data Types
- Operators
- Basic Syntax
- SQL Joins
- Some Exercise
DDL(Data defination Language ) Using OracleFarhan Aslam
The document discusses DDL and DCL commands in Oracle including naming rules for objects, data types, creating tables, constraints, defining constraints, updating and violating constraints, creating tables using subqueries, altering tables, views, sequences, granting and revoking privileges, and dropping tables. It also discusses the Oracle data dictionary.
Similar to Cassandra by Example: Data Modelling with CQL3 (20)
Presented at Cassandra London (April 7, 2014); The challenges of time-series storage and analytics in OpenNMS, with an introduction to Newts, a new Cassandra-based time-series data store.
CQL is a structured query language for Apache Cassandra that is similar to SQL. It provides an alternative interface to the existing Thrift API, with the goals of being more stable, easier to use, and providing a better mental model for querying and data. The document outlines the motivations for developing CQL, including limitations of the existing Thrift API, and provides details on CQL specification, drivers, and additional resources.
1. The document discusses Cassandra Query Language (CQL), a new structured query language for Apache Cassandra that is similar to SQL.
2. CQL aims to provide a simpler alternative to Cassandra's existing Thrift API, which is difficult for clients to use and unstable due to its tight coupling to Cassandra's internal APIs.
3. The document outlines some benefits of CQL compared to the Thrift API, such as requiring less client-side abstraction and being more intuitive through its use of a familiar query/data model.
Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers. It provides high availability with no single points of failure and linear scalability as nodes are added. Cassandra uses a peer-to-peer distributed architecture and tunable consistency levels to achieve high performance and availability without requiring strong consistency. It is based on Amazon's Dynamo and Google's Bigtable papers and provides a combination of their features.
This document provides an overview and introduction to Cassandra, an open source distributed database management system designed to handle large amounts of data across many commodity servers. It discusses Cassandra's origins from influential papers on Bigtable and Dynamo, its properties including flexibility, scalability and high availability. The document also covers Cassandra's data model using keyspaces and column families, its consistency options, API including Thrift and language drivers, and provides examples of usage for an address book app and storing timeseries data.
This document summarizes Cassandra, an open source distributed database management system designed to handle large amounts of data across many commodity servers. It discusses Cassandra's history, key features like tunable consistency levels and support for structured and indexed columns. Case studies describe how companies like Digg, Twitter, Facebook and Mahalo use Cassandra to handle terabytes of data and high transaction volumes. The roadmap outlines upcoming releases that will improve features like compaction, management tools, and support for dynamic schema changes.
This document is an introduction to Cassandra presented by Eric Evans. It provides an outline that covers the project history, description of Cassandra as a massively scalable and decentralized structured data store, and lists some of the people and companies involved in Cassandra including Facebook, Digg, IBM Research, Rackspace and Twitter. The document discusses Cassandra's capabilities such as tunable consistency levels, structured columns and supercolumns, querying, updates, client APIs and performance compared to MySQL.
This document summarizes Cassandra, an open source distributed database. It describes Cassandra's history starting at Facebook, then being taken over by Apache. It provides details on Cassandra's architecture as a massively scalable, distributed, structured data store with tunable consistency levels and fast reads/writes. The document outlines that values in Cassandra are structured and indexed by columns and supercolumns with slicing queries supported. Key features like hinted handoff, Thrift API, data center awareness, pluggable comparators, and enumeration/range queries are also summarized.
1. Cassandra By Example:
Data Modelling with CQL3
Berlin Buzzwords
June 4, 2013
Eric Evans
eevans@opennms.com
@jericevans
2. CQL is...
● Query language for Apache Cassandra
● Almost SQL (almost)
● Alternative query interface First class citizen
● More performant!
● Available since Cassandra 0.8.0 (almost 2
years!)
25. following
-- Users a user is following
CREATE TABLE following (
username text,
followed text,
PRIMARY KEY(username, followed)
);
26. following
-- Meg follows Stewie
INSERT INTO following (username, followed)
VALUES ('meg', 'stewie')
-- Get a list of who Meg follows
SELECT followed FROM following
WHERE username = 'meg'
29. followers
-- The users who follow username
CREATE TABLE followers (
username text,
following text,
PRIMARY KEY(username, following)
);
30. followers
-- Meg follows Stewie
INSERT INTO followers (username, followed)
VALUES ('stewie', 'meg')
-- Get a list of who follows Stewie
SELECT followers FROM following
WHERE username = 'stewie'
31. redux: following / followers
-- @meg follows @stewie
BEGIN BATCH
INSERT INTO following (username, followed)
VALUES ('meg', 'stewie')
INSERT INTO followers (username, followed)
VALUES ('stewie', 'meg')
APPLY BATCH
34. tweets
-- Tweet storage (think: permalink)
CREATE TABLE tweets (
tweetid uuid PRIMARY KEY,
username text,
body text
);
35. tweets
-- Store a tweet
INSERT INTO tweets (
tweetid,
username,
body
) VALUES (
60780342-90fe-11e2-8823-0026c650d722,
'stewie',
'victory is mine!'
)
36. Query tweets by ... ?
● author, time descending
● followed authors, time descending
● date starting / date ending
38. userline
-- Materialized view of the tweets
-- created by user.
CREATE TABLE userline (
username text,
tweetid timeuuid,
body text,
PRIMARY KEY(username, tweetid)
);
39. Wait, WTF is a timeuuid?
● Aka "Type 1 UUID" (http://goo.gl/SWuCb)
● 100 nano second units since Oct. 15, 1582
● Timestamp is first 60 bits (sorts temporally!)
● Used like timestamp, but:
○ more granular
○ globally unique
40. userline
-- Range of tweets for a user
SELECT
dateOf(tweetid), body
FROM
userline
WHERE
username = 'stewie' AND
tweetid > minTimeuuid('2013-03-01 12:10:09')
ORDER BY
tweetid DESC
LIMIT 40
43. timeline
-- Materialized view of tweets from
-- the users username follows.
CREATE TABLE timeline (
username text,
tweetid timeuuid,
posted_by text,
body text,
PRIMARY KEY(username, tweetid)
);
44. timeline
-- Range of tweets for a user
SELECT
dateOf(tweetid), posted_by, body
FROM
timeline
WHERE
username = 'stewie' AND
tweetid > '2013-03-01 12:10:09'
ORDER BY
tweetid DESC
LIMIT 40
45. most recent tweets for @meg
dateOf(posted_at) | posted_by | body
--------------------------+-----------+-------------------
2013-03-19 14:43:15-0500 | stewie | victory is mine!
2013-03-19 13:23:25-0500 | meg | evolve intuit...
2013-03-19 13:23:25-0500 | meg | whiteboard bric...
2013-03-19 13:23:25-0500 | stewie | brand clic...
2013-03-19 13:23:25-0500 | brian | synergize gran...
2013-03-19 13:23:24-0500 | brian | expedite real-t...
2013-03-19 13:23:24-0500 | stewie | generate kil...
2013-03-19 13:23:24-0500 | stewie | grow B2B ...
2013-03-19 13:23:24-0500 | meg | generate intera...
...
46. redux: tweets
-- @stewie tweets
BEGIN BATCH
INSERT INTO tweets ...
INSERT INTO userline ...
INSERT INTO timeline ...
INSERT INTO timeline ...
INSERT INTO timeline ...
...
APPLY BATCH
47. In Conclusion:
● Think in terms of your queries, store that
● Don't fear duplication; Space is cheap to scale
● Go wide; Rows can have 2 billion columns!
● The only thing better than NoSQL, is MoSQL
● Python hater? Java ❤'r?
○ https://github.com/eevans/twissandra-j
● http://tinyurl.com/d0ntklik