This document summarizes Eric Evans' presentation on using Cassandra as the backend for Wikimedia's content API. It discusses Wikimedia's goals of providing free knowledge, key metrics about Wikipedia and its architecture. It then focuses on how Wikimedia uses Cassandra, including their data model, compression techniques, and ongoing work to optimize compaction strategies and reduce node density to improve performance.
Frank Wessels for VM Ware meet up. This talk looked at the modern application stack whereby a cloud native application is split into both stateless and stateful containers.
WSO2 uses Kubernetes to provide multi-tenancy for its middleware platform. Kubernetes namespaces isolate each tenant's resources, while quotas control how much CPU and memory each tenant can use. Kubernetes also provides health monitoring, rolling updates, secret sharing between pods, and autoscaling that help reduce the complexity of WSO2's platform. WSO2's identity server integrates with Kubernetes to provide access management for tenants and users.
This document discusses solutions for preventing distributed denial-of-service (DDoS) attacks on game servers at different levels including DNS, network, and application levels. It recommends purchasing anti-DDoS services, using content delivery networks, web application firewalls, blacklisting abnormal IP addresses, and implementing packet marking and filtering techniques. The document also provides references to several commercial anti-DDoS service providers and their pricing.
Bizur is a consensus algorithm invented by Elastifile to address issues with log-based algorithms like Paxos. It optimizes for strongly consistent distributed key-value stores by having independent buckets for keys that are replicated and use leader election. Reads require a majority and writes succeed with majority acknowledgement. It includes recovery mechanisms like ensuring buckets are up-to-date after leader changes and can reconfigure membership or number of buckets per shard dynamically through techniques like SMART migrations.
In Apache Cassandra Lunch #59: Functions in Cassandra, we discussed the functions that are usable inside of the Cassandra database. The live recording of Cassandra Lunch, which includes a more in-depth discussion and a demo, is embedded below in case you were not able to attend live.
Modern day Data Engineering requires creating reliable data pipelines, architecting distributed systems, designing data stores, and preparing data for other teams. We’ll describe a year in the life of a Data Engineer who is tasked with creating a streaming data pipeline and touch on the skills necessary to set one up using Apache Spark. Slides from the April 2019 meeting of the St. Louis Big Data IDEA meetup.
Hadoop is used extensively at TD, with 1000 daily active users running 65K Hive jobs, 180K Yarn apps, and scanning 20 trillion records from 500 petabytes stored in S3. The team runs 5 Hadoop clusters across 3 regions using a patched version of Hadoop 2.7.3 called PTD. They have improved the clusters to boot faster, be more ephemeral by storing data directly in S3, and made changes to reduce failures by enabling circuit breakers and disk quotas. The team is working on migrating to the latest Hive and simplifying configurations, as well as moving to auto-scaling and code deploy for faster operations.
A deep dive into the history of containers as well as an introduction to how they work under the covers. This includes a discussion around Control Groups and Process Namespaces, as well as touching on some underlying syscalls, such as Fork and Clone.
This document discusses sharding patterns and antipatterns for scalable databases. It covers selecting good shard keys like user IDs, routing types like using smart clients or proxies, and approaches for re-sharding like moving data instead of redistributing it. The key topics are sharding functions, routing, and re-sharding strategies to minimize disruption when updating shard configurations.
1. The document discusses cloud object storage, describing its features like multipart uploads, versioning, and lifecycles. It provides examples of using object storage for media and documents. 2. Key aspects of object storage security are covered, including signatures, encryption, access control lists, and policies. Disaster recovery options like geo-replication are also summarized. 3. In the conclusion, the document emphasizes using object storage APIs to access advanced features, ensuring data safety, testing disaster recovery plans, and using Ceph for private cloud object storage.
ElasticCache is a caching service that uses Memcached. Memcached is an in-memory key-value store that provides no persistence or replication. It is fast and preferable for caching relatively small static data. At a certain point, implementation knowledge is needed to ensure Memcached is behaving as expected. Production issues can occur if objects do not fit properly into Memcached slabs, which allocate fixed-size chunks of memory. Monitoring tools like "stats slabs" help analyze slab allocation and object eviction patterns.
Marble is a free digital globe and map application for KDE and Qt. It provides an interactive globe and map widget that can be used in various KDE applications. Marble has a small dataset, does not require hardware acceleration, and runs on Linux, Windows and Mac. It supports plugins and new map types can be added. Future plans include vector map tiles, routing support, and using Marble on mobile devices.
This document provides an overview of MapReduce in Python for analyzing text. It discusses setting up the environment, counting the words in Moby Dick as an example, the mapping, shuffling, and reducing steps of MapReduce, and limitations when processing very large texts. Requirements include a Unix-like system and Python. The example counts words by processing the input text with a mapper, sorting the output, and then reducing the counts with a reducer. Hadoop is also introduced as a MapReduce framework.
This document discusses the "Concierge Paradigm" for simplifying container management at scale. It proposes using two fundamental components - service discovery and process orchestration - to automate common container operations. This approach leverages small scripts to automatically register, discover, and take corrective actions on containers with minimal overhead. It has been optimized over many years and allows containers to "fly on autopilot" with drastically less management than traditional frameworks.
Git is a distributed version control system created by Linus Torvalds for Linux kernel development. It stores snapshots of files and uses checksums to track file versions. Commits contain a message, author, timestamp and reference to parent commits. Branches are pointers to commits that can be rewritten using rebase, cherry-pick or squash to clean up history. Good practices include writing descriptive commit messages and using rebase instead of merge for pull requests.
This document discusses memory related issues in Android applications. It explains that each app runs in a separate process with limited memory based on the device. If an app demands more memory than the limit, it will crash. Memory leaks and handling large bitmaps can also cause issues. Tools like logcat, MAT, and adb commands can help debug memory problems by analyzing heap dumps and tracking allocations over time.
This document summarizes a KubeVirt 101 workshop covering: 1. An introductory session and first set of labs on integrating virtual machines with Kubernetes. 2. A short break followed by a second set of labs on more advanced KubeVirt features. 3. An open discussion on common KubeVirt use cases, troubleshooting, and staying engaged with the community. The workshop introduces KubeVirt as a Kubernetes addon for providing virtualization and explains how it uses CustomResourceDefinitions and controllers to integrate virtual machines and their lifecycles with Kubernetes. Hands-on labs demonstrate defining VMs, starting them, and using data volumes for importing disk images.
Among the resources offered by Wikimedia is an API providing low-latency access to full-history content, in many formats. Its results are often the product of computationally intensive transforms, and must be pre-generated and stored to meet latency expectations. Unsurprisingly, there are many challenges to providing low-latency access to such a large data-set, in a demanding, globally distributed environment. This presentation covers the Wikimedia content API and its use of Apache Cassandra as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature will be discussed.
The Wikimedia Foundation is a non-profit and charitable organization driven by a vision of a world where every human can freely share in the sum of all knowledge. Each month Wikimedia sites serve over 18 billion page views to 500 million unique visitors around the world. Among the many resources offered by Wikimedia is a public-facing API that provides low-latency, programmatic access to full-history content and meta-data, in a variety of formats. Commonly, results from this system are the product of computationally intensive transformations, and must be pre-generated and persisted to meet latency expectations. Unsurprisingly, there are numerous challenges to providing low-latency storage of such a massive data-set, in a demanding, globally distributed environment. This talk covers Wikimedia Content API, and it's use of Apache Cassandra, a massively-scalable distributed database, as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature are discussed.
Webinar Degetel DataStax du 15 octobre 2015 Du SQL au NoSQL : Pourquoi ? Différences ? Comment ça marche ?
Webinaire Banque / Assurance Reprenez le pouvoir sur vos données
Castle is an open-source project that provides an alternative to the lower layers of the storage stack -- RAID and POSIX filesystems -- for big data workloads, and distributed data stores such as Apache Cassandra. This presentation from Berlin Buzzwords 2012 provides a high-level overview of Castle and how it is used with Cassandra to improve performance and predictability.
This document discusses using Apache Cassandra to store and retrieve time series data more efficiently than the traditional RRDTool approach. It describes how Cassandra is well-suited for time series data due to its high write throughput, ability to store data sorted on disk, and partitioning and replication. The document also outlines a data model for storing time series metrics in Cassandra and discusses Newts, an open source time series data store built on Cassandra.
Comment maîtriser le flux de données IoT avec DataStax, Apache Cassandra et Apache Spark ? Petit Déjeuner IoT du 19 novembre 2015
This document provides an overview and history of the Cassandra Query Language (CQL) and discusses changes between versions 1.0 and 2.0. It notes that CQL was introduced in Cassandra 0.8.0 to provide a more stable and user-friendly interface than the native Cassandra API. Major changes in CQL 2.0 included data type changes and additional functionality like named keys, counters, and timestamps. The document outlines the roadmap for future CQL features and lists several third-party driver projects supporting CQL connectivity.
Découvrez DataStax Enterprise avec ses intégrations de Apache Spark et de Apache Solr, ainsi que son nouveau modèle de données de type Graph.
A presentation on the recent work to transition Cassandra from its naive 1-partition-per-node distribution, to a proper virtual nodes implementation.
This document summarizes a presentation about modeling data with Cassandra Query Language (CQL) using examples from a Twitter-like application called Twissandra. It introduces CQL as an alternative to Thrift for querying Cassandra and describes how to model users, followers, tweets, timelines and other social media data structures in Cassandra tables. The presentation emphasizes denormalizing data and using materialized views to optimize queries, and concludes by noting that applications can be built in various languages thanks to Cassandra drivers.
The document discusses Cassandra's topology and how it is moving from a single token per node model to a virtual node model where each node is assigned multiple tokens. This improves load balancing and data distribution in the cluster. Specifically, it addresses problems with the single token approach like poor load distribution when nodes fail and inefficient data movement when adding or replacing nodes. The virtual node model with random token assignment provides better scaling properties as the number of nodes and data size increases.
This document discusses CQL, the Cassandra Query Language. CQL is designed to be similar to SQL but with some differences to account for Cassandra's data model. The presentation provides an overview of CQL's syntax and capabilities, discusses why CQL was created to provide a more stable interface than Cassandra's native protocol, and analyzes CQL's performance compared to the native protocol. Future roadmap items for CQL are also presented, including prepared statements and custom transports. Available CQL drivers for languages like Java, Python, Ruby, and Node.js are also briefly mentioned.
OpenNMS User Conference Europe presentation on using Apache Cassandra and Newts for time-series data storage.
Whether it's statistics, weather forecasting, astronomy, finance, or network management, time series data plays a critical role in analytics and forecasting. Unfortunately, while many tools exist for time series storage and analysis, few are able to scale past memory limits, or provide rich query and analytics capabilities outside what is necessary to produce simple plots; For those challenged by large volumes of data, there is much room for improvement. Apache Cassandra is a fully distributed second-generation database. Cassandra stores data in key-sorted order making it ideal for time series, and its high throughput and linear scalability make it well suited to very large data sets. This talk will cover some of the requirements and challenges of large scale time series storage and analysis. Cassandra data and query modeling for this use-case will be discussed, and Newts, an open source Cassandra-based time series store under development at The OpenNMS Group will be introduced.
The document discusses topology and partitioning in Cassandra distributed hash tables (DHTs). It describes issues with poor load distribution and data distribution in traditional DHT designs. It proposes using virtual nodes, where each physical node is assigned multiple tokens, to better distribute partitions and improve performance. Configuration options for Cassandra are presented that implement virtual nodes using a random token assignment strategy.
This document summarizes Spark, an open-source cluster computing framework that is 10-100x faster than Hadoop for interactive queries and stream processing. It discusses how Spark works and its Resilient Distributed Datasets (RDD) API. It then explains how Spark can be used with Cassandra for fast analytics, including reading and writing Cassandra data as RDDs and mapping rows to objects. Finally, it briefly covers the Shark SQL query engine on Spark.
This document discusses using Apache Cassandra to store and manage time series data in OpenNMS. It describes some limitations of the existing RRDTool-based data storage, such as high I/O requirements for updating and aggregating data. Cassandra is presented as an alternative that is optimized for write throughput, flexible data modeling, high availability, and ability to perform aggregations at read time rather than write time. The Newts project is introduced as a standalone time series data store built on Cassandra that aims to provide fast storage and retrieval of raw samples along with flexible aggregation capabilities.
Presented at Cassandra London (April 7, 2014); The challenges of time-series storage and analytics in OpenNMS, with an introduction to Newts, a new Cassandra-based time-series data store.