New Generation Oracle RAC 19c focuses on diagnosing Oracle RAC performance issues. The document discusses tools used by Oracle's RAC performance engineering team to instrument and measure key code areas between releases. It also covers how Oracle RAC provides high availability and scalability for workloads like traditional apps, new apps, IoT workloads, and more. Diagnosing performance requires understanding factors like private network latency and configuration.
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentalsJohn Beresniewicz
RMOUG 2020 abstract:
This session will cover core concepts for Oracle performance analysis first introduced in Oracle 10g and forming the backbone of many features in the Diagnostic and Tuning packs. The presentation will cover the theoretical basis and meaning of these concepts, as well as illustrate how they are fundamental to many user-facing features in both the database itself and Enterprise Manager.
MySQL InnoDB Cluster - A complete High Availability solution for MySQLOlivier DASINI
MySQL InnoDB Cluster provides a complete high availability solution for MySQL. It uses MySQL Group Replication, which allows for multiple read-write replicas of a database to exist with synchronous replication. MySQL InnoDB Cluster also includes MySQL Shell for setup, management and orchestration of the cluster, and MySQL Router for intelligent connection routing. It allows databases to scale out writes across replicas in a fault-tolerant and self-healing manner.
Replication Troubleshooting in Classic VS GTIDMydbops
This presentation talk will assist you in troubleshooting MySQL replication for the most common issues we might face with a simple comparison of how can we get them solved in the different replication methods (Classic VS GTID).
This document provides instructions for using Filebeat, Logstash, Elasticsearch, and Kibana to monitor and visualize MySQL slow query logs. It describes installing and configuring each component on appropriate servers to ship MySQL slow logs from database servers to Logstash for processing, indexing to Elasticsearch for search and analysis, and visualization of slow query trends and details in Kibana dashboards and graphs.
Running Kafka as a Native Binary Using GraalVM with Ozan GünalpHostedbyConfluent
"During development and automated tests, it is common to create Kafka clusters from scratch and run workloads against those short-lived clusters. Starting a Kafka broker typically takes several seconds, and those seconds add up to precious time and resources.
How about spinning up a Kafka broker in less than 0.2 seconds with less memory overhead? In this session, we will talk about kafka-native, which leverages GraalVM native image for compiling Kafka broker to native executable using Quarkus framework. After going through some implementation details, we will focus on how it can be used in a Docker container with Testcontainers to speed up integration testing of Kafka applications. We will finally discuss some current caveats and future opportunities of a native-compiled Kafka for cloud-native production clusters."
Understanding oracle rac internals part 1 - slidesMohamed Farouk
This document discusses Oracle RAC internals and architecture. It provides an overview of the Oracle RAC architecture including software deployment, processes, and resources. It also covers topics like VIPs, networks, listeners, and SCAN in Oracle RAC. Key aspects summarized include the typical Oracle RAC software stack, local and cluster resources, how VIPs and networks are configured, and the role and dependencies of listeners.
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIAltinity Ltd
Graham Mainwaring and Robert Hodges summarize management of ClickHouse on Kubernetes using the ClickHouse Kubernetes Operator and introduce a new UI for it. Presented at the 15 Dec '22 SF Bay Area ClickHouse Meetup.
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM.
Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.
Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.
Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.
Speaker
Siddharth Teotia, Dremio, Software Engineer
How to use 23c AHF AIOPS to protect Oracle Databases 23c Sandesh Rao
Oracle's Autonomous Health Framework (AHF) provides capabilities for artificial intelligence for IT operations (AIOps) on Oracle Database and Exadata systems. AHF includes components for real-time monitoring, anomaly detection, root cause analysis, issue detection and resolution through machine learning models. It collects telemetry and diagnostic data from databases and operating systems and uses this for automated incident handling and to provide insights to customers and support.
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1SolarWinds
The document provides an overview and agenda for a presentation on optimizing Oracle database performance through query tuning. It discusses identifying performance issues, collecting wait event information, reviewing execution plans, and understanding how the Oracle optimizer works using features like adaptive plans and statistics gathering. The goal is to show attendees how to quickly find and focus on the queries most in need of tuning.
29回勉強会資料「PostgreSQLのリカバリ超入門」
See also http://www.interdb.jp/pgsql (Coming soon!)
初心者向け。PostgreSQLのWAL、CHECKPOINT、 オンラインバックアップの仕組み解説。
これを見たら、次は→ http://www.slideshare.net/satock/29shikumi-backup
Oracle Database Performance Tuning Advanced Features and Best Practices for DBAsZohar Elkayam
Oracle Week 2017 slides.
Agenda:
Basics: How and What To Tune?
Using the Automatic Workload Repository (AWR)
Using AWR-Based Tools: ASH, ADDM
Real-Time Database Operation Monitoring (12c)
Identifying Problem SQL Statements
Using SQL Performance Analyzer
Tuning Memory (SGA and PGA)
Parallel Execution and Compression
Oracle Database 12c Performance New Features
This presentation covers MySQL data encryption at disk. How to encrypt all tablespaces and MySQL related files for the compliances ? The new releases in MySQL 8.0 take care of the encryption of the system tablespace and supporting tables unlike MySQL 5.7.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Oracle RAC 19c: Best Practices and Secret InternalsAnil Nair
Oracle Real Application Clusters 19c provides best practices and new features for upgrading to Oracle 19c. It discusses upgrading Oracle RAC to Linux 7 with minimal downtime using node draining and relocation techniques. Oracle 19c allows for upgrading the Grid Infrastructure management repository and patching faster using a new Oracle home. The presentation also covers new resource modeling for PDBs in Oracle 19c and improved Clusterware diagnostics.
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Docker is all the rage these days. While one doesn't hear much about Solr on Docker, we're here to tell you not only that it can be done, but also share how it's done.
We'll quickly go over the basic Docker ideas - containers are lighter than VMs, they solve "but it worked on my laptop" issues - so we can dive into the specifics of running Solr on Docker.
We'll do a live demo showing you how to run Solr master - slave as well as SolrCloud using containers, how to manage CPU assignments, constraint memory and use Docker data volumes when running Solr in containers. We will also show you how to create your own containers with custom configurations.
Finally, we'll address one of the core Solr questions - which deployment type should I use? We will demonstrate performance differences between the following deployment types:
- Single Solr instance running on a bare metal machine
- Multiple Solr instances running on a single bare metal machine
- Solr running in containers
- Solr running on virtual machine
- Solr running on virtual machine using unikernel
For each deployment type we'll address how it impacts performance, operational flexibility and all other key pros and cons you ought to keep in mind.
This document summarizes techniques for optimizing Logstash and Rsyslog for high volume log ingestion into Elasticsearch. It discusses using Logstash and Rsyslog to ingest logs via TCP and JSON parsing, applying filters like grok and mutate, and outputting to Elasticsearch. It also covers Elasticsearch tuning including refresh rate, doc values, indexing performance, and using time-based indices on hot and cold nodes. Benchmark results show Logstash and Rsyslog can handle thousands of events per second with appropriate configuration.
Running High Performance & Fault-tolerant Elasticsearch Clusters on DockerSematext Group, Inc.
This document discusses running Elasticsearch clusters on Docker containers. It describes how Docker containers are more lightweight than virtual machines and have less overhead. It provides examples of running official Elasticsearch Docker images and customizing configurations. It also covers best practices for networking, storage, constraints, and high availability when running Elasticsearch on Docker.
This document discusses key metrics to monitor for Node.js applications, including event loop latency, garbage collection cycles and time, process memory usage, HTTP request and error rates, and correlating metrics across worker processes. It provides examples of metric thresholds and issues that could be detected, such as high garbage collection times indicating a problem or an event loop blocking issue leading to high latency.
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...Sematext Group, Inc.
This talk covers the basics of centralizing logs in Elasticsearch and all the strategies that make it scale with billions of documents in production. Topics include:
- Time-based indices and index templates to efficiently slice your data
- Different node tiers to de-couple reading from writing, heavy traffic from low traffic
- Tuning various Elasticsearch and OS settings to maximize throughput and search performance
- Configuring tools such as logstash and rsyslog to maximize throughput and minimize overhead
From Zero to Hero - Centralized Logging with Logstash & ElasticsearchSematext Group, Inc.
Originally presented at DevOpsDays Warsaw 2014. How to set up centralized logging either using ELK stack - Logstash, Elasticsearch, and Kibana or using Logsene.
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Sematext Group, Inc.
In this talk from Lucene/Solr Revolution 2015, Solr and centralized logging experts Radu Gheorghe and Rafal Kuć cover topics like: flow in Logstash, flow in rsyslog, parsing JSON, log shipping, Solr tuning, time-based collections and tiered clusters.
Radu Gheorghe gives an introduction to Solr, an open source search engine based on Apache Lucene. He discusses when Solr would be used, such as for product search, as well as when it may not be suitable, such as for sparse data. The presentation covers how Solr works with inverted indexes and scoring documents, as well as features like facets, streaming aggregations, master-slave and SolrCloud architectures. A demo is offered to illustrate Solr functionality.
This document discusses centralized logging and monitoring for Docker Swarm and Kubernetes orchestration platforms. It covers collecting container logs and metrics through agents, automatically tagging data with metadata, and visualizing logs and metrics alongside events through centralized log management and monitoring systems. An example monitoring setup is described for a Swarm cluster of 3000+ nodes running 60,000 containers.
Running High Performance and Fault Tolerant Elasticsearch Clusters on DockerSematext Group, Inc.
Sematext engineer Rafal Kuc (@kucrafal) walks through the details of running high-performance, fault tolerant Elasticsearch clusters on Docker. Topics include: Containers vs. Virtual Machines, running the official Elasticsearch container, container constraints, good network practices, dealing with storage, data-only Docker volumes, scaling, time-based data, multiple tiers and tenants, indexing with and without routing, querying with and without routing, routing vs. no routing, and monitoring. Talk was delivered at DevOps Days Warsaw 2015.
An updated talk about how to use Solr for logs and other time-series data, like metrics and social media. In 2016, Solr, its ecosystem, and the operating systems it runs on have evolved quite a lot, so we can now show new techniques to scale and new knobs to tune.
We'll start by looking at how to scale SolrCloud through a hybrid approach using a combination of time- and size-based indices, and also how to divide the cluster in tiers in order to handle the potentially spiky load in real-time. Then, we'll look at tuning individual nodes. We'll cover everything from commits, buffers, merge policies and doc values to OS settings like disk scheduler, SSD caching, and huge pages.
Finally, we'll take a look at the pipeline of getting the logs to Solr and how to make it fast and reliable: where should buffers live, which protocols to use, where should the heavy processing be done (like parsing unstructured data), and which tools from the ecosystem can help.
This document compares the performance and scalability of Elasticsearch and Solr for two use cases: product search and log analytics. For product search, both products performed well at high query volumes, but Elasticsearch handled the larger video dataset faster. For logs, Elasticsearch performed better by using time-based indices across hot and cold nodes to isolate newer and older data. In general, configuration was found to impact performance more than differences between the products. Proper testing with one's own data is recommended before making conclusions.
In this presentation, we are going to discuss how elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.
This document discusses a game called "Monster Appetite" that was created to promote healthy eating habits in children. The game aims to teach children about nutrition in a fun way by having players feed their monster avatars over the course of a week, with the monster eating the highest number of calories being the winner. The game incorporates chance cards and competitive gameplay. It was initially tested and the creators plan to expand it with additional features like social networking and geo-location in the future.
Stealing the Best Ideas from DevOps: A Guide for Sysadmins without DevelopersTom Limoncelli
DevOps is not a set of tools, nor is it just automating deployments. It is a set of principles that benefit anyone trying to improve a complex process. This talk will present the DevOps principles in terms that apply to all system administrators, and use case studies to explore their use in non-developer environments.
Thomas Limoncelli, StackOverflow.com, and Christina Hogan, AT&T
Presented at: Usenix LISA 2016
https://www.usenix.org/conference/lisa16/conference-program
The document discusses how humans see and understand data visualizations. It explains that visualizations are an encoding of data that leverages pre-attentive processing in the human visual system. Good visualizations optimize detection, assembly and estimation of patterns by exploiting position, length, angle and other basic visual properties. The document provides guidelines for effective visualization, such as using position on a common scale to encode the most important measurement and avoiding visual encodings that require decoding, like stacked bars.
Painless container management with Container Engine and KubernetesJorrit Salverda
This document discusses how Travix, an online travel agency, uses Google Container Engine and Kubernetes for painless container management. It describes Travix's journey from an on-premise only environment in early 2015 to now having 50 applications running in Kubernetes. Key benefits highlighted include fast and reliable deployments, auto-scaling during deployments, and less alerts and manual actions due to automatic restarts of misbehaving applications.
The document discusses a pipeline for continuous integration and deployment using MSBuild conventions. Key goals of the pipeline are to have a short feedback loop, enable rapid deployments, and optimize the process for build grids. The pipeline uses MSBuild conventions to configure builds that can compile code, run unit tests, analyze code quality, deploy to environments like UAT and production, and more with simple commands. Future enhancements proposed include automated rollbacks and cross-browser testing.
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
En esta sesión voy a contar las decisiones técnicas que tomamos al desarrollar QuestDB, una base de datos Open Source para series temporales compatible con Postgres, y cómo conseguimos escribir más de cuatro millones de filas por segundo sin bloquear o enlentecer las consultas.
Hablaré de cosas como (zero) Garbage Collection, vectorización de instrucciones usando SIMD, reescribir en lugar de reutilizar para arañar microsegundos, aprovecharse de los avances en procesadores, discos duros y sistemas operativos, como por ejemplo el soporte de io_uring, o del balance entre experiencia de usuario y rendimiento cuando se plantean nuevas funcionalidades.
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?
In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
1. The document discusses various technologies for building big data architectures, including NoSQL databases, distributed file systems, and data partitioning techniques.
2. Key-value stores, document databases, and graph databases are introduced as alternatives to relational databases for large, unstructured data.
3. The document also covers approaches for scaling databases horizontally, such as sharding, replication, and partitioning data across multiple servers.
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...Lucidworks
The document summarizes key points from a presentation on optimizing Solr and log pipelines for time-series data. The presentation covered using time-based Solr collections that rotate based on size, tiering hot and cold clusters, tuning OS and Solr settings, parsing logs, buffering pipelines, and shipping logs using protocols like UDP, TCP, and Kafka. The overall conclusions were that tuning segments per tier and max merged segment size improved indexing throughput, and that simple, reliable pipelines like Filebeat to Kafka or rsyslog over UNIX sockets generally work best.
This document summarizes PostgreSQL features and development. It discusses how SQL allows users to access data more efficiently than other methods. It outlines scalability improvements in version 9.6 like parallel queries and replication. Community efforts for version 10 focus on further scalability, logical replication, and performance optimizations. The document suggests Tryton could benefit from PostgreSQL full text search, trigram indexes, and tools to analyze system performance.
Some vignettes and advice based on prior experience with Cassandra clusters in live environments. Includes some material from other operational slides.
Mapping Data Flows Perf Tuning April 2021Mark Kromer
This document discusses optimizing performance for data flows in Azure Data Factory. It provides sample timing results for various scenarios and recommends settings to improve performance. Some best practices include using memory optimized Azure integration runtimes, maintaining current partitioning, scaling virtual cores, and optimizing transformations and sources/sinks. The document also covers monitoring flows to identify bottlenecks and global settings that affect performance.
Best Practices with PostgreSQL on SolarisJignesh Shah
This document provides best practices for deploying PostgreSQL on Solaris, including:
- Using Solaris 10 or latest Solaris Express for support and features
- Separating PostgreSQL data files onto different file systems tuned for each type of IO
- Tuning Solaris parameters like maxphys, klustsize, and UFS buffer cache size
- Configuring PostgreSQL parameters like fdatasync, commit_delay, wal_buffers
- Monitoring key metrics like memory, CPU, and IO usage at the Solaris and PostgreSQL level
Log Analytics with ELK Stack describes optimizing an ELK stack implementation for a mobile gaming company to reduce costs and scale data ingestion. Key optimizations included moving to spot instances, separating logs into different indexes based on type and retention needs, tuning Elasticsearch and Logstash configurations, and implementing a hot-warm architecture across different EBS volume types. These changes reduced overall costs by an estimated 80% while maintaining high availability and scalability.
Zing Database is a distributed key-value database developed by Zing to handle their large volumes of data from feeds, user profiles, and comments in a highly available and scalable way. It uses a peer-to-peer architecture with consistent hashing for distributed addressing and data partitioning across multiple storage nodes, and supports features like caching, write-ahead logging, and replication for fault tolerance. The document discusses the architecture, distribution approach, and configuration options of the Zing Database system.
Zing Database is a distributed key-value database developed by Zing to handle their large volumes of data from feeds, user profiles, and comments in a highly available and scalable way. It uses a peer-to-peer architecture with consistent hashing for distributed addressing and data partitioning across multiple storage nodes, and supports features like caching, write-ahead logging, and replication for fault tolerance. The document discusses the architecture, distribution approach, and configuration options of the Zing Database system.
The document provides an overview of the Google Cloud Platform (GCP) Data Engineer certification exam, including the content breakdown and question format. It then details several big data technologies in the GCP ecosystem such as Apache Pig, Hive, Spark, and Beam. Finally, it covers various GCP storage options including Cloud Storage, Cloud SQL, Datastore, BigTable, and BigQuery, outlining their key features, performance characteristics, data models, and use cases.
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
This document provides an overview of scaling a Splunk deployment from an initial use case to a larger enterprise deployment. It discusses growing use cases and data volume over time. The agenda covers use case mapping, simple scaling approaches, indexer and search head clustering, distributed management, and hybrid cloud deployments. Best practices are outlined for sizing storage, tuning indexers, and designing high availability into the forwarding, indexing, and search tiers. Clustering impacts on storage sizing and additional hosts are also addressed.
Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora is disruptive technology in the database space, bringing a new architectural model and distributed systems techniques to provide far higher performance, availability and durability than previously available using conventional monolithic database techniques. In this session, we will do a deep-dive into some of the key innovations behind Amazon Aurora, discuss best practices and configurations, and share early customer experience from the field.
This session will cover performance-related developments in Red Hat Gluster Storage 3 and share best practices for testing, sizing, configuration, and tuning.
Join us to learn about:
Current features in Red Hat Gluster Storage, including 3-way replication, JBOD support, and thin-provisioning.
Features that are in development, including network file system (NFS) support with Ganesha, erasure coding, and cache tiering.
New performance enhancements related to the area of remote directory memory access (RDMA), small-file performance, FUSE caching, and solid state disks (SSD) readiness.
This is an exam cheat sheet hopes to cover all keys points for GCP Data Engineer Certification Exam
Let me know if there is any mistake and I will try to update it
comprehensive Introduction to NoSQL solutions inside the big data landscape. Graph store? Column store? key Value store? Document Store? redis or memcache? dynamo db? mongo db ? hbase? Cloud or open source?
Amazon Aurora adds PostgreSQL compatibility to its cloud-optimized relational database. With PostgreSQL compatibility, customers can now choose to use Amazon's database with the performance and availability of commercial databases and the simplicity and cost-effectiveness of open source databases. Amazon Aurora provides high performance, durability, availability and automatic scaling capabilities for PostgreSQL workloads.
Configuring storage. The slides to this webinar cover how to configure storage for Aerospike. It includes a discussion of how Aerospike uses Flash/SSDs and how to get the best performance out of them.
Find the full webinar with audio here - http://www.aerospike.com/webinars
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
This document provides an overview of MapReduce programming and best practices for Apache Hadoop. It describes the key components of Hadoop including HDFS, MapReduce, and the data flow. It also discusses optimizations that can be made to MapReduce jobs, such as using combiners, compression, and speculation. Finally, it outlines some anti-patterns to avoid and tips for debugging MapReduce applications.
Similar to Elasticsearch for Logs & Metrics - a deep dive (20)
This talk was given during Activate Conference 2019. Lucene has a lot of options for configuring similarity, and Solr inherits them. Similarity makes the base of your relevancy score: how similar is this document to the query? The default similarity (BM25) is a good start, but you may need to tweak it for your use-case. In this session, you will learn how BM25 works and how you may want to change its parameters. Then, we'll move to other similarity classes: DFR, DFI, IB and LM. You will learn the thinking behind them, how that thinking translates to the similarity score, and which parameters allow you to tweak how score evolves based on things like term frequency or document length. By the end, you’ll have a good understanding of which similarity options are likely to work well for your use-case. You'll know which tunables are available and whether you need to implement a custom similarity class. As an example, we’ll focus on E-commerce, where you often end up ignoring term frequency altogether.
Key Takeaway
1) What are the built-in Lucene/Solr similarities and what they do
2) Which similarity to use for which use-case
3) How to use a custom similarity class in Solr
Learn more about search relevance and similarity: sematext.com/blog/search-relevance-solr-elasticsearch-similarity
This document discusses best practices for containerizing Java applications to avoid out of memory errors and performance issues. It covers choosing appropriate Java versions, garbage collector tuning, sizing heap memory correctly while leaving room for operating system caches, avoiding swapping, and monitoring applications to detect issues. Key recommendations include using the newest Java version possible, configuring the garbage collector appropriately for the workload, allocating all heap memory at startup, and monitoring memory usage to detect problems early.
This talk was given during Monitorama EU 2018.
Observability, like other ops practices, has hard and soft benefits. No logs - no root cause, that’s a hard benefit. A soft benefit is when we have more confidence in an observable system. Then we can be more productive in developing it. The trouble with soft benefits like confidence, is how to measure them. Does observability actually make us more productive? How about other activities, such as post-mortems? Why is alert fatigue so bad? Turns out, there are plenty of studies about the impact of such activities on our brain, our behavior, our productivity. In this session, we’ll explore what [neuro]science says about such practices so that:
We turn soft benefits into hard benefits
We can encourage a culture where we get the benefits and avoid the traps
Be prepared for surprises, as some “best practices” aren’t “best” at all.
The document discusses introducing log analysis to an organization. It covers log shipping architecture using file shippers, centralized buffers like Kafka and Redis, and storage and analysis using Elasticsearch, Kibana and Grafana. Specific topics covered include choosing the right shipper, buffer types, protocols, and optimizing Elasticsearch configuration, indices, and hardware for different node types like data, ingest and client nodes.
This talk was given during Lucene Revolution 2017.
They say optimize is bad for you, they say you shouldn't do it, they say it will invalidate operating system caches and make your system suffer. This is all true, but is it true in all cases?
In this presentation we will look closer on what optimize or better called force merge does to your Solr search engine. You will learn what segments are, how they are built and how they are used by Lucene and Solr for searching. We will discuss real-life performance implications regarding Solr collections that have many segments on a single node and compare that to the Solr where the number of segments is moderate and low. We will see what we can do to tune the merging process to trade off indexing performance for better query performance and what pitfalls are there waiting for us. Finally, at the end of the talk we will discuss possibilities of running force merge to avoid system disruption and still benefit from query performance boost that single segment index provides.
The document summarizes the good, bad, and ugly aspects of using Solr on Docker. The good is the orchestration and ability to dynamically allocate resources which can deliver on the promise of development, testing, and production environments being the same. The bad is that treating instances as cattle rather than pets requires good sizing, configuration, and scaling practices. The ugly is that the ecosystem is still young, leading to exciting bugs as Docker is still the future.
Sematext's DevOps Evangelist, Stefan Thies (@seti321), takes a Docker Logging tour through the different log collection options Docker users have, the pros and cons of each, specific and existing Docker logging solutions, tooling, the role of syslog, log shipping to ELK Stack, and more. Q&A session at end.
For the Docker users out there, Sematext's DevOps Evangelist, Stefan Thies, goes through a number of different Docker monitoring options, points out their pros and cons, and offers solutions for Docker monitoring. Webinar contains actionable content, diagrams and how-to steps.
This document discusses Sematext's monitoring and logging products and services. It introduces Sematext, which is headquartered in Brooklyn and has employees globally. It then discusses why performance monitoring, log searching, and anomaly alerting are needed capabilities (Why). The document proceeds to describe Sematext's SPM and Logsene products, which provide these capabilities using open source technologies like OpenTSDB, Elasticsearch, and Kafka. It covers how the SPM agent collects metrics and traces and how Logsene ingests and analyzes logs at scale.
The document discusses various Solr anti-patterns and best practices for optimizing Solr performance, including properly configuring request handlers, schema fields, thread pools, caching, indexing, and faceting. It provides examples of incorrect configurations that can cause issues and recommendations for improved configurations to avoid problems and optimize querying, indexing, and response times.
This document discusses tuning Solr for log search and analysis. It provides the results of baseline tests on Solr performance and capacity indexing 10 million logs. Various configuration changes are then tested, such as using time-based collections, DocValues, commit settings, and hardware optimizations. Using tools like Apache Flume to preprocess logs before indexing into Solr is also recommended for improved throughput. Overall, the document emphasizes that software and hardware optimizations can significantly improve Solr performance and capacity when indexing logs.
This document discusses search in big data and how Elasticsearch provides a solution. It addresses the challenges of fancy search features requiring distributed architecture to process large volumes of data across multiple servers. Elasticsearch implements a distributed search engine that allows real-time analytics on large, document-oriented data through its use of Lucene, JSON over HTTP, and sharding of data and queries across multiple nodes.
This document summarizes a presentation comparing Solr and Elasticsearch. It outlines the main topics covered, including documents, queries, mapping, indexing, aggregations, percolations, scaling, searches, and tools. Examples of specific features like bool queries, facets, nesting aggregations, and backups are demonstrated for both Solr and Elasticsearch. The presentation concludes by noting most projects work well with either system and to choose based on your use case.
This document summarizes the evolution of open source search tools from the early 1970s to present day. It discusses the transition from early tools like WAIS and Harvest in the 1990s to modern distributed search platforms like Elasticsearch. Key areas of advancement are highlighted, such as support for more languages through improved stemming and lemmatization, more sophisticated relevance algorithms, distributed architectures for scaling data and queries, faster indexing and real-time search, reduced memory footprints, and expanding capabilities beyond basic text search to include geospatial, classification, recommendation, key-value storage, analytics and more.
Elasticsearch and Solr for Logs + info on Rsyslog, Kibana, Logstash, and Apache Flume for log shipping logs. VIDEO at: http://blog.sematext.com/2014/02/26/video-and-presentation-indexing-and-searching-logs-with-elasticsearch-or-solr/
The document discusses Elasticsearch. It is a RESTful search and analytics engine. The document contains various URLs and JSON snippets relating to indexing and retrieving data from Elasticsearch. It shows examples of adding, updating, and retrieving documents from an index called "blog".
This document summarizes concepts and techniques for administering and monitoring SolrCloud, including: how SolrCloud distributes data across shards and replicas; how to start a local or distributed SolrCloud cluster; how to create, split, and reload collections using the Collections API; how to modify schemas dynamically using the Schema API; directory implementations and segment merging; configuring autocommits; caching in Solr; metrics to monitor such as indexing throughput, search latency, and JVM memory usage; and tools for monitoring Solr clusters like the Solr administration panel and JMX.
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionBert Blevins
Cybersecurity is a major concern in today's connected digital world. Threats to organizations are constantly evolving and have the potential to compromise sensitive information, disrupt operations, and lead to significant financial losses. Traditional cybersecurity techniques often fall short against modern attackers. Therefore, advanced techniques for cyber security analysis and anomaly detection are essential for protecting digital assets. This blog explores these cutting-edge methods, providing a comprehensive overview of their application and importance.
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Support en anglais diffusé lors de l'événement 100% IA organisé dans les locaux parisiens d'Iguane Solutions, le mardi 2 juillet 2024 :
- Présentation de notre plateforme IA plug and play : ses fonctionnalités avancées, telles que son interface utilisateur intuitive, son copilot puissant et des outils de monitoring performants.
- REX client : Cyril Janssens, CTO d’ easybourse, partage son expérience d’utilisation de notre plateforme IA plug & play.
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
YOUR RELIABLE WEB DESIGN & DEVELOPMENT TEAM — FOR LASTING SUCCESS
WPRiders is a web development company specialized in WordPress and WooCommerce websites and plugins for customers around the world. The company is headquartered in Bucharest, Romania, but our team members are located all over the world. Our customers are primarily from the US and Western Europe, but we have clients from Australia, Canada and other areas as well.
Some facts about WPRiders and why we are one of the best firms around:
More than 700 five-star reviews! You can check them here.
1500 WordPress projects delivered.
We respond 80% faster than other firms! Data provided by Freshdesk.
We’ve been in business since 2015.
We are located in 7 countries and have 22 team members.
With so many projects delivered, our team knows what works and what doesn’t when it comes to WordPress and WooCommerce.
Our team members are:
- highly experienced developers (employees & contractors with 5 -10+ years of experience),
- great designers with an eye for UX/UI with 10+ years of experience
- project managers with development background who speak both tech and non-tech
- QA specialists
- Conversion Rate Optimisation - CRO experts
They are all working together to provide you with the best possible service. We are passionate about WordPress, and we love creating custom solutions that help our clients achieve their goals.
At WPRiders, we are committed to building long-term relationships with our clients. We believe in accountability, in doing the right thing, as well as in transparency and open communication. You can read more about WPRiders on the About us page.
Choose our Linux Web Hosting for a seamless and successful online presencerajancomputerfbd
Our Linux Web Hosting plans offer unbeatable performance, security, and scalability, ensuring your website runs smoothly and efficiently.
Visit- https://onliveserver.com/linux-web-hosting/
Best Programming Language for Civil EngineersAwais Yaseen
The integration of programming into civil engineering is transforming the industry. We can design complex infrastructure projects and analyse large datasets. Imagine revolutionizing the way we build our cities and infrastructure, all by the power of coding. Programming skills are no longer just a bonus—they’re a game changer in this era.
Technology is revolutionizing civil engineering by integrating advanced tools and techniques. Programming allows for the automation of repetitive tasks, enhancing the accuracy of designs, simulations, and analyses. With the advent of artificial intelligence and machine learning, engineers can now predict structural behaviors under various conditions, optimize material usage, and improve project planning.
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
Kief Morris rethinks the infrastructure code delivery lifecycle, advocating for a shift towards composable infrastructure systems. We should shift to designing around deployable components rather than code modules, use more useful levels of abstraction, and drive design and deployment from applications rather than bottom-up, monolithic architecture and delivery.
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
Comparison Table of DiskWarrior Alternatives.pdfAndrey Yasko
To help you choose the best DiskWarrior alternative, we've compiled a comparison table summarizing the features, pros, cons, and pricing of six alternatives.
4. Daily indices are a good start
...
indexing, most searches
Indexing is faster in smaller indices
Cheap deletes
Search only needed indices
“Static” indices can be cached
5. The Black Friday problem*
* for logs. Metrics usually don’t suffer from this
6. Typical indexing performance graph for one shard*
* throttled so search performance remains decent
At this point it’s better to index in a new shard
Typically 5-10GB, YMMV
11. Slicing data by time
For spiky ingestion, use size-based indices
Make sure you rotate before the performance drop
(test on one node to get that limit)
12. Multi tier architecture (aka hot/cold)
Client
Client
Client
Data
Data
Data
...
Data
Data
Data
Master
Master
Master
We can optimize data nodes layer
Ingest
Ingest
Ingest
16. Multi tier architecture (aka hot/cold)
logs_2016.11.11
logs_2016.11.07
logs_2016.11.09
logs_2016.11.08
logs_2016.11.10
indexing, most searches long running searches
good CPU, best possible IO heap, IO for backup/replication and stats
es_hot_1 es_cold_1 es_cold_2
SSD or RAID0 for spinning
17. Hot - cold architecture summary
Costs optimization - different hardware for different tier
Performance - above + fewer shards, less overhead
Isolation - long running searches don't affect indexing
18. Elasticsearch high availability & fault tolerance
Dedicated masters is a must
discovery.zen.minimum_master_nodes = N/2 + 1
Keep your indices balanced
not balanced cluster can lead to instability
Balanced primaries are also good
helps with backups, moving to cold tier, etc
total_shards_per_node is your friend
19. Elasticsearch high availability & fault tolerance
When in AWS - spread between availability zones
bin/elasticsearch -Enode.attr.zone=zoneA
cluster.routing.allocation.awareness.attributes: zone
We need headroom for spikes
leave at least 20 - 30% for indexing & search spikes
Large machines with many shards?
look out for GC - many clusters died because of that
consider running smaller ES instances but more
20. Which settings to tune
Merges → most indexing time
Refreshes → check refresh_interval
Flushes → normally OK with ES defaults
21. Relaxing the merge policy
Less merges ⇒ faster indexing/lower CPU while indexing
Slower searches, but:
- there’s more spare CPU
- aggregations aren’t as affected, and they are typically the bottleneck
especially for metrics
More open files (keep an eye on them!)
Increase index.merge.policy.segments_per_tier ⇒ more segments, less merges
Increase max_merge_at_once, too, but not as much ⇒ reduced spikes
Reduce max_merged_segment ⇒ no more huge merges, but more small ones
22. And even more settings
Refresh interval (index.refresh_interval)*
- 1s -> baseline indexing throughput
- 5s -> +25% to baseline throughput
- 30s -> +75% to baseline throughput
Higher indices.memory.index_buffer_size higher throughput
Lower indices.queries.cache.size for high velocity data to free up heap
Omit norms (frequencies and positions, too?)
Don't store fields if _source is used
Don't store catch-all (i.e. _all) field - data copied from other fields
* https://sematext.com/blog/2013/07/08/elasticsearch-refresh-interval-vs-indexing-performance/
23. Let’s dive deeper into storage
Not searches on a field, just aggregations ⇒ index=false
Not sorting/aggregating on a field ⇒ doc_values=false
Doc values can be used for retrieving (see docvalue_fields), so:
● Logs: use doc values for retrieving, exclude them from _source*
● Metrics: short fields normally ⇒ disable _source, rely on doc values
Long retention for logs? For “old” indices:
● set index.codec=best_compression
● force merge to few segments
* though you’ll lose highlighting, update API, reindex API...
24. Metrics: working around sparse data
Ideally, you’d have one index per metric type (what you can fetch with one call)
Combining them into one (sparse) index will impact performance (see LUCENE-7253)
One doc per metric: you’ll pay with space
Nested documents: you’ll pay with heap (bitset used for joins) and query latency
25. What about the OS?
Say no to swap
Disk scheduler: CFQ for HDD, deadline for SSD
Mount options: noatime, nodiratime, data=writeback, nobarrier
because strict ordering is for the weak
26. And hardware?
Hot tier. Typical bottlenecks: CPU and IO throughput
indexing is CPU-intensive
flushes and merges write (and read) lots of data
Cold tier: Memory (heap) and IO latency
more data here ⇒ more indices&shards ⇒ more heap
⇒ searches hit more files
many stats calls are per shard ⇒ potentially choke IO when cluster is idle
Generally:
network storage needs to be really good (esp. for cold tier)
network needs to be low latency (pings, cluster state replication)
network throughput is needed for replication/backup
27. AWS specifics
c3 instances work, but there’s not enough local SSD ⇒ EBS gp2 SSD*
c4 + EBS give similar performance, but cheaper
i2s are good, but expensive
d2s are better value, but can’t deal with many shards (spinning disk latency)
m4 + gp2 EBS are a good balance
gp2 → PIOPS is expensive, spinning is slow
3 IOPS/GB, but caps at 160MB/s or 10K IOPS (of up to 256kb) per drive
performance isn’t guaranteed (for gp2) ⇒ one slow drive slows RAID0
Enhanced Networking (and EBS Optimized if applicable) are a must
* And used local SSD as cache. With --cachemode writeback for async writing:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Vol
ume_Manager_Administration/lvm_cache_volume_creation.html
block
size?
30. The pipeline
read buffer deliver
Log shipper
reason #1
Files? Sockets? Network?
What if buffer fills up?
Processing
before/after buffer?
How?
Others besides Elasticsearch?
How to buffer if $destination is down?
Overview of 6 log shippers: sematext.com/blog/2016/09/13/logstash-alternatives/
32. Where to do processing
Logstash
(or Filebeat or…)
Buffer
(Kafka/Redis)
here
Logstash Elasticsearch
33. Where to do processing
Logstash
Buffer
(Kafka/Redis)
here
Logstash Elasticsearch
something
else
34. Where to do processing
Logstash
Buffer
(Kafka/Redis)
here
Logstash Elasticsearch
something
else
Outputs
need to be
in sync
35. Where to do processing
Logstash Kafka Logstash Elasticsearch
something
else
LogstashElasticsearch
offset
other
offset
here
here,
too
36. Where to do processing (syslog-ng, fluentd…)
input
here
Elasticsearch
something
else
here
37. Where to do processing (rsyslogd…)
input
here
here
here
38. Zoom into processing
Ideally, log in JSON
Otherwise, parse
For performance and maintenance
(i.e. no need to update parsing rules)
Regex-based (e.g. grok)
Easy to build rules
Rules are flexible
Slow & O(n) on # of rules
Tricks:
Move matching patterns to the top of the list
Move broad patterns to the bottom
Skip patterns including others that didn’t match
Grammar-based
(e.g. liblognorm, PatternDB)
Faster. O(1) on # of rules. References:
Logagent
Logstash
rsyslog syslog-ng
sematext.com/blog/2015/05/18/tuning-elasticsearch-indexing-pipeline-for-logs/
www.fernuni-hagen.de/imperia/md/content/rechnerarchitektur/rainer_gerhards.pdf
39. Back to buffers: check what happens if when they fill up
Local files: when are they rotated/archived/deleted?
TCP: what happens when connection breaks/times out?
UNIX sockets: what happens when socket blocks writes?
UDP: network buffers should handle spiky load
Check/increase
net.core.rmem_max
net.core.rmem_default
Unlike UDP&TCP,
both DGRAM and STREAM
local sockets
are reliable/blocking
40. Let’s talk protocols now
UDP: cool for the app, but not reliable
TCP: more reliable, but not completely
Application-level ACKs may be needed:
No failure/backpressure handling needed
App gets ACK when OS buffer gets it
⇒ no retransmit if buffer is lost*
* more at blog.gerhards.net/2008/05/why-you-cant-build-reliable-tcp.html
sender receiver
ACKs
Protocol Example shippers
HTTP Logstash, rsyslog, syslog-ng, Fluentd, Logagent
RELP rsyslog, Logstash
Beats Filebeat, Logstash
Kafka Fluentd, Filebeat, rsyslog, syslog-ng, Logstash
41. Wrapping up: where to log?
critical?
UDP. Increase network
buffers on destination,
so it can handle spiky
traffic
Paying with
RAM or IO?
UNIX socket. Local
shipper with memory
buffers, that can
drop data if needed
Local files. Make sure
rotation is in place or
you’ll run out of disk!
no
IO RAM
yes