Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It is written in Java and uses a pluggable backend. Presto is fast due to code generation and runtime compilation techniques. It provides a library and framework for building distributed services and fast Java collections. Plugins allow Presto to connect to different data sources like Hive, Cassandra, MongoDB and more.
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems.
This presentation explains where it fits into the data eco system and how it helps implement your system in Rust
The document discusses various techniques for profiling CPU and memory performance in Rust programs, including:
- Using the flamegraph tool to profile CPU usage by sampling a running process and generating flame graphs.
- Integrating pprof profiling into Rust programs to expose profiles over HTTP similar to how it works in Go.
- Profiling heap usage by integrating jemalloc profiling and generating heap profiles on program exit.
- Some challenges with profiling asynchronous Rust programs due to the lack of backtraces.
The key takeaways are that there are crates like pprof-rs and techniques like jemalloc integration that allow collecting CPU and memory profiles from Rust programs, but profiling asynchronous programs
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
This document discusses using ClickHouse to manage log data. It begins with an introduction to ClickHouse and its features. It then covers different ways to model log data in ClickHouse, including storing logs as JSON blobs or converting them to a tabular format. The document demonstrates using materialized views to ingest logs into ClickHouse tables in an efficient manner, extracting values from JSON and converting to columns. It shows how this approach allows flexible querying of log data while scaling to large volumes.
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Apache Kafka without Zookeeper is now production ready! This talk is about how you can run without ZooKeeper, and why you should.
Oracle Database 19c, builds upon key architectural, distributed data and performance innovations established in earlier versions Oracle Database 12c and 18c releases. Oracle 19c has many new features, in this presentation we have covered below areas
Automated Installation, Configuration and Patching
AutoUpgrade and Database Utilities
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1
The document provides an overview and agenda for a presentation on optimizing Oracle database performance through query tuning. It discusses identifying performance issues, collecting wait event information, reviewing execution plans, and understanding how the Oracle optimizer works using features like adaptive plans and statistics gathering. The goal is to show attendees how to quickly find and focus on the queries most in need of tuning.
This document discusses Kafka Streams state stores. It provides examples of using different types of windowing (tumbling, hopping, sliding, session) with state stores. It also covers configuring state store logging, caching, and retention policies. The document demonstrates how to define windowed state stores in Kafka Streams applications and discusses concepts like grace periods.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Presto is an open-source distributed SQL query engine for interactive analytics. It uses a connector architecture to query data across different data sources and formats in the same query. Presto's query planning and execution involves scanning data sources, optimizing query plans, distributing queries across workers, and aggregating results. Understanding Presto's query plans helps optimize queries and troubleshoot performance issues.
Building a fully managed stream processing platform on Flink at scale for Lin...
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function.
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
The document discusses how an easy-to-use and fast database can have a complicated implementation for developers. It outlines four key areas: 1) Flexible writing schema requires schema merging at read time. 2) Fast reads prune non-covered data chunks through predicate push-down. 3) Loading duplicated data necessitates data deduplication and compaction operations. 4) Quick data deletion still needs data elimination at read time or in the background. The document provides examples to illustrate the tradeoffs between user and developer requirements.
Evening out the uneven: dealing with skew in Flink
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Presto generates Java bytecode at runtime to optimize query execution. Key query operations like filtering, projections, joins and aggregations are compiled into efficient Java methods using libraries like ASM and Fastutil. This bytecode generation improves performance by 30% through techniques like compiling row hashing for join lookups directly into machine instructions.
The document discusses the future of server-side JavaScript. It covers various Node.js frameworks and libraries that support both synchronous and asynchronous programming styles. CommonJS aims to provide interoperability across platforms by implementing synchronous proposals using fibers. Examples demonstrate how CommonJS allows for synchronous-like code while maintaining asynchronous behavior under the hood. Benchmarks show it has comparable performance to Node.js. The author advocates for toolkits over frameworks and continuing development of common standards and packages.
Node has captured the attention of early adopters by clearly differentiating itself as being asynchronous from the ground up while remaining accessible. Now that server side JavaScript is at the cutting edge of the asynchronous, real time web, it is in a much better position to establish itself as the go to language for also making synchronous, CRUD webapps and gain a stronger foothold on the server.
This talk covers the current state of server side JavaScript beyond Node. It introduces Common Node, a synchronous CommonJS compatibility layer using node-fibers which bridges the gap between the different platforms. We look into Common Node's internals, compare its performance to that of other implementations such as RingoJS and go through some ideal use cases.
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
The power of Gatling is the DSL it provides to allow writing meaningful and expressive tests. We provide an overview of the framework, a description of their development environment and goals, and present their test results.
Source code available https://github.com/lawlessc/random-response-time
Wprowadzenie do technologi Big Data i Apache Hadoop
The document introduces concepts related to Big Data technology including volume, variety, and velocity of data. It discusses Hadoop architecture including HDFS, MapReduce, YARN, and the Hadoop ecosystem. Examples are provided of common Big Data problems and how they can be solved using Hadoop frameworks like Pig, Hive, and Ambari.
This document provides an overview of CouchDB, a NoSQL database that uses JSON documents with a flexible schema. It demonstrates CouchDB's features like replication, MapReduce, and filtering. The presentation then shows how to build a mobile running app called Couch25K that tracks locations using CouchDB and syncs data between phones and a server. Code examples are provided in Objective-C, Java, and JavaScript for creating databases, saving documents, querying, and syncing.
This talk will begin with a discussion of the strengths of PostgreSQL and Hadoop. We will then lead into a high level overview of Hadoop and its community of projects like Hive, Flume and Sqoop. Finally, we will dig down into various use cases detailing how you can leverage Hadoop technologies for your PostgreSQL databases today. The use cases will range from using HDFS for simple database backups to using PostgreSQL and Foreign Data Wrappers to do low latency analytics on your Big Data.
A Deep Dive into Query Execution Engine of Spark SQL
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
JavaScript is evolving with the addition of modules, platform consistency, and harmony features. Modules allow JavaScript code to be organized and avoid naming collisions. CommonJS and AMD module formats are used widely. Platform consistency is improved through polyfills that mimic future APIs for older browsers. Harmony brings language-level modules and features like destructuring assignment, default parameters, and promises to JavaScript. Traceur compiles Harmony code to existing JavaScript.
The document provides best practices for handling performance issues in an Odoo deployment. It recommends gathering deployment information, such as hardware specs, number of machines, and integration with web services. It also suggests monitoring tools to analyze system performance and important log details like CPU time, memory limits, and request processing times. The document further discusses optimizing PostgreSQL settings, using tools like pg_activity, pg_stat_statements, and pgbadger to analyze database queries and performance. It emphasizes reproducing issues, profiling code with tools like the Odoo profiler, and fixing problems in an iterative process.
This document provides an introduction and overview of a Node.js tutorial presented by Tom Hughes-Croucher. The tutorial covers topics such as building scalable server-side code with JavaScript using Node.js, debugging Node.js applications, using frameworks like Express.js, and best practices for deploying Node.js applications in production environments. The tutorial includes exercises for hands-on learning and demonstrates tools and techniques like Socket.io, clustering, error handling and using Redis with Node.js applications.
This talk was given at the Dutch PHP Conference 2011 and details the use of Comet (aka reverse ajax or ajax push) technologies and the importance of websockets and server-sent events. More information is available at http://joind.in/3237.
The document provides an overview of RxJava and its advantages over traditional Java streams and callbacks. It discusses key RxJava concepts like Observables, Observers, and Subscriptions. It demonstrates how to create Observables, subscribe to them, and compose operations like filter, map, and zip. It shows how to leverage schedulers to control threading. The document also provides examples of using RxJava with HTTP requests and the Twitter API to asynchronously retrieve user profiles and tweets. It highlights scenarios where RxJava is useful, like handling asynchronous operations, and discusses some pitfalls like its learning curve and need to understand backpressure.
This document introduces Reactive Programming with RxJava and how it can be used to create non-blocking applications. It discusses the limitations of blocking code and how RxJava uses Observables and Subscribers to implement reactive and asynchronous operations. It provides examples of converting blocking servlets and HTTP calls to non-blocking using RxJava. While non-blocking code is not always faster, it allows asynchronous operations to utilize threads more efficiently.
MongoDB is the trusted document store we turn to when we have tough data store problems to solve. For this talk we are going to go a little bit off the path and explore what other roles we can fit MongoDB into. Others have discussed how to turn MongoDB’s capped collections into a publish/subscribe server. We stretch that a little further and turn MongoDB into a full fledged broker with both publish/subscribe and queue semantics, and a the ability to mix them. We will provide code and a running demo of the queue producers and consumers. Next we will turn to coordination services: We will explore the fundamental features and show how to implement them using MongoDB as the storage engine. Again we will show the code and demo the coordination of multiple applications.
This 15 minute presentation discusses non-blocking I/O, event loops, and Node.js. It builds on previous work by Ryan Dahl, explaining how threads can be expensive due to context switching and memory usage, and how Node.js uses an event-driven, non-blocking model to avoid these costs. Code examples demonstrate getting and printing a policy object, handling HTTP requests asynchronously without blocking additional connections, and using callbacks to chain asynchronous actions together.
The document discusses Concurrency-oriented Programming (COP) using Erlang. It explains how Erlang programs work using lightweight processes that communicate asynchronously via message passing. This allows for high performance, reliability, and scalability. It provides examples of stateless server processes and using CouchDB for schema-free document storage accessible via REST APIs. Ruby libraries for interacting with CouchDB are also mentioned.
This document summarizes ql.io, a domain specific language for consuming HTTP APIs. Ql.io allows API calls to be made with fewer lines of code and reduced data sizes compared to traditional HTTP requests. It handles parallelizing requests and joining responses implicitly. Ql.io also allows mapping HTTP resources to SQL-like queries, enabling sequential and parallel queries over multiple APIs with a simple syntax. It can be used as an HTTP gateway or from Node.js.
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
Over the last year, we have been moving from a batch processing jobs setup with Airflow using EC2s to a powerful & scalable setup using Airflow & Spark in K8s.
The increasing need of moving forward with all the technology changes, the new community advances, and multidisciplinary teams, forced us to design a solution where we were able to run multiple Spark versions at the same time by avoiding duplicating infrastructure and simplifying its deployment, maintenance, and development.
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years.
Inspired by the increasingly complex SQL queries run by the Presto user community, engineers at Facebook and Starburst have recently focused on cost-based query optimization. In this talk we will present the initial design and implementation of the CBO, support for connector-provided statistics, estimating selectivity, and choosing efficient query plans. Then, our detailed experimental evaluation will illustrate the performance gains for several classes of queries achieved thanks to the optimizer. Finally, we will discuss our future work enhancing the initial CBO and present the general Presto roadmap for 2018 and beyond.
Speakers
Kamil Bajda-Pawlikowski, Starburst Data, CTO & Co-Founder
Martin Traverso
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems.
This presentation explains where it fits into the data eco system and how it helps implement your system in Rust
The document discusses various techniques for profiling CPU and memory performance in Rust programs, including:
- Using the flamegraph tool to profile CPU usage by sampling a running process and generating flame graphs.
- Integrating pprof profiling into Rust programs to expose profiles over HTTP similar to how it works in Go.
- Profiling heap usage by integrating jemalloc profiling and generating heap profiles on program exit.
- Some challenges with profiling asynchronous Rust programs due to the lack of backtraces.
The key takeaways are that there are crates like pprof-rs and techniques like jemalloc integration that allow collecting CPU and memory profiles from Rust programs, but profiling asynchronous programs
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...Altinity Ltd
This document discusses using ClickHouse to manage log data. It begins with an introduction to ClickHouse and its features. It then covers different ways to model log data in ClickHouse, including storing logs as JSON blobs or converting them to a tabular format. The document demonstrates using materialized views to ingest logs into ClickHouse tables in an efficient manner, extracting values from JSON and converting to columns. It shows how this approach allows flexible querying of log data while scaling to large volumes.
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022HostedbyConfluent
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Apache Kafka without Zookeeper is now production ready! This talk is about how you can run without ZooKeeper, and why you should.
Oracle Database 19c, builds upon key architectural, distributed data and performance innovations established in earlier versions Oracle Database 12c and 18c releases. Oracle 19c has many new features, in this presentation we have covered below areas
Automated Installation, Configuration and Patching
AutoUpgrade and Database Utilities
Stop the Chaos! Get Real Oracle Performance by Query Tuning Part 1SolarWinds
The document provides an overview and agenda for a presentation on optimizing Oracle database performance through query tuning. It discusses identifying performance issues, collecting wait event information, reviewing execution plans, and understanding how the Oracle optimizer works using features like adaptive plans and statistics gathering. The goal is to show attendees how to quickly find and focus on the queries most in need of tuning.
Kafka Streams State Stores Being Persistentconfluent
This document discusses Kafka Streams state stores. It provides examples of using different types of windowing (tumbling, hopping, sliding, session) with state stores. It also covers configuring state store logging, caching, and retention policies. The document demonstrates how to define windowed state stores in Kafka Streams applications and discusses concepts like grace periods.
Presentation at Strata Data Conference 2018, New York
The controller is the brain of Apache Kafka. A big part of what the controller does is to maintain the consistency of the replicas and determine which replica can be used to serve the clients, especially during individual broker failure.
Jun Rao outlines the main data flow in the controller—in particular, when a broker fails, how the controller automatically promotes another replica as the leader to serve the clients, and when a broker is started, how the controller resumes the replication pipeline in the restarted broker.
Jun then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Presto is an open-source distributed SQL query engine for interactive analytics. It uses a connector architecture to query data across different data sources and formats in the same query. Presto's query planning and execution involves scanning data sources, optimizing query plans, distributing queries across workers, and aggregating results. Understanding Presto's query plans helps optimize queries and troubleshoot performance issues.
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Understanding and Improving Code GenerationDatabricks
Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function.
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...InfluxData
The document discusses how an easy-to-use and fast database can have a complicated implementation for developers. It outlines four key areas: 1) Flexible writing schema requires schema merging at read time. 2) Fast reads prune non-covered data chunks through predicate push-down. 3) Loading duplicated data necessitates data deduplication and compaction operations. 4) Quick data deletion still needs data elimination at read time or in the background. The document provides examples to illustrate the tradeoffs between user and developer requirements.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Presto generates Java bytecode at runtime to optimize query execution. Key query operations like filtering, projections, joins and aggregations are compiled into efficient Java methods using libraries like ASM and Fastutil. This bytecode generation improves performance by 30% through techniques like compiling row hashing for join lookups directly into machine instructions.
The document discusses the future of server-side JavaScript. It covers various Node.js frameworks and libraries that support both synchronous and asynchronous programming styles. CommonJS aims to provide interoperability across platforms by implementing synchronous proposals using fibers. Examples demonstrate how CommonJS allows for synchronous-like code while maintaining asynchronous behavior under the hood. Benchmarks show it has comparable performance to Node.js. The author advocates for toolkits over frameworks and continuing development of common standards and packages.
Node has captured the attention of early adopters by clearly differentiating itself as being asynchronous from the ground up while remaining accessible. Now that server side JavaScript is at the cutting edge of the asynchronous, real time web, it is in a much better position to establish itself as the go to language for also making synchronous, CRUD webapps and gain a stronger foothold on the server.
This talk covers the current state of server side JavaScript beyond Node. It introduces Common Node, a synchronous CommonJS compatibility layer using node-fibers which bridges the gap between the different platforms. We look into Common Node's internals, compare its performance to that of other implementations such as RingoJS and go through some ideal use cases.
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...Aman Kohli
The power of Gatling is the DSL it provides to allow writing meaningful and expressive tests. We provide an overview of the framework, a description of their development environment and goals, and present their test results.
Source code available https://github.com/lawlessc/random-response-time
Wprowadzenie do technologi Big Data i Apache HadoopSages
The document introduces concepts related to Big Data technology including volume, variety, and velocity of data. It discusses Hadoop architecture including HDFS, MapReduce, YARN, and the Hadoop ecosystem. Examples are provided of common Big Data problems and how they can be solved using Hadoop frameworks like Pig, Hive, and Ambari.
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
This document provides an overview of CouchDB, a NoSQL database that uses JSON documents with a flexible schema. It demonstrates CouchDB's features like replication, MapReduce, and filtering. The presentation then shows how to build a mobile running app called Couch25K that tracks locations using CouchDB and syncs data between phones and a server. Code examples are provided in Objective-C, Java, and JavaScript for creating databases, saving documents, querying, and syncing.
Leveraging Hadoop in your PostgreSQL EnvironmentJim Mlodgenski
This talk will begin with a discussion of the strengths of PostgreSQL and Hadoop. We will then lead into a high level overview of Hadoop and its community of projects like Hive, Flume and Sqoop. Finally, we will dig down into various use cases detailing how you can leverage Hadoop technologies for your PostgreSQL databases today. The use cases will range from using HDFS for simple database backups to using PostgreSQL and Foreign Data Wrappers to do low latency analytics on your Big Data.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
JavaScript is evolving with the addition of modules, platform consistency, and harmony features. Modules allow JavaScript code to be organized and avoid naming collisions. CommonJS and AMD module formats are used widely. Platform consistency is improved through polyfills that mimic future APIs for older browsers. Harmony brings language-level modules and features like destructuring assignment, default parameters, and promises to JavaScript. Traceur compiles Harmony code to existing JavaScript.
The document provides best practices for handling performance issues in an Odoo deployment. It recommends gathering deployment information, such as hardware specs, number of machines, and integration with web services. It also suggests monitoring tools to analyze system performance and important log details like CPU time, memory limits, and request processing times. The document further discusses optimizing PostgreSQL settings, using tools like pg_activity, pg_stat_statements, and pgbadger to analyze database queries and performance. It emphasizes reproducing issues, profiling code with tools like the Odoo profiler, and fixing problems in an iterative process.
This document provides an introduction and overview of a Node.js tutorial presented by Tom Hughes-Croucher. The tutorial covers topics such as building scalable server-side code with JavaScript using Node.js, debugging Node.js applications, using frameworks like Express.js, and best practices for deploying Node.js applications in production environments. The tutorial includes exercises for hands-on learning and demonstrates tools and techniques like Socket.io, clustering, error handling and using Redis with Node.js applications.
This talk was given at the Dutch PHP Conference 2011 and details the use of Comet (aka reverse ajax or ajax push) technologies and the importance of websockets and server-sent events. More information is available at http://joind.in/3237.
The document provides an overview of RxJava and its advantages over traditional Java streams and callbacks. It discusses key RxJava concepts like Observables, Observers, and Subscriptions. It demonstrates how to create Observables, subscribe to them, and compose operations like filter, map, and zip. It shows how to leverage schedulers to control threading. The document also provides examples of using RxJava with HTTP requests and the Twitter API to asynchronously retrieve user profiles and tweets. It highlights scenarios where RxJava is useful, like handling asynchronous operations, and discusses some pitfalls like its learning curve and need to understand backpressure.
The Road To Reactive with RxJava JEEConf 2016Frank Lyaruu
This document introduces Reactive Programming with RxJava and how it can be used to create non-blocking applications. It discusses the limitations of blocking code and how RxJava uses Observables and Subscribers to implement reactive and asynchronous operations. It provides examples of converting blocking servlets and HTTP calls to non-blocking using RxJava. While non-blocking code is not always faster, it allows asynchronous operations to utilize threads more efficiently.
MongoDB is the trusted document store we turn to when we have tough data store problems to solve. For this talk we are going to go a little bit off the path and explore what other roles we can fit MongoDB into. Others have discussed how to turn MongoDB’s capped collections into a publish/subscribe server. We stretch that a little further and turn MongoDB into a full fledged broker with both publish/subscribe and queue semantics, and a the ability to mix them. We will provide code and a running demo of the queue producers and consumers. Next we will turn to coordination services: We will explore the fundamental features and show how to implement them using MongoDB as the storage engine. Again we will show the code and demo the coordination of multiple applications.
Non-blocking I/O, Event loops and node.jsMarcus Frödin
This 15 minute presentation discusses non-blocking I/O, event loops, and Node.js. It builds on previous work by Ryan Dahl, explaining how threads can be expensive due to context switching and memory usage, and how Node.js uses an event-driven, non-blocking model to avoid these costs. Code examples demonstrate getting and printing a policy object, handling HTTP requests asynchronously without blocking additional connections, and using callbacks to chain asynchronous actions together.
The document discusses Concurrency-oriented Programming (COP) using Erlang. It explains how Erlang programs work using lightweight processes that communicate asynchronously via message passing. This allows for high performance, reliability, and scalability. It provides examples of stateless server processes and using CouchDB for schema-free document storage accessible via REST APIs. Ruby libraries for interacting with CouchDB are also mentioned.
This document summarizes ql.io, a domain specific language for consuming HTTP APIs. Ql.io allows API calls to be made with fewer lines of code and reduced data sizes compared to traditional HTTP requests. It handles parallelizing requests and joining responses implicitly. Ql.io also allows mapping HTTP resources to SQL-like queries, enabling sequential and parallel queries over multiple APIs with a simple syntax. It can be used as an HTTP gateway or from Node.js.
This document discusses container security and analyzes potential vulnerabilities in Docker containers. It describes how containers may not fully isolate processes and how an attacker could escape a container to access the host machine via avenues like privileged containers, kernel exploits, or Docker socket access. It provides examples of container breakouts using these methods and emphasizes the importance of security features like seccomp, AppArmor, cgroups to restrict containers. The document encourages readers to apply security best practices like the Docker Bench tool to harden containers.
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
8. Who uses Presto - Airbnb
Airpal - a web-based, query execution tool
Presto is amazing. It's an order of magnitude
faster than Hive in most our use cases. It reads
directly from HDFS, so unlike Redshift, there
isn't a lot of ETL before you can use it. It just
works.
- Christopher Gutierrez, Manager of Online Analytics, Airbnb
12. Presto is
•Fast !!! (10x faster than Hive)
•Even faster with new Presto ORC reader
•Written in Java with a pluggable backend
•Not SQL-like, but ANSI-SQL
•Code generation like LLVM
•Not only source is open, but open sourced (No private branch)
15. Presto is
CREATE TABLE mysql.hello.order_item AS
SELECT o.*, i.*
FROM hive.world.orders o —― TABLESAMPLE SYSTEM (10)
JOIN mongo.deview.lineitem i —― TABLESAMPLE BERNOULLI (40)
ON o.orderkey = i.orderkey
WHERE conditions..
16. Coordinator
Presto - Planner
Fragmenter
Worker
SQL Analyzer
Analysis
Logical
Plan
Optimizer
Plan
Plan
Fragment
Distributed
Query
Scheduler
Stage
Execution
Plan
Worker
Worker
Local
Execution
Planner
TASK
TASK
17. Presto - Page
PAGE
- positionCount
VAR_WIDTH BLOCK
nulls
Offsets
Values
FIEXED_WIDTH BLOCK
nulls
Values
positionCount blockCount 11 F I X E D _ W I D T H po
sitionCount bit encoded nullFlags values length
values
14 V A R I A B L E _ W
I D T H positionCount offsets[0] offsets[1] offsets[2…]
offsets[pos-1] offsets[pos] bit encoded nullFlags
values
Page / Block serialization
18. Presto - Cluster Memory Manager
Coordinator
Worker
Worker
Worker
GET /v1/memory
@Config(“query.max-memory”) = 20G
@Config(“query.max-memory-per-node”) = 1G
@Config(“resources.reserved-system-memory”) = 40% of -Xmx
System
reserved-system-memory
Reserved
max-memory-per-node
General
Task
Block All tasks
26. Airlift - Distributed service framework
•https://github.com/airlift/airlift
•Core of Presto communication
•HTTP
•Bootstrap
•Node discovery
•RESTful API
•Dependency Injection
•Configuration
•Utilities
@Path("/v2/event")
public class EventResource {
@POST
public Response createQuery(EventRequests events) {
…
}
}
public class CollectorMainModule implements ConfigurationAwareModule {
@Override
public synchronized void configure(Binder binder) {
discoveryBinder(binder).bindHttpAnnouncement("collector");
jsonCodecBinder(binder).bindJsonCodec(EventRequest.class);
jaxrsBinder(binder).bind(EventResource.class);
}
public static void main(String[] args){
Bootstrap app = new Bootstrap(ImmutableList.of(
new NodeModule(),
new DiscoveryModule(),
new HttpServerModule(),
new JsonModule(), new JaxrsModule(true),
new EventModule(),
new CollectorMainModule()
));
Injector injector = app.strictConfig().initialize();
injector.getInstance(Announcer.class).start();
}
}
ex) https://github.com/miniway/presto-event-collector
27. Fastutil - Fast Java collection
•FastUtil 6.6.0 turned out to be consistently fast.
•Koloboke is getting second in many tests.
•GS implementation is good enough, but is slower than FastUtil and Koloboke.
http://java-performance.info/hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove-january-2015/
28. ASM - Bytecode manipulation
package pkg;
public interface SumInterface {
long sum(long value);
}
public class MyClass implements SumInterface {
private long result = 0L;
public MyClass(long value) {
result = value;
}
@Override
public long sum(long value) {
result += value;
return result;
}
}
ClassWriter cw = new ClassWriter(0);
cw.visit(V1_7, ACC_PUBLIC,
"pkg/MyClass", null,
"java/lang/Object",
new String[] { "pkg/SumInterface" });
cw.visitField(ACC_PRIVATE,
"result", "J", null, new Long(0));
// constructor
MethodVisitor m = cw.visitMethod(ACC_PUBLIC,
"<init>", "(J)V", null, null);
m.visitCode();
// call super()
m.visitVarInsn(ALOAD, 0); // this
m.visitMethodInsn(INVOKESPECIAL,
"java/lang/Object", "<init>", “()V",
false);
29. ASM - Bytecode manipulation (Cont.)
// this.result = value
m.visitVarInsn(ALOAD, 0); // this
m.visitVarInsn(LLOAD, 1 ); // value
m.visitFieldInsn(PUTFIELD,
"pkg/MyClass", "result", "J");
m.visitInsn(RETURN);
m.visitMaxs(-1, -1).visitEnd();
// public long sum(long value)
m = cw.visitMethod(ACC_PUBLIC , "sum", "(J)J",
null, null);
m.visitCode();
m.visitVarInsn(ALOAD, 0); // this
m.visitVarInsn(ALOAD, 0); // this
m.visitFieldInsn(GETFIELD,
"pkg/MyClass", "result", "J");
m.visitVarInsn(LLOAD, 1); // value
// this.result + value
m.visitInsn(LADD);
m.visitFieldInsn(PUTFIELD,
"pkg/MyClass", "result", "J");
m.visitVarInsn(ALOAD, 0); // this
m.visitFieldInsn(GETFIELD,
"pkg/MyClass", "result", "J");
m.visitInsn(LRETURN);
m.visitMaxs(-1, -1).visitEnd();
cw.visitEnd();
byte[] bytes = cw.toByteArray();
ClassLoader.defindClass(bytes)
30. Library - Misc.
•JDK 8u40 +
•Guice - Lightweight dependency injection
•Guava - Replacing Java8 Stream, Optional and Lambda
•ANTLR4 - Parser generator, SQL parser
•Jetty - HTTP Server and Client
•Jackson - JSON
•Jersey - RESTful API
32. Code Generation
•ASM
•Runtime Java classes and methods generation base on SQL
•Where
•Filter and Projection
•Join Lookup source
•Join Probe
•Order By
•Aggregation
33. Code Generation - Filter
SELECT * FROM lineitem
WHERE orderkey = 100 AND quantity = 200
AND
EQ
(#0,100)
EQ
(#1,200)
Logical Planner
class AndOperator extends Operator {
private Operator left = new EqualOperator(#1, 100);
private Operator right = new EqualOperator(#2, 200);
@Override
public boolean evaluate(Cursor cur)
{
if (!left.evaluate(cur)) {
return false;
}
return right.evaluate(cur);
}
}
class EqualOperator extends Operator {
@Override
public boolean evaluate(Cursor c)
{
return cur.getValue(position).equals(value);
}
}
34. Code Generation - Filter
// invoke MethodHandle( $operator$EQUAL(#0, 100) )
push cursor.getValue(#0)
push 100
$statck = invokeDynamic boostrap(0) $operator$EQUAL
if (!$stack) { goto end; }
push cursor.getValue(#1)
push 200
$stack = invokeDynamic boostrap(0) $operator$EQUAL
end:
return $stack
@ScalarOperator(EQUAL)
@SqlType(BOOLEAN)
public static boolean equal(@SqlType(BIGINT) long left,
@SqlType(BIGINT) long right){
return left == right;
}
=> MethodHandle(“$operator$EQUAL(long, long): boolean”)
AND
$op$EQ
(#0,100)
$op$EQ
(#1,200)
Local Execution Planner
37. Code Generation - PageHash (Cont.)
long hashRow (int position,
Block[] blocks) {
int result = 0;
for (int i = 0; i < hashChannels.size(); i++) {
int hashChannel = hashChannels.get(i);
Type type = types.get(hashChannel);
result = result * 31 +
type.hash(blocks[i], position);
}
return result;
}
long (Compiled)hashRow (int position,
Block[] blocks) {
int result = 0;
result = result * 31 +
type_colX.hash(block[0], position);
result = result * 31 +
type_colY.hash(block[1], position);
return result;
}
38. Code Generation - PageHash (Cont.)
boolean equalsRow (
int leftBlockIndex, int leftPosition,
int rightPosition, Block[] rightBlocks) {
for (int i = 0; i < hashChannels.size(); i++) {
int hashChannel = hashChannels.get(i);
Type type = types.get(hashChannel);
Block leftBlock =
channels.get(hashChannel)
.get(leftBlockIndex);
if (!type.equalTo(leftBlock, leftPosition,
rightBlocks[i], rightPosition)) {
return false;
}
}
return true;
}
boolean (Compiled)equalsRow (
int leftBlockIndex, int leftPosition,
int rightPosition, Block[] rightBlocks) {
Block leftBlock =
channels_colX.get(leftBlockIndex);
if (!type.equalTo(leftBlock, leftPosition,
rightBlocks[0], rightPosition)) {
return false;
}
leftBlock =
channels_colY.get(leftBlockIndex);
if (!type.equalTo(leftBlock, leftPosition,
rightBlocks[1], rightPosition)) {
return false;
}
return true;
}
39. Method Variable Binding
1. regexp_like(string, pattern) → boolean
2. regexp_like(string, cast(pattern as RegexType)) // OperatorType.CAST
3. regexp_like(string, new Regex(pattern))
4. MethodHandle handle = MethodHandles.insertArgument(1, new Regex(pattern))
5. handle.invoke (string)
@ScalarOperator(OperatorType.CAST)
@SqlType(“RegExp”)
public static Regex castToRegexp(@SqlType(VARCHAR) Slice pattern){
return new Regex(pattern.getBytes(), 0, pattern.length());
}
@ScalarFunction
@SqlType(BOOLEAN)
public static boolean regexpLike(@SqlType(VARCHAR) Slice source,
@SqlType(“RegExp”) Regex pattern){
Matcher m = pattern.matcher(source.getBytes());
int offset = m.search(0, source.length());
return offset != -1;
}
42. Plugin - Raptor
•Storage data in flash on the Presto machines in ORC format
•Metadata is stored in MySQL (Extendable)
•Near real-time loads (5 - 10mins)
•3TB / day, 80B rows/day , 5 secs query
•CREATE VIEW myview AS SELECT …
•DELETE FROM tab WHERE conditions…
•UPDATE (Future)
•Coarse grained Index : min / max value of all columns
•Compaction
•Backup Store (Extendable)
No more ?!
43. Plugin - How to write
•https://prestodb.io/docs/current/develop/spi-overview.html
•ConnectorFactory
•ConnectorMetadata
•ConnectorSplitManager
•ConnectorHandleResolver
•ConnectorRecordSetProvider (PageSourceProvider)
•ConnectorRecordSinkProvider (PageSinkProvider)
•Add new Type
•Add new Function (A.K.A UDF)
44. Plugin - MongoDB
•https://github.com/facebook/presto/pull/3337
•5 Non-business days
•Predicate Pushdown
•Add a Type (ObjectId)
•Add UDFs (objectid(), objectid(string))
public class MongoPlugin implements Plugin {
@Override
public <T> List<T> getServices(Class<T> type) {
if (type == ConnectorFactory.class) {
return ImmutableList.of(
new MongoConnectorFactory(…));
} else if (type == Type.class) {
return ImmutableList.of(OBJECT_ID);
} else if (type == FunctionFactory.class) {
return ImmutableList.of(
new MongoFunctionFactory(typeManager));
}
return ImmutableList.of();
}
}
45. Plugin - MongoDB
class MongoFactory implements ConnectorFactory {
@Override
public Connector create(String connectorId) {
Bootstrap app = new Bootstrap(new
MongoClientModule());
return app.initialize()
.getInstance(MongoConnector.class);
}
}
class MongoClientModule implements Module {
@Override
public void configure(Binder binder){
binder.bind(MongoConnector.class)
.in(SINGLETON);
…
configBinder(binder)
.bindConfig(MongoClientConfig.class);
}
}
class MongoConnector implements Connector {
@Inject
public MongoConnector(
MongoSession mongoSession,
MongoMetadata metadata,
MongoSplitManager splitManager,
MongoPageSourceProvider
pageSourceProvider,
MongoPageSinkProvider
pageSinkProvider,
MongoHandleResolver
handleResolver) {
…
}
}
46. Plugin - MongoDB UDF
public class MongoFunctionFactory
implements FunctionFactory {
@Override
public List<ParametricFunction> listFunctions()
{
return new FunctionListBuilder(typeManager)
.scalar(ObjectIdFunctions.class)
.getFunctions();
}
}
public class ObjectIdType
extends AbstractVariableWidthType {
ObjectIdType OBJECT_ID = new ObjectIdType();
@JsonCreator
public ObjectIdType() {
super(parseTypeSignature("ObjectId"),
Slice.class);
}
}
public class ObjectIdFunctions {
@ScalarFunction("objectid")
@SqlType("ObjectId")
public static Slice ObjectId() {
return Slices.wrappedBuffer(
new ObjectId().toByteArray());
}
@ScalarFunction("objectid")
@SqlType("ObjectId")
p.s Slice ObjectId(@SqlType(VARCHAR) Slice value) {
return Slices.wrappedBuffer(
new ObjectId(value.toStringUtf8()).toByteArray())
}
@ScalarOperator(EQUAL)
@SqlType(BOOLEAN)
p.s boolean equal(@SqlType("ObjectId") Slice left,
@SqlType("ObjectId") Slice right) {
return left.equals(right);
}
}