Treasure Data is a data analytics service company that makes heavy use of Ruby in its platform and services. It uses Ruby for components like Fluentd (log collection), Embulk (data loading), scheduling, and its Rails-based API and console. Java and JRuby are also used for components involving Hadoop and Presto processing. The company's architecture includes collectors that ingest data, a PlazmaDB for storage, workers that process jobs on Hadoop and Presto clusters, and schedulers that queue and schedule those jobs using technologies like PerfectSched and PerfectQueue which are written in Ruby. Hive jobs are built programmatically using Ruby to generate configurations and submit the jobs to underlying Hadoop clusters.
This document discusses automating analytics pipelines and workflows using a workflow engine. It describes the challenges of managing workflows across multiple cloud services and database technologies. It then introduces a multi-cloud workflow engine called Digdag that can automate workflows, handle errors, enable parallel execution, support modularization and parameterization. Examples are given of using Digdag to define and run workflows across services like BigQuery, Treasure Data, Redshift, and Tableau. Key features of Digdag like loops, parameters, parallel tasks and pushing workflows to servers with Docker are also summarized.
This document discusses middleware in Ruby and provides examples of considerations when writing middleware:
- Middleware should be a long-running daemon process that is compatible across platforms and environments and handles various data formats and traffic volumes.
- Tests must be run on all supported platforms to ensure compatibility as thread and process scheduling differs between operating systems.
- Memory usage and object leaks must be carefully managed in long-running processes to avoid consuming resources over time.
- Performance of JSON parsing/generation should be benchmarked and the most optimized library used to avoid unnecessary CPU usage.
1) The document proposes making a key-value storage system (CDP KVS) 10 times more scalable to support real-time data delivery.
2) Three ideas are presented: using an alternative distributed KVS, implementing a storage hierarchy on the existing KVS, and shipping edit logs to indexed archives.
3) The storage hierarchy approach of partitioning, compressing, and writing data to DynamoDB in batches is selected as it improves write performance and reduces storage costs while remaining stateless.
This document summarizes Tagomori Satoshi's presentation on handling "not so big data" at the YAPC::Asia 2014 conference. It discusses different types of data processing frameworks for various data sizes, from sub-gigabytes up to petabytes. It provides overviews of MapReduce, Spark, Tez, and stream processing frameworks. It also discusses what Hadoop is and how the Hadoop ecosystem has evolved to include these additional frameworks.
Digdag can automate large-scale data processing and handle errors. It provides constructs like operators, parameters, and task groups to organize workflows. Operators package tasks to run queries or process data. Parameters allow passing variables between tasks. Task groups modularize and organize workflows. Digdag supports error handling, monitoring, parallelization, versioning, and reproducing workflows across environments.
Fluentd is an open source data collector that allows flexible data collection, processing, and output. It supports streaming data from sources like logs and metrics to destinations like databases, search engines, and object stores. Fluentd's plugin-based architecture allows it to support a wide variety of use cases. Recent versions of Fluentd have added features like improved plugin APIs, nanosecond time resolution, and Windows support to make it more suitable for containerized environments and low-latency applications.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
This document provides tips for troubleshooting common issues with Sqoop. It discusses how to effectively provide debugging information when seeking help, addresses specific problems with MySQL connections and importing to Hive, Oracle case-sensitive errors and export failures, and recommends best practices like using separate tables for import and export and specifying options correctly.
Tomas Doran presented on their implementation of Logstash at TIM Group to process over 55 million messages per day. Their applications are all Java/Scala/Clojure and they developed their own library to send structured log events as JSON to Logstash using ZeroMQ for reliability. They index data in Elasticsearch and use it for metrics, alerts and dashboards but face challenges with data growth.
User defined partitioning is a new partitioning strategy in Treasure Data that allows users to specify which column to use for partitioning, in addition to the default "time" column. This provides more flexible partitioning that better fits customer data platform workloads. The user can define partitioning rules through Presto or Hive to improve query performance by enabling colocated joins and filtering data by the partitioning column.
Debugging PySpark: Spark Summit East talk by Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
An updated talk about how to use Solr for logs and other time-series data, like metrics and social media. In 2016, Solr, its ecosystem, and the operating systems it runs on have evolved quite a lot, so we can now show new techniques to scale and new knobs to tune.
We'll start by looking at how to scale SolrCloud through a hybrid approach using a combination of time- and size-based indices, and also how to divide the cluster in tiers in order to handle the potentially spiky load in real-time. Then, we'll look at tuning individual nodes. We'll cover everything from commits, buffers, merge policies and doc values to OS settings like disk scheduler, SSD caching, and huge pages.
Finally, we'll take a look at the pipeline of getting the logs to Solr and how to make it fast and reliable: where should buffers live, which protocols to use, where should the heavy processing be done (like parsing unstructured data), and which tools from the ecosystem can help.
Cloudera Morphlines is a new open source framework, recently added to the CDK, that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
The document discusses Azure Cache and Redis. It provides an overview of Redis, including its data types, transactions, pub/sub capabilities, scripting, and sharding/partitioning. It then discusses common patterns for using Redis, such as caching, counting likes on Facebook, getting the latest reviews, rate limiting, and autocompletion. The document emphasizes that Redis is very flexible and can be used for more than just caching, acting as a general datastore. It concludes by recommending a Redis reference book for further learning.
Using Solr to Search and Analyze Logs discusses how to use Solr and related tools like Logstash and Elasticsearch to index, search, and analyze log data. It covers sending logs to Solr using various methods like Logstash, Flume, and rsyslog plugins. It also discusses best practices for handling logs like production data in Solr, including using docValues, omitting norms/term positions, configuring caches, commits, and merges. The document concludes by discussing scaling options like SolrCloud and collections/aliases for partitioning logs over time.
Building a near real time search engine & analytics for logs using solr
Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd
Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.
Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.
Cloud deployments of Apache Hadoop are becoming more commonplace. Yet Hadoop and it's applications don't integrate that well —something which starts right down at the file IO operations. This talk looks at how to make use of cloud object stores in Hadoop applications, including Hive and Spark. It will go from the foundational "what's an object store?" to the practical "what should I avoid" and the timely "what's new in Hadoop?" — the latter covering the improved S3 support in Hadoop 2.8+. I'll explore the details of benchmarking and improving object store IO in Hive and Spark, showing what developers can do in order to gain performance improvements in their own code —and equally, what they must avoid. Finally, I'll look at ongoing work, especially "S3Guard" and what its fast and consistent file metadata operations promise.
How to create/improve OSS product and its community (revised)
1) The document discusses how to create and improve open source software (OSS) projects and their communities. It addresses questions around the purpose of the OSS, languages used, versioning, and community engagement.
2) Key recommendations for building community include using English, being open to contributions, demonstrating stability and maintenance, and having a pluggable architecture.
3) The document debates tradeoffs like clean code vs quick contributions, focused vs feature-rich software, and localized vs global development and highlights the need to choose approaches given limitations. Overall it stresses continuous improvement over time.
This document summarizes Satoshi Tagomori's presentation on Treasure Data, a data analytics service company. It discusses Treasure Data's use of Ruby for various components of its platform including its logging (Fluentd), ETL (Embulk), scheduling (PerfectSched), and storage (PlazmaDB) technologies. The document also provides an overview of Treasure Data's architecture including how it collects, stores, processes, and visualizes customer data using open source tools integrated with services like Hadoop and Presto.
This document summarizes Yurie Yamane's presentation on using mruby to make robots. It discusses using mruby and TOPPERS on the LEGO EV3 to create an inverted pendulum self-balancing robot. It then covers creating a DIY self-balancing robot using a Raspberry Pi, gyro sensor, DC motors, and a motor driver. Code examples are provided for reading sensor values, controlling motors with PWM, and implementing a balancing algorithm using a balancer class. Source code links are included for further reference.
Grifork runs defined tasks on the system in a way like tree's branching.
Give grifork a list of hosts, then it creates a tree graph internally, and runs tasks in a top-down way.
- "poloxy" is a proxy system that pools and proxies alerts to prevent monitoring systems from bursting alerts to recipients. It works by having monitoring systems send alerts to "poloxy" instead of recipients. "poloxy" then enqueues alerts into a queue and has a worker dequeue alerts every minute and deliver them to original recipients, merging duplicate alerts and preventing repeated alerts from being delivered too frequently. This allows customizing how alerts are merged and controlling the frequency of alerts received by each recipient.
Distributed Logging Architecture in Container Era
The document discusses distributed logging architecture in the container era. It covers: 1) The difficulties of logging with microservices and containers due to their ephemeral and distributed nature, 2) The need to redesign logging to push logs from containers to destinations quickly without fixed addresses or mappings; 3) Common patterns for distributed logging architectures including source aggregation, destination aggregation, and scaling; and 4) A case study using Docker and Fluentd to implement source aggregation and scaling for logging. Open source solutions are important to keep the logging layer transparent, interoperable, and able to scale independently of applications and infrastructure.
The worst Ruby codes I’ve seen in my life - RubyKaigi 2015
Video presentation: https://www.youtube.com/watch?v=jLAFXQ1Av50
Most applications written in Ruby are great, but also exists evil code applying WOP techniques. There are many workarounds in several programming languages, but in Ruby, when it happens, the proportion is bigger. It's very easy to write Ruby code with collateral damage.
You will see a collection of bad Ruby codes, with a description of how these codes affected negatively their applications and the solutions to fix and avoid them. Long classes, coupling, misapplication of OO, illegible code, tangled flows, naming issues and other things you can ever imagine are examples what you'll get.
This document summarizes a presentation about Meteor and React. It introduces Meteor as a full-stack platform for building web and mobile apps with JavaScript. It demonstrates storing messages in a Meteor collection and rendering them with React components. It also covers latency compensation, user accounts, and the future of Meteor integrating React, Redux, GraphQL and Socket.io on the backend.
The document discusses testing PHP applications using SimpleTest, Selenium IDE, and CakePHP. It provides an overview of these testing tools and frameworks and recommends them for testing PHP applications.
The document summarizes the new plugin API in Fluentd v0.14. Key points include:
- The v0.12 plugin API was fragmented and difficult to write tests for. The v0.14 API provides a unified architecture.
- The main plugin classes are Input, Filter, Output, Buffer, and plugins must subclass Fluent::Plugin::Base.
- The Output plugin supports both buffered and non-buffered processing. Buffering can be configured by tags, time, or custom fields.
- "Owned" plugins like Buffer are instantiated by primary plugins and can access owner resources. Storage is a new owned plugin for persistent storage.
- New test drivers emulate plugin
Fighting API Compatibility On Fluentd Using "Black Magic"
The document discusses Fluentd's changes to its plugin API between versions 0.12 and 0.14. In 0.14, the API was overhauled to separate entry points from implementations and introduce a plugin base class to control data and control flow. A compatibility layer was added to allow most 0.12 plugins to work unmodified in 0.14 by handling calls to overridden methods. However, plugins that override certain methods like #emit may cause errors due to changes in how buffering works.
The document discusses testing practices for the Ruby programming language. It provides details on how to run various test suites that are part of the Ruby source code repository, including:
1. Running the "make test" command which runs sample tests, known bug tests, and tests defined in the test/ directory.
2. Running "make test-all" which runs core library and standard library tests under the test/ directory.
3. Running "make check" which builds encodings and extensions, runs all test tasks including test frameworks like Test::Unit and Minitest.
4. It also discusses strategies for merging test changes from external repositories like RubyGems and RDoc back into the Ruby source code
Experiments in Sharing Java VM Technology with CRuby
IBM is developing a just-in-time (JIT) compiler called Testarossa based on its Open Runtime (OMR) toolkit. The goal is to integrate the JIT compiler into MRI Ruby to improve performance without changing how MRI works. So far, the JIT supports most opcodes and can run Rails applications, but performance gains are modest. IBM hopes to collaborate with the Ruby community to further optimize Ruby and help make it faster.
Open Source Software, Distributed Systems, Database as a Cloud ServiceSATOSHI TAGOMORI
- Treasure Data is a database as a cloud service company that collects and stores customer data beyond the cloud [1].
- It uses open source software like Fluentd and MessagePack to easily integrate and collect data from customers [2]. It also uses open source distributed systems software like Hadoop and Presto to store, process and query large amounts of customer data [3].
- As a database service, it needs to share computer resources securely for many customers. It contributes to open source to build and maintain the distributed systems software that powers its cloud database service [4].
Presto is used to process 15 trillion rows per day for Treasure Data customers. Treasure Data developed tools to manage Presto performance and optimize queries. They collect Presto query logs to analyze performance bottlenecks and classify queries to set implicit service level objectives. Tools like Prestobase Proxy and Presto Stella storage optimizer were created to improve low-latency access and optimize storage partitioning. Workflows using DigDag and a new tabular data format called MessageFrame are being explored to split huge queries and support incremental processing.
This document discusses automating analytics pipelines and workflows using a workflow engine. It describes the challenges of managing workflows across multiple cloud services and database technologies. It then introduces a multi-cloud workflow engine called Digdag that can automate workflows, handle errors, enable parallel execution, support modularization and parameterization. Examples are given of using Digdag to define and run workflows across services like BigQuery, Treasure Data, Redshift, and Tableau. Key features of Digdag like loops, parameters, parallel tasks and pushing workflows to servers with Docker are also summarized.
This document discusses middleware in Ruby and provides examples of considerations when writing middleware:
- Middleware should be a long-running daemon process that is compatible across platforms and environments and handles various data formats and traffic volumes.
- Tests must be run on all supported platforms to ensure compatibility as thread and process scheduling differs between operating systems.
- Memory usage and object leaks must be carefully managed in long-running processes to avoid consuming resources over time.
- Performance of JSON parsing/generation should be benchmarked and the most optimized library used to avoid unnecessary CPU usage.
1) The document proposes making a key-value storage system (CDP KVS) 10 times more scalable to support real-time data delivery.
2) Three ideas are presented: using an alternative distributed KVS, implementing a storage hierarchy on the existing KVS, and shipping edit logs to indexed archives.
3) The storage hierarchy approach of partitioning, compressing, and writing data to DynamoDB in batches is selected as it improves write performance and reduces storage costs while remaining stateless.
This document summarizes Tagomori Satoshi's presentation on handling "not so big data" at the YAPC::Asia 2014 conference. It discusses different types of data processing frameworks for various data sizes, from sub-gigabytes up to petabytes. It provides overviews of MapReduce, Spark, Tez, and stream processing frameworks. It also discusses what Hadoop is and how the Hadoop ecosystem has evolved to include these additional frameworks.
Digdag can automate large-scale data processing and handle errors. It provides constructs like operators, parameters, and task groups to organize workflows. Operators package tasks to run queries or process data. Parameters allow passing variables between tasks. Task groups modularize and organize workflows. Digdag supports error handling, monitoring, parallelization, versioning, and reproducing workflows across environments.
Fluentd is an open source data collector that allows flexible data collection, processing, and output. It supports streaming data from sources like logs and metrics to destinations like databases, search engines, and object stores. Fluentd's plugin-based architecture allows it to support a wide variety of use cases. Recent versions of Fluentd have added features like improved plugin APIs, nanosecond time resolution, and Windows support to make it more suitable for containerized environments and low-latency applications.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
This document provides tips for troubleshooting common issues with Sqoop. It discusses how to effectively provide debugging information when seeking help, addresses specific problems with MySQL connections and importing to Hive, Oracle case-sensitive errors and export failures, and recommends best practices like using separate tables for import and export and specifying options correctly.
Tomas Doran presented on their implementation of Logstash at TIM Group to process over 55 million messages per day. Their applications are all Java/Scala/Clojure and they developed their own library to send structured log events as JSON to Logstash using ZeroMQ for reliability. They index data in Elasticsearch and use it for metrics, alerts and dashboards but face challenges with data growth.
User defined partitioning is a new partitioning strategy in Treasure Data that allows users to specify which column to use for partitioning, in addition to the default "time" column. This provides more flexible partitioning that better fits customer data platform workloads. The user can define partitioning rules through Presto or Hive to improve query performance by enabling colocated joins and filtering data by the partitioning column.
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
An updated talk about how to use Solr for logs and other time-series data, like metrics and social media. In 2016, Solr, its ecosystem, and the operating systems it runs on have evolved quite a lot, so we can now show new techniques to scale and new knobs to tune.
We'll start by looking at how to scale SolrCloud through a hybrid approach using a combination of time- and size-based indices, and also how to divide the cluster in tiers in order to handle the potentially spiky load in real-time. Then, we'll look at tuning individual nodes. We'll cover everything from commits, buffers, merge policies and doc values to OS settings like disk scheduler, SSD caching, and huge pages.
Finally, we'll take a look at the pipeline of getting the logs to Solr and how to make it fast and reliable: where should buffers live, which protocols to use, where should the heavy processing be done (like parsing unstructured data), and which tools from the ecosystem can help.
Cloudera Morphlines is a new open source framework, recently added to the CDK, that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Maarten Balliauw
The document discusses Azure Cache and Redis. It provides an overview of Redis, including its data types, transactions, pub/sub capabilities, scripting, and sharding/partitioning. It then discusses common patterns for using Redis, such as caching, counting likes on Facebook, getting the latest reviews, rate limiting, and autocompletion. The document emphasizes that Redis is very flexible and can be used for more than just caching, acting as a general datastore. It concludes by recommending a Redis reference book for further learning.
Using Solr to Search and Analyze Logs discusses how to use Solr and related tools like Logstash and Elasticsearch to index, search, and analyze log data. It covers sending logs to Solr using various methods like Logstash, Flume, and rsyslog plugins. It also discusses best practices for handling logs like production data in Solr, including using docValues, omitting norms/term positions, configuring caches, commits, and merges. The document concludes by discussing scaling options like SolrCloud and collections/aliases for partitioning logs over time.
Building a near real time search engine & analytics for logs using solrlucenerevolution
Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd
Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.
Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.
Cloud deployments of Apache Hadoop are becoming more commonplace. Yet Hadoop and it's applications don't integrate that well —something which starts right down at the file IO operations. This talk looks at how to make use of cloud object stores in Hadoop applications, including Hive and Spark. It will go from the foundational "what's an object store?" to the practical "what should I avoid" and the timely "what's new in Hadoop?" — the latter covering the improved S3 support in Hadoop 2.8+. I'll explore the details of benchmarking and improving object store IO in Hive and Spark, showing what developers can do in order to gain performance improvements in their own code —and equally, what they must avoid. Finally, I'll look at ongoing work, especially "S3Guard" and what its fast and consistent file metadata operations promise.
How to create/improve OSS product and its community (revised)SATOSHI TAGOMORI
1) The document discusses how to create and improve open source software (OSS) projects and their communities. It addresses questions around the purpose of the OSS, languages used, versioning, and community engagement.
2) Key recommendations for building community include using English, being open to contributions, demonstrating stability and maintenance, and having a pluggable architecture.
3) The document debates tradeoffs like clean code vs quick contributions, focused vs feature-rich software, and localized vs global development and highlights the need to choose approaches given limitations. Overall it stresses continuous improvement over time.
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
This document summarizes Satoshi Tagomori's presentation on Treasure Data, a data analytics service company. It discusses Treasure Data's use of Ruby for various components of its platform including its logging (Fluentd), ETL (Embulk), scheduling (PerfectSched), and storage (PlazmaDB) technologies. The document also provides an overview of Treasure Data's architecture including how it collects, stores, processes, and visualizes customer data using open source tools integrated with services like Hadoop and Presto.
This document summarizes Yurie Yamane's presentation on using mruby to make robots. It discusses using mruby and TOPPERS on the LEGO EV3 to create an inverted pendulum self-balancing robot. It then covers creating a DIY self-balancing robot using a Raspberry Pi, gyro sensor, DC motors, and a motor driver. Code examples are provided for reading sensor values, controlling motors with PWM, and implementing a balancing algorithm using a balancer class. Source code links are included for further reference.
grifork - fast propagative task runner -IKEDA Kiyoshi
Grifork runs defined tasks on the system in a way like tree's branching.
Give grifork a list of hosts, then it creates a tree graph internally, and runs tasks in a top-down way.
Introduction to poloxy - proxy for alertingIKEDA Kiyoshi
- "poloxy" is a proxy system that pools and proxies alerts to prevent monitoring systems from bursting alerts to recipients. It works by having monitoring systems send alerts to "poloxy" instead of recipients. "poloxy" then enqueues alerts into a queue and has a worker dequeue alerts every minute and deliver them to original recipients, merging duplicate alerts and preventing repeated alerts from being delivered too frequently. This allows customizing how alerts are merged and controlling the frequency of alerts received by each recipient.
Distributed Logging Architecture in Container EraSATOSHI TAGOMORI
Distributed Logging Architecture in Container Era
The document discusses distributed logging architecture in the container era. It covers: 1) The difficulties of logging with microservices and containers due to their ephemeral and distributed nature, 2) The need to redesign logging to push logs from containers to destinations quickly without fixed addresses or mappings; 3) Common patterns for distributed logging architectures including source aggregation, destination aggregation, and scaling; and 4) A case study using Docker and Fluentd to implement source aggregation and scaling for logging. Open source solutions are important to keep the logging layer transparent, interoperable, and able to scale independently of applications and infrastructure.
Video presentation: https://www.youtube.com/watch?v=jLAFXQ1Av50
Most applications written in Ruby are great, but also exists evil code applying WOP techniques. There are many workarounds in several programming languages, but in Ruby, when it happens, the proportion is bigger. It's very easy to write Ruby code with collateral damage.
You will see a collection of bad Ruby codes, with a description of how these codes affected negatively their applications and the solutions to fix and avoid them. Long classes, coupling, misapplication of OO, illegible code, tangled flows, naming issues and other things you can ever imagine are examples what you'll get.
This document summarizes a presentation about Meteor and React. It introduces Meteor as a full-stack platform for building web and mobile apps with JavaScript. It demonstrates storing messages in a Meteor collection and rendering them with React components. It also covers latency compensation, user accounts, and the future of Meteor integrating React, Redux, GraphQL and Socket.io on the backend.
The document discusses testing PHP applications using SimpleTest, Selenium IDE, and CakePHP. It provides an overview of these testing tools and frameworks and recommends them for testing PHP applications.
The document summarizes the new plugin API in Fluentd v0.14. Key points include:
- The v0.12 plugin API was fragmented and difficult to write tests for. The v0.14 API provides a unified architecture.
- The main plugin classes are Input, Filter, Output, Buffer, and plugins must subclass Fluent::Plugin::Base.
- The Output plugin supports both buffered and non-buffered processing. Buffering can be configured by tags, time, or custom fields.
- "Owned" plugins like Buffer are instantiated by primary plugins and can access owner resources. Storage is a new owned plugin for persistent storage.
- New test drivers emulate plugin
Fighting API Compatibility On Fluentd Using "Black Magic"SATOSHI TAGOMORI
The document discusses Fluentd's changes to its plugin API between versions 0.12 and 0.14. In 0.14, the API was overhauled to separate entry points from implementations and introduce a plugin base class to control data and control flow. A compatibility layer was added to allow most 0.12 plugins to work unmodified in 0.14 by handling calls to overridden methods. However, plugins that override certain methods like #emit may cause errors due to changes in how buffering works.
The document discusses testing practices for the Ruby programming language. It provides details on how to run various test suites that are part of the Ruby source code repository, including:
1. Running the "make test" command which runs sample tests, known bug tests, and tests defined in the test/ directory.
2. Running "make test-all" which runs core library and standard library tests under the test/ directory.
3. Running "make check" which builds encodings and extensions, runs all test tasks including test frameworks like Test::Unit and Minitest.
4. It also discusses strategies for merging test changes from external repositories like RubyGems and RDoc back into the Ruby source code
Experiments in Sharing Java VM Technology with CRubyMatthew Gaudet
IBM is developing a just-in-time (JIT) compiler called Testarossa based on its Open Runtime (OMR) toolkit. The goal is to integrate the JIT compiler into MRI Ruby to improve performance without changing how MRI works. So far, the JIT supports most opcodes and can run Rails applications, but performance gains are modest. IBM hopes to collaborate with the Ruby community to further optimize Ruby and help make it faster.
ETL with SPARK - First Spark London meetupRafal Kwasny
The document discusses how Spark can be used to supercharge ETL workflows by running them faster and with less code compared to traditional Hadoop approaches. It provides examples of using Spark for tasks like sessionization of user clickstream data. Best practices are covered like optimizing for JVM issues, avoiding full GC pauses, and tips for deployment on EC2. Future improvements to Spark like SQL support and Java 8 are also mentioned.
This document discusses building a social analytics tool using MongoDB from a developer's perspective. It covers using MongoDB for its schema-less data and ability to handle fast read-write operations. Key topics include using aggregation queries to gain insights from data by chaining queries together and filtering/manipulating results at each stage. JavaScript capabilities in MongoDB allow applying business logic directly to data. Examples demonstrate removing garbage data and stopwords. Indexes, current progress, and tips/tricks learned around cloning collections and removing vs dropping are also covered, with a demo planned.
The document discusses various techniques used to optimize Hive query execution and deployment in Treasure Data, including:
1) Running Hive queries through a custom QueryRunner that handles query planning, execution, and statistics reporting.
2) Using an in-memory metastore and schema-on-read from Treasure Data's columnar storage to manage schemas and tables.
3) Configuring jobs through HiveConf properties to control mappings, partitions, and storage handlers for efficient execution on Hadoop clusters.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
While working with Hadoop, you'll eventually encounter the need to schedule and run workflows to perform various operations like ingesting data or performing ETL. There are a number of tools available to assist you with this type of requirement and one such tool that we at Clairvoyant have been looking to use is Apache Airflow. Apache Airflow is an Apache Incubator project that allows you to programmatically create workflows through a python script. This provides a flexible and effective way to design your workflows with little code and setup. In this talk, we will discuss Apache Airflow and how we at Clairvoyant have utilized it for ETL pipelines on Hadoop.
SF Big Analytics meetup : Hoodie From UberChester Chen
Even after a decade, the name “Hadoop" remains synonymous with "big data”, even as new options for processing/querying (stream processing, in-memory analytics, interactive sql) and storage services (S3/Google Cloud/Azure) have emerged & unlocked new possibilities. However, the overall data architecture has become more complex with more moving parts and specialized systems, leading to duplication of data and strain on usability . In this talk, we argue that by adding some missing blocks to existing Hadoop stack, we are able to a provide similar capabilities right on top of Hadoop, at reduced cost and increased efficiency, greatly simplifying the overall architecture as well in the process. We will discuss the need for incremental processing primitives on Hadoop, motivating them with some real world problems from Uber. We will then introduce “Hoodie”, an open source spark library built at Uber, to enable faster data for petabyte scale data analytics and solve these problems. We will deep dive into the design & implementation of the system and discuss the core concepts around timeline consistency, tradeoffs between ingest speed & query performance. We contrast Hoodie with similar systems in the space, discuss how its deployed across Hadoop ecosystem at Uber and finally also share the technical direction ahead for the project.
Speaker: VINOTH CHANDAR, Staff Software Engineer at Uber
Vinoth is the founding engineer/architect of the data team at Uber, as well as author of many data processing & querying systems at Uber, including "Hoodie". He has keen interest in unified architectures for data analytics and processing.
Previously, Vinoth was the lead on Linkedin’s Voldemort key value store and has also worked on Oracle Database replication engine, HPC, and stream processing.
Connector/J is a popular Java database connector for connecting to MySQL databases. It allows building Java applications that connect to MySQL and provides features for high availability access. The presentation discusses using Connector/J to connect to MySQL for basic queries, in Tomcat applications, and for high availability configurations like replication, multi-master replication, MySQL Fabric, and MySQL Cluster. It also covers monitoring connections using tools like MySQL Enterprise Monitor.
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Building highly efficient data lakes using Apache Hudi (Incubating)
Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake.
Speaker: Vinoth Chandar (Uber)
Vinoth is Technical Lead at Uber Data Infrastructure Team
This is a slide deck that was used for our 11/19/15 Nike Tech Talk to give a detailed overview of the SnappyData technology vision. The slides were presented by Jags Ramnarayan, Co-Founder & CTO of SnappyData
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
This document discusses connecting the issue tracking software Jira to Oracle Application Express (APEX) by utilizing Jira's REST web services and JSON formatting. It covers motivating the need to integrate the tools, an overview of Jira features, using REST and JSON to retrieve and parse Jira issue data, and demonstrations of consuming the web services in APEX including using collections to cache responses.
At Capital One, I built a small framework on top of Apache Cascading. We have found the framework can significantly reduce the effort in developing and enhance the maintainability of Cascading applications.
Rhebok, High Performance Rack Handler / Rubykaigi 2015Masahiro Nagano
This document discusses Rhebok, a high performance Rack handler written in Ruby. Rhebok uses a prefork architecture for concurrency and achieves 1.5-2x better performance than Unicorn. It implements efficient network I/O using techniques like IO timeouts, TCP_NODELAY, and writev(). Rhebok also uses the ultra-fast PicoHTTPParser for HTTP request parsing. The document provides an overview of Rhebok, benchmarks showing its performance, and details on its internals and architecture.
The document discusses asynchronous programming with Spring 4.X and relational database management systems in microservices architectures. It covers asynchronous vs synchronous programming, the C10K problem of handling 10,000 clients simultaneously and its solutions like load balancing, NoSQL databases, and event-driven programming. It provides examples of using Spring's @Async annotation, DeferredResult, and CompletableFuture for asynchronous programming. It also discusses challenges with databases being blocking I/O and solutions like avoiding blocking on database connections, using asynchronous data access with Spring, and transaction management across asynchronous calls.
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
This document discusses techniques for building scalable websites with Perl, including:
1) Caching at various levels (page, partial page, and database caching) to improve performance and reduce load on application servers.
2) Using job queuing and worker processes to distribute processing-intensive tasks asynchronously instead of blocking web requests.
3) Leveraging caching and queueing libraries like Cache::FastMmap, Memcached, and Spread::Queue to implement caching and job queueing in Perl applications.
Similar to Data Analytics Service Company and Its Ruby Usage (20)
Ractor is a new experimental feature in Ruby 3.0 that allows Ruby code to run in parallel on CPUs. It manages objects per Ractor and can move objects between Ractors, making moved objects invisible to the original Ractor. It can share certain "shareable" objects like modules, classes, and frozen objects between Ractors. For web applications to fully utilize Ractors, an experimental application server called Right Speed was created that uses Rack and runs processing workers on Ractors. However, there are still problems to address like exceptions when closing connections and accessing non-shareable constants and instance variables across Ractors before Ractors can be ready for production use in web applications.
Good Things and Hard Things of SaaS Development/OperationsSATOSHI TAGOMORI
This document discusses the good and hard things about developing and operating a SaaS platform. It describes how the backend team at Treasure Data owns and manages various components of their distributed platform. It also discusses how they have modernized their deployment process from a periodic Chef-based approach to using CodeDeploy for more frequent deployments. This allows them to move faster by doing many small releases that minimize the number of affected components and customers.
The document is an invitation to a presentation about Maccro, a Ruby macro system that allows rewriting Ruby methods by registering rules. It summarizes Maccro's capabilities such as registering before/after matchers, applying rules to methods, matching method ASTs to rules, and rewriting method code by replacing placeholders. It also lists some limitations including only supporting Ruby 2.6+, not handling method call order dependencies, and not updating method visibility. The presentation aims to recruit more developers to Maccro to handle its remaining challenges.
The document discusses constants in Ruby programming. It notes that constant names start with capital letters and can be overwritten with warnings. It provides some constant name samples and encourages trying out different Ruby versions. It also discusses making Ruby scripts confusing by using Unicode characters and constant names that are difficult to understand for future maintainers.
This document summarizes a presentation given by @joker1007 and @tagomoris on hijacking Ruby syntax. It discusses various Ruby features like bindings, tracepoints, and refinements that can be used to modify Ruby's behavior. It then demonstrates several "hacks" that leverage these features, including finalist, overrider, and abstriker which add method modifiers, and binding_ninja which allows implicitly passing a binding. Other hacks discussed are with_resources and deferral which add "with" and "defer" statements to Ruby. The presentation emphasizes how these modifications are implemented using hooks, bindings, tracepoints and other Ruby internals.
Lock, Concurrency and Throughput of Exclusive OperationsSATOSHI TAGOMORI
1. The document discusses different patterns for implementing locks in a distributed key-value store to maximize concurrency and throughput of operations.
2. It describes a naive giant lock approach that locks the entire storage for any operation, resulting in poor concurrency and throughput.
3. Better approaches use metadata locks plus simple resource locks, and reference counting locks, to allow concurrent updates to different resources while minimizing critical sections.
This document discusses Ruby's role in data processing. It outlines the typical steps of data processing - collect, summarize, analyze, visualize. It then provides examples of open-source Ruby tools that can be used for each step, such as Fluentd for collection, and libraries for numerical analysis, bioinformatics, and machine learning. Services that use Ruby for collection and processing are also mentioned, like Log Analytics and Stackdriver Logging. The document encourages continuing to improve Ruby tools to make data processing better and celebrates Ruby's 25th anniversary.
Bigdam is a planet-scale data ingestion pipeline designed for large-scale data ingestion. It addresses issues with the traditional pipeline such as imperfectqueue throughput limitations, latency in queries from event collectors, difficulty maintaining event collector code, many small temporary and imported files. The redesigned pipeline includes Bigdam-Gateway for HTTP endpoints, Bigdam-Pool for distributed buffer storage, Bigdam-Scheduler to schedule import tasks, Bigdam-Queue as a high throughput queue, and Bigdam-Import for data conversion and import. Consistency is ensured through at-least-once design and deduplication is performed at the end of the pipeline for simplicity and reliability. Components are designed to scale out horizontally.
Technologies, Data Analytics Service and Enterprise BusinessSATOSHI TAGOMORI
This document discusses technologies for data analytics services for enterprise businesses. It begins by defining enterprise businesses as those "not about IT" and data analytics services as providing insights into business metrics like customer reach, ad views, purchases, and more using data. It then outlines some key technologies needed for such services, including data management systems, distributed processing systems, queues and schedulers, tools for connecting systems, and methods for controlling jobs and workflows with retries to handle failures. Specific challenges around deadlines, idempotent operations, and replay-able workflows are also addressed.
This document discusses using Ruby for distributed storage systems. It describes components like Bigdam, which is Treasure Data's new data ingestion pipeline. Bigdam uses microservices and a distributed key-value store called Bigdam-pool to buffer data. The document discusses designing and testing Bigdam using mocking, interfaces, and integration tests in Ruby. It also explores porting Bigdam-pool from Java to Ruby and investigating Ruby's suitability for tasks like asynchronous I/O, threading, and serialization/deserialization.
This document discusses features that could make Norikra, an open source stream processing software, even more "perfect". It describes how Norikra currently works and highlights areas for improvement, such as enabling queries to resume processing from historical batch query results, sharing operators between queries to reduce memory usage, and developing a true lambda architecture with a single query language for both streaming and batch processing. The document envisions a "perfect stream processing engine" with these enhanced capabilities.
Fluentd is an open source data collector that provides a unified logging layer between data sources and backend systems. It decouples these systems by collecting and processing logs and events in a flexible and scalable way. Fluentd uses plugins and buffers to make data collection reliable even in the case of errors or failures. It can forward data between Fluentd nodes for high availability and load balancing.
This document discusses the pros and cons of building an in-house data analytics platform versus using cloud-based services. It notes that in startups it is generally better not to build your own platform and instead use cloud services from AWS, Google, or Treasure Data. However, the options have expanded in recent years to include on-premise or cloud-based platforms from vendors like Cloudera, Hortonworks, or cloud services from various providers. The document does not make a definitive conclusion, but discusses factors to consider around distributed processing, data management, process management, platform management, visualization, and connecting different data sources.
This document describes a presentation about introducing black magic programming patterns in Ruby and their pragmatic uses. It provides an overview of Fluentd, including what it is, its versions, and the changes between versions 0.12 and 0.14. Specifically, it discusses how the plugin API was updated in version 0.14 to address problems with the version 0.12 API. It also explains how a compatibility layer was implemented to allow most existing 0.12 plugins to work without modification in 0.14.
The document discusses ISUCON5 and its benchmark tools. It describes how the benchmark tools were developed with two main requirements - high performance and integrity checks. It then provides details on the key features and implementation of the benchmark tools using Java and the Jetty client library to send requests and check responses in parallel. Distributed benchmarking using multiple nodes is also covered. Performance results from ISUCON5 are shared showing it could handle around 194,000 requests per minute using up to 30 nodes.
Data-Driven Development Era and Its TechnologiesSATOSHI TAGOMORI
This document discusses data-driven development and the technologies used in the data analytics process. It covers topics like data collection, storage, processing, and visualization. The document advocates using managed cloud services for data and analytics to focus on data instead of managing infrastructure. Choosing technologies should be based on the type of data and problems to solve, not the other way around. Services like Google BigQuery, Amazon Redshift, and Treasure Data are recommended for their ease of use.
1. The document discusses the role of an engineer at a tech company, focusing on how engineers can lead both technical and business aspects.
2. It argues that tech-led businesses involve engineers finding customer needs, creating valuable products/services, providing them to customers, collecting feedback, and continuously improving.
3. The most important thing is for engineers to enjoy their work and find meaning in creating things that provide value both for customers and themselves.
Efficient hot work permit software for safe, streamlined work permit management and compliance. Enhance safety today. Contact us on +353 214536034.
https://sheqnetwork.com/work-permit/
Are you wondering how to migrate to the Cloud? At the ITB session, we addressed the challenge of managing multiple ColdFusion licenses and AWS EC2 instances. Discover how you can consolidate with just one EC2 instance capable of running over 50 apps using CommandBox ColdFusion. This solution supports both ColdFusion flavors and includes cb-websites, a GoLang binary for managing CommandBox websites.
Explore the rapid development journey of TryBoxLang, completed in just 48 hours. This session delves into the innovative process behind creating TryBoxLang, a platform designed to showcase the capabilities of BoxLang by Ortus Solutions. Discover the challenges, strategies, and outcomes of this accelerated development effort, highlighting how TryBoxLang provides a practical introduction to BoxLang's features and benefits.
Cultural Shifts: Embracing DevOps for Organizational TransformationMindfire Solution
Mindfire Solutions specializes in DevOps services, facilitating digital transformation through streamlined software development and operational efficiency. Their expertise enhances collaboration, accelerates delivery cycles, and ensures scalability using cloud-native technologies. Mindfire Solutions empowers businesses to innovate rapidly and maintain competitive advantage in dynamic market landscapes.
React and Next.js are complementary tools in web development. React, a JavaScript library, specializes in building user interfaces with its component-based architecture and efficient state management. Next.js extends React by providing server-side rendering, routing, and other utilities, making it ideal for building SEO-friendly, high-performance web applications.
Software development... for all? (keynote at ICSOFT'2024)miso_uam
Our world runs on software. It governs all major aspects of our life. It is an enabler for research and innovation, and is critical for business competitivity. Traditional software engineering techniques have achieved high effectiveness, but still may fall short on delivering software at the accelerated pace and with the increasing quality that future scenarios will require.
To attack this issue, some software paradigms raise the automation of software development via higher levels of abstraction through domain-specific languages (e.g., in model-driven engineering) and empowering non-professional developers with the possibility to build their own software (e.g., in low-code development approaches). In a software-demanding world, this is an attractive possibility, and perhaps -- paraphrasing Andy Warhol -- "in the future, everyone will be a developer for 15 minutes". However, to make this possible, methods are required to tweak languages to their context of use (crucial given the diversity of backgrounds and purposes), and the assistance to developers throughout the development process (especially critical for non-professionals).
In this keynote talk at ICSOFT'2024 I presented enabling techniques for this vision, supporting the creation of families of domain-specific languages, their adaptation to the usage context; and the augmentation of low-code environments with assistants and recommender systems to guide developers (professional or not) in the development process.
CViewSurvey Digitech Pvt Ltd that works on a proven C.A.A.G. model.bhatinidhi2001
CViewSurvey is a SaaS-based Web & Mobile application that provides digital transformation to traditional paper surveys and feedback for customer & employee experience, field & market research that helps you evaluate your customer's as well as employee's loyalty.
With our unique C.A.A.G. Collect, Analysis, Act & Grow approach; business & industry’s can create customized surveys on web, publish on app to collect unlimited response & review AI backed real-time data analytics on mobile & tablets anytime, anywhere. Data collected when offline is securely stored in the device, which syncs to the cloud server when connected to any network.
Discover the Power of ONEMONITAR: The Ultimate Mobile Spy App for Android Dev...onemonitarsoftware
Unlock the full potential of mobile monitoring with ONEMONITAR. Our advanced and discreet app offers a comprehensive suite of features, including hidden call recording, real-time GPS tracking, message monitoring, and much more.
Perfect for parents, employers, and anyone needing a reliable solution, ONEMONITAR ensures you stay informed and in control. Explore the key features of ONEMONITAR and see why it’s the trusted choice for Android device monitoring.
Share this infographic to spread the word about the ultimate mobile spy app!
Seamless PostgreSQL to Snowflake Data Transfer in 8 Simple StepsEstuary Flow
Unlock the full potential of your data by effortlessly migrating from PostgreSQL to Snowflake, the leading cloud data warehouse. This comprehensive guide presents an easy-to-follow 8-step process using Estuary Flow, an open-source data operations platform designed to simplify data pipelines.
Discover how to seamlessly transfer your PostgreSQL data to Snowflake, leveraging Estuary Flow's intuitive interface and powerful real-time replication capabilities. Harness the power of both platforms to create a robust data ecosystem that drives business intelligence, analytics, and data-driven decision-making.
Key Takeaways:
1. Effortless Migration: Learn how to migrate your PostgreSQL data to Snowflake in 8 simple steps, even with limited technical expertise.
2. Real-Time Insights: Achieve near-instantaneous data syncing for up-to-the-minute analytics and reporting.
3. Cost-Effective Solution: Lower your total cost of ownership (TCO) with Estuary Flow's efficient and scalable architecture.
4. Seamless Integration: Combine the strengths of PostgreSQL's transactional power with Snowflake's cloud-native scalability and data warehousing features.
Don't miss out on this opportunity to unlock the full potential of your data. Read & Download this comprehensive guide now and embark on a seamless data journey from PostgreSQL to Snowflake with Estuary Flow!
Try it Free: https://dashboard.estuary.dev/register
6. Bulk Data Loader
High Throughput&Reliability
Embulk
Written in Java/JRuby
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
http://www.embulk.org/
22. Queue/Worker, Scheduler
• Treasure Data: multi-tenant data analytics service
• executes many jobs in shared clusters (queries,
imports, ...)
• CORE: queues-workers & schedulers
• Clusters have queues/scheduler... it's not enough
• resource limitations for each price plans
• priority queues for job types
• and many others
24. PerfectSched
• Provides periodical/scheduled queries for customers
• it's like reliable "cron"
• Highly available distributed scheduler using RDBMS
• Written in CRuby
• At-least-once semantics
• PerfectSched enqueues jobs into PerfectQueue
25. Jobs in TD
LOST Duplicated
Retried
for
Errors
Throughput
Execution
time
DATA
import/
export
NG NG
OK
or
NG
HIGH SHORT
(secs-mins)
QUERY NG OK OK LOW
SHORT
(secs)
or
LONG
(mins-hours)
28. Features
• Priorities for query types
• Resource limits per accounts
• Graceful restarts
• Queries must run long time (<= 1d)
• New worker code should be loaded, besides
running job with older code
29. PerfectQueue
• Highly available distributed queue using RDBMS
• Enqueue by INSERT INTO
• Dequeue/Commit by UPDATE
• Using transactions
• Flexible scheduling rather than scalability
• Workers does many things
• Plazmadb operations (including importing data)
• Building job parameters
• Handling results of jobs + kicking other jobs
• Using Amazon RDS (MySQL) internally (+ Workers on EC2)
30. Builing Jobs/Parameters
• Parameters
• for job types, accounts, price plans and clusters
• to control performance/parallelism, permissions
and data types
• ex: Java properties
• Jobs
• to prepare for customers' queries
• to make queries safer/faster
• ex: Hive Queries (HiveQL, a variety of SQL)
33. Example: Hive job (cont)
ADD JAR 'td-hadoop-1.0.jar';
CREATE DATABASE IF NOT EXISTS `db`;
USE `db`;
CREATE TABLE tagomoris (`v` MAP<STRING,STRING>, `time` INT)
STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler'
WITH SERDEPROPERTIES ('msgpack.columns.mapping'='*,time')
TBLPROPERTIES (
'td.storage.user'='221',
'td.storage.database'='dfc',
'td.storage.table'='users_20100604_080812_ce9203d0',
'td.storage.path'='221/dfc/users_20100604_080812_ce9203d0',
'td.table_id'='2',
'td.modifiable'='true',
'plazma.data_set.name'='221/dfc/users_20100604_080812_ce9203d0'
);
CREATE TABLE tbl1 (
`uid` INT,
`key` STRING,
`time` INT
)
STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler'
WITH SERDEPROPERTIES ('msgpack.columns.mapping'='uid,key,time')
TBLPROPERTIES (
'td.storage.user'='221',
'td.storage.database'='dfc',
34. ADD JAR 'td-hadoop-1.0.jar';
CREATE DATABASE IF NOT EXISTS `db`;
USE `db`;
CREATE TABLE tagomoris (`v` MAP<STRING,STRING>, `time` INT)
STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler'
WITH SERDEPROPERTIES ('msgpack.columns.mapping'='*,time')
TBLPROPERTIES (
'td.storage.user'='221',
'td.storage.database'='dfc',
'td.storage.table'='users_20100604_080812_ce9203d0',
'td.storage.path'='221/dfc/users_20100604_080812_ce9203d0',
'td.table_id'='2',
'td.modifiable'='true',
'plazma.data_set.name'='221/dfc/users_20100604_080812_ce9203d0'
);
CREATE TABLE tbl1 (
`uid` INT,
`key` STRING,
`time` INT
)
STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler'
WITH SERDEPROPERTIES ('msgpack.columns.mapping'='uid,key,time')
TBLPROPERTIES (
'td.storage.user'='221',
'td.storage.database'='dfc',
'td.storage.table'='contests_20100606_120720_96abe81a',
'td.storage.path'='221/dfc/contests_20100606_120720_96abe81a',
'td.table_id'='4',
'td.modifiable'='true',
'plazma.data_set.name'='221/dfc/contests_20100606_120720_96abe81a'
);
USE `db`;
35. USE `db`;
CREATE TEMPORARY FUNCTION MSGPACK_SERIALIZE AS
'com.treasure_data.hadoop.hive.udf.MessagePackSerialize';
CREATE TEMPORARY FUNCTION TD_TIME_RANGE AS
'com.treasure_data.hadoop.hive.udf.GenericUDFTimeRange';
CREATE TEMPORARY FUNCTION TD_TIME_ADD AS
'com.treasure_data.hadoop.hive.udf.UDFTimeAdd';
CREATE TEMPORARY FUNCTION TD_TIME_FORMAT AS
'com.treasure_data.hadoop.hive.udf.UDFTimeFormat';
CREATE TEMPORARY FUNCTION TD_TIME_PARSE AS
'com.treasure_data.hadoop.hive.udf.UDFTimeParse';
CREATE TEMPORARY FUNCTION TD_SCHEDULED_TIME AS
'com.treasure_data.hadoop.hive.udf.GenericUDFScheduledTime';
CREATE TEMPORARY FUNCTION TD_X_RANK AS
'com.treasure_data.hadoop.hive.udf.Rank';
CREATE TEMPORARY FUNCTION TD_FIRST AS
'com.treasure_data.hadoop.hive.udf.GenericUDAFFirst';
CREATE TEMPORARY FUNCTION TD_LAST AS
'com.treasure_data.hadoop.hive.udf.GenericUDAFLast';
CREATE TEMPORARY FUNCTION TD_SESSIONIZE AS
'com.treasure_data.hadoop.hive.udf.UDFSessionize';
CREATE TEMPORARY FUNCTION TD_PARSE_USER_AGENT AS
'com.treasure_data.hadoop.hive.udf.GenericUDFParseUserAgent';
CREATE TEMPORARY FUNCTION TD_HEX2NUM AS
'com.treasure_data.hadoop.hive.udf.UDFHex2num';
CREATE TEMPORARY FUNCTION TD_MD5 AS
'com.treasure_data.hadoop.hive.udf.UDFmd5';
CREATE TEMPORARY FUNCTION TD_RANK_SEQUENCE AS
'com.treasure_data.hadoop.hive.udf.UDFRankSequence';
CREATE TEMPORARY FUNCTION TD_STRING_EXPLODER AS
'com.treasure_data.hadoop.hive.udf.GenericUDTFStringExploder';
CREATE TEMPORARY FUNCTION TD_URL_DECODE AS
36. CREATE TEMPORARY FUNCTION TD_URL_DECODE AS
'com.treasure_data.hadoop.hive.udf.UDFUrlDecode';
CREATE TEMPORARY FUNCTION TD_DATE_TRUNC AS
'com.treasure_data.hadoop.hive.udf.UDFDateTrunc';
CREATE TEMPORARY FUNCTION TD_LAT_LONG_TO_COUNTRY AS
'com.treasure_data.hadoop.hive.udf.UDFLatLongToCountry';
CREATE TEMPORARY FUNCTION TD_SUBSTRING_INENCODING AS
'com.treasure_data.hadoop.hive.udf.GenericUDFSubstringInEncoding';
CREATE TEMPORARY FUNCTION TD_DIVIDE AS
'com.treasure_data.hadoop.hive.udf.GenericUDFDivide';
CREATE TEMPORARY FUNCTION TD_SUMIF AS
'com.treasure_data.hadoop.hive.udf.GenericUDAFSumIf';
CREATE TEMPORARY FUNCTION TD_AVGIF AS
'com.treasure_data.hadoop.hive.udf.GenericUDAFAvgIf';
CREATE TEMPORARY FUNCTION hivemall_version AS
'hivemall.HivemallVersionUDF';
CREATE TEMPORARY FUNCTION perceptron AS
'hivemall.classifier.PerceptronUDTF';
CREATE TEMPORARY FUNCTION train_perceptron AS
'hivemall.classifier.PerceptronUDTF';
CREATE TEMPORARY FUNCTION train_pa AS
'hivemall.classifier.PassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_pa1 AS
'hivemall.classifier.PassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_pa2 AS
'hivemall.classifier.PassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_cw AS
'hivemall.classifier.ConfidenceWeightedUDTF';
CREATE TEMPORARY FUNCTION train_arow AS
'hivemall.classifier.AROWClassifierUDTF';
CREATE TEMPORARY FUNCTION train_arowh AS
'hivemall.classifier.AROWClassifierUDTF';
37. CREATE TEMPORARY FUNCTION train_arowh AS
'hivemall.classifier.AROWClassifierUDTF';
CREATE TEMPORARY FUNCTION train_scw AS
'hivemall.classifier.SoftConfideceWeightedUDTF';
CREATE TEMPORARY FUNCTION train_scw2 AS
'hivemall.classifier.SoftConfideceWeightedUDTF';
CREATE TEMPORARY FUNCTION adagrad_rda AS
'hivemall.classifier.AdaGradRDAUDTF';
CREATE TEMPORARY FUNCTION train_adagrad_rda AS
'hivemall.classifier.AdaGradRDAUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_perceptron AS
'hivemall.classifier.multiclass.MulticlassPerceptronUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_pa AS
'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_pa1 AS
'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_pa2 AS
'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_cw AS
'hivemall.classifier.multiclass.MulticlassConfidenceWeightedUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_arow AS
'hivemall.classifier.multiclass.MulticlassAROWClassifierUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_scw AS
'hivemall.classifier.multiclass.MulticlassSoftConfidenceWeightedUDTF';
CREATE TEMPORARY FUNCTION train_multiclass_scw2 AS
'hivemall.classifier.multiclass.MulticlassSoftConfidenceWeightedUDTF';
CREATE TEMPORARY FUNCTION cosine_similarity AS
'hivemall.knn.similarity.CosineSimilarityUDF';
CREATE TEMPORARY FUNCTION cosine_sim AS
'hivemall.knn.similarity.CosineSimilarityUDF';
CREATE TEMPORARY FUNCTION jaccard AS
'hivemall.knn.similarity.JaccardIndexUDF';
38. CREATE TEMPORARY FUNCTION jaccard AS
'hivemall.knn.similarity.JaccardIndexUDF';
CREATE TEMPORARY FUNCTION jaccard_similarity AS
'hivemall.knn.similarity.JaccardIndexUDF';
CREATE TEMPORARY FUNCTION angular_similarity AS
'hivemall.knn.similarity.AngularSimilarityUDF';
CREATE TEMPORARY FUNCTION euclid_similarity AS
'hivemall.knn.similarity.EuclidSimilarity';
CREATE TEMPORARY FUNCTION distance2similarity AS
'hivemall.knn.similarity.Distance2SimilarityUDF';
CREATE TEMPORARY FUNCTION hamming_distance AS
'hivemall.knn.distance.HammingDistanceUDF';
CREATE TEMPORARY FUNCTION popcnt AS 'hivemall.knn.distance.PopcountUDF';
CREATE TEMPORARY FUNCTION kld AS 'hivemall.knn.distance.KLDivergenceUDF';
CREATE TEMPORARY FUNCTION euclid_distance AS
'hivemall.knn.distance.EuclidDistanceUDF';
CREATE TEMPORARY FUNCTION cosine_distance AS
'hivemall.knn.distance.CosineDistanceUDF';
CREATE TEMPORARY FUNCTION angular_distance AS
'hivemall.knn.distance.AngularDistanceUDF';
CREATE TEMPORARY FUNCTION jaccard_distance AS
'hivemall.knn.distance.JaccardDistanceUDF';
CREATE TEMPORARY FUNCTION manhattan_distance AS
'hivemall.knn.distance.ManhattanDistanceUDF';
CREATE TEMPORARY FUNCTION minkowski_distance AS
'hivemall.knn.distance.MinkowskiDistanceUDF';
CREATE TEMPORARY FUNCTION minhashes AS 'hivemall.knn.lsh.MinHashesUDF';
CREATE TEMPORARY FUNCTION minhash AS 'hivemall.knn.lsh.MinHashUDTF';
CREATE TEMPORARY FUNCTION bbit_minhash AS
'hivemall.knn.lsh.bBitMinHashUDF';
CREATE TEMPORARY FUNCTION voted_avg AS
'hivemall.ensemble.bagging.VotedAvgUDAF';
39. CREATE TEMPORARY FUNCTION voted_avg AS
'hivemall.ensemble.bagging.VotedAvgUDAF';
CREATE TEMPORARY FUNCTION weight_voted_avg AS
'hivemall.ensemble.bagging.WeightVotedAvgUDAF';
CREATE TEMPORARY FUNCTION wvoted_avg AS
'hivemall.ensemble.bagging.WeightVotedAvgUDAF';
CREATE TEMPORARY FUNCTION max_label AS
'hivemall.ensemble.MaxValueLabelUDAF';
CREATE TEMPORARY FUNCTION maxrow AS 'hivemall.ensemble.MaxRowUDAF';
CREATE TEMPORARY FUNCTION argmin_kld AS
'hivemall.ensemble.ArgminKLDistanceUDAF';
CREATE TEMPORARY FUNCTION mhash AS
'hivemall.ftvec.hashing.MurmurHash3UDF';
CREATE TEMPORARY FUNCTION sha1 AS 'hivemall.ftvec.hashing.Sha1UDF';
CREATE TEMPORARY FUNCTION array_hash_values AS
'hivemall.ftvec.hashing.ArrayHashValuesUDF';
CREATE TEMPORARY FUNCTION prefixed_hash_values AS
'hivemall.ftvec.hashing.ArrayPrefixedHashValuesUDF';
CREATE TEMPORARY FUNCTION polynomial_features AS
'hivemall.ftvec.pairing.PolynomialFeaturesUDF';
CREATE TEMPORARY FUNCTION powered_features AS
'hivemall.ftvec.pairing.PoweredFeaturesUDF';
CREATE TEMPORARY FUNCTION rescale AS 'hivemall.ftvec.scaling.RescaleUDF';
CREATE TEMPORARY FUNCTION rescale_fv AS
'hivemall.ftvec.scaling.RescaleUDF';
CREATE TEMPORARY FUNCTION zscore AS 'hivemall.ftvec.scaling.ZScoreUDF';
CREATE TEMPORARY FUNCTION normalize AS
'hivemall.ftvec.scaling.L2NormalizationUDF';
CREATE TEMPORARY FUNCTION conv2dense AS
'hivemall.ftvec.conv.ConvertToDenseModelUDAF';
CREATE TEMPORARY FUNCTION to_dense_features AS
'hivemall.ftvec.conv.ToDenseFeaturesUDF';
40. CREATE TEMPORARY FUNCTION to_dense_features AS
'hivemall.ftvec.conv.ToDenseFeaturesUDF';
CREATE TEMPORARY FUNCTION to_dense AS
'hivemall.ftvec.conv.ToDenseFeaturesUDF';
CREATE TEMPORARY FUNCTION to_sparse_features AS
'hivemall.ftvec.conv.ToSparseFeaturesUDF';
CREATE TEMPORARY FUNCTION to_sparse AS
'hivemall.ftvec.conv.ToSparseFeaturesUDF';
CREATE TEMPORARY FUNCTION quantify AS
'hivemall.ftvec.conv.QuantifyColumnsUDTF';
CREATE TEMPORARY FUNCTION vectorize_features AS
'hivemall.ftvec.trans.VectorizeFeaturesUDF';
CREATE TEMPORARY FUNCTION categorical_features AS
'hivemall.ftvec.trans.CategoricalFeaturesUDF';
CREATE TEMPORARY FUNCTION indexed_features AS
'hivemall.ftvec.trans.IndexedFeatures';
CREATE TEMPORARY FUNCTION quantified_features AS
'hivemall.ftvec.trans.QuantifiedFeaturesUDTF';
CREATE TEMPORARY FUNCTION quantitative_features AS
'hivemall.ftvec.trans.QuantitativeFeaturesUDF';
CREATE TEMPORARY FUNCTION amplify AS
'hivemall.ftvec.amplify.AmplifierUDTF';
CREATE TEMPORARY FUNCTION rand_amplify AS
'hivemall.ftvec.amplify.RandomAmplifierUDTF';
CREATE TEMPORARY FUNCTION addBias AS 'hivemall.ftvec.AddBiasUDF';
CREATE TEMPORARY FUNCTION add_bias AS 'hivemall.ftvec.AddBiasUDF';
CREATE TEMPORARY FUNCTION sortByFeature AS
'hivemall.ftvec.SortByFeatureUDF';
CREATE TEMPORARY FUNCTION sort_by_feature AS
'hivemall.ftvec.SortByFeatureUDF';
CREATE TEMPORARY FUNCTION extract_feature AS
'hivemall.ftvec.ExtractFeatureUDF';
41. CREATE TEMPORARY FUNCTION extract_feature AS
'hivemall.ftvec.ExtractFeatureUDF';
CREATE TEMPORARY FUNCTION extract_weight AS
'hivemall.ftvec.ExtractWeightUDF';
CREATE TEMPORARY FUNCTION add_feature_index AS
'hivemall.ftvec.AddFeatureIndexUDF';
CREATE TEMPORARY FUNCTION feature AS 'hivemall.ftvec.FeatureUDF';
CREATE TEMPORARY FUNCTION feature_index AS
'hivemall.ftvec.FeatureIndexUDF';
CREATE TEMPORARY FUNCTION tf AS 'hivemall.ftvec.text.TermFrequencyUDAF';
CREATE TEMPORARY FUNCTION train_logregr AS
'hivemall.regression.LogressUDTF';
CREATE TEMPORARY FUNCTION train_pa1_regr AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION train_pa1a_regr AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION train_pa2_regr AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION train_pa2a_regr AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION train_arow_regr AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION train_arowe_regr AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION train_arowe2_regr AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION train_adagrad_regr AS
'hivemall.regression.AdaGradUDTF';
CREATE TEMPORARY FUNCTION train_adadelta_regr AS
'hivemall.regression.AdaDeltaUDTF';
CREATE TEMPORARY FUNCTION train_adagrad AS
'hivemall.regression.AdaGradUDTF';
42. CREATE TEMPORARY FUNCTION train_adagrad AS
'hivemall.regression.AdaGradUDTF';
CREATE TEMPORARY FUNCTION train_adadelta AS
'hivemall.regression.AdaDeltaUDTF';
CREATE TEMPORARY FUNCTION logress AS 'hivemall.regression.LogressUDTF';
CREATE TEMPORARY FUNCTION pa1_regress AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION pa1a_regress AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION pa2_regress AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION pa2a_regress AS
'hivemall.regression.PassiveAggressiveRegressionUDTF';
CREATE TEMPORARY FUNCTION arow_regress AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION arowe_regress AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION arowe2_regress AS
'hivemall.regression.AROWRegressionUDTF';
CREATE TEMPORARY FUNCTION adagrad AS 'hivemall.regression.AdaGradUDTF';
CREATE TEMPORARY FUNCTION adadelta AS 'hivemall.regression.AdaDeltaUDTF';
CREATE TEMPORARY FUNCTION float_array AS
'hivemall.tools.array.AllocFloatArrayUDF';
CREATE TEMPORARY FUNCTION array_remove AS
'hivemall.tools.array.ArrayRemoveUDF';
CREATE TEMPORARY FUNCTION sort_and_uniq_array AS
'hivemall.tools.array.SortAndUniqArrayUDF';
CREATE TEMPORARY FUNCTION subarray_endwith AS
'hivemall.tools.array.SubarrayEndWithUDF';
CREATE TEMPORARY FUNCTION subarray_startwith AS
'hivemall.tools.array.SubarrayStartWithUDF';
CREATE TEMPORARY FUNCTION collect_all AS
43. CREATE TEMPORARY FUNCTION collect_all AS
'hivemall.tools.array.CollectAllUDAF';
CREATE TEMPORARY FUNCTION concat_array AS
'hivemall.tools.array.ConcatArrayUDF';
CREATE TEMPORARY FUNCTION subarray AS 'hivemall.tools.array.SubarrayUDF';
CREATE TEMPORARY FUNCTION array_avg AS
'hivemall.tools.array.ArrayAvgGenericUDAF';
CREATE TEMPORARY FUNCTION array_sum AS
'hivemall.tools.array.ArraySumUDAF';
CREATE TEMPORARY FUNCTION to_string_array AS
'hivemall.tools.array.ToStringArrayUDF';
CREATE TEMPORARY FUNCTION map_get_sum AS
'hivemall.tools.map.MapGetSumUDF';
CREATE TEMPORARY FUNCTION map_tail_n AS 'hivemall.tools.map.MapTailNUDF';
CREATE TEMPORARY FUNCTION to_map AS 'hivemall.tools.map.UDAFToMap';
CREATE TEMPORARY FUNCTION to_ordered_map AS
'hivemall.tools.map.UDAFToOrderedMap';
CREATE TEMPORARY FUNCTION sigmoid AS
'hivemall.tools.math.SigmoidGenericUDF';
CREATE TEMPORARY FUNCTION taskid AS 'hivemall.tools.mapred.TaskIdUDF';
CREATE TEMPORARY FUNCTION jobid AS 'hivemall.tools.mapred.JobIdUDF';
CREATE TEMPORARY FUNCTION rowid AS 'hivemall.tools.mapred.RowIdUDF';
CREATE TEMPORARY FUNCTION generate_series AS
'hivemall.tools.GenerateSeriesUDTF';
CREATE TEMPORARY FUNCTION convert_label AS
'hivemall.tools.ConvertLabelUDF';
CREATE TEMPORARY FUNCTION x_rank AS 'hivemall.tools.RankSequenceUDF';
CREATE TEMPORARY FUNCTION each_top_k AS 'hivemall.tools.EachTopKUDTF';
CREATE TEMPORARY FUNCTION tokenize AS 'hivemall.tools.text.TokenizeUDF';
CREATE TEMPORARY FUNCTION is_stopword AS
'hivemall.tools.text.StopwordUDF';
CREATE TEMPORARY FUNCTION split_words AS
44. CREATE TEMPORARY FUNCTION split_words AS
'hivemall.tools.text.SplitWordsUDF';
CREATE TEMPORARY FUNCTION normalize_unicode AS
'hivemall.tools.text.NormalizeUnicodeUDF';
CREATE TEMPORARY FUNCTION lr_datagen AS
'hivemall.dataset.LogisticRegressionDataGeneratorUDTF';
CREATE TEMPORARY FUNCTION f1score AS 'hivemall.evaluation.FMeasureUDAF';
CREATE TEMPORARY FUNCTION mae AS
'hivemall.evaluation.MeanAbsoluteErrorUDAF';
CREATE TEMPORARY FUNCTION mse AS
'hivemall.evaluation.MeanSquaredErrorUDAF';
CREATE TEMPORARY FUNCTION rmse AS
'hivemall.evaluation.RootMeanSquaredErrorUDAF';
CREATE TEMPORARY FUNCTION mf_predict AS 'hivemall.mf.MFPredictionUDF';
CREATE TEMPORARY FUNCTION train_mf_sgd AS
'hivemall.mf.MatrixFactorizationSGDUDTF';
CREATE TEMPORARY FUNCTION train_mf_adagrad AS
'hivemall.mf.MatrixFactorizationAdaGradUDTF';
CREATE TEMPORARY FUNCTION fm_predict AS
'hivemall.fm.FMPredictGenericUDAF';
CREATE TEMPORARY FUNCTION train_fm AS
'hivemall.fm.FactorizationMachineUDTF';
CREATE TEMPORARY FUNCTION train_randomforest_classifier AS
'hivemall.smile.classification.RandomForestClassifierUDTF';
CREATE TEMPORARY FUNCTION train_rf_classifier AS
'hivemall.smile.classification.RandomForestClassifierUDTF';
CREATE TEMPORARY FUNCTION train_randomforest_regr AS
'hivemall.smile.regression.RandomForestRegressionUDTF';
CREATE TEMPORARY FUNCTION train_rf_regr AS
'hivemall.smile.regression.RandomForestRegressionUDTF';
CREATE TEMPORARY FUNCTION tree_predict AS
'hivemall.smile.tools.TreePredictByStackMachineUDF';
45. CREATE TEMPORARY FUNCTION tree_predict AS
'hivemall.smile.tools.TreePredictByStackMachineUDF';
CREATE TEMPORARY FUNCTION vm_tree_predict AS
'hivemall.smile.tools.TreePredictByStackMachineUDF';
CREATE TEMPORARY FUNCTION rf_ensemble AS
'hivemall.smile.tools.RandomForestEnsembleUDAF';
CREATE TEMPORARY FUNCTION train_gradient_boosting_classifier AS
'hivemall.smile.classification.GradientTreeBoostingClassifierUDTF';
CREATE TEMPORARY FUNCTION guess_attribute_types AS
'hivemall.smile.tools.GuessAttributesUDF';
CREATE TEMPORARY FUNCTION tokenize_ja AS
'hivemall.nlp.tokenizer.KuromojiUDF';
CREATE TEMPORARY MACRO max2(x DOUBLE, y DOUBLE) if(x>y,x,y);
CREATE TEMPORARY MACRO min2(x DOUBLE, y DOUBLE) if(x<y,x,y);
CREATE TEMPORARY MACRO rand_gid(k INT) floor(rand()*k);
CREATE TEMPORARY MACRO rand_gid2(k INT, seed INT) floor(rand(seed)*k);
CREATE TEMPORARY MACRO idf(df_t DOUBLE, n_docs DOUBLE) log(10, n_docs /
max2(1,df_t)) + 1.0;
CREATE TEMPORARY MACRO tfidf(tf FLOAT, df_t DOUBLE, n_docs DOUBLE) tf *
(log(10, n_docs / max2(1,df_t)) + 1.0);
SELECT time, COUNT(1) AS cnt FROM tbl1
WHERE TD_TIME_RANGE(time, '2015-12-11', '2015-12-12', 'JST');
47. PQ written in Ruby
• Building jobs/parameters is so complex!
• using data from many configurations (YAML, JSON),
internal APIs and RDBMSs
• with many ext syntaxes/rules to tune performance,
override configurations for tests, ...
• Ruby empower to write fat/complex worker code
• Testing!
• Unit tests using Rspec
• System tests (executing real queries/jobs) using
Rspec
48. For Further improvement
about workers
• More performance for more customers and less
costs
• More scalability for many other kind jobs
• Better and well-controlled tests (indented here
documents!)