Scala at Treasure Data
Taro L. Saito - GitHub:@xerial
Ph.D., Software Engineer at Treasure Data, Inc.
Treasure Data Tech Talk @ Tokyo, June 13, 2017
Why Scala?
• Scala is not an official programming language of Treasure Data
• I was the only engineer who can write Scala in TD
• 3 years ago
• Now all of my team members can write Scala
• Fact: Java experts can quickly learn Scala
Challenge: Increased Presto Usage at Treasure Data (2017)
Processing 15 Trillion Rows / Day 

(= 173 Million Rows / sec.)
150,000~ Queries / Day
1,500~ Users
• How do we improve the service by utilizing this massive amount of query logs?
Query Logs
Improve & Optimize
A Success Story: Using Scala in Genome Science
Scala Use Cases in TD
• Analyzing Query Engine Logs
• Data analytics workflows written in Scala
• For finding effective optimization approaches
• Prestobase
• Management Base of Presto
• Gateway to access Presto (Finagle + Presto)
• Monitoring + Runtime Analysis
• Spark Integration
• Accessing to Treasure Data from Spark
Open-Source Scala Libraries Developed at TD
• Libraries that make Scala programming fun
• wvlet-log: handy logging library:
• Airframe: Dependency Injection Library
• Airframe Config: YAML-based configuration library (a module in Airframe)
• Heavy use of meta-programing via Scalamacros
• sbt plugins
• Data analytics
• sbt-sql:
• Deployment
• sbt-pack:
• sbt-sonatype:
What is Scalamacros?
• Generates Scala code at compile-time
• Meta-programming (Writing a program that writes programs)
• Experimental State at Scala 2.10, 2.11, and 2.12
• Scalamacros will no longer be experimental
• Productization within 2017
• Scala Macro author (@xeno-by), IntelliJ team, EPFL Ph.D student
• Support Scala 2.12 (and maybe Scala 2.11) and 3.0
• Announced at Scala Meetup at Twitter HQ, San Francisco
What is Scala 3.x?
• Scala 3.x
• Replaces the compiler to Dotty for faster compilation and better integration with IDE
• Dotty: Compilers Are Databases (Martin Odersky, Scala’s creator)
• Because compiler needs to answer …
• Q: What is the signature of 

method A.f at a given point of time?
• class A[T] { def f(x: T): T = … }
• Compiler itself, IDE (e.g., IntelliJ), etc.
• Need to know these temporal types (Denotation)
Open-Source Scala Libraries in TD
Logging Library: Hard to Use
• Logging configuration is hard
• slf4j, log4j, logback-classic, etc.
• XML configuration, etc.
• Need to have redundant getLogger calls
embulk log configuration with logback-classic
Dependency Hell of slf4j
• slf4j (simple logger for Java)
• The de facto standard of Java logging library
• scala-logging: slf4j wrapper for Scala
• Switches log outputs
• Using a binding library in classpath
• slf4j-nop (no output)
• slf4j-simple (console output)
• slf4j-log4j (output to log4j)
• Pitfall
• Cannot have multiple binders
• But must have 1 binder (!!!)
• de facto = many bad users
• e.g., hadoop
• Doesn’t care the other people: Including slf4j-log4j in the direct dependency
• Need to exclude slf4j-log4j bindings from all of hadoop-related projects
• Favors Simplicity
• Use Scalamacros to simplify user codes
• Only need to extend LogSupport trait
• No getLogger call
• Using standard java.util.logging
• No other dependency required
• Features
• Show source code locations of logs
• Log format is configurable in the code (No XML nor plugin!)
• Changing log levels with files or JMX
• Built-in log handlers
• log-rotate handler, async handler
• Works with Scala.js to show logs in Web browser console
wvlet-log: Logging code generation with Scalamacros
• Generate low-overhead logging code
• Quasiquote
• q”… scala code “
• Just writing Scala code template in macros
• Dependency Injection Library for Scala
• Best practices of building objects in Scala
• We needed Google Guice for Scala
• But there is no good alternative
• Guice, Dagger2, Scaldi, Macwire, etc.
• Using Google Guice in Scala
• PlayFramework
• Weird syntax
• Airframe uses Scalamacros to simplify DI in Scala
• Three step DI in Scala
• Bind
• Design
• Build
• Built-in life cycle manager
• Session start/shutdown
• e.g., connection open/close
• Session
• Manage singletons and 

binding rules
Clear Separation of Concerns
• Traditional Service Building:
• With Airframe:
• Clear separation of concerns:
• How to build objects (design)
• How to use objects (bind)
• Simplest DI patten for Scala
How to build dependencies
Just use components!
Need to remember argument orders
Airframe Internals (Advanced)
• Code generation with Scalamacros
• Passing a Session when building App and A
Customizing Prestobase Filters with Airframe
• Prestobase Proxy: Gateway to access Presto
• Adding TD specific binding
• Finagle filters -> Injecting TD Specific filters
VCR Record/Replay for Testing Presto
• Launching Presto requires a lot of memory (e.g., 2GB or more)
• Often crashes CI service containers (TravisCI, CircleCI, etc.)
• Recording Presto responses (prestobase-vcr)
• with sqlite-jdbc:
• DB file for each test suite
• Enabled small-memory footprint testing
• Can run many Presto tests in CI
Airframe Config
• YAML is useful for configuring applications
• Embedding YAML configurations inside docker images
• Provide credentials in a separate manner
• password, API keys, instance specific param, etc.
• properties file, environment variables, etc.
• YAML + overrides + object mapping
Airframe Internal: Surface
• Surface: Object surface (shape) inspector library
• case class A(id:Int, name:String)
• surface.of[A]
• => Surface(“A”, Seq(Param(“id”, surface.of[Int]), Param(“name”, surface.of[String]))
• Extract object type parameters with Scala Runtime Reflection
• Scala generates this type information at compile type
• Used as Type Identifiers of Airframe and Airframe Config
• e.g., [A], [Seq[B]], [Map[Int, String]], [A @@ Tag], etc.
• Generating serializer/deserializer of Scala classes
• Surface => Serialize object parameters => Encoding in MessagePack.gz => Embulk
• Access TD from Spark
• Binding components with Airframe
• IO Manager, Presto Client, etc.
• Passing Design through SparkContext
• Integration
• TD -> Spark Dataframe
• TD Presto Query -> DataFrame
Data Analytics with Scala
New Directions Explored By Presto
• Traditional Database Usage
• Required Database Administrator (DBA)
• DBA designs the schema and queries
• DBA tunes query performance
• After Presto
• Schema is designed by data providers
• 1st data (user’s customer data)
• 3rd party data sources
• Analysts or Marketers explore the data with Presto
• Don’t know the schema in advance
• Many Analytical SQL queries
Bridging Gaps Between SQL and Programming Language
• Traditional Approach
• OR-Mapper: app developer design objects and schema, then generate SQLs
• New Approach: SQL First
• Need to manage various SQL results inside programming language
• But How?
An Instinct
• Scala SBT plugin for generating model classes from SQL files
• src/main/sql/presto/*.sql (Presto Queries)
• Using SQL as a function
• Read Presto SQL Results as Objects
• Enabled managing SQL queries in GitHub
• Type-safe data analysis
Scala at Production
• Do you need to install Scala?
• No. Only JDK is required
• sbt-pack
• Create Scala code packages for releasing
• At ./target/pack folder
• Folder structure:
• bin/ - launch scripts
• lib/ - Scala/Java libraries
• Makes easier to create docker images
• Also used for creating distributable packages of td-spark
Deploying to Maven Central
• Necessary Steps
• Upload artifacts -> Close -> Release -> Drop
• Painful
• Need to login to Nexus Web UI
• Many manual steps
• Bintray?
• Uploading to Bintray -> Automatic sync to Maven Central
sbt-sonatype plugin
• Enable one-command release to Maven Central
• Using REST APIs of Sonatype NEXUS Repository Manager
• Developed at 2015 New Year holiday
• Jan 5: Test Nexus REST API
• Jan 20: First release (Just 1 day effort)
• Released sbt-sonatype using sbt-sonatype
• 2,000+ projects are using sbt-sonatype
• Supporting sbt 0.13.x and 1.0.0
• And can be used for Java projects too
• Nexus to Maven Central sync is now fast
• Less than 10 minutes (June 2017)
• TD is a heavy user of Scala
• Analytics pipelines
• Production services
• Many libraries helping development
• Airframe, wvlet-log
• sbt plugins
• For details about Presto analysis
• Join Presto Meetup on Thursday!
Presto Meetup Tokyo: June 15, 2017 (Thu)

