SlideShare a Scribd company logo
A view from the ivory tower
Participating in Apache as a member of academia
Michael Mior, Rochester Institute of Technology
Overview
1. Introduction
2. My path to Apache
3. The Apache Way and the Academic Way
4. Apache success stories
5. How to get involved
My Background
2009 Undergraduate degree in Computing Science
2011 Masters degree in Computer Science
Joined the startup Bunch
2013 Started a PhD in Computer Science
2016 Joined Apache Calcite
2018 Graduated the PhD program
Joined the Rochester Institute of Technology
My Research
● NoSQL database schema design and integration
● Connecting data sources of diverse formats
● Distributed data processing
● Open data analysis
● Data-driven schema discovery
My path to Apache
How I found Apache
● Searched for existing work on heterogeneous data processing
● Found Apache Calcite, a data processing framework
● Contributed an adapter for Apache Cassandra
● Also contributed to Apache Spark
Apache Calcite
● A dynamic data management framework
● Basis for query optimization in Hive and Drill
● Used by Alibaba, Uber, Tencent, …
● Connects to MongoDB, Cassandra, Spark, …
Apache Calcite History
2014 Optiq enters incubation
Optiq is renamed Calcite
2015 Calcite 1.0.0 released
Calcite graduates from incubator
2016 I join as a committer
2017 I join the PMC and start a term as PMC chair
2018 Apache Calcite members publish in ACM SIGMOD
My contributions
● Searched for existing work on heterogeneous data processing
● Found Apache Calcite, a data processing framework
● Contributed an adapter for Apache Cassandra
● Materialized view query rewriting with joins
● Minor improvements to Apache Spark
A view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academia
The Apache Way and
the Academic Way
The Apache Way
● Earned Authority
● Community of Peers
● Open Communications
● Consensus Decision Making
● Responsible Oversight
Apache
Academia
Earned Authority
● Everyone can participate
● Influence is based on publicly earned merit
● The most productive labs have diversity
● Anyone can have a good idea
Community of Peers
● Individual participation, not organizations
● Roles and titles are equal
● Collaboration is common and expected
● Students are also junior colleagues
Open Communications
● All communication is publicly accessible
● Private decisions are disallowed
● Unfortunately, not all work is public
● But publication is critical
Consensus Decision Making
● Projects are overseen by volunteers
● All votes are equal regardless of position
● Shared governance is key
Responsible Oversight
● Projects are self-governing
● Commits are peer-reviewed
● Labs are typically fairly independent
● Published work is peer-reviewed
Independence
● The ASF is vendor-neutral
● No organization has special privilege
● True in aspiration, if not always in practice
Community Over Code
● Healthy community is high priority
● Good communities write good code
● Burnout is real, the principles hold true
Success Stories
Success Stories
● Apache Flink (TU Berlin, HU Berlin, HPI)
● Apache Kylin (eBay R&D)
● Apache Mesos (UC Berkeley)
● Apache Pig (Yahoo Research)
● Apache Spark (UC Berkeley)
● Apache Stanbol (Salzburg Research)
Apache Spark Successes
● Apache spark: a unified engine for big data processing.
1,336 citations
● Mllib: Machine learning in apache spark
1,456 citations
● Spark sql: Relational data processing in spark
1,121 citations
● Hundreds of other papers
Apache Calcite Successes
● Apache Calcite: A Foundational Framework for Optimized Query Processing
Over Heterogeneous Data Sources, SIGMOD 2018.
47 citations
● One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach
to Management of Streams and Tables, SIGMOD 2019.
10 citations
● Automated Reasoning of Database Queries
Shumo Chu, PhD thesis
Apache Calcite Successes
Out of the 39 test cases that use SQL features supported by Cosette,
Cosette is now able to formally prove that Calcite's rewrite in 33 of them
are correct. This includes a few fairly complicated ones, like
"testPushFilterPassAggThree" The good news is that we haven't found any
bugs so far :) … We have also used the test cases to improve Cosette.
Apache Calcite Successes
RLO: a reinforcement learning-based method for join optimization
Xinyi ZHANG et al.
We implement RLO in Apache Calcite and Postgres. Extensive experiments
demonstrate that Apache Calcite RLO is 10 ×–56 × faster in finding the execution
plan and 80% faster in executing the plan than the state-of-the-art heuristics.
How to get involved
Why choose Apache? (for academics)
● Find people who may be interested in your problems
● Find problems that may be interesting to your people
● Discover interesting technology (maybe unpublished)
● Save time by building on existing platforms
How to get started (for academics)
● Find a project suited to your interests
● Look for ways to apply your expertise
● Write some code!
Why choose academia? (for Apache folks)
● Get more exposure (and potentially more committers)
● Find a different perspective on problems to be solved
● Meet people who may want to solve your problems
● Potentially discover new technologies
How to get started (with a problem)
● Contact researchers working on relevant problems
● Consider GSoC and other mentorship programs
● Many faculty would love to find good problems
How to get started (with a solution)
● Academic conferences are not limited to academics!
● Many conferences have an “industrial” track
● Perhaps find an academic partner to publish with
Challenges
● Writing working code is harder than writing code ready to commit
● Grad students find it difficult to make time for good code
● Many advisors don’t have time for code review
● Industry folks find it difficult to find time to write papers
What do I do now?
● Still a (somewhat) active member of the Calcite PMC
● I’m fortunate to still be able to write code regularly
● My students regularly work on code for Apache projects
Questions?

More Related Content

A view from the ivory tower: Participating in Apache as a member of academia

  • 1. A view from the ivory tower Participating in Apache as a member of academia Michael Mior, Rochester Institute of Technology
  • 2. Overview 1. Introduction 2. My path to Apache 3. The Apache Way and the Academic Way 4. Apache success stories 5. How to get involved
  • 3. My Background 2009 Undergraduate degree in Computing Science 2011 Masters degree in Computer Science Joined the startup Bunch 2013 Started a PhD in Computer Science 2016 Joined Apache Calcite 2018 Graduated the PhD program Joined the Rochester Institute of Technology
  • 4. My Research ● NoSQL database schema design and integration ● Connecting data sources of diverse formats ● Distributed data processing ● Open data analysis ● Data-driven schema discovery
  • 5. My path to Apache
  • 6. How I found Apache ● Searched for existing work on heterogeneous data processing ● Found Apache Calcite, a data processing framework ● Contributed an adapter for Apache Cassandra ● Also contributed to Apache Spark
  • 7. Apache Calcite ● A dynamic data management framework ● Basis for query optimization in Hive and Drill ● Used by Alibaba, Uber, Tencent, … ● Connects to MongoDB, Cassandra, Spark, …
  • 8. Apache Calcite History 2014 Optiq enters incubation Optiq is renamed Calcite 2015 Calcite 1.0.0 released Calcite graduates from incubator 2016 I join as a committer 2017 I join the PMC and start a term as PMC chair 2018 Apache Calcite members publish in ACM SIGMOD
  • 9. My contributions ● Searched for existing work on heterogeneous data processing ● Found Apache Calcite, a data processing framework ● Contributed an adapter for Apache Cassandra ● Materialized view query rewriting with joins ● Minor improvements to Apache Spark
  • 12. The Apache Way and the Academic Way
  • 13. The Apache Way ● Earned Authority ● Community of Peers ● Open Communications ● Consensus Decision Making ● Responsible Oversight
  • 15. Earned Authority ● Everyone can participate ● Influence is based on publicly earned merit ● The most productive labs have diversity ● Anyone can have a good idea
  • 16. Community of Peers ● Individual participation, not organizations ● Roles and titles are equal ● Collaboration is common and expected ● Students are also junior colleagues
  • 17. Open Communications ● All communication is publicly accessible ● Private decisions are disallowed ● Unfortunately, not all work is public ● But publication is critical
  • 18. Consensus Decision Making ● Projects are overseen by volunteers ● All votes are equal regardless of position ● Shared governance is key
  • 19. Responsible Oversight ● Projects are self-governing ● Commits are peer-reviewed ● Labs are typically fairly independent ● Published work is peer-reviewed
  • 20. Independence ● The ASF is vendor-neutral ● No organization has special privilege ● True in aspiration, if not always in practice
  • 21. Community Over Code ● Healthy community is high priority ● Good communities write good code ● Burnout is real, the principles hold true
  • 23. Success Stories ● Apache Flink (TU Berlin, HU Berlin, HPI) ��� Apache Kylin (eBay R&D) ● Apache Mesos (UC Berkeley) ● Apache Pig (Yahoo Research) ● Apache Spark (UC Berkeley) ● Apache Stanbol (Salzburg Research)
  • 24. Apache Spark Successes ● Apache spark: a unified engine for big data processing. 1,336 citations ● Mllib: Machine learning in apache spark 1,456 citations ● Spark sql: Relational data processing in spark 1,121 citations ● Hundreds of other papers
  • 25. Apache Calcite Successes ● Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources, SIGMOD 2018. 47 citations ● One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables, SIGMOD 2019. 10 citations ● Automated Reasoning of Database Queries Shumo Chu, PhD thesis
  • 26. Apache Calcite Successes Out of the 39 test cases that use SQL features supported by Cosette, Cosette is now able to formally prove that Calcite's rewrite in 33 of them are correct. This includes a few fairly complicated ones, like "testPushFilterPassAggThree" The good news is that we haven't found any bugs so far :) … We have also used the test cases to improve Cosette.
  • 27. Apache Calcite Successes RLO: a reinforcement learning-based method for join optimization Xinyi ZHANG et al. We implement RLO in Apache Calcite and Postgres. Extensive experiments demonstrate that Apache Calcite RLO is 10 ×–56 × faster in finding the execution plan and 80% faster in executing the plan than the state-of-the-art heuristics.
  • 28. How to get involved
  • 29. Why choose Apache? (for academics) ● Find people who may be interested in your problems ● Find problems that may be interesting to your people ● Discover interesting technology (maybe unpublished) ● Save time by building on existing platforms
  • 30. How to get started (for academics) ● Find a project suited to your interests ● Look for ways to apply your expertise ● Write some code!
  • 31. Why choose academia? (for Apache folks) ● Get more exposure (and potentially more committers) ● Find a different perspective on problems to be solved ● Meet people who may want to solve your problems ● Potentially discover new technologies
  • 32. How to get started (with a problem) ● Contact researchers working on relevant problems ● Consider GSoC and other mentorship programs ● Many faculty would love to find good problems
  • 33. How to get started (with a solution) ● Academic conferences are not limited to academics! ● Many conferences have an “industrial” track ● Perhaps find an academic partner to publish with
  • 34. Challenges ● Writing working code is harder than writing code ready to commit ● Grad students find it difficult to make time for good code ● Many advisors don’t have time for code review ● Industry folks find it difficult to find time to write papers
  • 35. What do I do now? ● Still a (somewhat) active member of the Calcite PMC ● I’m fortunate to still be able to write code regularly ● My students regularly work on code for Apache projects