A view from the ivory tower: Participating in Apache as a member of academia
- 1. A view from the ivory tower
Participating in Apache as a member of academia
Michael Mior, Rochester Institute of Technology
- 3. My Background
2009 Undergraduate degree in Computing Science
2011 Masters degree in Computer Science
Joined the startup Bunch
2013 Started a PhD in Computer Science
2016 Joined Apache Calcite
2018 Graduated the PhD program
Joined the Rochester Institute of Technology
- 4. My Research
● NoSQL database schema design and integration
● Connecting data sources of diverse formats
● Distributed data processing
● Open data analysis
● Data-driven schema discovery
- 6. How I found Apache
● Searched for existing work on heterogeneous data processing
● Found Apache Calcite, a data processing framework
● Contributed an adapter for Apache Cassandra
● Also contributed to Apache Spark
- 7. Apache Calcite
● A dynamic data management framework
● Basis for query optimization in Hive and Drill
● Used by Alibaba, Uber, Tencent, …
● Connects to MongoDB, Cassandra, Spark, …
- 8. Apache Calcite History
2014 Optiq enters incubation
Optiq is renamed Calcite
2015 Calcite 1.0.0 released
Calcite graduates from incubator
2016 I join as a committer
2017 I join the PMC and start a term as PMC chair
2018 Apache Calcite members publish in ACM SIGMOD
- 9. My contributions
● Searched for existing work on heterogeneous data processing
● Found Apache Calcite, a data processing framework
● Contributed an adapter for Apache Cassandra
● Materialized view query rewriting with joins
● Minor improvements to Apache Spark
- 13. The Apache Way
● Earned Authority
● Community of Peers
● Open Communications
● Consensus Decision Making
● Responsible Oversight
- 15. Earned Authority
● Everyone can participate
● Influence is based on publicly earned merit
● The most productive labs have diversity
● Anyone can have a good idea
- 16. Community of Peers
● Individual participation, not organizations
● Roles and titles are equal
● Collaboration is common and expected
● Students are also junior colleagues
- 17. Open Communications
● All communication is publicly accessible
● Private decisions are disallowed
● Unfortunately, not all work is public
● But publication is critical
- 18. Consensus Decision Making
● Projects are overseen by volunteers
● All votes are equal regardless of position
● Shared governance is key
- 20. Independence
● The ASF is vendor-neutral
● No organization has special privilege
● True in aspiration, if not always in practice
- 21. Community Over Code
● Healthy community is high priority
● Good communities write good code
● Burnout is real, the principles hold true
- 23. Success Stories
● Apache Flink (TU Berlin, HU Berlin, HPI)
��� Apache Kylin (eBay R&D)
● Apache Mesos (UC Berkeley)
● Apache Pig (Yahoo Research)
● Apache Spark (UC Berkeley)
● Apache Stanbol (Salzburg Research)
- 24. Apache Spark Successes
● Apache spark: a unified engine for big data processing.
1,336 citations
● Mllib: Machine learning in apache spark
1,456 citations
● Spark sql: Relational data processing in spark
1,121 citations
● Hundreds of other papers
- 25. Apache Calcite Successes
● Apache Calcite: A Foundational Framework for Optimized Query Processing
Over Heterogeneous Data Sources, SIGMOD 2018.
47 citations
● One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach
to Management of Streams and Tables, SIGMOD 2019.
10 citations
● Automated Reasoning of Database Queries
Shumo Chu, PhD thesis
- 26. Apache Calcite Successes
Out of the 39 test cases that use SQL features supported by Cosette,
Cosette is now able to formally prove that Calcite's rewrite in 33 of them
are correct. This includes a few fairly complicated ones, like
"testPushFilterPassAggThree" The good news is that we haven't found any
bugs so far :) … We have also used the test cases to improve Cosette.
- 27. Apache Calcite Successes
RLO: a reinforcement learning-based method for join optimization
Xinyi ZHANG et al.
We implement RLO in Apache Calcite and Postgres. Extensive experiments
demonstrate that Apache Calcite RLO is 10 ×–56 × faster in finding the execution
plan and 80% faster in executing the plan than the state-of-the-art heuristics.
- 29. Why choose Apache? (for academics)
● Find people who may be interested in your problems
● Find problems that may be interesting to your people
● Discover interesting technology (maybe unpublished)
● Save time by building on existing platforms
- 30. How to get started (for academics)
● Find a project suited to your interests
● Look for ways to apply your expertise
● Write some code!
- 31. Why choose academia? (for Apache folks)
● Get more exposure (and potentially more committers)
● Find a different perspective on problems to be solved
● Meet people who may want to solve your problems
● Potentially discover new technologies
- 32. How to get started (with a problem)
● Contact researchers working on relevant problems
● Consider GSoC and other mentorship programs
● Many faculty would love to find good problems
- 33. How to get started (with a solution)
● Academic conferences are not limited to academics!
● Many conferences have an “industrial” track
● Perhaps find an academic partner to publish with
- 34. Challenges
● Writing working code is harder than writing code ready to commit
● Grad students find it difficult to make time for good code
● Many advisors don’t have time for code review
● Industry folks find it difficult to find time to write papers
- 35. What do I do now?
● Still a (somewhat) active member of the Calcite PMC
● I’m fortunate to still be able to write code regularly
● My students regularly work on code for Apache projects