What We Can Learn About Code Review From the Apache Superset Project

What We Can Learn About Code Review From the Apache Superset Project


About the author

Will Barrett is a Software Engineer, Technical Lead, and Engineering Manager from the San Francisco Bay Area with over 14 years of experience. Will is a member of the Project Management Committee for Apache Superset. He’s held staff software engineer and senior engineer roles at Change.org, Entelo, Sqwiggle and Preset.

Will is the author of On Learning to Program, a blog for new Software Engineers entering the industry. Will is also a certified reviewer on PullRequest where he’s caught hundreds of bugs and other critical issues for over 30 teams.


images/apache-superset-github-pull-request-code-review.jpg

This is the first post in an ongoing series exploring code review practices in mature open source projects. Be sure to also check out What We Can Learn About Code Review From the Apache Spark Project.

Disclaimer: the author is affiliated with the Apache Superset project as a Contributor and Project Management Committee member.

What is Apache Superset?

Apache Superset is a modern data exploration and visualization platform. Superset is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts.

- From the official Apache Superset website

Superset was started in 2015, created by Maxime Beauchamin while working at Airbnb. It entered incubation with the Apache Software Foundation in 2017, graduating to full Apache project status in November of 2020. Its 1.0 version was released in January of 2021. At time of publication the project has approximately 40,000 stars on GitHub and has been forked over 7,000 times. Its main branch has over 8,000 commits.

Superset is a web application with a Python back-end and a React-based front-end. It’s back-end leverages Flask and the Flask-AppBuilder framework. The front-end leverages Ant Design and both NVD3 and Apache ECharts for visualization.

Apache Superset’s Merge Rules

Like most Apache projects, only Contributors to the project are allowed to merge code into the main branch, though pull requests can be submitted from a fork by anyone with a GitHub account. Contributors are voted in as members of the project, generally after 6 months of sustained contribution consisting of multiple pull requests spaced out over time.

Merges require CI to pass as well as one approving review from any of the Contributors on the project. In addition, CODEOWNERS have been established for the database migrations directory, and for an Elastic Container Service configuration. This Elastic Container Service-related rule gives us the first hint that something interesting is going on…

Leveraging Ephemeral Environments to help with Review

The Preset organization was founded to build a business around Apache Superset. They have sponsored a system to allow ephemeral environments to be launched with the code in any PR. This can be done by any committer leveraging the /testenv up command in a comment on a PR.

Here is an example: https://github.com/apache/superset/pull/16411#issuecomment-904209549

images/launch-ephimeral-test-environments-in-a-pull-request.jpg

ECS then launches a Superset instance with the code from the PR.

Superset has a highly complex UI with a wide variety of visualizations and front-end systems. Providing the ability to launch any changeset in an ephemeral environment makes it easier to review intricate UI changes by clicking around and testing out the results of changes. Automated testing for complex UIs is challenging and expensive, so making it easy to play with changes in an ephemeral environment is a good way to help reduce the regression rate on the main branch.

Superset Improvement Proposals

Looking over the repository issue list, some issues jump out - their titles are prefixed with [SIP-XX] where XX is a number. An example would be [SIP-41] Proposal for Alignment of Backend Error Handling. These issues track the history of decision making in the project and act as a record of standards and directions taken by the project as a whole. The status of these Superset Improvement Proposals is tracked in a GitHub Project.

images/apache-superset-github-project-tracking.jpg

The SIPs are voted on for adoption following the Apache voting rules for project governance. The SIPs are frequently referenced in code review to help align code changes with decisions previously made in the project. As a practice, this is helpful for a distributed team - they allow for a new Engineer to gain context on previous decisions without reaching out to project maintainers and allow standards to be enforced on PRs without reopening the same issues for discussion repeatedly.

Enforced Pull Request Naming Convention: fix, feature, chore, etc.

images/uniform-naming-convension-for-pull-requests-apache-superset-github.jpg

Scanning Superset’s open PR list there is a remarkable uniformity in PR naming. Every PR is prefixed with one of fix:, feature:, chore:, refactor: or one of a few other prefixes. Looking deeper, we can see that the project actually fails the build in GitHub Actions when the PR title does not match this regular expression:

^(build|chore|ci|docs|feat|fix|perf|refactor|style|test|other)(\(.+\))?(\!)?:\s.+

This is a good example of automating specific, repeatable types of feedback and allowing computers to enforce simple rules. In this case, creating a defined format for PR titles makes the changelog very readable, and makes it easier to tell what the goal of a PR might be. One by-product of this enforcement is it reduces the urge of Engineers to include multiple unrelated changes in a single PR - a PR with a new feature, some fixes, and a refactor becomes hard to categorize so Engineers are incentivized to make more tightly-scoped and focused PRs.

The PR Template Emphasizes Documenting Changes

Here is the current PR template for the project:

### SUMMARY
<!--- Describe the change below, including rationale and design decisions -->

### BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
<!--- Skip this if not applicable -->

### TESTING INSTRUCTIONS
<!--- Required! What steps can be taken to manually verify the changes? -->

### ADDITIONAL INFORMATION
<!--- Check any relevant boxes with "x" -->
<!--- HINT: Include "Fixes #nnn" if you are fixing an existing issue -->
- [ ] Has associated issue:
- [ ] Required feature flags:
- [ ] Changes UI
- [ ] Includes DB Migration (follow approval process in [SIP-59](https://github.com/apache/superset/issues/13351))
  - [ ] Migration is atomic, supports rollback & is backwards-compatible
  - [ ] Confirm DB migration upgrade and downgrade tested
  - [ ] Runtime estimates and downtime expectations provided
- [ ] Introduces new feature or API
- [ ] Removes existing feature or API

The PR template has the clear goal of generating good documentation for the change, including screenshots or an animation of any UI change, as well as testing instructions, links to related issues, and special instructions for database migrations, which includes a link to a specific approval process required for database changes, presented as a Superset Improvement Proposal.

As a large project with a high velocity (currently there are over 100 open pull requests, and 44 have been merged in the last week) this documentation makes it easier for maintainers to keep track of the movement in the repository.

Requiring testing instructions helps support maintainers in building releases. As part of the Apache process, releases must be voted in. Maintainers are expected to test the releases to ensure they remain high-quality over time. When combined with ephemeral environments, Superset has made it easy for contributors to manually verify changes without downloading and launching branches or going over the code line by line to understand its goals and behavior.

Very Few Rejected PRs

Looking through the PR history on the project, I find that the vast majority of PRs have been merged. Some hang out for a long time and are eventually closed, as on any project. Some are closed as duplicating work, or because they’re going in an unprofitable direction - usually by the original author. There are very few PRs that are rejected by the project. Some PRs continue for quite a while as contributors work with the author to bring the changeset in line with expectations. This is evidence of a positive, aligned community.

No Ancient PRs? The Stale Bot.

Looking at some of the older PRs still on the repo, many of them have been tagged with the inactive label. GitHub’s Stale Bot is used by the project to automatically tag and close old issues. This helps keep the issue list manageable and saves effort on behalf of the community in managing old issues. If something is not touched for a long time, it will be labeled as stale and after a couple more weeks it will be closed if the author does nothing further related to the issue.


Find this useful? Be sure to check out What We Can Learn About Code Review From the Apache Spark Project.


About PullRequest

HackerOne PullRequest is a platform for code review, built for teams of all sizes. We have a network of expert engineers enhanced by AI, to help you ship secure code, faster.

Learn more about PullRequest

Will Barrett headshot
by Will Barrett

August 30, 2021