People who know me might have heard me mention GitHub’s workflow in the past and how I really dislike it. I think it’s time to for me to make a detailed post explaining the problems I have with it. Here we go.
To make clear the precise workflow I’m talking about, let’s walk through a hypothetical scenario.
Alice comes across a bug in a piece of software she uses. Being a good open-source citizen, she locates the GitHub repository and sees if she can find the bug.
First, she signs into GitHub (signing up if she doesn’t have an account already). Next, Alice clicks the “fork” button on the original repository. It doesn’t “fork” anything, it simply creates a full copy of the repository that’s under Alice’s ownership.
Next, she clones her copy of the repository locally. Being a talented programmer, she quickly finds the bug and comes up with a tentative solution. She creates a new branch, commits this solution, then pushes it to her “fork” on GitHub. Finally, she clicks back to the original repository and hits the “Create Pull Request” button and selects her branch to propose as a change. Yay! A pull request has now been created and will serve as the home of the code review for the lifetime of this fix.
Now, maintainer Bob comes along. He has some comments on the code, which he leaves on the pull request (PR). Alice sees the comments, think that they make sense, and applies the changes to her local branch, commits again, and pushes once more. Bob has a few more comments. This repeats a couple times before Bob is finally happy with the state of the code, and merges the pull request.
Alice, liking to leave things clean, goes on the GitHub website and deletes her “fork” of the repository.
Here I’ll try to highlight what I find wrong with the above.
Every minor revision is a commit, which generates a new line item on the web UI. Often, you can find PR’s where more often than not the commit messages are just some variant of “fixes” or “do more”. Here’s a somewhat-contrived example. In a real software project, half of those commits might not even be in a working state, which will interfere with bisecting if that ever needs to be done.
To be fair, this might be a wider Git culture issue, not limited to GitHub, but GitHub definitely encourages and popularized it.
Many software projects used to force people to squash their commits manually, but a few years ago GitHub partially remediated this by providing a “squash and merge” button which squashes all the commits into one then applies it against the target branch.
Unfortunately, this still has some problems. It’s now up to the maintainer to dredge through all the commit messages and then try to come up with a representative squash message, otherwise the default is to concatenate all the commit messages together. The contributor is the one that has the most context on the work and should be the one composing the message.
The root of this problem is that GitHub has no way of expressing updated versions of the same change. Sure, contributors can periodically squash then force push to the branch representing their PR in order to keep things clean, but then you lose visibility and knowledge of prior versions of the change, since force-pushing is a destructive operation.
This problem is closely related to the previous. GitHub provides zero support for PR’s that depend on other PR’s. I don’t think it’s a controversial opinion that every PR, the unit of code review, should represent one independent and freestanding idea.
However, GitHub makes it difficult for one to, e.g. submit a PR for feature A, then submit a second PR for feature B which relies on feature A. Because PR’s are publicly associated with a single, fixed branch, we can’t do this cleanly in GitHub’s current model.
There’s two ways to hack around it. The first is to submit a PR for feature A, then branch again and submit a second PR for feature B. However, feature B’s PR code review interface will then include all the changes from A as well, because the UI compares the upstream trunk to feature B’s branch.
The other hack is to stuff everything into one massive PR. Leading to monster PR’s that are a mess of commits and comments and are impossible to review. An example.
GitHub has actually begun sanctioning this hack, since there is now UI support for viewing individual commits of a single PR. This seems…non-ideal. Because the PR still needs to be merged or rejected atomically as one unit, so commenting on changes in the middle of it is not necessarily productive for the author.
The first item is self-explanatory. Yet another set of credentials to keep track of. If you were paying attention during the scenario set out above, you might have noticed all the steps required, most of which required clicking around on webpages:
- Sign in
- “Fork” the repository
- Clone your “fork”
- Branch, make changes and commit
- Push to “fork”
- Create PR
- (After merge) delete “fork”
That’s really heavy friction. Part of me suspects that this is in GitHub’s interest, to have a number of forks and active repos that they can point to for engagement. It’s social network behaviour in an environment that should not be one. Same goes for “stars”.
The shame is that all GitHub clones copy this workflow verbatim – Gitea, GitLab, all of them do this.
Patches. Changes should be proposed as sets of simple patch files that can be retrieved and applied to the working tree.
To solve commit spam, each patch is one commit, and should be a complete idea that could be committed immediately as one unit, if the maintainers approve. Contributors amend their commits regularly and resubmit v2, v3, … patchsets using Git’s native support for doing so. Each resubmission of the patchset thus forms a public record of the progress made on the work. All of the pointless minutiae such as typo fixing, style/linter nits, are unnecessary for the public eye and are squashed away.
Patches implicitly support stacking, since they are not bound tightly to a specific publicly visible branch. Contributors should make clear where the patch is intended to be applied, and what other patchsets their patchset depends on (if any).
Patches do not require signing into yet-another-Gitea-instance. They are simply text files, and can be emailed, uploaded to a personal website, or even sent directly over a chat channel, if small enough. Compare the following steps to the longer procedure above:
- Clone the upstream repository
- Create a branch, make changes, and commit
- Generate patches with
git format-patchand convey them to the maintainer
- Delete the branch after the patches are integrated
Not only is this procedure shorter, it’s also simpler and does not rely on any proprietary services.
Now, who works this way, and who provides these workflows? Many massive, important open source projects such as the Linux kernel, OpenBSD operating system, PostgreSQL, and git itself are developed in this manner – developers exchange patchsets over email. Email is a very blunt tool, however, and isn’t very smart. There exist services such as SourceHut and Phabricator (unfortunately no longer maintained) that provide prettier interfaces over a patch-based or patch-like workflow.
If you ever join a very large company (like Google or Facebook), you’ll find that their change review systems will be quite similar to a patch workflow as well.
Anyways, this might sound like an old-fashioned graybeard’s workflow to many newer programmers who grew up doing things the “GitHub” way, but I urge you to consider my points and think again about the problems I’ve mentioned. I grew up on the “GitHub way” myself and didn’t realize what a productivity handicap it was until I experienced the “patch workflow” for myself.
A website exists that provides a gentle introduction to configuring your setup for emailing patches around. Give it a shot!
Please let me know what you think! (Mail the public comments mailing list shown in the footer)
Until next time!