View more stories by categories: DataBits

Nick Lyon – LTER Network Office Data Scientist
Stevan Earl – Central Arizona-Phoenix LTER Information Manager

Conventional Commits: A better way to track changes with Git

Computational code is increasingly valued as a fundamental component of scientific work. More complex projects (e.g., synthesis science) tend to be especially dependent upon novel code to facilitate reproducible workflows. Regardless of scientific focus, projects that rely heavily on code typically need to adopt a system for managing that code and tracking changes to it as projects evolve and the code develops. Version control systems (e.g., Git, SVN) are purpose-built for this sort of tracking of changes to code files, and Git in particular is popular within the life sciences. There are numerous online portals and companion software platforms (e.g., GitHub, GitLab, GitKraken, etc.) built around Git workflows that offer a suite of valuable project-management and communication tools that increase the benefit of using Git for version control. Here, we use “GitHub” to refer generically to this category of tool for the sake of brevity. While Git and related resources are powerful tools, their benefit to individuals and teams is greatest when best practices are adopted. One of the more important best practices is around crafting good commit messages.

A commit is a save point in the project workflow that is initiated by the user. For instance, a user can edit and save a file countless times locally but there will only be a new commit when the relevant steps are taken. This is a useful feature as it empowers the user to identify versions of the code that represent a substantive milestone rather than tracking every changed line as a completely separate edit automatically. This reliance on the user means that commits are initiated at appropriate points in the project workflow.

To help clarify the purpose of the changes captured by a commit, every commit is associated with a “commit message”. These typically describe the changes made to the project since the previous commit and/or details about the state of the project at that point. A commit message consists of a short (usually 50 or fewer characters) one-line summary or title that is a free-text field. Additional, optional components of a commit message include a body for providing more detail and footer for metadata, such as references to GitHub issues, other commits, or contributors. These optional components of a commit message are often not used, and some popular programming applications (e.g., RStudio, Visual Studio Code) do not expose them. As such, the summary is the salient component of a commit message and is what is typically displayed in a Git log. When editing a single file, the default commit message is “Update <file name>”, which is not particularly helpful when looking back at a set of code that was iteratively revised over the course of months or years. Similar to a good column name in a data table, a good commit message typically is one that balances conciseness against information density. Someone reading the message should be able to quickly identify the major change(s) captured by that commit without needing to wade through the minutiae of changes to the content.

The best practices of many facets of Git are well known, and introductory educational materials typically emphasize these components to novice learners. However, commit messages often lack this standardization beyond the general guidance included above about balancing brevity against information content, and even advanced users of version control often lack a standardized approach to handling commit messages. Conventional Commits (conventionalcommits.org) is a relative rarity then as it suggests a clear format for commit messages that is incredibly useful for anyone using version control systems.

Fundamentally, Conventional Commits relies on specifying a one-word “type” that clearly encapsulates the category of changes included in the commit (official types summarized in Table 1). The type is then followed by a more qualitative description at a finer level of detail that generally resembles a typical commit message. If the set of changes is a “breaking change” (i.e., the edits will make current users of the code need to edit their workflows to avoid errors), Conventional Commits specifies that an exclamation mark be inserted after the type but before the description. Finally, if the commit pertains to a particular area of focus within the project, you can list this noun “scope” parenthetically after the type but before the description (e.g., “fix (website): description”, etc.).

Table 1. Official commit “types” in the Conventional Commits framework

TypeDescription
buildChanges that affect the build system or external dependencies (e.g., Quarto extensions, etc.
ciChanges to continuous integration (“CI”) files and scripts (e.g., GitHub Actions, etc.)
docsDocumentation only changes
featA new feature
fixA bug fix
perfA code change that affects performance
refactorA code change that neither fixes a bug nor adds a feature. Used when the output of the code is unchanged but the way in which that output is reached has changed
styleChanges that do not affect the meaning of the code (e.g., white space, formatting, etc.)
testAdding missing tests or correcting / adding to existing tests

Conventional Commits suggests that any time a commit could reasonably have more than one type that those should be split into as many separate commits as needed such that each commit has only one clearly appropriate type. This makes identifying changes for a particular purpose easier because each commit is tightly related to only those changes that were motivated by a specific goal (rather than more complex/holistic sets of edits that have goals that are less clear). An ancillary benefit of strictly following this ‘single type’ rule is that it pushes users to engage more with project management systems. For example, if the user knows they are working on a set of edits meant to improve the performance of their code (i.e., type will be “perf”) but during that process they have an idea for an entirely new set of analyses (i.e., type will be “feat”) they would be gently nudged towards recording that idea (as a GitHub issue, post-it note, email, etc.) so that they can act on it after they finish the batch of edits related to the commit they originally were working on. This sort of explicit documentation of ideas and rationale is more reproducible and can make project metadata and/or Methods section writing easier even though it can feel like an extra step in the moment that those decisions are made.

It might be helpful here to consider some examples of different commit messages for the same–hypothetical–set of edits to project files. See below:

Default > “Update stats.R”

This commit message is quite common but gives essentially no information about the nature of the changes. Further, if users edit the same file repeatedly using this default message for each commit, it becomes progressively less informative as there is not any information or context that distinguishes the changes associated with one commit from another.

Custom > “Updating the stats script to try some new analyses”

This commit message is certainly an improvement over the previous one but it is still vague. Also, if every commit message included this level of detail it would be challenging to quickly scan through dozens or hundreds of commit messages on a particular project to get a sense for the major pivot points of a project or to search for a particular commit.

Conventional Commits > “feat (stats): mixed-effects models with site as random effect”

This message is in the Conventional Commits format and uses an optional scope to add context while still keeping the word count to a minimum. Granted, just the information after the colon would be a useful message in and of itself, but the inclusion of a type (and scope) would make searching across many commits much simpler. For instance one could filter for all commits with the “stats” scope to identify every commit that pertained to the statistics script (and the type of each would provide the high-level information on the type of changes in that commit).

To summarize:

  • Commit messages often lack critical information that–if included–would increase the value of using version control
  • Conventional Commits clearly defines a simple and extensible structure for informative commit messages
  • The inclusion of discrete “type” categories makes broadly grouping commit messages intuitive and easy to revisit

Conventional Commits in the Wild

To supplement the above article text and example commit messages, users might consider checking out some example repositories that use Conventional Commits. Adoption is rarely absolute but even partially integrating the Conventional Commits’ framework greatly clarifies the history of changes made to a given repository. The LTER Scientific Computing (“SciComp”) team maintains a website created with Quarto and deployed via GitHub. That repository’s commits can be seen here: github.com/lter/scicomp/commits. The LTER Network’s new Synthesis Skills for Early Career Researchers (SSECR) course also makes partial use of this style of commit message; see here: github.com/lter/ssecr/commits.

Taking it to the Next Level

Commit messages can be expanded further to include direct references to project issues. GitHub has a feature that allows project participants to create issues that facilitate conversations specific to the repository. For example, a participant might create an issue to highlight a bug in the code, a suggestion to improve the documentation, or request a new feature. Every issue is assigned a number when it is created and typing “#<number>” in a different issue will automatically convert the number into a link to the issue with that number. We can apply this same functionality to commits by including relevant issue numbers in our commit messages. This powerful feature allows us to link commit messages with issues related to the changes in the repository associated with that commit. For example, if issue #3 outlines the team’s plan to add more comments into the existing scripts (to make them more transparent) the commit message might be “style: adding comments to all project scripts (see #3)”. A good example of this in practice comes from Michael Kennedy who has a repository aimed at teaching intermediate Python coders how they might standardize the syntax and structure of their code to align with best practice. He makes extensive use of this shorthand for referencing GitHub issues–though he does not adhere to Conventional Commits’ framework elsewhere. See Kennedy’s commit history here: github.com/mikeckennedy/write-pythonic-code-demos/commits.

Vocabulary

  • Version control: A combination of software and practices employed to track and manage digital files, often associated but not limited to computer code.
  • Commit (adapted from https://git-scm.com/docs/gitglossary#def_branch)
    • As a noun: A single point in the Git history; the entire history of a project is represented as a set of interrelated commits. The word “commit” is often used by Git in the same places other revision control systems use the words “revision” or “version”.
    • As a verb: The action of storing a new snapshot of the project’s state in the Git history.
  • Commit message: A message associated with a commit that typically characterizes the changes made since the previous commit and/or the state of the project at that point.
  • Git log: a chronological history of commits in a repository, that may also include information about branching, merging, or other actions that affect the repository 
  • Continuous Integration (CI): Automated code execution that is typically associated with a particular action in a web-based platform built on Git infrastructure, such as GitHub or GitLab.
  • Issue: A feature of web-based platforms built on Git infrastructure, such as GitHub and GitLab, that allow users to track tasks, bug fixes, or feature requests associated with a Git repository.

This article part of DataBits, stories about data management, techniques, and tools. DataBits is curated by the LTER Information Managers. For more information and to contribute a DataBits article, reach out to the Network Office or Marina Frantz, current editor of DataBits.