Principles to live by

This article outlines engineering principles that I’ve found worth articulating to my engineering teams.  Some are strictly enforced while others are mere guidelines.  It’s really up to the stage of the team to decide how firmly to follow these, and each team may have additional principles or patterns of their own as well.

Version control

All changes must be made through version control.

Trunk-based development

The default git branch is named main (formerly master) and all feature branches must be branched from and regularly updated from it. Never push code directly to main, changes must be proposed via Pull Requests.

Exceptions can be made for large features where the components physically cannot be merged or released independently from each other.

main branch is always releasable

The main branch should never have code or functionality in it that is not safe to release. In the release-when-ready model we operate under, releases can and will occur at any point and they should not cause any overhead for the engineering teams.

Pull Requests should be as small as possible

Small changes are easier to develop, code review, test, release, and verify success.

One Pull Request per story

Product and Engineering should negotiate a story’s scope to be the smallest change that will add business value and allow for learning and iteration. The story should be able to stand on its own without required follow-up stories to be functional in production. Consider that priorities may shift and follow-up stories or phases may never occur.

Larger or more time-sensitive features should often be phased in with feature flags and released disabled (or “dark”). This allows the main functionality to be released early and any refactored code to be exercised ahead of the feature being released. Then when we’re ready to deploy the feature, it’s a single configuration change to enable it.

When multiple engineering teams are needed to deliver on a feature, the story should be a small vertical slice across the needed technologies. The engineering teams can spin off technical stories to meet this principle, with the overarching team’s work being associated to the main story. The main story should describe the overall problem and acceptance criteria. The downstream team would then create a tech-only story for their part of the story, and the primary team’s work would align with the main story. The downstream team would deliver their functionality to production first, and then once the main team’s changes are deployed, the overall story is then considered released.

Code reviews should occur on all Pull Requests

Code reviewers should be a mix of people within each team. Using different people within the team for code reviews will give people a fresh set of eyes on their code, helps developers get new types of feedback, and will make the whole team stronger at peer reviews. Also, while a Senior or above is required for PR approval, be sure to give a chance for people at all levels to participate in the code review process.

What we should look for during code reviews:

  • Code consistency
  • Adherence to established design patterns
  • Meaningful method and variable names
  • Clear and understandable code
  • Self-documenting unit test coverage

All testing should be performed on the feature branch

As main is always releasable we must have full confidence in any functionality that has been merged in, therefore QA/SDET should perform their manual and automated testing in each individual feature branch.

Never merge your own code

Engineers must always have someone else review and merge code changes. Typically QA/SDET performs the merge after their functional testing and approval, but configuration changes or non-production changes can be merged by other engineers.

Pull Request merges should squash commits

With squashed merges, each commit in main is a meaningful checkpoint of functionality that is releasable. This allows for easier triaging of issues and rollbacks.

The whole team is responsible for preventing stale PRs

Product, Engineering, and SDETs have a shared responsibility for preventing stale PRs.

  • Product should only prioritize stories that are needed in production quickly, and they should be shepherding the story throughout the sprint to ensure it’s released and meets the acceptance criteria.
  • Engineers should want to wrap up their stories as quickly as possible, negotiating scope to reduce the size of the changes, making them easier to review, finding reviewers, and following up when they seem stuck.
  • SDETs should juggle between new high-priority changes and PRs that are falling behind, pushing back on new changes as necessary to keep our cycle time from suffering.

Use feature flags to de-risk large features

Larger or more time-sensitive features should often be phased in with feature flags and released disabled (or “dark”). This allows the main functionality to be released early and any refactored code to be exercised ahead of the feature being released. Then when we’re ready to deploy the feature, it’s a single configuration change to enable it.

  • Switching a flag On or Off should take less effort than creating a merging a PR, otherwise it defeats the purpose. This should be as simple as clicking a button somewhere.
  • Feature flags are meant to be removed, we can assume that the flagged code will be the new norm going forward, so any tests/methods around the flag being off must be very easy to find and delete.
  • Don’t ever make any assumptions about the state of the flag, all tests must be agnostic and preserve / revert any flag status changes it makes.

Infrastructure-as-code

Infrastructure and configurations should be pushed and managed via source control whenever possible. This ensures that those changes are known, reviewed, and deployed in a transparent way. It also helps us avoid manually changed “snowflaked” servers that are extremely difficult to manage or rebuild.

Continuous Integration

We should automate as much of our processes as possible, run them per commit for fast feedback, and increase confidence in our feature development.

Code consistency should be automated

Unwavering consistency within a system leads to code where you can make more assumptions and predictions about its behavior and implementation that end up being accurate. [source]

We should follow community-sourced and maintained style guides whenever possible. There are generally tools we can integrate into CI that enforce those standards and allow exceptions from them. Exceptions should be rare team decisions to deviate for a specific purpose.

Other items we should look to enforce via code reviews if our tools don’t catch them:

  • Using spaces instead of tabs
  • Avoid additional spaces at end of lines
  • Include a new line character at end of all files
    • This can be enforced in most text editors by changing a default setting

We may need to programmatically address the above across all files in individual repos to set the baseline go-forward standard.

We should proactively seek out and integrate new tooling

New tools around different ways to analyze static code are introduced regularly and we should be open to trying new tools in our CI process to see if they add value for our teams.

Dependent libraries should be regularly updated

Code dependencies on external libraries need to be continually reviewed for new releases and updated appropriately. Being stuck on old versions of libraries is detrimental for several reasons, primarily missing out on new functionality, resolved bugs, and security enhancements that occur over time.

We should ideally automate this process as part of CI, but when that’s not possible, we should rotate engineering responsibility to manually check for updates at least monthly.

Unit, integration, and regression tests should run automatically

All of our test suites should run automatically during feature development to exercise our code changes at various levels:

  • Unit tests are written for each public method to exercise and self-document the intended functionality at the lowest level
  • Integration tests are used to test certain outcomes of specific API endpoints and/or interactions between systems
  • Regression tests mirror customer or agent usage of the application under test (AUT)

Strive to remove automated test flakiness

Sporadically failing tests reduce confidence in our overall automation suite and waste a lot of engineering time trying to determine if they were the cause of the test failure(s).

All automated processes must be passing before merging

We need to have confidence that main always has passing CI tools to ensure that any errors that arise during feature development are caused by those specific feature changes.

Continuous Delivery

Our applications should be able to support frequent and painless releases throughout the day.

Releases should be able to occur without downtime

We need to architect our ecosystem to allow for rolling deployments that do not require downtime. This will allow us to deploy during business hours instead under cover of darkness.

Release process should be fully automated

Releases should feel simple, easy, and occur with a high degree of confidence.

Releases to the staging environment should occur automatically upon merging PRs to main.

Releases to the production environment should occur with a single push-button approval either within a tool or via Slack.

Production support

Our systems should be built for easy monitoring to proactively find, triage, and resolve issues.

Have comprehensive instrumentation for all outcomes

Each interaction that a customer, agent, or API performs should be instrumented into a centralized system for monitoring and reporting (both from technical and business perspectives). It could be as simple as pass, fail, and error with relevant meta data, or if there are a finite number of outcomes, it should be more granular.

Strive to have zero “acceptable” errors

Errors that occur in production should be noticeable and meaningful. If there are “acceptable” or ignorable errors in production, it makes it harder to find and triage new issues that come up.