In the small to medium software teams that I’ve been on, our development cycle is inefficient and error prone at a conceptual level. Code changes are fully tested only after becoming the source of truth for the team, causing rollbacks that corrupt git history. Too often, developers are unable to determine whether a piece of code will actually work in production before they ship their code. I propose several practices to more fully test your code before deploying it to your customers.

Setting the Stage: Typical Development Cycles

Laptop ->  CI  -> Approval -> Staging -> Production
 8hrs     30min      8hrs       1hr         Years

Length of time branch exists at each phase of development cycle

In this kind of a workflow, the only production-like environment used to test software happens immediately before sending your code to your customers.

What, Specifically, Is Bad about Staging Right Before Production?

Problem 1: Prod is just different, man

There are often significant differences which emerge only in a production-like infrastructure. I once spent days developing and testing delete functionality for DynamoDB, adding in conditions to the delete to ensure atomicity in cases of writes happening concurrently, only to discover that the mocking library for DynamoDB had a different version of conditions than the DynamoDB running in AWS. If I could have easily run this code against real DynamoDB, I would have discovered this issue mucher earlier. While this kind of issue doesn’t happen to all developers everyday, it’s not uncommon among my peers and it can double or triple the time required to complete even simple tasks.

These kinds of problems aren’t limited to proprietary cloud systems either. I’ve personally encountered foot guns when using open source components, such as Celery, Redis, Django’s ORM, and MongoDB – because these are the infrastructure facing technologies I’ve run lately. The production infrastructure wasn’t properly mimicked anywhere, except maybe in staging.

Problem 2: Dips on Staging

Waiting for access to staging is another problem. At one of my jobs, we manually tested each branch after merging. I remember rolling out a big change to add SSL support for a large new customer. This change had the potential to break all images, stylesheets, javascript and basically everything if I didn’t get it right. So, I took staging from the team for a few hours to test it manually, but blocked the team from deploying. Anyway, I finally got it right on the third try. For nearly every team I know of where there’s a staging environment, I hear about production deploys being blocked due to the need to deploy various experimental changes before hitting production.

Problem 3: Bugs Going Viral

In teams of up to a hundred engineers working in the same codebase, if wrong assumptions about infrastructure reach production, its entirely possible for one developer to pull down the authoritative branch (let’s call it “master”) during the window of time that this wrong assumption has not yet been rolled back. Defective code like this can go viral and reoccur even after a rollback.

Its actually even worse than that. If automated tests run against a staging environment only after merging to master, its entirely possible that an unwitting developer pulled in the bad code which never reached production. At some point later, she discovers that her branch doesn’t pass tests for code unrelated to her changes. At one company I recently worked for, this problem became so widespread that the QA team built tools to toggle individual tests via a Google spreadsheet so that they could increase developer velocity. A few times, I’d spend 30 minutes to an hour diagnosing a bug unrelated to my own code, open up a PR to fix the unrelated code, and discover that another PR or two had already been opened and reviewed (but perhaps not yet merged to master) before mine was opened.

Fixing Software Development

For nearly all these issues I’ve identified above, some kind of test already existed but the test ran on staging too late. What if every change could land on stage automatically before merging to master? What if you could manually poke at this stage at your leisure, without any time pressure or blocking other engineers release? Fewer bugs and greater velocity.

Solution 1. Make a “Staging” Per Branch

Expense Hourly Annual
“Web Developer” in Kentucky $36.29 $75,487
m5d.24xlarge on-demand EC2 instance in us-west-1 $6.38 $13,023
  • The hourly wage is calculated by dividing the yearly salary into 52 weeks, and then dividing each week into 40 hours
  • The annual cost of an m5d.24xlarge is calculated for the a 51 week year of 40hrs per week

Even the cheapest labor is significantly more expensive (more than 5x) than the priciest cloud computing. But, if you hire experienced web developers in the Bay Area (or other coastal cities), the labor costs could be 2 to 3 times that of the our colleagues in Kentucky. And the EC2 instances we’d be testing on is likely much less expensive (do you actually need 384 GiB of RAM and 96 cores for your production workloads?). Any work an engineer is doing that can be replace by a cloud computing instance should be.

So, if all of our developers are sitting idle, waiting for a chance to test their work on a production-like environment, that looks like an ideal point to arbitrage the compute / engineer cost difference. If all of our engineers can spend less time waiting for tests to complete, and more time delivering features or fixing bugs, that means the company could be getting more value out of the workforce they already have without a significant outlay.

Objection! Our production infrastructure is hard to replicate

Congratulations! You just found an argument that gives you budget to do some real devops on your production infrastructure. Bonus: you’ll get really good at disaster recovery if every branch is mini-rollout of your production infrastructure. The details of how you do this are not something I can unpack in this blog post right now, but there are lots of other great resources on this starting with the 12 factor app.

Objection! How could an expensive EC2 instance possibly make any difference?

If you’re not already in the cloud, it’s the same principle: you’d want to buy a set of hardware which is identical to your production infrastructure and configured using the same set of automation for your on-premises servers. Oh, you don’t automation for your on-premises hardware? See previous paragraph.

Either way, the closer your development environment (and particularly your CI environment) is to production, the more effective the tests you already have will be. By deploying and running ALL tests against a production like environment, you will not bogged down with hardware or instance-type specific quirks. As we discussed earlier, if staging is the only production-like environment, and all tests need to pass through staging, you’ve got a single point of failure and a potentially limiting factor in your CI/CD pipeline. To solve both problems, eliminate the notion of a permanente environment called “staging” and create ephemeral environments for each branch that’s developed.

Objection! This still seems really expensive

There’s a lot you can do to cut costs on this operation. Here’s a few ideas:

  • Invest in autoscaling so that when your development environments are idle, they scale to 0
  • When the branch is deleted, automation should be able to clean up that ephemeral branch to reduce costs
  • If the whole company works in a few adjacent timezones, turn off all the machines for the night via automation. Bonus: employees will avoid burnout by going home at a “reasonable” hour

These techniques, alone or in combination, should keep your costs minimal while providing strong guarrantees about the code you ship to production.

Objection! What about databases, credentials to vendor integrations, etc?

This part depends heavily on your context, so I can’t provide the right answers for your case. However there are strategies you can try.

For relational databases, you can spin up a database cluster that’s an exact mirror of production, but only used by development environments. Each of your branches can create a separately named set of databases for it and your automation can run migrations or pull a recent snapshot from production (probably best to sterilize any sensitive data in there first!). For DynamoDB or other similar proprietary NoSQL databases, you can just create a duplicate AWS/GCP/Azure account, and namespace your database.

For other third party integrations, I’d recommend making a separate account and reusing credentials where possible. Speak to your vendor to find out how other customers are solving this problem – if the vendor is smart, they’ve find a way to solve your issues.

None of these techniques are new or particularly clever. They just take an investment of time.

Solution 2. No More Sneaky 3-ways

When your Github repository allows merge commits (a “three-way merge” technically), this means that its very possible code from your branch won’t work after its merged into master. There’s very little guarranteed about the logic or syntax of your code after a 3 way merge. However, if its merged to master, your bugs are going to go viral – even when the tests are already written that would catch these bugs!

Rebasing Is the Answer!

However, if your branch must be rebased against master and is only allowed to be a fast-forward merge, then you have guarranteed that the code in your PR will be exactly the same code in master. This means the test result will be the same for anything that relies purely on the code. This is huge! The key point here is not how to use git correctly, but how to guarrantee that a good starting point for other developers to base their own work upon.

Rebasing Isn’t Really The Answer

In fact, at a certain scale, simply doing rebases probably won’t work anymore. The problem will become that developers can’t rebase and finish tests as fast as code is being deployed to production and merged to master. This leads to a vicious cycle of rebasing endlessly against the current HEAD of master, like racing up an escalator going down.

OpenStack addresses this problem by testing combinations of branches together all the time to find combinations that will work before they are merged to master. They’ve custom built a tool called zuul for this.

From chatting with friends at Facebook, I think they do something like queue a series of changes, and put these changes into a deterministic order. By ordering the change requests, they’ve effectively gotten the same guarrantees as a rebase-based workflow.

Netflix seems to tackle this problem by breaking things into microservices so that there aren’t huge codebases collaborated on by thousands of engineers together.

But, Seriously, Rebasing Kinda Is the Answer

So, this idea is very powerful and much deeper than just “do rebases”. The real idea here is “test before you merge.” It just so happens that rebasing, with few enough engineers sharing a repository, tends to be an easy way to achieve this. Regardless of how you eliminate three way merges, these kind of tactics make your earlier branch tests more effective, making your existing tests work earlier.

Solution 3. Test Outside The Box

Unit and integration tests that I’ve written and seen typically work inside the same process as the application code under test. Mocking is used to speed up expensive operations and create very specific situations to constrain tests. All of that is great at testing very specific code paths efficiently, and it gets us something like 80% of the correctness guarrantees with 20% of the work of, say, fuzzing. So, this is really great stuff and its awesome we do it.

But, it doesn’t really answer if the application actually works for users. I like to get paid. I like it when users like the software I write and they pay for it…you know, because it worked for them. Life is hard enough without glitchy software, why make people suffer more?

By buttressing our “whitebox” testing with tests that operate outside the application as a separate process, we can check detect serious regressions. Additionally, these tests are very portable. With a tinsy bit of cleverness, you can write blackbox tests to run against a remote machine just as easier as a separate process on your same machine.

Cons with Blackbox Testing

Upfront investment in blackbox testing can discourage managers. However, I find that writing blackbox tests during a proof-of-concept phase to be worthwhile effort. Because the tests are portable and separate from the application, I can throw away my prototype and keep the same tests. If I’ve written enough coverage of my application using blackbox testing, I can be sure that any services using my application can switch over to the new application flawlessly. I can even write the application in a different language my blackbox test!

Maintenance of blackbox tests can intimidating. My prefered metholodogy would be make the blackbox tests mandatory for CI, and encourage developers to add blackbox tests alongside new features. This does impact development velocity, but in many cases the existing tests can remain unchanged.

Pros with BlackBox Testing

You know those annoying bits of code to test near your program entrance? Using blackbox testing with code coverage tools can make it easy to see that main method and many other wide swathes of your application have abundant test coverage.

Again, because blackbox tests can be written for portability, you can run them against production like environments from your own machine or from CI and experiment with changes before you get to production…if you only had your own production like environment…

Conclusion

Each of my three solutions can work separately. But, there’s magic that happens if you put them together: if code can only be fast-forward merged, and is tested on your development branch from a staging environment, there’s no need to run the test suite again on master…until after you deploy to production! In my experience, this could easily 30% of your deployment time that was previously dedicated to re-running tests.