I've come across some terrible code at work lately. It was written by me a few months earlier. When I tried to think about why it was so bad, I realized that I hadn't separated my business logic from all the "glue code" to other network services. Once I recognized this anti-pattern, I started seeing it lots of places.

What Is This Terrible Code?

Here's something very similar to the actual code:

def autoscale():
    all_jobs = ORM.query_for_unfinished_jobs()
    total_work = 0
    for job in all_jobs:
        if job.is_valid():
            total_work = job.cost

    time_sla = lookup_time_sla()
    worker_speed = lookup_worker_speed()
    ideal_worker_count = round(total_work / (worker_speed * time_sla))

    min_workers, max_workers, actual_worker_count = AWS.describe_auto_scaling_group()

    if ideal_worker_count > max_workers:
        ideal_worker_count = max_workers

    if ideal_worker_count < min_workers:
        ideal_worker_count = min_workers

    if ideal_worker_count > actual_worker_count:
        PAAS.scale_up_to(ideal_worker_count)
    elif ideal_worker_count < actual_worker_count:
        PAAS.scale_down_to(ideal_worker_count)

The actual code has a few additional constraints and lots and lots of metric-spewing instrumentation so we can see what its thinking...all of which makes the code that much harder to reason about and all the more important to test!

What Does This Code Do?

Essentially it does some math to decide how many servers we need to complete our long running batch jobs within a specific number of seconds (the "SLA"). Here's the essential formula:

number of workers = combined time estimates for all jobs / (speed of one worker * the SLA)

Let's mull over how changing the value of each individual variable in that formula effects the number of workers:

  • As the total amount of work increases (all else being constant), we'll need more workers
  • As the speed of the workers increases, we'll need fewer workers
  • If we want the work to get done faster (by decreasing the SLA), we'll need more workers

So far so good! Those outcomes all seem reasonable and intuitive. The gist of the formula feels right.

Oh, the Edge Cases!

BUutttttt -- there's a whole lot of additional concerns we need to think about to ensure this code is getting us to the correct outcomes. For one thing, some jobs might be invalid, so we'd want to exclude them from the calculation. For another, scaling up and scaling down work totally differently because we want a graceful shutdown, not interrupting a job in progress. Finally, we set boundaries on the min and max sizes of our pool of workers. If the ideal number of workers is more than the limit, we need to respect the limit. (In real life, there's at least three other constraints we need to adhere to as well.) If we get this formula or its edge cases wrong, we're either wasting money or making our customers frustrated. Its safe to say the stakes are high!

Testing This Is Terrible

So, how do I test the scale up and scale down behavior in various scenarios? There's no arguments to this function and there's no return values. However, you'll notice that there's basically one main mathy thing going on here to calculate the ideal_worker_count.

Let's the consider the case of a cold start. I have no active workers, and suddenly a whole slew of jobs come in at once. Let's check (1) we scale up and (2) we scale up to the correct amount. Given the above implementation, it would look something like this:

@patch.object(PAAS, 'scale_up_to')
@patch.object(PAAS, 'scale_down_to')
@patch('path.to.code.lookup_time_sla', return_value=300)
@patch('path.to.code.lookup_worker_speed', return_value=2)
@patch.object(AWS, 'describe_auto_scaling_group')
@patch.object(ORM, 'query_for_unfinished_jobs')
def test_scale_up(m_query_for_unfinished_jobs, m_describe_auto_scaling_group, m_lookup_worker_speed, m_lookup_time_sla, m_scale_down_to, m_scale_up_to):
    job = Mock(cost=120)
    job.is_valid.return_value = True
    m_query_for_unfinished_jobs.return_value = [job for _ in range(0, 10)]
    m_describe_auto_scaling_group.return_value = (0, 10, 0)

    autoscale()  # System under test

    m_scale_up_to.assert_called_once_with(2)

This test stinks. Its so counterintuitive, we literally wrote out comments with the math spelled out as grade-school math homework. While I'm quite happy to mock out lots of things in my unit tests like this, how each of the different mocks interacts to reach that result of 2 at the bottom is extremely non-obvious. Now imagine around 10 different tests that are variations of this code for the various scenarios and intuitions I've outline for this piece of code. Good luck trying to debug this or extend it in the future!

There Must Be a Better Way

Let's step back a second and think about what we're really trying to do here in the test: We want to know that given some set of jobs and some set of workers, we arrive at the correct final size. It's almost like all the various bits of information should be arguments to a function, and the size of the group should be the return value.

Let's forget about where all the data comes from and let's focus on seeing what a function would look like that just did some math instead of finding all that data. Here goes:

def autoscale_math(min_workers, max_workers, total_work, worker_speed, time_sla):
    ideal_worker_count = round(total_work / (worker_speed * time_sla))

    if ideal_worker_count > max_workers:
        return max_workers

    if ideal_worker_count < min_workers:
        return min_workers

    return ideal_worker_count

This code almost feels too terse to qualify for being its own separate function. But, how do the tests look for this?

# Let's keep these bits constant for our test
MIN_WORKERS = 0
MAX_WORKERS = 10
WORKER_SPEED = 2
TIME_SLA = 300

def test_scale_up():
    current_workers = 0
    total_work = 1200

    ideal_worker_count = autoscale_math(MIN_WORKERS, MAX_WORKERS, total_work, WORKER_SPEED, TIME_SLA)

    assert ideal_worker_count == 2

Already, this seems quite a bit clearer for testing as well as for the application code itself. But, if we want to really cover a significant portion of the different cases, perhaps we could be even more expressive somehow.

What if we used test tables to layout all the different test cases? Let's see how that would look in python:

MIN_WORKERS = 0
MAX_WORKERS = 10
WORKER_SPEED = 2
TIME_SLA = 300

test_autoscale_cases():
    test_cases = [
        # total_work | ideal_worker_count | description
        (       1200,                   2,  'cold scale up'),
        (       3000,                   5,  'scale up partially'),
        (    1000000,                  10,  'only scale up to max'),
        (          0,                   0,  'scale down totally'),
        (        300,                   1,  'scale down partially'),
        (   -1000000,                   0,  'only scale down to min'),
    ]

    for (total_work, ideal_worker_count, description) in test_cases:
        assert autoscale_math(MIN_WORKERS, MAX_WORKERS, total_work,
                              WORKER_SPEED, TIME_SLA) == ideal_worker_count, description

This style of testing is considered idiomatic in golang and is also my favorite feature of Gherkins. I don't see a lot of python projects using this style, but it definitely works toward the overarching goal of most python syntax: readability.

The test tables allow me to think only about the specific case I want to test without thinking about the boilerplate to drive the test into that state. Combining that style of test with a very simple, pure function results in super easy test coverage for critical systems like this.

Notice that my test table doesn't include coverage of the invalid jobs -- that's okay! I've managed to compress so much boilerplate by using this test table, that I'll just add another one-off test for the invalid job edge case. Trying to shoe-horn that bit into the above table wouldn't really make sense to me because I'd have to add a column to every row, but really invalid jobs would only effect one of the tests.

But About All That Glue Code?

Right! So lets' try breaking about all the network calls into its own function and see if that makes any sense:

def autoscale():
    time_sla = lookup_time_sla()
    worker_speed = lookup_worker_speed()
    min_workers, max_workers, current_workers = AWS.describe_auto_scaling_group()
    jobs = ORM.query_for_unfinished_jobs()
    total_work = [job.cost for job in jobs if job.is_valid()]

    ideal_worker_count = autoscale_math(current_workers, min_workers, max_workers, total_work, worker_speed, time_sla)

    if ideal_worker_count > current_workers
        PAAS.scale_up_to(ideal_worker_count)
    elif ideal_worker_count < current_workers
        PAAS.scale_down_to(ideal_worker_count)

This function has also become dramatically simpler. Looking at this function, its clear that we also need to think about the failure modes for these calls as well as the business logic. For instance, should we retry the PAAS.scale_up_to() calls if they fail at first? Do we want a hardocded fallback for lookup_worker_speed()? I'll leave those details as an exercise to the reader as well.

When Separating Logic and IO Does Make Sense

The key feature that made separating the IO and logic work for this case was that we want all the data from various network calls every time. As long as we need to always gather the same data every time (or doing that IO is considered cheap in your context), then we don't have to be strategic about when to gather the data. We just always get all of it.

When Separating Logic and IO Doesn't Make Sense

Sometimes, various networks calls which are input to your buisness logic can and should be skipped. In those cases, the refactor technique only makes complex code harder to read. If we try to pull all of the IO out but still be strategic, now we end up making lots of little functions to evaluate lazily. If you strongly prefer very short and numerous 1-2 lines functions over 100-200 line functions, maybe separating the IO and Logic still makes sense. However, when I've tried this, my coworkers didn't thank me and there was just a lot more code to try to understand. Additionally, to make the network calls lazy, I had all more functions which just added more boilerplate to sift through. The end result was greater head scratching.

For these cases, there's a number of refactor strategies you could try:

  • Look for dead code that can be totally removed
  • Simplify your data model to reduce the differences in these various scenarioes
  • Use some kind of caching to make the data less expensive to look up
  • Consider breaking apart a series of smaller chunks of IO + logic units and chain them together
  • Try using a cucmber clone to write narrative descriptive tests around the icky bits
  • What component has too many responsibilites? Can I split out one muddled abstraction into two more narrow abstractions?

Sometimes, the nasty code refuses to yield to your gentle coaxing, and that's okay too.

Conclusion

Sometimes, separating your core logic out from IO can greatly enhance the readability and testability of your code. Try using test tables to greatly reduce amount of boilerplate you need to write. Also, make sure that your code always fetches the same data before trying to isolate logic and network concerns.