My “on-prem” compute infrastructure at home consists of a laptop. This meets most of my daily needs, but is insufficient for certain tasks that require large amounts of memory, CPU, or benefit from CUDA support (which my laptop lacks). The title focuses on machine learning, but this represents only a fraction of the types of workloads I run. This document provides a solution to the more general problem of offloading personal/small team compute jobs to AWS using the Netflix Metaflow framework and its native AWS integration.

Along with this document, I have made available a Terraform module which codifies the AWS compute setup for personal use, which is designed to be low-cost while maintaining flexibility.

Requirements

  • Low cost. My uses for scaling up compute are sporadic, which means I’d like to take advantage of elasticity and low market price spare compute if possible.
  • Parity between local workflow and non-local workflow development processes. The development, experimentation, and diagnostic process should look and feel similar regardless of where the compute is performed.
  • Unobtrusive. I would appreciate reasonable constraints in terms of the compute substrate, but nothing too specialized in terms of the types of workloads.
  • Low maintenance. I don’t mind an up-front setup cost, but maintenance burdens prevent me from utilizing my time most effectively.
  • Just me. I’m unconcerned about scaling this solution to larger teams. A given solution might make this possible, but won’t be considered a requirement for this design.
  • Reproducibility. I’d like a simple way to save and load input datasets, final products, and anything else required to reproduce results and perform diagnostics and analysis after the job(s) is complete.

Why Metaflow?

Many of my projects are expressed in Python or “MATLAB”, and I also use both professionally. I have also used, in production environments for several years, two amazing workflow tools that express workflow digraphs of arbitrary complexity directly in Python: Luigi and Apache Airflow. I have come to value the ergonomics and flexibility of these systems, and I gravitate to solutions that look and feel similar.

Metaflow feels like a spiritual successor to Luigi. Both are workflows expressed in pure Python, which largely stay out of your way. The largest ergonomic differences from my perspective are:

  • Metaflow has a metadata system that allows for examining results from any job run, which are each cataloged as immutable datasets. A common use case here is to open a Jupyter notebook, pull up the latest job run via the Metaflow client API, and visualize some diagnostic data.
  • Metaflow lets you persist basically anything serializable by assigning the object to a member variable, e.g. self.x = x. This can be configured as either local or AWS S3 storage, and both look and feel the exact same from the user perspective.
  • Metaflow lets you offload compute via a simple decorator on your function.

The last one is really powerful. Here is an example from the tutorial:

from metaflow import FlowSpec, step, batch, retry


class HelloAWSFlow(FlowSpec):
    @step
    def start(self):
        from metaflow import get_metadata

        print("HelloAWS is starting.")
        self.next(self.hello)

    @batch(cpu=1, memory=500)
    @retry
    @step
    def hello(self):
        """
        This steps runs remotely on AWS Batch using 1 virtual CPU and 500Mb of
        memory.
        """
        self.message = 'Hi from AWS!'
        print("Metaflow says: %s" % self.message)
        self.next(self.end)

    @step
    def end(self):
        """
        The 'end' step is a regular step, so runs locally on the machine from
        which the flow is executed.

        """
        # demonstrate object persistence across local and AWS batch via AWS S3
        print(self.message)
        print("HelloAWS is finished.")


if __name__ == '__main__':
    HelloAWSFlow()

For those familiar with Luigi, the task class and task descriptions will likely look and feel very familiar.

In my case, running this simple example with this personal setup launched an m4.xlarge at a spot price of $0.06 per hour (on-demand was $0.20 per hour). Total cost for the job was about $0.01.

Architecture

Metaflow provides an example CloudFormation template. I manage my personal (and professional) infrastructure with Terraform, and therefore I created a Terraform module to setup Metaflow AWS to my requirements. The architecture is described in the following diagram:

Architecture diagram

Concept of Operations

Most of the time, I stay on the laptop. Whenever I reach a step in my processing pipeline that exceeds the capabilities of my local compute capabilities, I guesstimate the resources required to run the job, and annotate the step with the compute resources required. When I run the workflow, and it reaches the step, it will start a batch job in AWS and block until it succeeds or fails. If it fails, I can make modifications and resume, allowing the processing to pick up where it left off.

Dependencies are managed via Conda, which work locally and remotely in a mostly-transparent fashion. Remote use of Conda dependnecies requires an additional annotation on each remote step.

Since my off-laptop workloads tend to be “bursty” and are tolerant to latency, I can take advantage of AWS Spot as a compute option, and the native Metaflow integration with AWS Batch means I can configure cheap on-demand workloads through the Metaflow workflow itself. I only have to pay for AWS S3 storage and the compute time I use for a job. AWS Batch will auto-terminate any EC2 instances that are no longer used by any Batch job.

Features

Here is a summary of some noteworthy features of this design:

  • Saves money by using Spot instances
  • Saves money by running the metaflow service and database locally, instead of an always-on RDS instance and ECS service.
  • Creates a VPC and a public subnet in each availability zone, such that AWS Batch can optimally schedule in the Region.
  • Runs AWS Batch compute in a public subnet with a reasonable security group default (no ingress, permissive egress). This saves money by not requiring a NAT gateway.
  • Follows the principle of least privilege.
  • Configured to use a Managed compute environment for AWS Batch with instance type “optimal”, which allows AWS Batch to choose between C, M, and R instance families.
  • Adds a few GPU instance types by default, in case there is a need for GPU workloads.
  • AWS Batch configured with an allocation strategy SPOT_CAPACITY_OPTIMIZED, which allows AWS Batch to select instance types that are large enough to meet the requirements of the jobs in the queue, with a preference for instance types that are less likely to be interrupted.

Conclusion

This setup has been at work for me, and it has been delightfully ergonomic. I no longer need to fret that my matrix multiplications will cause out-of-memory errors, or context-switching in order to setup an appropriate compute environment, to include loading code, dependencies, and data.

Having S3 as my data store means I can start to play with large datasets, and benefit from the great network performance between EC2 and S3. I can inspect results locally with a Jupyter notebook, and resume failed jobs where they stopped. I don’t need to worry about provisioning compute; it is auto-provisioned for me when I need it, and can easily include dependencies a required by the workflow step.