"Build First, Refine Later" is a philosophy I have always lived by in my professional career into data engineering and analytics. Start delivering value to the business as quickly as possible and then keep evolving your data products based on the feedback received.
But when it comes to the foundational decisions such as choosing the right runtime environment or infrastructure for our data workloads, we should take a well thought out approach.
Else, we may end up building something that is either too costly to maintain or doesn't scale well with the growing data demands of the organization.
I am not implying that these areas don't evolve over time, but making informed choices at the outset can save significant time, effort, and resources in the long run.
In this article, we will discuss a pragmatic approach to choose optimal data engineering runtimes amongst: Virtual Machines, Containers, Serverless, and Spark.
Understanding the Options
Before diving into the decision-making process, let's briefly understand each of these runtimes:
- Virtual Machines (VMs): VMs provide a full-fledged operating system environment. They are suitable for workloads that require specific OS-level configurations and stricter networking controls.
- Containers: Containers offer a lightweight alternative to VMs. They package applications and their dependencies together, ensuring consistency across different environments. Containers are ideal for dynamic data loads that need to scale quickly.
- Serverless: Serverless computing, like AWS Lambda or Azure Functions, abstracts away the underlying infrastructure, allowing developers to focus solely on code. It is cost-effective for event-driven workloads with quicker execution times, as you only pay for the compute time you consume.
- Spark: Apache Spark is a distributed computing framework designed for big data processing. It excels at handling large-scale data transformations and analytics tasks, making it suitable for complex data engineering workflows.
Making the Decision
Now, before we jump into the decision tree, it's important to note that the choice of runtime should be guided by the specific requirements of your data engineering workloads, such as:
- The nature of the workload (batch, real-time, event-driven)
- The expected execution time
- The scalability needs
Our decision tree will help you navigate through these considerations and arrive at the most suitable runtime for your data engineering tasks.
Typically, data engineering workloads can be categorized into three types: Batch / Scheduled, Real-time / Streaming, and Event-based. In this article, we will focus on "Batch / Scheduled" and "Event-based" data engineering workloads. "Real-time / Streaming" workloads deserve a separate discussion and will be covered in a future article. Also, this article doesn't cover SQL workloads that are typically executed in data warehouses or lakehouses, as they have their own set of runtime options and considerations.
Alright here we go:
Let us cover each of these options in more detail:
First we are covering the right side of the decision tree, which is for event-based workloads. In principle, event-based workloads are generally not resource heavy or long-running, if they are, then they should be moved to batch/scheduled modes and manage the business expectations accordingly.
Decision box 1 : Serverless (Cloud Functions)
For event-based workloads which are expected to be quick (< 10 minutes, preferably less than 7 minutes) and no heavy libraries or OS-level control is required, serverless runtimes (Azure functions, AWS Lambda etc.) can be a good fit. It is also the simplest, quickest to implement and cost effective specially for sporadic workloads as you only pay for the compute time you consume.
Decision box 2 : Managed Container Services with Auto-Scale enabled
For event-based workloads with Unpredictable data volume, Managed Container Services (like Kubernetes, AWS Fargate, Azure Container Apps, and similar) with auto-scale enabled are a great choice. They can automatically scale up or down based on the incoming load, ensuring that you only pay for what you use. I don't recommend using serverless runtimes here (Azure Functions/ AWS Lambda etc.), because workloads with unpredictable load or data volume will hit execution time limits in no time.
Another scenario where I would consider using Managed Container Services with auto-scale enabled is when the workload is event-based, data volume to be processed is predictable but the source system does not expect a response within sub-seconds (like processing files dropped in a storage account, or processing messages from a queue etc.). In this case, we can allow some time for the containers to spin up and process the data without the need for sub-second response times.
Decision box 3 : Virtual Machines (always-up) with Containers
Now, this is for Event-based workloads where the data volume to be processed is predictable and the source system expects a response within sub-seconds (like API calls, user interactions etc.). I would choose Virtual Machines (always-on) and allow parallel containers to run on top of it. Perfect, no fuss, no frills, just good old VMs.
If Managed Container Services could achieve sub-second cold start, we could also consider it as an option here, but in my experience so far, even the lightest of images take 7+ seconds to spin up a container, which is not ideal for workloads that require sending a response back in sub-seconds. Keeping 1 pod warmed up at all times could also do the job here, but it will still be far costlier than using VMs.
Rule of thumb: If your event-based workload can run within the limits of serverless runtimes, go for it. If not, and if the data volume is unpredictable, go for Managed Container Services with auto-scale enabled. If the data volume is predictable but sub-second response times are required, consider using VMs with containers on top.
Now, let us move to the left side of the decision tree, which is for batch/scheduled workloads.
Decision box 4 : Spark (Managed or Self-Managed)
Well, this is a no-brainer. For batch/scheduled workloads that are expected to run for hours and process large volumes of data AND require complex transformations, Spark (either managed or self-managed) is the way to go. Spark's distributed computing capabilities make it ideal for handling big data processing tasks efficiently. You will benefit from Spark's abiility to scale horizontally if the problem size grows over time, and also from its rich ecosystem of libraries for machine learning, graph processing, and more. If the workload is not horizontally distributable, then we can consider other options like VMs or containers, but for workloads that can be distributed across a cluster, Spark is the best choice.
Decision box 5 : Virtual Machines (Scale-up)
For batch/scheduled workloads that are expected to run for hours and process large volumes of data but do not require complex transformations, Virtual Machines (scale-up) can be a good fit. This is especially true if the workload is not horizontally distributable and can be efficiently processed on a single machine with enough resources (CPU, memory, disk etc.). This option can be more cost-effective than Spark for workloads that do not require distributed computing capabilities, and it can also be simpler to set up and manage compared to a Spark cluster.
Decision box 6 : Cloud Functions
This is exactly like decision box 1, but for batch/scheduled workloads. For batch/scheduled workloads which are expected to be quick (< 10 minutes, preferably less than 7 minutes) and no heavy libraries or OS-level control is required, serverless runtimes (Azure functions, AWS Lambda etc.) can be a good fit.
Decision box 7 : Containers on VMs (consumption based)
This is similar to decision box 3 but for batch/scheduled workloads, you could use heavier machines (with more CPU, memory etc.) and allow parallel containers to run on top of it. But instead of keeping the VMs always up, we can spin them up when the scheduled job is triggered and shut them down once the job is done. This way we can save costs while still benefiting from the flexibility of containers for batch/scheduled workloads that require more resources than what serverless runtimes can provide.
Rule of thumb: If your batch/scheduled workload can run within the limits of serverless runtimes, go for it. If not, and if it can be distributed across a cluster, go for Spark. If it cannot be distributed but requires more resources than serverless runtimes can provide, consider using containers on VMs with a consumption-based approach. If it can be efficiently processed on a single machine with enough resources, then scale-up VMs can be a good fit.
Disclaimer: Even though the article is written with years of experience into architecture and platform designs but it may still not be exhaustive, and you may need to consider additional factors such as the skill set of your team, the existing infrastructure, and the specific requirements of your workloads. However, it should serve as a good starting point for making informed decisions about the right runtimes for your data engineering workloads.
For any questions, suggestions or feedback, please feel free to reach out to me on LinkedIn Manish K. Narang or post your comments below. I would love to hear your thoughts and experiences on this topic!