Agents are just workflows, really

It seems like everyone (including your author) is coalescing around AI agents as the next major platform shift. And like every platform shift before it – think cloud and mobile in particular – the rise of autonomous agents will require a complete transformation in how we think about infrastructure and the developer toolchain (which I’ve written about here).

One perhaps under the radar aspect of these discussions is the logistics of writing and running agent code itself. This is because most agents you see today are essentially toys; they tease at some (eventually) very useful task but for now do little things like booking flights or writing a basic Postgres query. But when we start moving to getting these things into serious, production environments, developers are going to face a litany of familiar problems.

In this post I’m going to argue that agents in essence are just dynamic workflows. Writing and managing them will carry with it the same challenges that have plagued distributed systems over the past decade – things like retries, recovery, and long running calls. And I believe that if you care about these things, Temporal should be your default execution layer for these complex workflows agents.

‍

The anatomy of an agent in production

To illustrate how autonomous agents are just distributed systems problems in a trenchcoat, let’s implement what appears to be the most popular of all the agent use cases today: a customer support bot. The first version of our little agent will be pretty tightly scoped. All it does is issue refunds (when appropriate).

Other than the obvious boilerplate, our agent boils down to one function that takes a natural language message from a customer talking to it:

1async def process_refund_request(customer_message):

What needs to happen to issue refunds effectively? A few things:

We need to analyze the message from the customer, using an LLM, to process the incoming data like which order number it corresponds to.
We need to query our production database (or Shopify, etc.) to pull the customer and their order/payment data.
We need to decide if a refund is warranted and the nature of that refund (full, partial, store credit).
If approved, we need to (1) issue the refund in Stripe, (2) update any existing ticket status in Zendesk, and then (3) send a confirmation email to the customer.

Even a function that’s quite simple on the surface – it just issues a refund – needs to communicate with 5 different systems and make 6 different sequential network calls to those systems. The code for such an agent might look like this (assuming your agents are written in Python for some inscrutable reason):

1async def process_refund_request(customer_message):
2
3    # Analyze the customer request
4    analysis = await openai.chat.completions.create(
5        model="gpt-4",
6        messages=[{"role": "user", "content": f"Analyze this refund request: {customer_message}"}]
7    )
8    
9    # Get customer data
10    customer = await get_customer_from_db(analysis.customer_id)
11    
12    # Check payment history
13    payments = await stripe.PaymentIntent.list(customer=customer.stripe_id)
14    
15    # Determine if refund should be approved
16    decision = await openai.chat.completions.create(
17        model="gpt-4", 
18        messages=[{"role": "user", "content": f"Should we refund {customer.email}? Payment history: {payments}"}]
19    )
20    
21    if "approve" in decision.choices[0].message.content.lower():
22        # Process the refund
23        refund = await stripe.Refund.create(payment_intent=payments.data[0].id)
24        
25        # Update ticket status
26        await update_zendesk_ticket(customer.ticket_id, "refund_processed")
27        
28        # Send confirmation email
29        await send_email(customer.email, "Your refund has been processed")
30    
31    return "Refund processed successfully"

The first thing you might notice about this code is that it’s highly order dependent. In most cases each stage is a complete dependency for the next. We must process the customer message before we start talking to our database, and we must get the requisite customer information from the DB before we pass it to Stripe to get a payment history.

At the same time, some of these tasks can also be done in parallel…sort of. In the final phase of the function – actually issuing the refund – the Zendesk ticket can be updated at the same time as the confirmation email is sent. But neither of those two things can happen until the refund is actually done in Stripe.

Obviously, this “agent” is just a workflow. It might involve some non-deterministic LLM calls, and it might get kicked off (also non-deterministically) via a chat interface. It might communicate semi autonomously with other agents. An LLM itself might control the order and execution of the steps. But ultimately, it is just a workflow. And like other workflows of arbitrary complexity, we have 2 major rhyming questions to answer before we get this into production:

What happens when things go wrong?

Let me count the seemingly infinite ways that a poker game between 5 internal external systems can go haywire. The customer doesn’t exist in your database. The customer exists in your database, but doesn’t exist in Stripe. Your process crashes in between refunding and emailing. Stripe is having a bad day and returns a 500 (just kidding Stripe, I know this would never happen).

When any, or all, of these things happen…do you retry? How many times? What if the refund did go through, but Stripe’s response got lost in the network?

What happens when things go long?

Things might not technically go wrong, but they might take a really long time, too long to respond to your customer in a reasonable timeframe. Your database might be overloaded. Calls to OpenAI might be taking longer than usual, because your prompt accidentally generated a huge response. Or you might have a human being as part of this workflow, manually verifying refunds over a certain dollar threshold.

Working through the constraints of HTTP, you are going to run up against natural timeout issues. If you’re running your agent on something like Lambda, your max timeout is 15 minutes (and default is 3 seconds). How do you account for all of this?

And then there are some other questions, like:

How do we know if it’s working?
How do we handle state so we can keep progress?
How do we scale it past one machine?
…and so on and so forth.

You might naively just try to build this logic into your agent. Perhaps a key/value store to handle state for the workflow in memory, and then checks for each workflow stage to make sure it hasn’t happened yet. You might manually add error handling to each workflow stage. You might implement exponential backoff. All of a sudden your 20 lines of code are more like 200+.

These are not new problems! These are all classic distributed systems problems. And since they compound gradually, many engineers go at it the old fashioned way. A wise person once said that it’s a colloquial rite of passage for every backend engineer to accidentally build their own workflow engine. Speaking of which…

‍

Why don’t you just use workflow engines, then?

You can, but it’s important to mention that empirically, most teams do not. They build their own workflow engines or frameworks, often falling prey to the “incremental trap” wherein each feature you need to add seems independent and reasonable in isolation. This held true for traditional workflows and we’re now seeing it play out for agentic ones too.

That said, there’s no shortage of workflow engines / orchestrators out there. We can develop a rudimentary definition of a workflow engine as something that provides:

The ability to define a workflow as a series of sequential steps
State persistence / passing data between steps
Error handling
Retries and exponential backoff
Monitoring and observability

Most engines that meet these criteria were developed or refined during those crazy modern data stack times and designed for building data platforms. I’m talking here about Airflow, Dagster (Amplify company), Luigi, and Prefect et al. Their fundamental abstraction is the DAG (directional acyclic graph), into which you shoehorn the code you actually want to write. Through years of battle testing on the gnarliest data pipelines the world had to offer (HDFS), these workflow engines with data platform origins have gotten pretty good at moving data around.

Then there are the AI pipeline workflow engines: products like MLFlow have been around for years, built in an era before GenAI existed. And in addition to these folks, we’ve seen an explosion of workflow engines (or agent frameworks) that are supposed to be GenAI native, built specifically for autonomous agents running in production.

So yes, there are a lot of options to choose from. But if agents really are just workflows, there is only one thing that you should care about when it comes to your workflow orchestrator: reliability. Reliability is not just a feature, it is the entire reason agents need a workflow engine. Contending with our two questions – what happens when things go wrong and long – are the most important challenges stopping us from having agents that actually work. Outside of some commoditized boilerplate, choosing a workflow engine for agents is entirely about downside protection; it is about the process crashes, the timeouts, the hardware failures, and all of the other things that will inevitably not work as intended.

‍

Building a workflow AI agent in Temporal

Before we dive in, a bit about why Temporal is not like other workflow engines.

Temporal was originally developed at Uber (as Cadence) as a “Durable Execution” platform – AKA making sure code executes no matter what. It was responsible for business-critical workflows like payment processing and driver matching, ensuring that they executed; no matter what went wrong or how long things took. The technology behind Temporal powers more than 1,000 services at Uber, and today is used by countless large organizations like Stripe, Netflix, and NVIDIA for their most important workflows.

Unlike most workflow engines, Temporal is not organized around the abstraction of the DAG. Instead, you just write your code normally and annotate it with Temporal decorators. Behind the scenes, Temporal’s magic makes sure it runs…no matter what:

The full running state of a Workflow is durable and fault tolerant by default
Your business logic can be recovered, replayed, or paused from an arbitrary point
Workflow activities can run for arbitrary lengths of time, and automatically retried (forever)

For our agent code, this means that we are covered when things go wrong or long. If a request returns a 500, we can define custom retry policies in Temporal with parameters like maximum attempts, backoff coefficient, intervals, etc. And Temporal will never time out on long running (think, months) requests. Unless you want it to.

The good news is, we don’t need to totally rewrite our code to make it work in Temporal. All we need to do is wrap our existing functions with Activity decorators (stages in Temporal are called Activities), and then assemble them into a Workflow:

1from temporalio import workflow, activity 
2
3@activity.defn 
4async def analyze_request(message: str) -> dict: 
5return await openai.chat.completions.create( 
6model="gpt-4", 
7messages=[{"role": "user", "content": f"Analyze: {message}"}] 
8) 
9
10@activity.defn 
11async def get_customer_data(customer_id: str) -> dict: 
12return await get_customer_from_db(customer_id)
13
14# and so on and so forth

You then define the Workflow itself. You can define custom retry policies, and decide which Activities can run in parallel or must run sequentially.

1@workflow.defn
2class RefundWorkflow:
3    @workflow.run
4    async def run(self, customer_message: str) -> str:
5        # Each step is automatically retried, state is preserved
6        analysis = await workflow.execute_activity(
7            analyze_request, 
8            customer_message,
9            retry_policy=RetryPolicy(maximum_attempts=3)
10        )
11        
12        customer = await workflow.execute_activity(get_customer_data, analysis.customer_id)
13        payments = await workflow.execute_activity(check_payment_history, customer.stripe_id)
14        should_refund = await workflow.execute_activity(make_refund_decision, customer, payments)
15        
16        if should_refund:
17            # These can run in parallel
18            refund, _, _ = await asyncio.gather(
19                workflow.execute_activity(process_refund, payments[0].id),
20                workflow.execute_activity(update_ticket, customer.ticket_id, "refund_processed"),
21                workflow.execute_activity(send_confirmation_email, customer.email, "Refund processed")
22            )

One important thing to mention is that you don’t need to choose between Temporal and existing agent frameworks. Yesterday Temporal and OpenAI jointly announced a new native integration between the OpenAI Agents SDK and the Temporal Python SDK. The OpenAI Agents SDK lets developers quickly build agentic AI apps in a lightweight, easy-to-use package, while Temporal provides the orchestration to make those applications durable and resilient at scale.

Temporal is open source and free to deploy on your own infrastructure. There’s also Temporal Cloud, which you can try out for free for 90 days. You can check out their example of an AI agent running in Temporal here.

‍

Authors

Lenny Pruss

Editors

Justin Gage

Acknowledgments

Thanks to Maxim Fateev for edits.