I'm trying to understand how different prompt managers handle errors when updating code that uses generative AI. As a software engineer working with an LLM, you'll likely need to move prompts out of the source code and into a CMS at some point. This allows others, like the product team or a dedicated prompt engineer, to work on them. Even in small projects, clients might want control over the prompts, necessitating a CMS for prompts.
I reviewed several prompt managers and wrote about setting them up setup in another post. Now, I want focus on how these tools handle errors when the code changes or prompt get updated. The first tool I looked at was PromptLayer.
This is a view of what you get in PromptLayer when managing an individual prompt in their registry. My app is a simple one: it has a form that sends inputs to an OpenAI request, and then creates ads based on the response.
Once you have set your prompt up in PromptLayer, you will an API key form their settings page, and then you implement it in your codebase like so:
This is a Typescript implementation. They also have a Python, LangChain and Rest API implementation.
Changing the shape of the payload ❌
The form inputs I send to PromptLayer are sent as an object with five keys, all strings. So I changed the shape of this payload to an array with the value to see what would happen. What I got was an error in my console saying "prompt not found". The fact that it threw an error is great, but it wasn’t the most helpful error since the prompt didn’t change, only the shape of the payload.
There were no errors in the prompt manager since the request was never sent. I would have preferred if an error did show up in the prompt manager so that, as a prompt engineer, I could tell my developer that someone changed the shape of the payload coming in.
Sending too much information 🙈
Next, I tried adding extra key to the payload to see if it would throw an error. The system worked fine worked fine. There was no awareness of the excess data now passing to the request. I tried deleting a variable in the prompt manager and it had the effect.
I think this is a dangerous case to ignore because it is going to happen so often. The development team updates the payload with new data but the prompt team forgets to consume it. Conversely, the prompt team deletes an input variable but the dev team doesn’t get the message and they continue to maintain everything needed to pass in the unused information. I would have preferred to see both the development and the prompting team being alerted to the excess data.
Not sending enough information 🤐
I deleted one of the keys in the payload and, surprisingly, there were no issues. The prompt just ran as normal, and did its best to make up for the missing context.
This is a particularly dangerous case because the quality of a completion will drop significantly if you start depriving it of critical information. The problem here is that the GPT will always try and close the gap for you. Deprive a GPT of enough information, and sometimes it will tell you it needs more context or it will ask for a clarification. But, in a case like this, where I’m sending in data from 5 different form fields, GPT will never complain about one missing bit.
The solution is to just make inputs mandatory or optional. When I added an input variable to the prompt in PromptLayer, or changed the name of an input, it failed silently in the same way. There was no indication to the prompt team that the prompt was not getting all the information it needed.
The way that PromptLayer handles these types of issues is to make you confirm your changes with a commit message (followed by the option to run evals). This is great for figuring out who broke the system, but it’s not great at helping someone understand how they’re breaking the system when they’re breaking it. This approach makes error handling entirely the user’s responsibility, when it could be more of a shared responsibility.
Poor Rollback Game 🚫
Rolling back to a previous version of a prompt wasn't possible within PromptLayer; it had to be done in the source code by specifying the version of a prompt you want. I didn’t understand this and instead tried to delete a version in PromptLayer, but ended up deleting the whole template instead. Losing all of the changes I’d made to the prompt over the past week. That’s my fault though.
In a situation where the prompting team publishes a version riddled with errors, and they know it, not giving them the option to roll things back and forcing the developers to make a changes kind of makes the whole point of a prompt manager redundant.
Other Low hanging fruit 🍒
Creating a new prompt to continue caused more setup errors. In this scenario, I forgot to configure the setting for the prompt before using it. This means I had a prompt published, and being used in production that didn’t have an associated model or token count. This feels like a massive oversight since it’s such a simple error to catch. I had to debug the issue from my developer console.
The last problem I had was setting the token count too low when consuming responses in JSON. The bottom of the dashboard screenshot (at the beginning of this post) is their debugger. It gives you a little response preview for each prompt. In the image below you can see that the JSON response just ends abruptly. This causes all kinds of nasty errors in the app but there’s no error in the debugger. The prompt team wouldn’t even know they’re sending back invalid JSON. This is one of those areas where I feel like the prompt managed could share a little of responsibility when it comes to avoiding avoidable errors.
While PromptLayer offers a bunch of valuable features for managing prompts, it falls short in a few critical areas when handling errors. As a developer, if I’m going to working with a client or a product team via a prompt manager then I want to be relatively confident that obvious error detection is covered for things like payload changes, missing information, or excess data. Implementing more proactive error detection, clearer communication of issues, and user-friendly version control would significantly enhance the reliability and usability of the system.
This teardown was conducted with respect for PromptLayer's team, I hope this was useful to them, and I welcome any feedback or updates to the post.