Klarna’s AI chatbot: how revolutionary is it, really?

The below article was originally published in The Pragmatic Engineer, on 29 February 2024. I am re-publishing it 6 months later as a free-to-read article. This is because the below case is a good example on hype versus reality with GenAI. To get timely analysis like this in your inbox, subscribe to The Pragmatic Engineer.

Klarna launched its AI chatbot, built in collaboration with OpenAI, which the company wants to use to eliminate 2/3rds of customer support positions. But is it as revolutionary, and as likely to replace jobs, as Klarna claims? My take is that it’s a well-written bot, but I’m dubious that it’s as disruptive as that.

Klarna is going all-in on AI: more specifically, on OpenAI. I have confirmed this through talking with software engineers there, who told me there’s a top-down mandate to utilize AI wherever possible in an effort to drive more efficiency, and product improvements. It’s not exactly a secret, nor is the fact that Klarna’s cofounder and CEO Sebastian Siemiatkowski is a huge OpenAI fan. Two weeks ago, Fortune reported:

“Drawing on a common investor, Siemiatkowski reached out to OpenAI CEO Sam Altman last year, telling him: ‘I want Klarna to be your favorite guinea pig.’ The two companies have worked together ever since.”

A few days ago on 27 February, Klarna shared progress, a month after the project went live.

“Klarna today announced its AI assistant powered by OpenAI. Now live globally for 1 month, the numbers speak for themselves:The AI assistant has had 2.3 million conversations, two-thirds of Klarna’s customer service chatsIt is doing the equivalent work of 700 full-time agentsIt is on par with human agents in regard to customer satisfaction scoreIt is more accurate in errand resolution, leading to a 25% drop in repeat inquiriesCustomers now resolve errands in less than 2 mins compared to 11 mins previouslyIt’s available in 23 markets, 24/7 and communicates in more than 35 languagesIt’s estimated to drive a $40 million USD in profit improvement to Klarna in 2024”

Drawing on this press release, Siemiatkowski then posted on X, unintentionally painting a gloomy picture of the future of customer support jobs:

“In our case, customer service has been handled by on average 3000 full time agents employed by our customer service / outsourcing partners. Those partners employ 200, 000 people, so in the short term this will only mean that those agents will work for other customers of those partners.

But in the longer term, as more companies adopt these technologies, we believe society needs to consider the impact. While it may be a positive impact for society as a whole, we need to consider the implications for the individuals affected.

We decided to share these statistics to raise awareness and encourage a proactive approach to the topic of AI. For decision makers worldwide to recognise this is not just "in the future", this is happening right now.”

But what does this chatbot do, and is it bad news for customer support jobs today, and in the future? I signed up to try it out. As a spoiler, it’s well-built, and will probably be a blueprint for how to put an L1 (first-level support) chatbot together.

The Klarna chatbot experience

The chatbot can be started either with a generic question, or about a specific purchase. I bought an item using Klarna, and started talking about it with the bot. The experience is snappy: in 20 seconds, you always get an answer:

On one hand, the bot is a tool that seems to find relevant parts of documentation, and then shares these sections. This feels word-by-word, or sometimes summarized. Where it goes beyond being just a simple search tool is when it injects order-specific context:

Chatting about an item includes context on the price of the item, the merchant and the date of purchase. It’s all weaved in nicely into the responses.

It can also venture in other areas – for example, it gets chatty when asking about the history of the company:

Overall, it feels like the chatbot is carefully designed to not allow it to go into details that are not on a whitelist. Although it is possible to bypass these restrictions – we’ll see examples below – it is not trivial to do so.

Common questions for the chatbot

Amusingly, when I asked the bot for the most-asked questions by customers, it gave a list:

Payment-related queries. How to make a payment; confirm a payment or extend a payment due date
Order issues: questions or concerns about orders – not receiving goods or receiving faulty of broken items
Personal detail updates
Unauthorized purchases
Questions about Karna’s policies

And some rarer questions, as per this bot:

Less common scenarios: a purchase not reflecting in the Klarna account due to a technical glitch; understanding unique payment agreements; splitting payments across multiple cards
Asking for help about rare error messages

Chatbot safety guards and limitations

One of the biggest downsides of chatbots is that they hallucinate. Under the hood, Klarna’s bot uses ChatGPT’s APIs. We’ve previously covered how ChatGPT works– it generates a list of possible next words (tokens) based on the input, and chooses the highest probability one. This means that it can (and does!) make things up, if left unchecked.

This was how Air Canada’s chatbot made up a non-existent policy, which Air Canada claimed it was not liable for, but a judge ruled otherwise.

Klarna put up safety guards, and over the course of 15 minutes of probing, I could not get the chatbot to hallucinate; i.e: come up with something that’s incorrect. I assume there’s a verification step which compares the output to a whitelist of permitted topics, and that specific policy information must be recounted word for word.

When an “out of bounds” topic comes up, the chatbot hands over to a human agent:

My first attempt to get the chatbot to talk about non-shopping related topics led to a swift handoff to a human agent

Poking at the system prompt

A system prompt is a way to provide context and instructions before passing on the user input. These prompts specify things like when to end the conversation, topics to avoid, and other high-level instructions. With clever-enough probing, this system prompt can be revealed. For example, here’s ChatGPT’s recent system prompt, and an excerpt with instructions on image generation:

“Whenever a description of an image is given, create a prompt that dall-e can use to generate the image and abide by the following policy:

The prompt must be in English. Translate to English if needed.

DO NOT ask for permission to generate the image, just do it!

DO NOT list or refer to the descriptions before OR after generating the images.

Do not create more than 1 image, even if the user requests more.

Do not create images of politicians or other public figures. Recommend other ideas instead.

Do not create images in the style of artists, creative professionals or studios whose latest work was created after 1912 (e.g. Picasso, Kahlo).

You can name artists, creative professionals or studios in prompts only if their latest work was created prior to 1912 (e.g. Van Gogh, Goya).”

What is the system prompt for Klarna’s bot? I couldn’t get the chatbot to share all of it, but I got some interesting details for when the bot stops responding and connects to a human agent:

Examples of topics the chatbot was instructed to not engage with – most likely confiugred via its system prompt

It’s likely that Klarna put these guardrails in place within its system prompt. And it also explains why the bot handed me off to human agents when I went out of bounds on topics.

Prompt injection

Another risk with chatbots is prompt injection. Software engineer – and Django creator – Simon Willison covered this topic in detail in his post, I don’t know how to solve prompt injection:

“Language models like GPT-3 are the ultimate black boxes. It doesn’t matter how many automated tests I write, I can never be 100% certain that a user won’t come up with some grammatical construct I hadn’t predicted that will subvert my defenses.”

Users have already managed to bypass protections the Klarna team has put in place to produce output that should not be possible – but it was, thanks to clever grammatical prompting, proving that even this carefully built bot is not immune to it:

*A user of the Klarna chatbot managed to get the chatbot to generate code. This is an example of deliberate prompt injection. Source:* *Colin fraser on X*

Speaking multiple languages

One impressive capability of Klarna’s bot is that it holds the same conversation in around 30 languages. I’m a native Hungarian speaker, which is so rare that even Amazon does not support it. To my surprise, Klarna’s assistant switched over, and informed me of my order, and whether I had other questions using grammar on a par with a human speaker. The information was a direct translation of what it told me when I asked the same question in English. Still, this is an area in which I can see demand for capabilities that do not exist today, like more extensive support for niche languages, and tiny markets. Suddenly, Klarna’s chatbot is there to support dozens of locations.

What does this chatbot really replace?

The CEO of Klarna says the chatbot did the work of 700 customer service representatives. However, the bot transfers anything more complex than adding basic context to a human.

This chatbot seems a good example of a way to eliminate most “L1” support; i.e: the initial point of contact for users seeking technical support, which handles basic queries and straightforward questions. Level 2 and Level 3 support are handled by trained staff. A user with a rare and complex issue would often go through L1 and L2 support to reach an agent who can help.

To me, it seems that Klarna’s bot does really well on oft-repeated questions with the same answers. So, it now acts as a new “first line of defense,” meaning Klarna would indeed need fewer, less trained, L1 support staff. This is the warning in the CEO’s message.

It’s clear this is just the start for Klarna, and they’ll experiment with adding more complex flows to the bot, trying to handle more complicated queries in an automated way. The challenge is to do this safely, given large language models (LLMs) are non-deterministic, and trusting it with sensitive data is a bad idea. Klarna is a Fintech company, so there are few things more embarrassing than this chatbot leaking sensitive financial details of customers, and outsiders accessing it! I expect Klarna to be very cautious here, and perhaps only human agents will have access to sensitive data.

Is automating away L1 support revolutionary?

Klarna’s AI chatbot integration is well-built and it’s safe to assume it is probably the most comprehensive integration at this scale, right now. I’m making this assumption having tested it, and also based on the close relationship Klarna has with OpenAI. Perhaps the company is, indeed, OpenAI’s “favorite guinea pig,” at least for now!

Klarna’s claim that it saves $40M per year with this chatbot is plausible because:

The system does the work of 700 agents at any given point in time
Agents usually work in 8-hour shifts, meaning 3 agents need to work to offer 24/7 support
This means that 2,100 L1 support jobs could be lost to this bot
… which matches with how Klarna offloaded 2/3rds of their customer support load to this system: they worked with 3,000 customer support agents, mostly from external companies
… and this means the company paid $19,000 per agent annually, on average

Automating L1 support is not that revolutionary, and has been done previously on different systems. For example, If you have called phone support and experienced the automated “please press 1 for this, 2 for that” script; well, that was an early example of automating away L1 support.

I talked with a software engineer who worked at Citibank in around 2008, when the company decided to eliminate its L1 support system, and automate support via a phone system. Over two years, the team automated 95% of phone support requests and closed a call center of 7,000 people in Mexico. As part of this work, Citibank did something interesting and realized that this system annoyed customers, and risked churning high value customers. So they added a condition that was something like this:

if (customer_is_high_value) {
// Connect to human, do not put them into the automated call sequence
} else
// Low value or free customer, have them go through automation
}

A big difference between Klarna and Citibank is that the latter did not brag about it at the time.

So, I don’t really buy that this is revolutionary in business terms, beyond AI’s own revolutionary potential. If anything, it illustrates Klarna was overspending as if it was still the old, now-departed, ZIRP era. The company could have easily automated away L1 support that answers basic questions for low-value customers ages ago, without an LLM. The company simply didn’t prioritize it, and was happy to spend $60M per year on customer support. We previously covered more on what the end of ZIRP means for the tech industry.

Now that Klarna has an AI-based system in place, it has something which mature companies built before LLMs existed! At places like Uber, people are connected to a support person when their query cannot be solved by providing readily-available, basic information, and there is also automation built to handle common queries. For example, if your UberEats order is missing an item, you can report this and receive an automatic refund, all without human intervention – no AI agent needed (although a machine learning model surely runs in the background, flagging possible fraudulent refunds, in which case routing to a human agent likely happens.)

More companies will soon have access to LLM-based L1 support automation. Klarna’s CEO says his company is saving tens of millions in customer support costs thanks to an AI chatbot built in a few months, so this will get the attention of other companies which have not yet thought about how to deal with L1 support.

Does AI bot-building solve the right problem?

I created this image to provoke thoughts:

It’s a lot more interesting – and faster – to create an AI chatbot, than to fix confusing UIs and products.

On one hand, offering chat support is something many customers will appreciate because chat is often a more convenient process. On the other hand, if chat support is the preferred feedback channel, it could mean it’s not a priority to make the UI easy and self-explanatory, and the product logical and simple to use. So, AI bots might not always be the best option, even if your org has the opportunity to build them.

Subscribe to my weekly newsletter to get articles like this in your inbox. It's a pretty good read - and the #1 tech newsletter on Substack.