Stas Tushinskiy: Inside the AI Challenge of Making Interactive Voice Ads Actually Viable

Stas Tushinskiy is the CEO and Co-founder of Instreamatic.

Interactive voice advertising – where consumers can verbally respond to brands’ audio ads – has such a high ceiling that I can see only two ways the industry could screw it up. The first would be adopting low-end technology that fails to deliver a natural-enough interactivity, providing experiences that are less like conversations and more like those IVR “Say ‘yes’ to continue” systems that we pretty much all despise. The second would be ignoring or mismanaging the realities of how real people are actually interacting with voice systems, and the challenges and opportunities therein.

Make no mistake about it, the great strength contributing to interactive voice advertising’s promise is AI capable of understanding user intent and quickly putting that into natural and meaningful dialogues. With this nascent ad format, consumers with microphone-enabled devices – who are listening to streaming audio, podcasts, or other audio-based media – receive interactive ads that include verbal calls-to-action. Listeners are prompted to speak aloud and respond positively to accept the ad’s offer (e.g. downloading an app or receiving more detailed information on a product) or respond negatively to skip the ad and get back to their content.

Interactive voice ads can give new advantages to advertisers and ad publishers, in that they provide much more complete metrics on the number of listeners who are served each ad, convert successfully, or skip the ad, along with other data valuable to ad spend strategy. In a medium that cannot be “clicked” and therefore has not yet been touched by the digital ad revolution led by click-based metrics, interactive voice ads are a new way for brands to optimize ad campaigns based on real-time data, and for publishers to better monetize ad inventory. The potential of these ads has been acknowledged by both Pandora, the largest streaming service in the United States, and Spotify, each of which is currently developing interactive voice ad tests on their platforms.

Let’s pull back the hood to reveal a couple hard truths that are emblematic of the challenge here: a lot of us swear, and a lot of us aren’t really all that clear. In fact, the list of top ten replies to a voice ad is always full of swear words. It’s common for an interactive voice ad prompt to receive an emotional “F*&# no” as a response, or an enthusiastic “Oh f*&# yeah!” from a user naturally expressing excitement. At the same time, we often speak in confusing and contradictory ways that other humans may readily understand, but make it impossible for a voice interface without capable AI to interpret user intent. We regularly say phrases like “No, I want it!” or “Yes, I’m not interested,” which will fool simplistic systems that function by recognizing keywords into giving responses opposite to what the user wants.

The future success of voice ads depends upon users’ goodwill and the quality of the experiences delivered today. If publishers choose to embrace lesser technology that cannot properly understand user intent and provides substandard experiences, the entire ad format could falter out of the gate, zapping its iterative potential. Simple keyword recognition technology that can’t always tell a yes response from a no – or that requires clunky prescribed responses – could quickly sour the public opinion about all interactive voice ad technology. These errors will only illicit more swearing than usual from users (which simple systems won’t be capable of offering knowing responses to). Mistake-prone versions of this technology will also fail publishers and advertisers – not only through the lack of compelling experiences, but by providing data full of false positives and negatives due to the poor interpretation of user responses.

In contrast, high-end interactive voice ad technology – powered by natural language understanding (NLU), AI, and deep learning mechanisms able to successfully intuit user intention and iteratively improve to serve users better with each interaction – allows audiences to use any words they wish in a conversational manner and be fully understood. This has the effect of making users feel like they’re actually speaking with someone, rather than talking to a dumb machine.

When it comes to addressing responses laced with F-bombs, sophisticated voice AI systems can read intent and voice tone to respond with “strong yes” and “strong no” responses as appropriate. This ability to offer a joke or respond with enthusiasm equal to the consumer’s emotions is important to brands: normal people having a conversation can handle these situations, so conversational AI must be able to also. It may sound funny, but achieving the right engagement with users who swear at ads directly depends on this understanding.

Given that interactive voice ads are in their infancy (and the importance of user perception), a shoddy product could take down this whole industry before it has a chance. That said, if publishers embrace capable AI technologies and users first experience interactive voice ads at their best, the industry is poised to be an unmitigated success for publishers, advertisers and consumers alike.