IBM’s new Watson large-scale speech model brings generative AI to mobile phones
Generative AI has entered our everyday lexicon with its incredible text and image generation capabilities and the promise of revolutionizing how companies handle core business functions, so most people have probably heard of Large-Scale Language Models (LLMs). . Now more than ever I think about that talking Connecting to AI through a chat interface or having it perform a specific task is a practical reality. Tremendous progress is being made in adopting this technology to positively impact our everyday experiences as individuals and consumers.
But what about in the world of voices? With so much attention focused on LLM as a catalyst for improved generative AI chat capabilities, not many people are talking about how it can be applied to voice-based conversation experiences. Modern contact centers are now dominated by strictly conversational experiences (yes, interactive voice response (IVR) is still the standard). Enter the world of Large-Scale Speech Models (LSMs). Yes, LLM has a cousin that offers more of the benefits and possibilities one would expect from generative AI, but this time customers can interact with an assistant over the phone.
Over the past few months, the IBM watsonx development team and IBM Research have been hard at work developing a new, state-of-the-art Large Speech Model (LSM). Based on transducer technology, LSM uses massive amounts of training data and model parameters to provide accuracy in speech recognition. Built specifically for customer care use cases such as self-service phone assistant and real-time call transcription, LSM provides advanced, out-of-the-box transcription capabilities to deliver a seamless customer experience.
We’re excited to announce new LSM deployments in English and Japanese, now available exclusively in closed beta to Watson Speech to Text and watsonx Assistant phone customers.
We could go on and on about how great these models are, but really the bottom line is this: Performance. Internal benchmarking shows that the new LSM is the most accurate speech model to date, outperforming OpenAI’s Whisper model in short-form English use cases. We compared the baseline performance of the English LSM against OpenAI’s Whisper model on five real-world customer use cases with phone calls and found that the IBM LSM had a 42% lower word error rate (WER). Whisper model (see footnote (1) for evaluation methodology)
IBM’s LSM is 5x smaller (5x fewer parameters) than the Whisper model. This means it processes audio 10x faster when running on the same hardware. With streaming, the LSM will finish processing the audio once it has finished. Whisper, on the other hand, processes audio in block mode (e.g. 30-second intervals). Let me give you an example. When processing audio files shorter than 30 seconds (e.g. 12 seconds), Whisper will fill in with silence but still take 30 seconds to process. IBM LSM processes after 12 seconds of audio is complete.
These tests indicate that LSM is very accurate in its short form. But there’s more. LSM also demonstrated comparable accuracy to Whisper in long-form use cases (such as call analysis and call summaries), as shown in the chart below.
How can I get started with this model?
Apply for our Private Beta User Program and our product management team will contact you to schedule a call. IBM LSM is in private beta, so some features are still in development.2.
Register now to explore LSM.
One Benchmarking Methodology:
- Whisper model for comparison: Medium.en
- Assessment language: US-English
- Metrics used for comparison: Word error rate, commonly known as WER, is defined as the number of editing errors (substitutions, deletions, and insertions) divided by the number of words in the reference/human transcript.
- Before scoring, all machine recordings were normalized using a whisper normalizer to remove format differences that may cause WER inconsistencies.
2 IBM’s statements regarding plans, direction and intent are subject to change or withdrawal without notice, at IBM’s sole discretion. Any information mentioned regarding potential future products is not a promise, commitment or legal obligation to provide any materials, code or functionality. The development, release and timing of future features is at IBM’s discretion.