Bigger isn’t always better: How hybrid AI patterns enable smaller language models

adminApril 26, 2024

As large language models (LLMs) have entered common language, people have discovered ways to use apps that access those models. Modern AI tools can create, generate, summarize, translate, classify, and even converse. Tools in the generative AI space allow you to learn from existing artifacts and then generate responses to prompts.

One area where there hasn’t been much innovation is far edges and confined devices. We’re seeing some versions of AI apps running locally on mobile devices with language translation capabilities built in, but we haven’t reached the point where LLMs are creating value outside of cloud providers.

But there is a smaller model that has the potential to revolutionize Gen AI capabilities on mobile devices. Let’s look at these solutions from the perspective of a hybrid AI model.

Basics of LLM

LLM is a special kind of AI model that supports this new paradigm. Natural language processing (NLP) makes this feature possible. To teach LLM, developers use huge amounts of data obtained from various sources, including the Internet. With billions of parameters processed, the parameters have become too large.

Although LLMs are knowledgeable about a wide range of topics, they are limited to the data they are trained on. That means it’s not always “up-to-date” or accurate. Because of their size, LLMs are typically hosted in the cloud, which requires powerful hardware deployments with many GPUs.

This means that businesses looking to mine information from personal or proprietary business data will not be able to use the LLM right away. To answer specific questions, generate summaries, or outlines, you will need to include data in the public LLM or create your own models. This method of adding your own data to LLM is called Retrieval Augmentation Generation, or RAG pattern. Gen AI design pattern for adding external data to LLM.

Is smaller better?

Companies operating in specialized areas such as telecommunications companies, healthcare services, and oil and gas companies have a laser focus. Although common AI scenarios and use cases can and do benefit, it is better to use smaller models.

For example, for telcos, some of the common use cases include AI assistants in contact centers, personalized offers in service delivery, and AI-powered chatbots for improved customer experience. Use cases that help carriers improve network performance, increase spectral efficiency in 5G networks, or determine specific bottlenecks in their networks are best addressed through companies’ own data (rather than public LLM).

This brings us to the concept that smaller is better. There is now a Small Language Model (SLM) that is “smaller” in size compared to the LLM. SLMs are trained on hundreds of billions of parameters, while LLMs are trained on hundreds of billions of parameters. More importantly, SLMs are trained on data relevant to a specific domain. They may not have extensive contextual information, but they perform very well in their chosen areas.

Because these models are small, they can be hosted in an enterprise’s data center instead of in the cloud. SLM can also run at scale on a single GPU chip, saving thousands of dollars in annual computing costs. But as chip design has advanced, the distinction between what can only run in the cloud and what can only run in enterprise data centers has become less clear.

For reasons of cost, data privacy, or data sovereignty, enterprises may want to run these SLMs in their data centers. Most businesses don’t like sending their data to the cloud. Another major reason is performance. Gen AI at the Edge It performs calculations and inferences as close to the data as possible, making it faster and more secure than going through a cloud provider.

It is worth noting that SLM requires less computing power and is ideal for resource-constrained environments and mobile device deployments.

An on-premises example might be an IBM Cloud® Satellite location with a secure, high-speed connection to the IBM Cloud that hosts LLM. Carriers can host these SLMs on their base stations and offer this option to their customers as well. Optimizing GPU usage is important because it improves bandwidth by reducing the distance data must travel.

How small can you get?

Let’s go back to the original question about whether these models can run on mobile devices. Mobile devices can be high-end phones, cars, or even robots. Device manufacturers have discovered that running LLM requires significant bandwidth. Tiny LLM is a smaller-scale model that can be run locally on mobile phones and medical devices.

Developers create these models using techniques such as low-rank adaptation. This allows users to fine-tune the model to their unique requirements while keeping the number of trainable parameters relatively low. In fact, there is also a TinyLlama project on GitHub.

Chip manufacturers are developing chips that can run a reduced version of LLM through image diffusion and knowledge distillation. System-on-chip (SOC) and neural processing units (NPU) support edge devices in executing AI tasks.

Some of these concepts are not yet in production, but solution architects should consider what is possible today. An SLM working and partnering with an LLM can be a viable solution. Businesses can decide to use existing, small, specialized AI models tailored to their industry or create their own to deliver personalized customer experiences.

Is hybrid AI the answer?

While running an SLM on-premises is practical and small LLMs seem attractive on mobile edge devices, what happens when your model requires a larger collection of data to answer some prompts?

Hybrid cloud computing gives you the best of both worlds. Could the same apply to AI models? The image below illustrates this concept.

When small-scale models fall short, hybrid AI models can provide an option to make LLM accessible in the public cloud. It makes sense to enable such technology. This allows enterprises to keep their on-premises data safe using domain-specific SLMs while still being able to access LLMs in the public cloud when needed. As mobile devices with SOCs increase in performance, this appears to be a more efficient way to distribute generative AI workloads.

IBM® recently announced the availability of the open source Mistral AI model on its Watson™ platform. This compact LLM requires fewer resources to run, but is effective and performs better than traditional LLM. IBM has also launched the Granite 7B model as part of its highly curated and reliable family of foundation models.

Our argument is that companies should focus on building small-scale, domain-specific models from internal enterprise data to differentiate their core capabilities and leverage insights from their data. ).

Bigger isn’t always better

Telecommunications companies are a prime example of companies that can benefit from adopting this hybrid AI model. They have a unique role in that they can be both consumers and suppliers. Similar scenarios can apply to healthcare, oil rigs, logistics companies, and other industries. Are telcos ready to effectively leverage Gen AI? We know they have a lot of data, but do they have a time series model that fits the data?

When it comes to AI models, IBM has a multi-model strategy to accommodate each unique use case. Bigger isn’t always better. This is because specialized models outperform general-purpose models with lower infrastructure requirements.

Create agile, domain-specific language models Learn more about generative AI with IBM

Was this article helpful?

yesno

Executive Cloud Architect

Distributed Infrastructure and Network Management Research, Master Inventor

adminApril 26, 2024