Large brands simply cannot afford to have search and product discovery solutions that only work effectively in English. To succeed globally, multilingual search capabilities are a must  — organizations need powerful tools that can deliver impactful product discovery experiences for shoppers across diverse languages and regions. 

For example, Coveo’s multilingual support empowers companies like Nespresso, Dow, and Philips to scale their ecommerce operations worldwide, ensuring relevant search experiences for a diverse and widespread shopper demographic.

Many search solutions are primarily designed for and optimized in English, but with over 7,000 languages spoken globally, this approach is far from sufficient and companies that fail to adapt risk being outpaced by competitors who do.

Multilingual Search Challenges

Multilingual search poses some unique findability challenges. These issues can arise not only between different languages but even within the same language. On a large, international online store, shoppers from different countries may use entirely distinct words for the same products. For instance, U.K. shoppers might search for “trousers” while U.S. shoppers use “pants” to describe the same item. The same goes for “trainers” in the U.K. and “sneakers” in the U.S., illustrating how regional language variations impact product searches.

But the complexity doesn’t end there. Certain languages add further layers of difficulty, especially when it comes to stemming — the process of reducing words to their root form. This plays a crucial role in product search because it helps the search engine understand and match different variations of a word to the same underlying concept or product. Without stemming, a search engine might fail to connect a shopper’s query with relevant products that have slightly different word forms.

Imagine a user searching for “running shoes” on an ecommerce platform. Without stemming, the search engine might only return results that contain the exact phrase “running shoes” and ignore product listings that use slightly different word forms, such as “best shoes for runners” or “ideal for those who run daily”.

Stemming is relatively straightforward in English, as it has relatively simple morphology, which means that word forms tend to change less dramatically based on grammar compared to languages with more complex inflectional systems. However, in languages with complex inflectional morphology, such as Slavic languages like Polish, stemming requires more sophisticated rules. These languages have highly inflected grammar, where nouns and verbs change form based on elaborate grammatical rules, making search even more challenging.

However, one Natural Language Processing (NLP) task that is especially important is decompounding. It is arguably crucial for languages where words are frequently combined, such as German or Finnish where words are often combined into long, descriptive compounds. Breaking these words down into their components is essential for ensuring shoppers are accurately connected with the right products.

Understanding and addressing these linguistic nuances is key to optimizing multilingual search. Let’s dive deeper into decompounding and its relevance.

Understanding the Complexity of Compound Words

Languages like German frequently combine nouns to create compound words that encapsulate complex ideas within a single term. For example, “heimwerkerbedarf” translates to “DIY supplies,” combining “heimwerker” (do-it-yourselfer) and “bedarf” (supplies). Similarly, “esstisch” (dining table) merges “ess” (dining) and “tisch” (table). In an ecommerce context, this presents a significant challenge: how do you ensure that search queries like “tisch” (table) return relevant results when the database primarily contains compound terms?

In Germanic languages such as German, Swedish, or Dutch, compounding is a common way to form new words. Similarly, agglutinative languages like Finnish or Turkish require decompounding (the process of breaking down compound words into their individual components or smaller units) to accurately interpret complex search queries. For instance, a Finnish shopper might search for “kahvinkeittimen suodatinpussi,” meaning “coffee maker’s filter bag.” If the search engine fails to decompound this term, it might miss products relevant to either “coffee makers” or “filter bags,” reducing search accuracy and relevance.

Why Decompounding Matters: Precision & Recall

Precision
Precision refers to the ability of a search engine to return only the most relevant results for a given query. Decompounding plays a crucial role in enhancing precision by ensuring that all parts of a compound word are considered during the search.

  • Example: If a user searches for “wandfarbeimer” (wall paint bucket), and the search engine correctly decompounds this into “wandfarbe” (wall paint) and “eimer” (bucket), it will avoid returning irrelevant results that only match “Farbe” (paint) or “eimer” (bucket). Without decompounding, the search might return a broad set of results, including general paint or any type of bucket, rather than focusing on the specific product the user is interested in.

Recall
Recall is the ability of a search engine to return all relevant results for a given query. Decompounding is essential for improving recall because it ensures that a query can match all possible variations of a word.

  • Example: Consider a search for “schraubenzieher” (screwdriver). If the inventory includes “schraubenziehersatz” (screwdriver set), decompounding ensures that this related item is also retrieved. Without decompounding, the search engine might fail to include “schraubenziehersatz” in the results, missing an opportunity to present a relevant product to the user.

The Problem with Keyword Search and Compound Words

Most search solutions rely on keyword search, which depends on exact or partial matches between query terms and indexed content. However, compound words can present significant challenges. For example, searching for “hand” or “book” separately may not return results that include the compound “handbook.” This mismatch can lead to missed opportunities and a frustrating user experience.

Can Vector Search Sidestep Decompounding Challenges?

It might seem like vector search, which uses word embeddings to capture the semantic meaning of terms, could bypass the need for decompounding. While vector search enhances relevance by understanding the broader context of words, it still faces challenges, especially with compound words.

If a compound term hasn’t been well-represented in the training data, the vector model might struggle to capture its full meaning or match it with relevant product queries. Although vector search can complement traditional methods by being more forgiving with compound words – placing their vectors close to related terms – it’s not necessarily a perfect solution. For example, if the search engine hasn’t learned the specific semantics of a rare compound word, it may not perform as expected.

The Role of Pretrained Models

Using a pretrained model offers advantages because these models are typically trained on large, diverse corpora (large, structured collections of texts or spoken language used for studying and analyzing language patterns), helping them learn the semantics of many compound words. However, there are still limitations:

  • Domain-specific compounds: If compounds are highly domain-specific, a pretrained model might not have encountered them frequently enough during training, leading to suboptimal embeddings.
  • Language-specific compounding: In languages with highly productive compounding, a model pre-trained on a general corpus might not capture the full range of possible compounds, necessitating additional preprocessing like decompounding.

How Coveo Solves the Decompounding Challenge

Multilingual search is a complex problem that demands sophisticated solutions. Coveo addresses this challenge head-on, particularly in the case of compounding, through its advanced semantic decompounder. This tool analyzes a customer’s query, breaking it down into its constituent parts, and matches these parts against all relevant product textual fields, including titles, categories, attributes, tags, and descriptions.

Coveo’s decompounding process is robust and operates at both index-time and query-time, ensuring comprehensive coverage of compound words. During index-time, Coveo decompounds all words within a document, adding both the original compound and its individual components to the lexicon. For instance, when indexing a German document containing the word “handtuch” (hand towel), Coveo indexes not only the compound “Handtuch” but also the separate components “hand” and “tuch.”

At query-time, Coveo performs a similar decompounding of the search terms. So, if a user searches for “handtuch grün” (green hand towel), Coveo can match documents containing “handtuch” and “grün” as well as those that include the components “hand” and “tuch” alongside “grün.” This dual-layer decompounding ensures that Coveo effectively retrieves the most relevant documents, even when exact compound matches are not present, significantly enhancing both search accuracy and relevance.

Semantic search

Beyond decompounding, Coveo also leverages fine-tuned semantic vector search to ensure smart, relevant inferences are drawn. Coveo’s semantic search model uses vector embeddings to retrieve items from your index based on their semantic similarity to the query. The model creates embeddings for specified enterprise content and references these embeddings at query time, enabling the search engine to understand and match the intent behind complex queries more effectively.

See Coveo in Action
Book a demo