
The Multilingual Analysis Stack We Built for Real-World Search
- Medcl
- Engineering
- May 23, 2026
Table of Contents
Search quality is decided before ranking even starts.
If your analysis layer is weak, every downstream component pays for it: matching, highlighting, aggregations, typo tolerance, and query understanding. That is why we invested heavily in Pizza Analysis and turned it into one of the strongest parts of the platform.
Today, Pizza Analysis is not a single tokenizer with a few stopwords. It is a multilingual analysis stack designed for real production workloads.
What We Built
At a high level, Pizza Analysis provides:
- 39 analysis plugins
- 27 dedicated language plugin crates
- 70+ prebuilt analyzers
- 140+ token filters
- 27 tokenizers
- 13 input normalizers
And more importantly, it does this with a modular architecture teams can actually operate, test, and evolve.
Explore on GitHub
- Analysis meta-crate (all plugins): github.com/pizza-rs/analysis-all
- Auto language analyzer: github.com/pizza-rs/analysis-auto
- Synonym plugin (with hot-reload): github.com/pizza-rs/analysis-synonym
- Core analysis primitives: github.com/pizza-rs/analysis-core
Why This Matters
Most modern datasets are multilingual, noisy, and domain-specific.
You are indexing a mix of:
- product catalogs with transliterated brand names
- support tickets in multiple languages
- documents with CJK scripts and Latin fragments in the same field
- short, ambiguous user queries
- entity aliases and phrase-level synonym variants
A generic analysis chain cannot handle this well. You need language-aware pipelines and a controlled way to compose them.
The Architecture in Practice
Pizza Analysis follows a clear pipeline model:
- Character/input normalization
- Tokenization
- Token filtering (lowercase, stopwords, stemming, synonym graph, etc.)
- Analyzer composition for each language or use case
This sounds simple, but the impact is huge: deterministic behavior, easier debugging, and safer iteration.
System Architecture Diagram
flowchart LR
A[Raw Text] --> B[Input Normalizers]
B --> C[Tokenizer]
C --> D[Token Filters]
D --> E[Index Terms]
subgraph Registry[AnalysisFactory]
R1[core]
R2[language crates]
R3[cjk/asian crates]
R4[synonym]
R5[auto]
end
Registry --> B
Registry --> C
Registry --> D
Runtime Routing Diagram (Auto Analyzer)
flowchart TD
T[Input Text] --> L[whatlang Detection]
L --> C{confidence >= threshold?}
C -- no --> S[standard analyzer]
C -- yes --> O{override exists?}
O -- yes --> A1[override analyzer]
O -- no --> A2[default language mapping]
A1 --> OUT[Final token stream]
A2 --> OUT
S --> OUT
Technical Deep Dive
Pizza Analysis is deliberately split into dedicated crates so each language family can evolve independently without destabilizing the rest of the stack.
Analyzer Composition Strategy
A typical high-quality analyzer chain is:
- script-aware normalization
- language-appropriate tokenization
- stopword filtering
- stem/lemmatize (when available)
- optional synonym graph expansion
For example, Indic analyzers use indic_normalization as a shared base, then add language-specific normalization and stemming where available.
CJK Strategy Selection
Instead of one CJK fallback for all scenarios, Pizza allows strategy by workload:
ikfor practical Chinese segmentation in many search workloadsjiebafor flexible Chinese segmentation behaviorsmartcnfor dictionary/statistical segmentation stylekuromojifor Japanese morphology-aware tokenizationnorifor Korean morphology-aware tokenization
This is essential for balancing recall vs precision depending on corpus shape and query length distribution.
Code Examples (Real Usage)
1. Register a full multilingual stack
use pizza_engine::analysis::AnalysisFactory;
let mut factory = AnalysisFactory::new();
// Register everything (core + language plugins + auto)
pizza_analysis_all::register_all(&mut factory);
// Pick analyzers by name
let english = factory.get_analyzer("english").unwrap();
let japanese = factory.get_analyzer("kuromoji").unwrap();
let auto = factory.get_analyzer("auto").unwrap();
2. Use the auto analyzer with confidence and overrides
use pizza_analysis_auto::AutoTokenizer;
let mut auto_tokenizer = AutoTokenizer::new(analyzers, fallback);
// Default is 0.3; make it stricter for production traffic
auto_tokenizer.set_confidence_threshold(0.5);
// Route Chinese detection to jieba instead of default ik
auto_tokenizer.set_override("cmn", "jieba");
// Route Japanese detection to cjk bigram for a specific use case
auto_tokenizer.set_override("jpn", "cjk");
3. Hot-reload synonyms without restart
use pizza_analysis_synonym::{SynonymFilter, SynonymParser};
let filter = SynonymFilter::empty();
let handle = filter.reload_handle();
// Initial load
let map_v1 = SynonymParser::new().parse("tv, television\nny => new york");
handle.reload(map_v1);
// Later: update from config/file watcher endpoint
let map_v2 = SynonymParser::new().parse("tv, television, smart tv\nny => new york city");
handle.reload(map_v2);
This is exactly the kind of runtime workflow relevance teams need when tuning live systems.
Deep Language Coverage
We support broad language coverage across dedicated analyzers and specialized segmentation engines.
Dedicated per-language analyzers
Pizza ships dedicated analyzers for major language families, including:
- English, French, German, Spanish, Italian, Portuguese, Dutch
- Russian, Greek, Norwegian, Swedish, Finnish, Hungarian, Turkish
- Arabic, Persian, Hindi, Bengali, Indonesian, Brazilian Portuguese
- Vietnamese, Thai
- Tamil, Telugu, Kannada, Malayalam
These analyzers are not cosmetic wrappers. They encode language-specific normalization, stopword strategy, and stemming behavior where available.
Representative language crates:
- Vietnamese: github.com/pizza-rs/analysis-vietnamese
- Thai: github.com/pizza-rs/analysis-thai
- Tamil: github.com/pizza-rs/analysis-tamil
- Telugu: github.com/pizza-rs/analysis-telugu
- Kannada: github.com/pizza-rs/analysis-kannada
- Malayalam: github.com/pizza-rs/analysis-malayalam
CJK and script-specific specialization
For East Asian text, quality depends on segmentation strategy. Pizza includes dedicated analyzers/tokenizers for:
- IK for Chinese segmentation
- Jieba for Chinese segmentation
- SmartCN for Chinese segmentation
- Kuromoji for Japanese morphological analysis
- Nori for Korean morphological analysis
- CJK bigram pipeline for fallback and mixed-script scenarios
This gives teams practical control instead of forcing one segmentation approach across all workloads.
Related CJK repos:
- IK: github.com/pizza-rs/analysis-ik
- Jieba: github.com/pizza-rs/analysis-jieba
- SmartCN: github.com/pizza-rs/analysis-smartcn
- Kuromoji: github.com/pizza-rs/analysis-kuromoji
- Nori: github.com/pizza-rs/analysis-nori
IK Performance Optimization (Technical Notes)
IK is one of the most performance-sensitive analyzers in real deployments because Chinese segmentation is often on hot query/index paths.
Our optimization focus for IK workloads is:
- dictionary traversal efficiency (reduce repeated branch work)
- token emission discipline (avoid unnecessary intermediate allocations)
- short-circuit logic for non-productive branches
- analyzer-chain ordering to minimize downstream filter cost
Practical tuning pattern:
- profile segmentation hotspots on real corpus distributions
- benchmark IK smart vs max-word by workload (index-time vs query-time)
- measure token count inflation and its impact on posting-list fanout
- tune analyzer chain to keep useful recall while controlling term explosion
Dedicated repo:
- IK plugin: github.com/pizza-rs/analysis-ik
Pinyin in Production Pipelines
Pinyin support is critical for Chinese name/entity retrieval and cross-input-method matching.
Typical usage patterns:
- Chinese characters -> full pinyin tokens for tolerant lookup
- first-letter pinyin for shorthand query behavior
- dual-field indexing (original Han + pinyin-normalized field)
Pinyin + ST conversion are usually combined in multilingual Chinese deployments.
Dedicated repos:
- Pinyin plugin: github.com/pizza-rs/analysis-pinyin
- Simplified/Traditional conversion: github.com/pizza-rs/analysis-stconvert
CJK Execution Flow (IK + Pinyin + STConvert)
flowchart LR
Q[Chinese Input Text] --> N[Optional STConvert]
N --> IK[IK/Jieba/SmartCN Segmentation]
IK --> F[Lowercase + Stop + Optional SynonymGraph]
F --> IDX[Index Terms]
Q --> P[Pinyin Normalizer/Tokenizer]
P --> PF[Pinyin Tokens + First-Letter Tokens]
PF --> IDX
This dual-lane design is commonly used to preserve Han-character precision while adding pinyin recall paths for transliterated user input.
Dedicated Repository Matrix
| Area | Repo |
|---|---|
| Meta integration | github.com/pizza-rs/analysis-all |
| Core analysis primitives | github.com/pizza-rs/analysis-core |
| Auto language routing | github.com/pizza-rs/analysis-auto |
| Synonym + graph + hot-reload | github.com/pizza-rs/analysis-synonym |
| Chinese IK | github.com/pizza-rs/analysis-ik |
| Chinese Jieba | github.com/pizza-rs/analysis-jieba |
| Chinese SmartCN | github.com/pizza-rs/analysis-smartcn |
| Japanese Kuromoji | github.com/pizza-rs/analysis-kuromoji |
| Korean Nori | github.com/pizza-rs/analysis-nori |
| Chinese Pinyin | github.com/pizza-rs/analysis-pinyin |
| Simplified/Traditional conversion | github.com/pizza-rs/analysis-stconvert |
| Vietnamese | github.com/pizza-rs/analysis-vietnamese |
| Thai | github.com/pizza-rs/analysis-thai |
| Tamil | github.com/pizza-rs/analysis-tamil |
| Telugu | github.com/pizza-rs/analysis-telugu |
| Kannada | github.com/pizza-rs/analysis-kannada |
| Malayalam | github.com/pizza-rs/analysis-malayalam |
Support Matrix
The table below summarizes what is available today in Pizza Analysis.
| Capability | Coverage | Notes |
|---|---|---|
| Plugin ecosystem | 39 plugins | Modular crates, feature-gated composition |
| Dedicated language crates | 27 | Per-language normalization/stopword/stemming pipelines |
| Prebuilt analyzers | 70+ | Ready-to-use analyzers for indexing/querying |
| Token filters | 140+ | Includes stemming, graph synonyms, script normalization |
| Tokenizers | 27 | Standard + specialized CJK/Asian tokenizers |
| Input normalizers | 13 | Character-level normalization stage |
| Auto language routing | Yes | whatlang-based detection + threshold + per-language overrides |
| Synonym hot-reload | Yes | Runtime map replacement without restart |
| CJK specialization | IK, Jieba, SmartCN, Kuromoji, Nori, CJK | Multiple segmentation strategies by workload |
| Indic dedicated analyzers | Tamil, Telugu, Kannada, Malayalam, Hindi, Bengali | Includes script-specific normalization and stemming where available |
| Southeast Asian analyzers | Vietnamese, Thai, Indonesian | Includes dedicated Vietnamese and Thai pipelines |
| no_std compatibility | Yes | Designed for embedded/low-footprint deployments |
Auto Language Detection That Actually Helps
The auto analyzer routes input to the best analyzer at runtime.
It supports:
- confidence threshold control
- override by detected language code
- dedicated mapping for CJK analyzers
- dedicated mappings for newly added Indic analyzers
This is critical when language is unknown or mixed at ingestion time. Instead of forcing index-level segmentation decisions too early, teams can adapt per field, per document, and even per use case.
Production-Grade Synonyms with Hot Reload
Synonyms are where relevance engineering often becomes operationally painful. We focused on making this production-friendly.
Pizza supports:
- synonym and synonym_graph filters
- single-word and multi-word mappings
- expand and contract modes
- dynamic hot reload
Recent improvements enable lock-safe map swapping so synonym updates can be applied without disruptive restarts.
Why Pizza Analysis Is Different
The key difference is not one single feature. It is the combination:
- strong language coverage
- composable architecture
- specialized analyzers for hard scripts
- operational features for continuous relevance tuning
- modular packaging via plugin crates
This lets product and search teams ship relevance improvements continuously rather than in risky quarterly batches.
Real Workload Scenarios
Global ecommerce
- multilingual titles and attributes
- variant brand spellings and aliases
- language-dependent stemming and stopwords
Result: better recall without sacrificing precision.
Support and knowledge search
- mixed-language tickets and comments
- phrase-level synonym handling
- robust normalization for messy text
Result: better answer findability and lower query friction.
Mixed-script content platforms
- CJK text mixed with Latin and numeric symbols
- language unknown at write time
- high volume of user-generated text
Result: more stable relevance across regional traffic patterns.
The Engineering Principle Behind It
We treat analysis as first-class infrastructure, not preprocessing glue.
That means:
- explicit analyzer definitions
- clear registration and composition
- testable behavior at each stage
- safe runtime updates for relevance assets like synonyms
When this layer is solid, every ranking and retrieval improvement lands on stronger ground.
Closing
Search quality starts with language understanding. Pizza Analysis gives us a powerful, modular, and operationally practical foundation for that work.
If your workload is multilingual, noisy, and constantly evolving, this is exactly the layer that determines whether search feels average or exceptional.
Pizza Analysis is how we make it exceptional.
Pizza is being open-sourced now. Keep an eye on us and follow along.


