Elasticsearch lemmatization with Hebrew analyzer plugin

Hebrew is morphologically rich: Prefixes, inflections, and clitics make exact-token search brittle. This project provides an open-source Hebrew analyzer plugin for Elasticsearch 9.x that performs neural lemmatization in the analysis chain, using an embedded DictaBERT model executed in-process via ONNX Runtime with an INT8-quantized model.

Quick start

Download the relevant release or build and install (Linux build script generates Elasticsearch‑compatible zip):

./scripts/build_plugin_linux.sh

Install in Elasticsearch:

/path/to/elasticsearch/bin/elasticsearch-plugin install file:///path/to/heb-lemmas-embedded-plugin-<ES_VERSION>.zip

Test:

curl -k -X POST "https://localhost:9200/_analyze" \
  -H "Content-Type: application/json" \
  -u "elastic:<password>" \
  -d '{"tokenizer":"whitespace","filter":["heb_lemmas","heb_stopwords"],"text":"הילדים אוכלים את הבננות"}'

Why Hebrew search is different

Hebrew is morphologically rich: Prefixes, suffixes, inflection, and clitics all collapse into a single surface form. That makes naive tokenization insufficient. Without true lemmatization, search quality suffers; users miss relevant results due to simple variations in form. This project tackles that by embedding a Hebrew lemmatization model inside the analyzer itself, so every token passes through a neural model before indexing and querying.

Example

Users may search for the lemma “בית” (house), but documents might contain:

בית (a house)
בבית (in the house)
לבית (to the house)
בבתים (in houses)
לבתים (to houses)

Without lemmatization, these become different surface tokens; lemmatization normalizes them toward the same lemma (בית), improving recall:

Hebrew analyzer for Elasticsearch lemmatization

What this plugin does

Rather than relying on rule-based stemming, the analyzer runs a Hebrew lemmatization model as part of the Elasticsearch analysis chain and emits one normalized lemma per token. Because the model is neural, it can use local context within each analyzed segment to choose a lemma in ambiguous cases—while still producing stable tokens that work well for indexing and querying. The analyzer:

Runs a Hebrew lemmatization model inside Elasticsearch.
Produces better normalized tokens for Hebrew text.
Supports stopwords and standard analyzer pipelines.

The result: Fast, reliable lemmatization

This analyzer is optimized for real‑world throughput:

ONNX Runtime in‑process inference.
INT8-quantized model for lower latency and memory footprint.
Java Foreign Function Interface (FFI) for high‑performance native inference.

The result: fast, reliable lemmatization with predictable operational behavior.

To evaluate performance, we ran a benchmark in a Docker container (4 cores, 12 GB RAM) on 1 million large documents (5.7 GB of data) from the Hebrew Wikipedia dataset. You’ll find the results below:

Metric (search)	Task	Value	Unit
Min throughput	hebrew-query-search	409.75	ops/s
Mean throughput	hebrew-query-search	490.65	ops/s
Median throughput	hebrew-query-search	491.85	ops/s
Max throughput	hebrew-query-search	496.13	ops/s
50th percentile latency	hebrew-query-search	7.02242	ms
90th percentile latency	hebrew-query-search	10.7338	ms
99th percentile latency	hebrew-query-search	19.0406	ms
99.9th percentile latency	hebrew-query-search	27.165	ms
50th percentile service time	hebrew-query-search	7.02242	ms
90th percentile service time	hebrew-query-search	10.7338	ms
99th percentile service time	hebrew-query-search	19.0406	ms
99.9th percentile service time	hebrew-query-search	27.165	ms
Error rate	hebrew-query-search	0	%

Open source and Elastic‑ready

The plugin is fully open source and works on:

Elastic open‑source distributions.
Elastic Cloud.

You can build it yourself or download prebuilt releases and install it like any other plugin.

To upload the analyzer plugin to Elastic Cloud, navigate to the Extensions section within your Elastic Cloud console and proceed with the upload.

Open source and Elastic‑ready for Hebrew analyzer for Elasticsearch lemmatization

Credits

This project is a fork of the Korra ai Hebrew analysis plugin (MIT), which was implemented by Korra.ai with funding and guidance from the National NLP Program led by MAFAT and the Israel Innovation Authority.

This fork focuses on Elasticsearch 9.x compatibility and running lemmatization fully in-process via ONNX Runtime, using an INT8‑quantized model and bundled Hebrew stopwords. Lemmatization is powered by DictaBERT dicta-il/dictabert-lex (CC‑BY‑4.0).

Huge thanks to the Dicta team for making high-quality Hebrew natural language processing (NLP) models available to the community.