Integrating Hunspell into Your App: APIs, Libraries, and TipsHunspell is a powerful, open-source spell checker and morphological analyzer widely used in browsers (Firefox), office suites (LibreOffice), e-mail clients, and many other applications. It supports complex morphology, rich affixation rules, and Unicode — making it especially useful for languages with rich inflection or compound-word formation. This article walks through practical steps to integrate Hunspell into your application, covers available APIs and language bindings, shows how to handle dictionaries and affix files, offers performance and fallback tips, and highlights common pitfalls.
What Hunspell provides and when to use it
Hunspell offers:
- Dictionary-based spell checking with support for affix rules that expand base words into many forms.
- Morphological analysis useful for validating word forms or generating stems.
- Compound-word handling and personal dictionaries.
- Unicode and multi-language support.
Use Hunspell when your app needs robust, language-aware spell checking beyond simple word lists — especially for languages like German, Hungarian, Turkish, Czech, Finnish, and others with rich morphology.
Core components: .dic and .aff files
Hunspell relies on two main file types:
- The .dic file: a list of root words plus optional flags indicating applicable affix rules or compound flags.
- The .aff file: defines affix rules (prefixes/suffixes), flag encoding, morphological actions, character encoding, compound rules, and special options.
Key points:
- Character encoding declared in the .aff must match the .dic encoding (UTF-8 or legacy charsets).
- Affix rules can generate thousands of derived forms from a small root set, which reduces dictionary size but increases rule complexity.
- Personal/user dictionaries are simple lists (usually one word per line) and are merged at runtime.
Integration approaches
There are three common approaches to integrating Hunspell:
-
Native C/C++ integration
- Use the Hunspell C++ API directly by linking libhunspell.
- Pros: best performance, full control, access to latest features.
- Cons: C++ complexity, cross-platform build challenges.
-
Language bindings / wrappers
- Many languages provide bindings: Python, Java, C#, Node.js, Go, Rust, PHP, Ruby, etc.
- Pros: faster development, easier packaging for existing stacks.
- Cons: sometimes feature gaps or outdated bindings.
-
Service-based architecture
- Run Hunspell as a microservice (e.g., REST/JSON) that your app calls.
- Pros: language-agnostic, centralized dictionary management, easier scaling.
- Cons: latency, additional infrastructure, stateful user dictionaries require design.
Choose based on your app’s performance needs, deployment constraints, and language ecosystem.
Native C/C++: basic usage example
When using the native API, the typical flow is:
- Initialize a Hunspell object with paths to the .aff and .dic files.
- Call spell() to check words, suggest() for suggestions, analyze() or stem() for morphology.
- Add words to user dictionary with add() and save with write_personal() or manage in-memory.
Example (C++ conceptual snippet — adapt to your build system):
#include <hunspell/hunspell.hxx> Hunspell hunspell("/path/to/en_US.aff", "/path/to/en_US.dic"); bool correct = hunspell.spell("example"); // true/false std::vector<std::string> suggestions = hunspell.suggest("exampel"); std::vector<std::string> stems = hunspell.stem("running"); hunspell.add("MyCustomWord"); hunspell.save_wordlist("/path/to/user.dic");
Build/linking notes:
- Install hunspell dev package (libhunspell-dev) or build from source.
- Link against libhunspell and include proper include paths.
- Watch ABI compatibility between hunspell versions.
Language bindings and examples
Select bindings with active maintenance and feature parity for your target language.
-
Python
- Packages: hunspell (bindings), cyhunspell, and pyhunspell historically exist, but maintenance varies.
- Example (pyhunspell-like):
from hunspell import Hunspell h = Hunspell('en_US') h.spell('example') # True h.suggest('exampel') # ['example', ...] h.add('MyCustomWord')
- Note: installation may require libhunspell and headers.
-
Java
- Options: hunspell-java wrappers or use JNI/JNA bridges. Apache Lucene also has Hunspell-based spellchecking components (lucene-hunspell).
- Example: use lucene-hunspell for integration with Lucene analyzers or standalone JNI bindings for direct use.
-
Node.js
- Packages: nodehun (Node.js native bindings to Hunspell).
- Example:
const Nodehun = require('nodehun'); const fs = require('fs'); const aff = fs.readFileSync('en_US.aff'); const dic = fs.readFileSync('en_US.dic'); const hunspell = new Nodehun(aff, dic); hunspell.spell('example', (err, correct) => { ... });
-
C#
- Hunspell.Net or NHunspell provide .NET bindings and are commonly used in Windows/.NET apps.
-
Go, Rust, PHP, Ruby, etc.
- Most ecosystems have community bindings; verify maintenance and support for features like compound rules and encoding.
When choosing a binding, confirm:
- Support for suggest(), stem(), analyze(), and compound features you need.
- Ability to load custom/user dictionaries at runtime.
- Compatibility with your Hunspell dictionaries’ encoding and flags.
Running Hunspell as a microservice
For language-agnostic integration or centralized dictionary management:
- Build a small service that loads dictionaries and exposes endpoints: /spell, /suggest, /stem, /add-user-word.
- Keep user-dictionary state per user (store persisted files or database).
- Use batching for checking many words to reduce RPC overhead.
- Example endpoint design:
- POST /spell { “words”: [“word1”,“word2”] } -> { “results”: [true,false] }
- POST /suggest { “word”: “mispell”, “limit”: 5 } -> [“misspell”, …]
- Scale by running multiple instances behind a load balancer. Use caching for repeated suggestions.
Performance tips
- Reuse Hunspell instances — initialization is costly. Create a pool or singleton per process.
- Batch checks: call spell() on arrays rather than single words when bindings support it.
- Cache suggestions for common misspellings and heavy words.
- For large text, tokenize first and only check tokens that are likely words (skip numbers, URLs, code fragments).
- Keep user dictionaries small and load them per session or merge selectively to avoid slowing lookups.
- Monitor memory use: affix rules can expand internally; test worst-case morphological expansion.
Handling encodings and localization
- Ensure your .aff declares the correct encoding (e.g., SET UTF-8). Use UTF-8 dictionaries when possible.
- Normalize input (NFC/NFD) consistently before spell-checking for languages with combining marks.
- For right-to-left languages or scripts with contextual forms, Hunspell handles word forms but UI rendering is outside its scope.
- Provide locale-aware tokenization (e.g., what counts as a word boundary differs by language).
Building and packaging dictionaries
- Obtain high-quality dictionaries: LibreOffice, Mozilla, and OpenOffice community dictionaries are common sources.
- Test dictionaries with sample corpora to catch missing words or problematic affix rules.
- Customize by:
- Adding domain-specific words to a separate user or project dictionary.
- Editing affix rules carefully; incorrect rules can create false positives/negatives.
- When distributing with your app, keep dictionary updates separate so users can download updated .dic/.aff files without updating the whole app.
Suggestions and user experience
- Provide in-place suggestions (top N) and a “learn word” button that adds terms to the user dictionary.
- Offer “ignore once” and “ignore all” behaviors.
- Show suggestions with context (e.g., highlight differing letters) and provide keyboard shortcuts.
- Consider offering grammar or style suggestions via separate tools; Hunspell focuses on word-level correctness.
Common pitfalls and how to avoid them
- Mismatched encoding between .aff and .dic — always verify SET in .aff and file encoding.
- Relying only on spell-check — Hunspell won’t catch grammar or context-based errors.
- Loading huge personal dictionaries per request — persist and reuse per session.
- Using outdated language bindings — choose maintained libraries or implement a thin service layer around a native install.
- Not testing with real-world text (URLs, code, special tokens) — add tokenization rules to skip non-language tokens.
Troubleshooting checklist
- If suggestions are poor: verify affix flags, encoding, and whether affix rules are being interpreted by your binding.
- If initialization fails: check file paths, permissions, and that the .aff/.dic pair match (dictionary version).
- If performance is slow: ensure you aren’t reloading dictionaries per check and profile for hotspots.
Example integration plan (small web app)
- Choose integration style: Node.js with nodehun for a web app.
- Add libhunspell as a dependency on server, install en_US dictionary files.
- Load Hunspell once at server start; expose /spell and /suggest endpoints.
- Implement client-side tokenization and only send suspected misspellings.
- Cache suggestion results and implement user dictionary endpoints.
- Monitor usage and tune caching, worker pools.
Further resources
- Official Hunspell source and documentation for detailed affix syntax.
- Community dictionaries (LibreOffice/Mozilla) for ready-made language packs.
- Language-specific bindings’ repositories and README for installation notes.
Integrating Hunspell gives your app robust, language-aware spell checking with manageable dictionary sizes and strong support for complex languages. Choose the right integration approach (native, binding, or service), ensure proper encoding and affix handling, reuse instances for performance, and provide UX features like personal dictionaries and suggestion caching to deliver a smooth experience.