Skip to content

February 27, 2026

How to verify international addresses in any language

If your business onboards customers from more than one country, you've hit this problem: a customer uploads a proof of address document, and it's not in English. It might be a Bulgarian electricity bill in Cyrillic, an Arabic bank statement written right-to-left, or a Japanese government letter mixing kanji, katakana, and Latin characters on the same page. Your system needs to read that document, extract the address, and compare it to what the customer entered during signup, which is almost certainly in Latin script.

This is one of the hardest problems in automated KYC. It's not enough to run OCR and do a string comparison. You need transliteration, fuzzy matching, and an understanding of how addresses work in different countries. Here's how it all fits together.

Why OCR alone fails for international documents

Standard OCR engines do a reasonable job with Latin-script documents in good condition. But international proof of address documents introduce problems that break naive text extraction:

Even when OCR correctly extracts the text, you're left with a string in a script your system probably can't compare against your database. That's where the real problem begins.

The transliteration problem

Consider a concrete example. A Bulgarian customer signs up and enters their address as:

ul. Vitosha 15, Sofia 1000

They upload their electricity bill. The bill shows:

ул. Витоша 15, София 1000

These are the same address. But a character-level comparison gives you 0% similarity. The Cyrillic "ул." is an abbreviation for "улица" (street), which transliterates to "ulitsa" or is commonly shortened to "ul." in Latin. "Витоша" transliterates to "Vitosha." "София" becomes "Sofia."

This isn't a special case. It's the norm for any customer whose documents aren't in Latin script. A few more examples:

Transliteration is not translation. You're converting characters from one script to their phonetic equivalent in Latin script, not changing the language. But even this seemingly mechanical process has ambiguity: there are multiple valid transliteration standards for most scripts (BGN/PCGN, ISO 9, GOST for Cyrillic alone), and real-world usage often doesn't follow any standard consistently.

Address format differences by country

Even after transliteration, you need to handle the fact that address structures vary dramatically by country:

A verification system that expects "number, street, city, postcode" will fail on most of these. You need country-aware parsing or, better, a model that understands document structure rather than relying on regex patterns.

What cross-script matching means

Cross-script matching is the process of comparing two strings that are written in different scripts and determining whether they refer to the same thing. In practice, this means:

  1. Detect the script of the extracted text (Cyrillic, Arabic, Georgian, etc.)
  2. Transliterate the non-Latin text to Latin characters using an appropriate standard for that script
  3. Normalize both the transliterated text and the expected value, lowercasing, removing diacritics, expanding or collapsing common abbreviations
  4. Fuzzy match the normalized strings to produce a similarity score

The transliteration step is where most of the difficulty lives. A single Cyrillic character can map to one, two, or three Latin characters depending on the standard. The Russian "щ" transliterates to "shch" in BGN/PCGN but "šč" in ISO 9. Real documents use a mix of conventions, and customer-entered addresses use whatever the customer felt like typing.

Why fuzzy matching matters

Even with perfect transliteration, exact string matching won't work. Real-world addresses are messy in every language:

Fuzzy matching algorithms (like Levenshtein distance, Jaro-Winkler, or token-based similarity) handle these variations by producing a similarity score rather than a binary match. A good system lets you set thresholds: maybe 0.85 is good enough for the address, but you want 0.95 for the postcode since postcodes should be exact.

Real-world examples across scripts

Here's what cross-script address verification looks like in practice, showing the same address as it appears on a document and how it needs to match against a Latin-script database entry:

Arabic bank statement (Saudi Arabia)

Japanese government letter

Georgian utility bill

Each of these examples requires script detection, transliteration, normalization, and fuzzy matching to produce a reliable result. A system that can't handle these cases will either reject valid documents (frustrating customers) or require manual review (defeating the purpose of automation).

How AI document understanding solves this

Modern AI models, particularly large vision-language models, bring a fundamentally different approach to this problem compared to traditional OCR pipelines:

The key shift is from a pipeline approach (OCR → text extraction → parsing → matching) to an end-to-end approach where a single model reads the document and produces structured, normalized output. This eliminates many of the failure modes where errors compound through the pipeline.

How trusqo handles international documents

trusqo automatically detects the script and language of uploaded documents and performs cross-script matching without any configuration. When you submit a verification request, here's what happens:

  1. The document is analyzed by an AI model that extracts the customer name and address, regardless of script
  2. Non-Latin text is transliterated to Latin characters, and the original script version is preserved alongside
  3. The transliterated address is compared against the expected address using fuzzy matching that accounts for abbreviations, word order differences, and transliteration variations
  4. A similarity score and pass/fail verdict are returned for each check

This works the same whether the document is a German bank statement, an Arabic utility bill, or a Thai government letter. You don't need to specify the language upfront or use different endpoints for different regions.

Full API documentation and integration guides are available at trusqo.com/docs.