February 27, 2026
How to verify international addresses in any language
If your business onboards customers from more than one country, you've hit this problem: a customer uploads a proof of address document, and it's not in English. It might be a Bulgarian electricity bill in Cyrillic, an Arabic bank statement written right-to-left, or a Japanese government letter mixing kanji, katakana, and Latin characters on the same page. Your system needs to read that document, extract the address, and compare it to what the customer entered during signup, which is almost certainly in Latin script.
This is one of the hardest problems in automated KYC. It's not enough to run OCR and do a string comparison. You need transliteration, fuzzy matching, and an understanding of how addresses work in different countries. Here's how it all fits together.
Why OCR alone fails for international documents
Standard OCR engines do a reasonable job with Latin-script documents in good condition. But international proof of address documents introduce problems that break naive text extraction:
- Non-Latin scripts: Cyrillic (Russian, Bulgarian, Ukrainian), Arabic, Chinese, Georgian, Thai, Korean, Devanagari, and dozens of others each have their own character sets, ligatures, and rendering rules
- Right-to-left text: Arabic and Hebrew documents read right-to-left, but often contain embedded left-to-right numbers, dates, or Latin words. This bidirectional mixing confuses many OCR pipelines
- Mixed scripts on one document: a Georgian utility bill might have the customer's address in Georgian script, the company name in Latin, account numbers in Arabic numerals, and a footer in English. Each region of the document uses different character recognition rules
- Low-resource scripts: OCR models trained primarily on English and Chinese may produce poor results on Georgian, Khmer, or Amharic text, where training data is scarce
Even when OCR correctly extracts the text, you're left with a string in a script your system probably can't compare against your database. That's where the real problem begins.
The transliteration problem
Consider a concrete example. A Bulgarian customer signs up and enters their address as:
ul. Vitosha 15, Sofia 1000
They upload their electricity bill. The bill shows:
ул. Витоша 15, София 1000
These are the same address. But a character-level comparison gives you 0% similarity. The Cyrillic "ул." is an abbreviation for "улица" (street), which transliterates to "ulitsa" or is commonly shortened to "ul." in Latin. "Витоша" transliterates to "Vitosha." "София" becomes "Sofia."
This isn't a special case. It's the norm for any customer whose documents aren't in Latin script. A few more examples:
- Arabic: شارع الملك فهد ١٥، الرياض (King Fahd Road 15, Riyadh). The street name "الملك فهد" can transliterate as "Al Malik Fahd," "al-Malik Fahd," or "King Fahd" depending on the system
- Georgian: რუსთაველის გამზირი 24, თბილისი (Rustaveli Avenue 24, Tbilisi). "გამზირი" means "avenue" and "რუსთაველის" transliterates to "Rustaveli" or "Rust'aveli"
- Japanese: 東京都渋谷区神宮前1-2-3 (1-2-3 Jingumae, Shibuya-ku, Tokyo). The structure is completely inverted from Western format: country, prefecture, city, ward, district, block, building
Transliteration is not translation. You're converting characters from one script to their phonetic equivalent in Latin script, not changing the language. But even this seemingly mechanical process has ambiguity: there are multiple valid transliteration standards for most scripts (BGN/PCGN, ISO 9, GOST for Cyrillic alone), and real-world usage often doesn't follow any standard consistently.
Address format differences by country
Even after transliteration, you need to handle the fact that address structures vary dramatically by country:
- Germany: street name comes before the house number ("Friedrichstraße 43"), the opposite of US convention ("43 Friedrich Street")
- Japan: addresses are block-based, not street-based. A Japanese address identifies a prefecture, city, ward, district, block number, and building number, there may be no street name at all
- United Kingdom: postcodes like "SW1A 1AA" encode geographic information and are critical for matching, but the format is unlike any other country's postal code system
- South Korea: the country recently switched from a lot-number system (지번주소) to a road-name system (도로명주소), so older documents use a completely different address format than newer ones
- Middle East: many Gulf countries have less standardized addressing. A document might reference a district name, a building name, and a P.O. box rather than a street number
- Brazil: addresses include neighborhood (bairro) as a standard component, and abbreviations like "R." for "Rua" (street) are universal
A verification system that expects "number, street, city, postcode" will fail on most of these. You need country-aware parsing or, better, a model that understands document structure rather than relying on regex patterns.
What cross-script matching means
Cross-script matching is the process of comparing two strings that are written in different scripts and determining whether they refer to the same thing. In practice, this means:
- Detect the script of the extracted text (Cyrillic, Arabic, Georgian, etc.)
- Transliterate the non-Latin text to Latin characters using an appropriate standard for that script
- Normalize both the transliterated text and the expected value, lowercasing, removing diacritics, expanding or collapsing common abbreviations
- Fuzzy match the normalized strings to produce a similarity score
The transliteration step is where most of the difficulty lives. A single Cyrillic character can map to one, two, or three Latin characters depending on the standard. The Russian "щ" transliterates to "shch" in BGN/PCGN but "šč" in ISO 9. Real documents use a mix of conventions, and customer-entered addresses use whatever the customer felt like typing.
Why fuzzy matching matters
Even with perfect transliteration, exact string matching won't work. Real-world addresses are messy in every language:
- Abbreviations: "St" vs "Street," "ул." vs "улица," "р-н" vs "район" (district in Russian)
- Word order: "15 Vitosha Street" vs "Vitosha Street 15" vs "ul. Vitosha 15"
- Missing components: the customer might omit the apartment number, or the document might not show it
- Transliteration variations: "Tbilisi" vs "T'bilisi" vs "Tiflis," "Kyiv" vs "Kiev"
- Spelling differences: "Mohammed" vs "Muhammad" vs "Mohamed" when transliterating محمد
- Extra detail on documents: a utility bill might include a floor number, entrance number, or meter ID that the customer didn't provide
Fuzzy matching algorithms (like Levenshtein distance, Jaro-Winkler, or token-based similarity) handle these variations by producing a similarity score rather than a binary match. A good system lets you set thresholds: maybe 0.85 is good enough for the address, but you want 0.95 for the postcode since postcodes should be exact.
Real-world examples across scripts
Here's what cross-script address verification looks like in practice, showing the same address as it appears on a document and how it needs to match against a Latin-script database entry:
Arabic bank statement (Saudi Arabia)
- On document: شارع العليا، حي العليا، الرياض 12211
- Transliterated: Shariʿ al-ʿUlayya, Hayy al-ʿUlayya, ar-Riyad 12211
- Customer entered: Olaya Street, Olaya District, Riyadh 12211
- Challenge: "العليا" can be transliterated as "al-Ulayya," "Olaya," or "Al Olaya." The common English name "Olaya Street" is actually an anglicized version, not a standard transliteration
Japanese government letter
- On document: 東京都新宿区西新宿二丁目8番1号
- Transliterated: Tokyo-to Shinjuku-ku Nishi-Shinjuku 2-chome 8-ban 1-go
- Customer entered: 2-8-1 Nishi-Shinjuku, Shinjuku-ku, Tokyo
- Challenge: the document uses kanji numerals (二丁目8番1号) mixed with the compact form, while the customer used the common shortened format. The address components are also in reverse order
Georgian utility bill
- On document: თბილისი, ვაჟა-ფშაველას გამზ. 71
- Transliterated: Tbilisi, Vazha-Pshavelas gamz. 71
- Customer entered: 71 Vazha-Pshavela Ave, Tbilisi
- Challenge: "გამზ." is an abbreviation for "გამზირი" (avenue). The customer wrote "Ave" instead. "ვაჟა-ფშაველას" is the genitive form; the nominative is "ვაჟა-ფშაველა" which the customer used
Each of these examples requires script detection, transliteration, normalization, and fuzzy matching to produce a reliable result. A system that can't handle these cases will either reject valid documents (frustrating customers) or require manual review (defeating the purpose of automation).
How AI document understanding solves this
Modern AI models, particularly large vision-language models, bring a fundamentally different approach to this problem compared to traditional OCR pipelines:
- They understand document structure: they can distinguish the customer's address from the utility company's headquarters address, even when both appear on the same page, because they understand layout, font size, labels, and context
- They handle mixed scripts natively: a model trained on multilingual data can read Georgian, Arabic, and Latin text on the same document without switching modes or pipelines
- They can transliterate in context: rather than doing character-by-character transliteration, the model understands that "ул. Витоша" is a street address and produces a sensible Latin equivalent
- They're robust to poor quality: phone photos of documents, faded text, stamps overlapping text, watermarks, AI models handle these gracefully where traditional OCR fails
The key shift is from a pipeline approach (OCR → text extraction → parsing → matching) to an end-to-end approach where a single model reads the document and produces structured, normalized output. This eliminates many of the failure modes where errors compound through the pipeline.
How trusqo handles international documents
trusqo automatically detects the script and language of uploaded documents and performs cross-script matching without any configuration. When you submit a verification request, here's what happens:
- The document is analyzed by an AI model that extracts the customer name and address, regardless of script
- Non-Latin text is transliterated to Latin characters, and the original script version is preserved alongside
- The transliterated address is compared against the expected address using fuzzy matching that accounts for abbreviations, word order differences, and transliteration variations
- A similarity score and pass/fail verdict are returned for each check
This works the same whether the document is a German bank statement, an Arabic utility bill, or a Thai government letter. You don't need to specify the language upfront or use different endpoints for different regions.
Full API documentation and integration guides are available at trusqo.com/docs.