US-ASCII transliterations of Unicode text

Following on from this release, the this turned out to be easier than I thought – ported Text::Unidecode to PHP – code available here or track down the utf8_to_ascii package from the main page – released it seperately to keep with the original (Perl artistic) license while the rest of the stuff is under LGPL.

Now I first need to point out that what I’ve done is easy, compared to the amazing job Sean M. Burke has done with Text::Unidecode. You really need to read the docs to understand what it does and it’s limitations but, in short, it keeps a “database” of unicode characters and corresponding sensible US-ASCII equivalents. For example, a simple transformation would be “Zürich” to “Zuerich”, “ue” being a common replacement for “ü” in Germanic languages.

Really only came to understand how good a job Sean has done on passing this UTF-8 sampler through the PHP version – at a rough guess it did “something” for 85%+ of the non-ASCII characters in that document. Here’s some snippets to give you a feeling of before and after;

Before: *Sanskrit* /(standard transcription):/ k?ca? ?aknomyattum; nopahinasti m?m.
After:  *Sanskrit* /(standard transcription):/ kaca- saknomyattum; nopahinasti mam.

Before: *Greek*: ????? ?? ??? ???????? ?????? ????? ?? ???? ??????.
After: *Greek*: Mporo na phao spasmena gualia khoris na patho tipota.

Before: *Anglo-Saxon* /(Latin):/ Ic mæg glæs eotan ond hit ne hearmiað me.
After: *Anglo-Saxon* /(Latin):/ Ic maeg glaes eotan ond hit ne hearmiad me.

Before: *Soenderjysk*: Æ ka æe glass uhen at det go mæ naue.
After: *Soenderjysk*: AE ka aee glass uhen at det go mae naue.

Before: *Ukrainian*: ? ???? ???? ????, ? ???? ???? ?? ?????????.
After: *Ukrainian*: Ia mozhu yisti shklo, i vono mieni nie poshkodit'.

Before: *Farsi / Persian*: .?? ?? ????? ????? ????? ??? ???? ?????
After: *Farsi / Persian*: .mn my twnm bdwni Hss drd shyshh bkhwrm

Whether all of those actually make sense to a native speaker, I can’t say (feedback appreciated). I guess it depends partly on the language e.g. it’s easier to do with Greek than with Farsi. It should also be pointed out that the Text::Unidecode database (which I ported 1 to 1 – in fact there’s a script to automate it) isn’t entirely complete – for some characters and languages it has no data.

That said, if those languages are not relevant to your site, this can be a big help when you need ASCII not UTF-8. You might use this for filenames or critical “identifers” like a userid, for example, where you don’t want any risk of “phishing” or the overhead of processing UTF-8 characters. You might also consider it for search engine friendly URLs – although modern browsers largely support UTF-8 in URLs, phishing is again an issue and it may be smarter not to.

Anyway – the first PHP version “works” although no doubt it could get faster (although I doubt very much is will get as fast as the Perl version). This could also be readily ported to other languages like Python and Ruby.