When you use software that is still under development you can’t expect troublefree operation for 100% of all possible use-cases. Software has bugs. Some software has lots of bugs.
Bugs come in several varieties. Plain nuisances, security critical, and those caused by unforgivable stupidity and arrogance.
Building and distributing an application for Debian that is not locale aware falls under the last category. I mean utf-8 has been an officially supported locale since what – Sarge?
Currently I work on a bash script that lets me use Twitter from the commandline or from within a Vim buffer. A script that goes beyond the mindless curl one-liners that are splattered all over the blogosphere. Web 2.0 for the text pistols so to speak.
Twitter is run by an internationally inclined bunch of smart piglets and uses utf-8 encoding on everything it does. Great, I thought, I use Debian with a utf-8 locale, so Twitter’s encoding suits me fine.
My bash script does all sorts of neat little tricks and pulling in a friends_timeline is one of the easier chores you can task it with. The scripts flow for a timeline looks like this:
- tell curl to get the timeline
- pipe said script to smjs which parses the JSON and writes formatted text to stdout
It worked beautifully first time round except for one really ugly flaw. Because smjs is not locale aware it returns garbage for all characters above iso-8859-1.
When used in a terminal or console it returned text littered with those reverse-video question marks which are a sure sign of an encoding mismatch. When I pulled its output into a Vim buffer, which does transliteration on the fly, it gave me umlauts and other German oddities which are in iso-8859-1, but flopped on € or the curly quotes which all those twittering journalists seem to love so much.
Nobody is perfect. Before I snap at somebody I always check my own work first. Maybe I had made a mistake somewhere? Like passing a bad substring to the script or whatever.
So I changed the script to something that could be loaded into a webpage and fed it to Iceweasel from a local html file. Hey presto – even the weirdest outlandish characters were rendered in all their splendour. No reverse-video question marks or unwanted transliterations.
So, back to smjs in interactive mode.
js> var x = String.fromCharCode(0x20AC); js> print(x); � < this is the Unicode replacement char a sure sign that smjs sends garbage to the terminal js> js> var euro = '€'; js> var x = euro.charCodeAt(0); js> print(x); 226 js> var x = String.fromCharCode(226); js> print(x); js>
Readline as installed on my system, and against which smjs is supposed to have been compiled, is definitely utf-8 clean. The € has charcode value
\u20AC and I would never be able to enter it into my terminal if readline had utf-8 problems. SpiderMonkey, as the test in Iceweasel showed, obviously has no problems either.
So whoever cobbled smjs together for the Debian repos must have made a really bad blunder somewhere. Like believing that all the world works in a C environment …
Of course said oaf would have noticed his blunder had he tested smjs in a utf-8 locale.
I got the bash script to work in terminal and console, of a fashion, by piping the output from smjs through iconv, which at least got rid of all the ugly reverse-video question marks, but the result is still far from perfect. The output now looks a bit like an Abiword doc with “show invisibles” set to ON.
Could the maintainer(s) of smjs at Debian please wake up to the fact that we live in an interconnected world and the year is 2009? In this day and age locale awareness is not an optional feature – it is a must!