Locale awareness

When you use software that is still under development you can’t expect troublefree operation for 100% of all possible use-cases. Software has bugs. Some software has lots of bugs.

Bugs come in several varieties. Plain nuisances, security critical, and those caused by unforgivable stupidity and arrogance.

Building and distributing an application for Debian that is not locale aware falls under the last category. I mean utf-8 has been an officially supported locale since what – Sarge?

I do a lot of javascripting and I hate having to build a test-site just to give a quick hack the once over. So I went and installed smjs from the Debian repos.

smjs is what one might loosely term a shellwrapper around Mozilla’s SpiderMonkey javascript engine. Depending on the parameters you pass it on startup it will either run as an interactive process which you can use as a sort of javascript REPL, or it will simply work as part of a pipe, where it reads from stdin and writes to stdout or file. You can also pass it a file with javascript for execution.

Currently I work on a bash script that lets me use Twitter from the commandline or from within a Vim buffer. A script that goes beyond the mindless curl one-liners that are splattered all over the blogosphere. Web 2.0 for the text pistols so to speak.

Twitter offers 4 reply formats, namely JSON, XML, RSS and Atom. Of the four only JSON is available for all methods of Twitter’s API. Hence I chose it. Parsing JSON objects in javascript is a no-brainer, because they evaluate to an array over which a script can loop without much further ado.

Twitter is run by an internationally inclined bunch of smart piglets and uses utf-8 encoding on everything it does. Great, I thought, I use Debian with a utf-8 locale, so Twitter’s encoding suits me fine.

My bash script does all sorts of neat little tricks and pulling in a friends_timeline is one of the easier chores you can task it with. The scripts flow for a timeline looks like this:

  1. tell curl to get the timeline
  2. build a javascript from several snippets and the returned JSON
  3. pipe said script to smjs which parses the JSON and writes formatted text to stdout

It worked beautifully first time round except for one really ugly flaw. Because smjs is not locale aware it returns garbage for all characters above iso-8859-1.

When used in a terminal or console it returned text littered with those reverse-video question marks which are a sure sign of an encoding mismatch. When I pulled its output into a Vim buffer, which does transliteration on the fly, it gave me umlauts and other German oddities which are in iso-8859-1, but flopped on € or the curly quotes which all those twittering journalists seem to love so much.

Nobody is perfect. Before I snap at somebody I always check my own work first. Maybe I had made a mistake somewhere? Like passing a bad substring to the script or whatever.

Well – SpiderMonkey is Mozilla’s javascript engine and runs in Iceweasel the Firefox clone that ships with Debian. The ideal testbed.

So I changed the script to something that could be loaded into a webpage and fed it to Iceweasel from a local html file. Hey presto – even the weirdest outlandish characters were rendered in all their splendour. No reverse-video question marks or unwanted transliterations.

So, back to smjs in interactive mode.

js> var x = String.fromCharCode(0x20AC);
js> print(x);
� < this is the Unicode replacement char
                        a sure sign that smjs sends garbage to the terminal
js>

js> var euro = '€';
js> var x = euro.charCodeAt(0);
js> print(x);
 226
js> var x = String.fromCharCode(226);
js> print(x);

js>

Readline as installed on my system, and against which smjs is supposed to have been compiled, is definitely utf-8 clean. The € has charcode value \u20AC and I would never be able to enter it into my terminal if readline had utf-8 problems. SpiderMonkey, as the test in Iceweasel showed, obviously has no problems either.

So whoever cobbled smjs together for the Debian repos must have made a really bad blunder somewhere. Like believing that all the world works in a C environment …

Of course said oaf would have noticed his blunder had he tested smjs in a utf-8 locale.

I got the bash script to work in terminal and console, of a fashion, by piping the output from smjs through iconv, which at least got rid of all the ugly reverse-video question marks, but the result is still far from perfect. The output now looks a bit like an Abiword doc with “show invisibles” set to ON.

Could the maintainer(s) of smjs at Debian please wake up to the fact that we live in an interconnected world and the year is 2009? In this day and age locale awareness is not an optional feature – it is a must!

Advertisements

About dozykraut

Proud member of Hillbilly's on Linux, promoting open source redneckism in remote parts of the Milky Way.
This entry was posted in Debian and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s