I18n - some techniques

This page mentions a few selected techniques. It's not a balanced overview, but rather, some techniques with which this author has a little familiarity and which readers might find useful, depending on their actual authoring situation.

Note that some of the techniques shown here are pure character-code conversion utilities, whereas others are, more or less, HTML-aware, and can convert to and from HTML "ampersand" notations, primarily &#number; numerical references (or where appropriate, &entityname), for situations where the encoding does not cover some of the characters which need to be represented.

The SP package

(This section was written quite some time back and may now be outdated in some details. Check the SP web site!)

The SP package is a powerful set of SGML tools, which come with terse documentation that's aimed at SGML specialists. So, for the average HTML hack like myself, it's somewhat opaque. However, it seems that one of its tools, spam, can be used (amongst other things) to convert a valid HTML document from one supported charset to another. This could be used to convert the author's "appropriate 8-bit charset plus &-notations" into "utf-8 plus &-notations". This is not the kind of thing that I need for everyday use, so I only played around with it a bit, but I did this successfully for a test document that was in iso-8859-7. Under Win95 in a DOS window, with appropriate settings of PATH and of the environment needed by SP (SGML_SEARCH_PATH, SGML_CATALOG_FILES), it went like this:

set SP_CHARSET_FIXED=1
set SP_ENCODING=iso-8859-7
spam -b utf-8 -p source.htm >dest.htm

where the -p option causes the DOCTYPE to be copied over to the output. I gather that the spent tool in the same package can be used in a similar way to address this issue.

When working on the page you'd of course want to work on the original - which you could preview using a conforming browser such as MSIE4 - and then re-run the conversion for publication on the WWW.

A disadvantage for Cyrillic users is that the only Cyrillic encoding that the package seems to support is the iso-8859-5 one, which I'm told is little-used in practice. On the other hand, Russian Apache supports on-the-fly recoding of documents according to the recipient's requirements, so someone seriously contemplating serving out a body of Cyrillic documents might want to consider this option anyway, in which case it's no big deal which of the supported codings (charset) you are creating.

XML/XHTML issues

An XHTML document normally begins with the XML declaration, e.g
<?xml version="1.0" encoding="ISO-8859-1"?>

However, when the provisions in XHTML/1.0 Appendix C are used for sending XHTML as text/html to older HTML browsers, some are liable to display this as if it were part of the body content, which is annoying.

My recommended solution in this area, which is also the W3's recommendation in the cited Appendix C, would be to send the charset= attribute on the real HTTP header: there is then no need to specify it in the <?xml...?> thingy, and that can then be omitted without harm. (don't confuse this with the meta http-equiv in HTML - as far as XHTML is concerned, the meta comes too late to do any good - its only purpose, if used here, is for compatibility with HTML user agents.)

However, some authors insist that they're unable to achieve this desirable state of affairs in their HTTP server configuration. Fortunately, the XML declaration is optional if the encoding is UTF-8, and, as has already been remarked, us-ascii is a proper subset of utf-8. Therefore, if the document is coded in us-ascii (what I called the "conservative recommendation" in the i18n quick-start page), then the XML declaration can be omitted and the XHTML document can be compatible with older HTML browsers. Note, however, that this option is not feasible if you used 8-bit coded characters, rather than &-notations, in your source.

Perl

Perl 5.8.0 has native support for Unicode, and for the most part this works well, though there is a certain "learning curve", and there seem to be occasional anomalies and surprises. Nevertheless, this is certainly the way to go, and anyone seriously intending to use Perl for this kind of activity should be using at least 5.8.0.

Consult the perldoc pages, in your own Perl 5.8 installation or at the web site, for perluniintro and perlunicode

Perl users would also want to go to CPAN or their nearest mirror for related modules.

GNU iconv

See http://www.gnu.org/software/libiconv/

The recode program

Last spotted as Free recode; version 3.5 supports a number of the interesting codings discussed here.

Comparison:

GNU iconv is a program which allows converting a document from one character coding into another.

Recode is a more powerful (and more complex) code converter than iconv. It uses iconv internally, but also handles some other issues, including line terminations and html character entities.

Mozilla Composer

Mozilla Composer can be used for converting between different document encodings, including utf-8, by loading the document into the Composer as for editing, and then changing the document encoding before saving it.

As a manual conversion technique, this works very well.

Do not confuse this with similar operations in the old Netscape 4 Composer, which supported such activities badly and could seriously corrupt the content of a document.

Other possibilities

Windows notepad on NT, 2000, XP is actually capable of performing simple character coding conversions, primarily between UTF-8, Windows 2-byte Unicode (UCS-2), and the user's default 8-bit coding.

I've said that NS3 and 4 don't work in general. There are however some exceptions to that statement. One way is to use charset=utf-8, or at least to pretend to do so as described in the "conservative recommendation". If the only non-Latin-1 characters that you are trying to get are the typographical niceties, such as trademark, em-dash, matched quotes etc. then you may find that you can get them on Netscape even with the 8-bit charset settings. Unfortunately in earlier versions of the "Big Two" browsers, one implemented only the entity names and not the &#bignumber; representations, while the other implemented only the &#bignumber; representations and not the entity names.

Apart from the above suggestions, about the only other option that has such a range of browser support is to send a properly encoded utf-8 data stream: a little has already been said about that, but a wider coverage of the topic is beyond the scope of the current note.

I must stress once again though that even if one is using a browser that in principle supports this part of HTML4.0, it may need to be installed in a certain way or may require extra operating system resources, fonts etc. before it can actually perform the desired function. With Lynx, of course, you also need a terminal emulation environment that supports what is needed: linux consoles are capable of supporting this method of working (utf-8 is standard in RedHat9, for example); also the putty ssh client for Win32 can be configured to support utf-8, and works well, at least as far as display is concerned (I've not done any tests of forms input in utf-8 from this environment yet).


|Previous|Up|Next | |RagBag|About the author||