HTML Internationalization:
Browser tests of 8-bit character codes

This is a historical document, not substantially revised since about 1998

This document contains the results of some detailed tests that were made in order to serve as a basis for the summaries that are included in my "Notes on Internationalization" briefing.

Disclaimer: The tests reported here were done only on MS-Windows platform. Please don't for a moment think that I would advocate methods that were confined to one platform: we have far too many of those already. The WWW protocols were specifically designed to promote cross-platform interworking. The only explanation for the limited tests is the limits of my time and expertise, sorry. They can serve only as straw-polls of the general level of support for the specifications and standards involved. And, as ever, all statements in here are of course made to the best of my knowledge, but errors are possible and no responsibility can be accepted.

A Case Study: iso-8859-7 Greek

The test material was the table that I prepared earlier for testing browser compliance to iso-8859-1: it contains three columns (columns 6, 7, and 8) containing respectively the 8-bit character, the numerical character reference (&#number;) and the entity (&entityname;). Unlike the iso-8859-1 tests, however, the table was transmitted with an HTTP header that said:

Content-type: text/html;charset=iso-8859-7

According to the specifications (RFC2070 as well as the HTML4.0 recommendation), column 6 is required to be rendered according to the iso-8859-7 character code; but columns 7 and 8 are required to be rendered by reference to the Latin-1 definitions. It was a common mistake in early browser implementations to render columns 7 and/or 8 by reference to the iso-8859-7 character code, but this is emphatically wrong.

The Alis Tango browser had been getting this right for a long time. As we will see, MSIE took longer to get it right, and Netscape4 became obsolete without ever getting it right, except for a few limited special cases.

Results

Note that these tests use the browser versions that happened to be at my disposal at the time. I've made no attempt to locate specific versions on the basis of their untypically good - or untypically bad - coverage, I have simply taken whatever was readily to hand. The images are rough-and-ready extracts snipped from screen shots, without any attempt to line them up or scale them equally: apologies for the rough edges.

Note about MSIE3: when one installs MSIE4 in Win95, the vendor states that it will override version 3. However, thanks to a tip on usenet I learned that the Win3.1 (16-bit) version of MSIE3 can be installed in Win95 alongside the Win95 (32-bit) version of MSIE4, and so indeed seems to be the case. I installed the newest version available at the time: 3.03. When I came to test it, I found that it was not responding to the advertised charsets in the HTTP headers, and needed the character set to be selected manually in the options before it displayed correctly. This experience differs from the earlier experiences with 32-bit MSIE 3.01 under Win95, as reported under koi8-r (note that MSIE3.01 required a "Pan-European Multi-Language Support" option installed). I do not know the reason for this apparent loss of functionality.

Netscape Gold 3.03 MS IE 3.03 16-bit See note Netscape 4.04 MS IE 4.71.1712.0

Summary

Apart from minor discrepancies, all of the browser/versions tested here were able to display the 8-bit characters.

Netscape version 3 displayed both the numerical character references, and the named entities, wrongly. Netscape version 4 for the most part displayed "?" instead!

MS IE3.03 displayed both the numerical character references, and the named entities, correctly (this is an improvement from the IE3.01 situation described in the koi8-r report below). MS IE 4 was also correct in these respects. The alignment of the preformatted columns was somewhat erratic in 3.03.

Case study: Russian Cyrillic and the KOI8-R code

Note: I am using this example primarily as a means to illustrate how character codes other than ISO-8859-1 work (or should work) on the WWW. It's not my place to take sides in any argument about what character code ought to be used for writing Cyrillic: this isn't my area of study at all.

This contains material from an earlier case study, now updated in Dec1997. To the browsers previously tried, I have added the browser versions that also featured in the Greek tests reported above.

Note that many of the characters defined in ISO-8859-1 do not exist in koi8-r; and some of those that do exist (copyright, degree, superscript-2) are in different places. Also, the koi8-r code contains printable characters in the range 128-159 decimal (unlike the iso-8859-x series, where this area contains only control characters); however, I'm afraid that my tests didn't cover that part of the code, for the trivial reason that my test table doesn't include it. The koi8-r code includes a non-breaking space, but it's at position 154: position 160 contains one of the boxing characters. So, plenty of scope for confusion.

A standards-compliant browser: Alis Tango

This screen shot shows the Alis Tango browser (2.5.1) displaying a (pre-formatted) document that has been sent out with charset=koi8-r. The test material is, apart from the charset parameter, the same ISO-8859-1 test table that I use for Latin-1 compliance, but of course the 8-bit characters (col.6) instead of following the ISO-8859-1 code now must follow the KOI8-R code. (At this point I used to cite the "Home of koi8-r", but that site now bombards one with advertising cookies, and refuses direct access to the code chart, so I've removed the link to it.)

[Alis Tango browser screen shot, 15k gif]

Note that columns 7 and 8 are still displaying the ISO-Latin-1 repertoire, as required by the HTML internationalization specifications.

A standards-compliant browser (as best it can): Lynx

The following applies at least to Lynx versions 2.6, 2.7 etc.

Lynx has the ability to process a document of the kind described above, provided it has access to a terminal emulation that implements the koi8-r code. For 8-bit characters, it sends the character directly to the terminal; for HTML entities, it uses the koi8-r equivalent, where there is one (copyright, degree, etc.) or a "7-bit approximation" such as "(R)" for registered trademark, "YEN" for yen, etc. The results can be seen later in an excerpt from a screen shot. The Lynx environment in which this is happening is as follows:

As can be seen below, column 6 - the 8-bit characters - now implements the koi8-r character code, just as with the Alis Tango browser. Columns 7 and 8, the numbered character references and named entities, still refer to their ISO-8859-1 assignments. Since Lynx has been told that its terminal display is set for KOI8-R, it cannot use many of the ISO-8859-1 characters that it needs for its repertoire. Examples of the approach used are as follows.

The Copyright sign, © or ©
Lynx sends to the terminal the character (decimal)191, which is the copyright sign in koi8-r. Similar strategies are followed for the mid dot, the degree and the superscript-2 characters.
The broken vertical bar, ¦ or ¦
There is no available broken bar in the terminal's character set, so Lynx uses an unbroken one as an approximation.
Fractions one-quarter, one-half, three-quarters
Lynx uses the "approximations" 1/4, 1/2 and 3/4.
And so on for the other ISO-8859-1 characters.
There is a configurable table in the Lynx source code, so that readers who needed a different approximation could tailor their own.

The shortcomings of this approach are evident, but they are predicated by the fact that Lynx is displaying 8-bit characters to a terminal, whose repertoire is assumed to be restricted to the character code defined in its "display Character set" option. Perhaps one could say that its intentions are fully compliant with the protocol, but due to the limitations of its environment, its ability to fulfil those intentions is limited.

Recent versions of Lynx include support for a unicode-capable terminal emulation, and this was subsequently seen working very well with the 'putty' terminal emulator.

MS IE 3.03 16-bit

Bit of a mystery here. When set up in what seemed to be the obvious fashion, the display of the 8-bit characters was completely wrong. After playing around with the settings a bit, I hit on the following settings:

This then displayed correctly (provided that, as already noted in the Greek tests above, the user did manually select the View/Language option - this ought to happen automatically based on the HTTP header specifying charset=koi8-r). How odd. I've no idea whether this is some consequence of using a 16-bit(Win/3.x) version under Win95. I don't recall any similar problem with 3.01 (3-bit), although there was another problem with that, as you see in the table below.

Samples

This rough-and-ready table is exhibiting screenshot extracts from both the earlier and the later tests.

Lynx 2.6 or 2.7
terminal environment forces use of approximations for entities
Alis Tango 2.5.1
protocol-compliant
MS IE 3.01
&#number; wrong
MS IE 16bit 3.03
(see note above)
NS Navigator 3.01
&#number; and &name; wrong 3.03 was much the same
NS Communicator 4.04 MS IE 4
[Lynx screen shot] [Alis Tango screen shot] [MSIE screen shot] [NSN screen shot]

Alis Tango displays all three columns correctly. Generally speaking, Lynx and NSN display the 8-bit characters correctly, and MSIE is correct, apart from 3.01's crude approximations used for the boxing characters. MS IE correctly displays the named entities, and Lynx does the best that it can ("7-bit approximations") within the limitations of its terminal environment. But MS IE 3.01 gets the numbered character references wrong, and NSN gets both the named entities and numbered character references wrong. NS displays the character at position (decimal)160 as a space, when in fact the 8-bit (koi8-r) character is one of the boxing characters, as can be seen from "column 6" in the Lynx, Tango and MS IE displays. And then there was the mystery about MSIE 3.03 (16bit) not responding to the HTTP header, and needing non-obvious manual configuration settings.

Straw Poll on i18n entities and character refs

Here's a few of the i18n characters that one might expect to have gotten implemented first. Let's try representing them as entity names, numerical character references (decimal) and as hexadecimal character references.

Entity names
ndash: –, mdash: —, trade: ™

Numerical character references
8211: –, 8212: —, 8482: ™

Hexadecimal character references
x2013: –, x2014: —, x2122: ™
Results
Browser/version Named entities Decimal character references Hex. character references
NS 4.05 No Yes No, displayed ?xnumber;
MSIE 4.something Yes Yes No, displayed &#xnumber;
Opera 3.21 No No, wrong display No, displays &#xnumber;
Lynx 2.7.2 Yes* Yes* Yes*

[*] Lynx was configured for an iso-8859-1 display, and hence used approximations: minus signs and "(TM)".

Late addition: the "Euro" sign has been assigned the Unicode point U+20AC. We might expect to see it as the entity "euro": €, and as the numerical reference €: €. In fact, the latter representation works, both in MSIE4 and also (so long as the utf-8 trick is played) in NS 4, if the font supports it.

Conclusions

The performance of the popular browsers has been very disappointing, and was still not properly resolved at the date of this comment, around 1998. Lynx was still way ahead in principle! The protocols (HTTP and HTML) have been clear enough for several years already, even if they were only in draft; given access to appropriate, compatible, fonts the rendering of multi alphabet documents is a solved problem. All that they needed to do was to apply the known principles. It would perhaps have been more useful to acknowledge that they do not support the option (as NS4 is sort-of doing when it displays "?").

And this note is only documenting their ability to browse straighforward HTML source! Their ability to browse ALT texts, to submit FORMs, etc. in these various codes was not investigated.

For widest browser coverage of an extended character repertoire, the &#bignumber; notation, using decimal character numbers, can still be recommended.


|Previous|Up|Next | |RagBag|About the author||