Character code coverage - browser report

For a really quick start, you could try my FAQ section.

Last extensively revised 1996 Apr 23

Lynx 2.6/2.7 MS DOS and Win'32 ports included (4/97)

Opera 2.1 added, various minor additions through end 1996

Editorial Preface added May 1997

This document was started several years earlier, to document the extent to which browsers were supporting the ISO-8859-1 8-bit code on which HTML2.0 was based.

Things have moved on quite a bit now, and, apart from a few notable exceptions, the currently available browsers are able to display the full range of characters, and to interpret all of the entity names listed in the HTML2.0 (and HTML3.2) specifications. So, to a considerable extent this has now become more of a historical record, helping authors to understand how their documents might be rendered by back-level or obsolete browsers that some readers are still using, than a documentation of active problems in current browser versions.

Apart from the problem that still affects all Mac-based browsers as they come out of the box, and the problems that can result from mis-match between the browser and the environment (font repertoire, terminal encoding etc.) in which it is executing, it's fair to say that all current browsers are capable of displaying the iso-8859-1 repertoire correctly.

A couple of the characters are supposed to perform logical functions, rather than to be rendered as such. The TAB character has been explicitly deprecated since at least HTML2.0, and is not investigated here at all. There is still some disagreement over the precise interpretation of the   - all browsers treat it correctly as a no-breaking space, but a few treat it otherwise as collapsable whereas most browsers treat it as non-collapsable. Hardly any browsers implement the logical function of the soft-hyphen, but merely render it as some kind of dash.

So, "the action has moved on", but I decided to leave this document on the web, with no more than a few tidying-up changes, at least for its historical interest. Of more vital concern nowadays is how well the available browsers deal with other character codes, such as the other iso-8859-x sets or the de-facto character codes such as koi8-r, or with extended codings, specifically iso10646/unicode. But those issues are outside the topic of this present page.

Disclaimer

This is a working document that was produced for our own purposes and is made available with all usual disclaimers in the hope that it may be of interest. It is incomplete, and may well contain errors. Some products that are mentioned here are or may be proprietary software, commercial packages, trademarked etc. Any statements about products are intended for discussion and fair comment only. A variable mileage production.

Introduction

This report will only make sense to a reader who understands how (8-bit) characters are encoded according to WWW (i.e HTML and HTTP) standards. Please review some appropriate materials if necessary. It was assumed that display of the lower half of the 8-bit code table, i.e the US-ASCII 7-bit code, was not a problem and needed no investigation.

Characters from the upper half of the 8-bit table can be offered to a browser by three methods:

In the first two contexts, the current HTTP/HTML standards (HTTP/1.0, HTML 2.0 and 3.2) only define one representation of an 8-bit character code on the net, and that is the ISO8859-1 code. No other codings are required by the current standards, and no other network encodings were studied in this review. The question of local storage encodings on the various platforms, and of a browser opening a local file, have not been specifically investigated in this survey, as it's not directly relevant to the question of offering a document to a user over the net. For all that the client cares, the document could be stored in, say, EBCDIC on the server, provided that the character codes are correctly mapped into the ISO8859-1 code for transmission over the net. The browsers have merely been checked as if they were "black boxes", with no concern for what happens inside them.

Of course, a web author who decides to work on a platform that does not use iso-8859-1 as its storage encoding (for example, a Mac, or indeed MS-DOS or an IBM mainframe) had better make sure that they understand how to get their documents correctly transferred to other platforms, but that issue is outside of the scope of the present page. My ISO-8859-1 tutorial might be useful in that respect.

The Browsers Studied

The table below contains a description of some browsers studied. Earlier versions get removed if they do not show anything that I considered to be of note; they may be retained if they seem to me to be of interest for some reason (server logs do record quite a proportion of out-of-date browser versions).

Method

A table had been prepared, containing the code points 160-255(decimal) of the ISO8859-1 character code, represented in the three different ways indicated in the introduction. The table exists on an X Windows-based server, which uses ISO8859-1 as its storage code, thereby avoiding any risk of misinterpreted character codes. The choice of entity names is explained in the associated character-code briefing: they are consistent with the SGML public ISO* entity sets.

Once the table document had been created, it was viewed from the various browsers under test by opening it over the network using the http: method only. Where feasible, the displayed document was printed out for later reference, but it should be stressed (since the printed version didn't always correctly match the display) that the observations reported here were made on the screen display version.

Results

Character cell browsers
Character cell browsers such as Lynx (or the now practically obsolete CERN WWW) work in a terminal emulation environment. Although the browsers themselves are perfectly capable of rendering the iso-8859-1 repertoire correctly, this can only work if the terminal emulation environment is set up properly. There is no way that the browser can control or verify the correct behaviour of the terminal environment: it is entirely the responsibility of the user to ensure that the terminal emulation (or DOS console etc.) is set up for the correct behaviour, and that (where appropriate) the browser's Options settings correctly reflect the setup.

If you use one of these browsers, and your session fails to display the full repertoire of 8-bit characters documented here, then that is probably not the fault of your character cell browser, but a shortcoming of the terminal session that you are using. The assertion often made on usenet that "Lynx cannot display 8-bit characters" is simply NOT TRUE, no matter how often it is repeated. Note that on some terminal emulations, the no-break space character (or its representation  ) might be displayed as a rectangular block, rather than as ordinary "white" space.

SAMBA/MacWWW,
Obsolete, does not support 8-bit characters, not considered further in this review.

Mac-based browsers
All of the browsers that were tested for the Mac platform were in violation of the standard by displaying incorrect glyphs at a number of code points.

The reason for this problem, rather obviously, is that the Mac standard fonts do not contain all of the required characters: for the most part, the browser makers have mapped the unavailable ISO-8859-1 characters to the remaining unassigned code points in the Mac font. They are mostly consistent with each other because they mostly apply the "de-facto mapping" (see associated article for further discussion).

hex decimal description
A6 166 broken vertical bar
B2, B3, B9 178, 179, 185 superscript 2, 3, 1
BC, BD, BE 188, 189, 190 fractions 1/4 1/2 3/4
D0, DD, DE 208, 221, 222 upper Eth, Y-acute, Thorn
F0, FD, FE 240, 253, 254 lower eth, y-acute, thorn
D7 215 multiplication sign

The ISO8859-1 code points affected by the problem are shown in the table, and these browsers are noted in the table as "Mac" where they are otherwise compliant with the standard. It may be noted that Macs that are running MacX, or that are running an ISO-8859-1 terminal emulation, have no problem displaying the browser output correctly from a remote host.

Mac Mosaic character code bugs (long-standing)
Mac Mosaic displayed the entity ÿ with a currency sign! It did, however, display the corresponding code point, ÿ, or the actual 8-bit character, correctly as y-diaeresis. On reviewing the situation, I found that the bug had been present as early as version 1.0.3, and, despite several emails to the developers, was still there in version 3.0b4. I also noticed that NCSA Mac Mosaic 2.0.1 was displaying í incorrectly as I-acute (upper-case), and on checking back with the older printouts, I found that this too has been the case since 1.0.3 at least, although I had failed to notice it before!!! By 3.0b4, the iacute bug had gone, but the yuml bug was still there.

In autumn 1996 I finally got a response from the Mac Mosaic folks at NCSA saying that they had other priorities, and didn't intend to do anything about character code bugs any time soon. (NCSA later froze Mosaic development, as you presumably know).

(Amusingly, NCSA have made a character code blunder in their own buglist, using an upper-case O-tilde in one place where they should have used an apostrophe! In the ISO8859-1 code, O-tilde is hex 'D5', and in the Mac code, hex 'D5' is the closing single quotation mark, which it seems they have mistaken for an apostrophe. Evidently they have attempted to use a Mac 8-bit character in their text, and have then transferred the file from the Mac to their - probably Unix-based - server system without taking account of the different storage code used on the two systems. An instructive and cautionary tale.)

Mac fonts
A further point to be made about Mac-based browsers is that some Mac fonts are incomplete. For pre-formatted text, for instance, the Monaco font is missing some of the characters needed, over and above the problem of the fourteen characters that do not exist in the Mac character code. This means that, of the fonts delivered with the original Mac operating system, the only preformatted font worth choosing is Courier.

I can only stress (and suggest that authors stress to their readers, where accurate reproduction of these characters may be an issue), that Mac-based browsers should be tested before being believed. Links to suitable test tables have already been provided.

Some versions of Mac Netscape
contain a Preferences option called "Default encoding" or "Document encoding". See my accompanying article for further discussions. (The release notes remind you that "appropriate" system fonts also have to be installed on your system. However, I see no explicit admission that the support for ISO8859-1 is not correct using the standard Mac fonts.)

At Mac Netscape 2.0b1, a change was noticed to the display of five of the fourteen characters that are usually displayed wrongly on Mac-based browsers. The superscript-1,2,3 characters were displayed as normal digits 1,2,3, the broken vertical bar as an unbroken vertical bar, and the multiplication sign looked like a lower case letter "x". This is clearly the start of an attempt to remedy the shortcomings of the usual Mac character code display (though there seems to be no mention of it in the respective release notes) and, as such, is to be welcomed in the interests of standards-compliance; but the results, so far, are not exactly impressive. At version 3, they had gone further, and kludged the fractions 1/4, 1/2 and 3/4 too. Mac Netscape 2.0 beta (I happened to notice this at 2.0b3) also displays some of the characters in the undefined range, 128-159, differently from other Mac-based browsers, but you should not be using these on the WWW anyway.

Disappearing nbsp on X based browsers
Several users of X based browsers have reported that the no-break space does not leave any space between words. It is my understanding that this is a bug in some X Windows fonts, in that the font assigns the no-break space a width of 0. For what it's worth, the no-break space works for me on the X-based browsers that I use: I constructed a test in which   ,   and an ordinary space were on successive lines between vertical bars, and they lined up perfectly with each other, both in normal text (variable spaced font) and in PRE text (monospaced font).

Toby Speight suggests a work-around for emacs: include the following in the init file
(standard-display-european t) ;tell emacs we have ISO-8859-1 font
(aset standard-display-table 160 ' 32 ) ; fix nbspace bug

NCSA Mosaic (various platforms)
The NCSA Mosaic teams for Win, Mac and X do not seem to have liaised very closely with each other. NCSA WinMosaic 2.0 supported die, macron, degree, Cedilla, compatible with the HTML+ list; of these, NCSA MacMosaic 2.0.0B12 supported only die and degree. NCSA X Mosaic 2.6, by contrast, supported uml, hibar, deg, cedil, compatible with M.Ramsch's list and almost (apart from hibar) compatible with the ISO* entity sets.

WinWeb
A bug was observed with the version of WinWeb 1 tested, in that it sometimes failed to interpret valid &entity; sequences in running text. The problem was not seen when the sequences were in isolation, as in the test tables.

Cello 1.01a
A browser that was very popular in its day, but had not been updated since 1994, pre-dating some of the features of HTML2.0. Details now removed from this survey.

UdiWWW
was a free browser from Bernd Richter (then of rz.uni-ulm.de). The URL, http://www.uni-ulm.de/%7Erichter/udiwww/index.htm , no longer works, so the link has been removed. It continued to improve month by month over a period of time, but there was a gap before 1.2.000 was released in April 1996, then the project was frozen. Its coverage of entity names is very good.

Netscape (various versions, for Windows, Mac and X),
the first "commercial" browser included in this evaluation, over a considerable period of time seemed to feel no need for the pound sterling entity, nor indeed for the Japanese yen, which I found rather amusing. Coverage for the RFC1866 "proposed entities" finally appeared in the preview (ATLAS) of NN version 3, just in time to scrape into the HTML3.2 document.

Netscape (up to and including version 3) has a curious strategy of dealing with unknown entity names, e.g if presented with the HTML+ entity °ree; it interprets the ° part as a degree sign (per the HTML2.0 entity °), then displays the residue ree; as text.

Most other browsers, when presented with an unknown entity name, display the entire construct, including the opening ampersand and terminating semicolon, as text, which (at least for a reasonably HTML-aware reader) seems to me to be a good compromise. Netscape 4 (PR versions tested) seem to have adopted this strategy too, causing the usual confusion to people who omitted the terminating semicolon and were relying on the error fixup behaviour of NS's earlier versions.

Lynx 2.7 versions for MS-DOS and for Win32.
See http://www.fdisk.com/doslynx/. These ported versions work well provided that CP850 is used. For the DOS version, this is a simple matter of following the standard instructions in the DOS manuals.

For the Win'32 version it's necessary to put the "Windows console" into CP850 (by default it's in CP437), and this is done by an obscure procedure that is accompanied by dire warnings about invalid filenames. On the Win'95 CD-ROM, look for a folder \other\changecp and execute the changecp.exe file therein. I accept no responsibility for consequences, but it worked well for me. The changecp software is also said to be downloadable from the Microsoft ftp server. If you are unwilling to apply this configuration change, your only recourse is to set the Lynx Options/ display-Charset to be "IBM PC code page" (it means CP437), but then, Lynx is forced to render some of the iso-8859-1 characters by means of approximations.

DOSLynx V0.8 alpha (and various subsequent patches)
This original DOSLynx is obsolete. See previous section for a much better solution now.

MINUET
was not specifically a WWW browser - it seemed to be an established package to which WWW support was added in around Spring 1995 and I tested it soon thereafter. The product was located at ftp://minuet.micro.umn.edu/pub/minuet/ but isn't there any longer. A

MINUET when presented with 8-bit characters was seen to display nonsense for the most part. Since it was being tested on a DOS that was configured to use the default code page CP437 (US English), I tried it again with the "international" code page CP850 instead. The browser then displayed all 8-bit characters correctly. Unfortunately, there was no support for the &#number; construction, and the support for the &name; construction was very patchy. Some of the entity names for ISO Latin-1 accented letters were not recognized, and some were displayed incorrectly. Of the remaining entity names, most were not recognized, and some of those that were recognized were displayed incorrectly. In conclusion, this browser is not recommended in the terms of this assessment.

arachne
The name "arachne" has been used a lot; in this case it was referring to the arachne browser version 1.07

This was a somewhat quirky graphical browser for MS-DOS, which is certainly worth a look for those with low-powered old PCs. Unfortunately, although it claimed to support HTML3.2 (which by implication should mean it would correctly support iso-8859-1), it didn't. Indeed, its "bugs" document stated that support for ISO Latin 1 and 2 is a "Feature to be implemented later (Sept 1997?)" (which meant that any claim about supporting HTML3.2 or even HTML2.0 was untenable). This browser had been excluded from the tests for that reason. Jul 1998: version 1.41 finally added iso-8859-1 support (its current embodiment - Dec.1999 - can be found at Arachne Labs, Prague).

emacs-w3
Complete coverage in e.g version 3.0.62. (Provided your X server hasn't got bugs in it - it seems as if one of the X servers I tried had mistakes in it, presumably a font problem.)

Earlier reports received had mentioned problems with the no-break space, but these seemed to be associated with the X font, that had a zero-width character at this position, a problem that has afflicted other X-based character cell browsers too.

Chimera
I haven't seen this one, but in Jan 1996, Abigail "from Mars" emailed me with the following report for Chimera 1.65.

The non-breaking space got displayed as a zero-width character, when a monospaced font was used. No problem with a variable width font. [We agree that the prime suspect here would be a font problem, similar to what has already been mentioned above.]

Of the named entities, the Latin-1 letters were all displayed correctly. But none of the additional entity names was honoured.

No problems in displaying 8-bit characters, nor &#number; representations.

Having described some specific anomalies in the various browsers, we now move to a tabulation of the extent to which the various browsers support characters passed using the &entity; mechanism: the entities can be divided into groups.

For the purpose of this report, therefore, the entities have been broken into these five groups and are tabulated respectively in the last five columns of the table below

Please note that this report covers, unless otherwise stated, only the entity names that are listed in my main table (i.e those listed in RFC1866). Some variants and additional non-ISO-8859-1-repertoire entities are listed in my associated briefing materials, but unless I mention them explicitly, the results tabulated below should not be taken to imply anything about the browsers' ability to display them.


Browser version 8-bit char &#n; ISO-L1 C/R pound nbsp other
X Windows:
X Mosaic 2.6released all all all yes yes yes almost all
X Netscape 1.1N all all all yes no yes no
X Netscape 2.0 released all all all yes no yes no
arena 0.97h all all all yes yes yes some
emacs-w3 (see text) all all all yes yes [6] all
Chimera 1.65 (see text) all all all no no [6] none
Macintosh:
Mac Mosaic 2.0.1 Mac Mac Mac[4] yes yes yes few
Mac Mosaic 3.0b4 Mac Mac Mac[4] yes yes yes few
Mac Netscape 1.1N Mac Mac Mac yes no yes no
NS Navigator 2.01 Mac Mac[5] Mac[5] Mac yes no yes no
NS Nav 3.0 Mac Mac[8] Mac[8] Mac yes yes yes many
MacWeb 1.0.0A3.2
TradeWave 1.1.1E
Mac Mac Mac yes yes yes one
MS Windows:
Win Mosaic 2.0 released (2.1 was same) all all all yes yes yes most[1]
MS IE 2.0 4.40.516 all all all yes no yes no
MS IE 3.0 4.70.1155 all all all yes yes yes all
NS Atlas PR1 (= v.3) Win95 all all all yes yes yes all
NS Navigator Gold 2.01 Win95 all all all yes no yes no
Win N'sc' 32 1.22 all all all yes no yes no
WinWeb 1.0A2.2 all all all no no no no
UdiWWW 1.0.010 all all all yes yes yes all+[2]
Opera 2.1 beta 3 all all all yes yes yes all
Tango 2.5.1 all all all yes yes yes all+[10]
MS DOS:
DOS/Win32 ports of Lynx 2.6/2.7 [9]: all all all yes yes yes all+[2]
Minuet1.0beta18A all failed some no no no [3]
Character terminal (TELNET) based:
Lynx 2.4.1 all all all no no yes no
Lynx 2-4-FM Dec'95 all all all yes yes yes almost all
Lynx 2-5FM Jun'96, 2.6, 2.7 etc. all all all yes yes yes all+[2]
CERN WWW 3.0 all all all no no no no

[1] NCSA WinMosaic 2.0 supported the variant entity names laid down in HTML+, rather than those in the ISO* entity definitions. It was also missing the entity name ¬.
[2] Supports all the ISO* entity names for the ISO-8859-1 repertoire, as well as supporting some variant entity names from HTML+ and Hyper-G.
[3] See text. Needed to select MS DOS code page 850 as the default although not mentioned in documentation. Some named entities were displayed wrongly, and unrecognized entity names were displayed in an unacceptable way.
[4] But ÿ was incorrectly rendered, and on 2.0.1 so was í, as mentioned in preceding text. By 3.0b4 the iacute bug had been fixed, but the yuml bug was still there.
[5] Five characters (out of the fourteen usually displayed wrongly on Macs) were displayed differently, see text.
[6] Font problem, see text.
[7] DOS CP850 selected as stated in README; y-diaeresis (ÿ) was missing from the display.
[8] Similar to [5], but additionally, the fractions (1/4. 1/2 and 3/4) had been kludged.
[9] When using DOS codepage 850, and Lynx Options configured to match.
[10] Tango covers additional entities such as emsp, ndash, mdash, trade, and their numerical equivalents in accordance with RFC2070.


So, as an author, what do you recommend me to do?

I think my best overall advice is that which I have given in my "Conclusions" below. However, there are a few details that it might be worth mentioning here. There are, as you see, apart from the set of characters that are well-covered by browsers, a set whose coverage is, to say the least, erratic (especially on Macs). If it is vital to you that you get those special characters displayed correctly, irrespective of the known shortcomings of browsers, then I offer you the following discussion.

1. The legalistic approach
You compose your document strictly according to the HTML2.0/3.2 specs (e.g as recommended in my "Conclusion"), and add a caveat to the reader stating that this document must be viewed with a fully compliant browser. You helpfully mention that Mac browsers generally aren't compliant!

2. A helpful approach
You compose your document according to HTML2, but you add a little "browser-check table", either in the document or linked to it, that lists the known-to-be-troublesome characters that you are using, composed in whichever way you decided to compose them, alongside a plain-text description of what the character should be.

Providing them with a pointer to my article, or to the resources which it references, might be useful to your Mac-based readers if they are unfamiliar with the issues.

3. The composite IMG/ALT approach
Here, you will provide small images (e.g GIFs) of the troublesome characters, and put them inline to your text - with all the disadvantages that this implies, i.e there is no way of matching the size of the images to the unknown size of the browser's fonts. Also, you need to provide some kind of ALT text for character cell browsers or for users who run their graphic browsers with image loading off.

Conclusions, and recommendations to authors

This survey did not cover any of the characters in the low half (7-bit US-ASCII) part of the table. I take it for granted that any browser worth using will honour the entities or character references that are required for representing those characters that would otherwise be interpreted as part of HTML constructs (less-than, ampersand, etc.).

Note also that where a browser (such as Netscape) allows the user to select a non-standard character code configuration, there is nothing that the author can do directly to ensure a correct display; the same is true for browsers such as Lynx, that work via terminal emulations that might be configured wrongly.

As has been demonstrated in the course of this survey, files containing 8-bit character data can be handled properly by web servers and browsers using the http transfer protocol. Nevertheless, for reasons that have been discussed in more detail elsewhere, the use of such files can lead to significant difficulties when exchanging files in other ways (FTP, email, diskette...) between systems that use different native character codes. Consequently, at the present state of computer systems I personally would discourage the use of such files, at any rate in an English-speaking environment (authors who are accustomed to typing in accented 8-bit characters in their own language will need to reach their own conclusions.) Instead, I am recommending that HTML files be stored using only characters from the 7-bit US-ASCII set, and where glyphs from the upper half of the ISO8859-1 table are required, they should be represented by one of the other techniques. You could consider using a utility that converts between 8-bit HTML and entity representation: I don't use one myself, but H.Churchyard's htmlchek package included a utility called entify which should do the trick, but the package is no longer actively worked on. Free recode also has an option for doing this conversion, and still seems to be supported.

Out of consideration for Mac-based users, authors might be wise to avoid the fourteen characters that are affected by the "Mac" problem as described above if it's at all feasible to do so; or to include some kind of note alerting Mac users to a potential problem. At the risk of stating the obvious: on no account should authors put characters into their HTML files with the intention of displaying the non-compliant glyphs that are typically displayed by Mac-based browsers. This would be perverse, and causes confusion to readers who are using properly-compliant browsers.

With the browser versions reported on here (this was not so with some now-obsolete browsers), the review has shown that equally good results are achieved with the &#number; representation as with including the actual 8-bit character in the file. This can therefore be recommended with a clear conscience for any character that is defined in the ISO8859 specification (and I repeat, this does NOT mean that you can use it for any value of #number that you choose - you may only use the values that ISO8859 assigns to displayable characters). But of course this representation has no mnemonic value - users should keep a table of the ISO8859-1 code handy if they propose to use this method.

Of more mnemonic value, and fully available (subject to the "Mac" problem, that is to say excluding upper and lower case eth, thorn and y-acute) on the browsers under review, are the entity names of the accented letters of the ISO-Latin-1 repertoire. The use of these, i.e &entity;, by authors is highly recommended.

{Update Apr 1997} According to HTML3.2, which codified browser behaviour from the first half of 1996, you can now expect current browsers to support the full range of entity names that were "proposed" in the published HTML2.0 document. However, it is fully legal to represent them numerically, and, although browsers have supported the accented-letters entity names for a long time, and later also supported the copyright, registered trademark, and no-break space entity names, the remaining entity names that were "proposed" in the HTML2.0 spec (RFC1866) were not in fact supported by some popular browsers until much later. The numerical references have been supported by all browsers for much longer, so, if you do not want to exclude readers who are using older browser versions, I would still mildly recommend that you represent these characters in your HTML documents by means of their &#number; representationi.

When I originally wrote this report, I included the following paragraph:

One word of caution about the use of such constructions in attribute values, such as in text string. Although the HTML2.0 spec makes it clear that entity representation in such an attribute value is intended to be evaluated, not all browsers do in fact honour this. The only workaround is to use the actual 8-bit character, but the disadvantages of that have already been set out above.

That was certainly true at the time I wrote it (maybe 1995), although the HTML2.0 specification had already been available for some time. Fortunately, browsers have moved on a little since, and as far as I know, this is no longer a problem with current browser versions (1997). However, I haven't conducted an extensive analysis myself, so I left that caveat in place and invite you to draw your own conclusions.


|Previous|Up|Next?|?|RagBag|About?the?author|