Notes on Internationalization

Last major update: Dec1997
There have been some updates since, but be aware that the references to browser versions are by now quite dated; and, although many references to detail of older browser shortcomings have now been removed, there are still some references to shortcomings in outgoing browser versions where it's debatable whether there's any need to worry about them nowadays.

The principles, however, are still valid and important.

Preface

Some kind of approachable document seemed to be needed to explain how HTML was meant to be extended to international character sets and codings, beyond the ISO-8859-1 that was used up to and including HTML3.2.

Dan Connolly asked me in July 1996 whether I could write something up, but I didn't really have the time, plus I knew that my knowledge of SGML and of Unicode terminology was rather shaky. But, discussions on usenet showed the need for some kind of briefing on this topic, so I decided to have a try. This covers 8-bit codings other than iso-8859-1 (which have, of course, been in use for quite some time, though unfortunately not always implemented in conformity with RFC2070), as well as covering iso-10646/Unicode. I do not specifically cover other techniques such as iso-2022-based codes etc., as I have very little experience of them myself.

This is quite a heavy document: you might be interested in the quickstart first.

Principles

Some of the terminology here may be a bit loose and informal. It's intended to explain, rather than to give people a rigorous theoretical background. For that, and for pointers to authoritative sources, see http://www.w3.org/International/ at the W3C. And in particular, Standards Track RFC2070, "Internationalization of the Hypertext Markup Language", available from your usual source of Internet RFCs or as rfc2070.html.

Documents that are limited to one non-Latin-1 repertoire have been in practical use for a considerable time, even if they were outside of what HTML/2.0 and /3.2 actually specified. HTML/2.0 (RFC1866) indicated how the specifications were meant to be extended, and these indications have been vindicated by subsequent developments (RFC2070 and HTML/4). However, you should be aware that there are also some bogus techniques out there, which give a visual impression of working on consenting browsers, but which are fundamentally unsound in terms of HTML protocol specs.

There are two important issues: "Document Character Set", on the one hand, and what I used to call "Data Transfer Coding", on the other hand. The latter is referred to in Dan Connolly's paper Character Set Considered Harmful by means of the term "character encoding scheme", and in RFC2070 it's referred to as "the external character encoding"; in HTML4 the term "character encoding" is also used for this concept, and there's a reasonably clear explanation of it in the text. Recent versions of Unicode have layered the concept of "Character Encoding" into the "Character Encoding Form" and the "Character Encoding Scheme", as explained in chapter 2 of the Unicode specification.

The bottom line is that what's significant for the HTTP protocol transfer between web server and client is what is now properly called the "Character Encoding Scheme", which XML declares by means of its "encoding" attribute, while MIME (and hence HTTP) still uses the older, and now quite confusing, term "charset".

The key point, as regards the WWW, is that what matters is the character coding scheme that's used on the data transmission from the server to the client: this might be the same as the one used natively for text storage on the server's own platform (and indeed this is now the situation in the overwhelming majority of cases), but the specifications have been formulated so that the can properly deal with the cases where it is not so (consider classic Macs, or consider EBCDIC-based machines, whose native internal character coding would be unacceptable for transmission to an arbitrary WWW client).

In the original HTML/HTTP versions, both the "Document Character Set" and the "character encoding" are defined by reference to the ISO-8859-1 specification. There was nothing wrong with doing that, within the terms of earlier versions of HTML, but it results in many people thinking that "Document Character Set" and "character encoding" are one and the same thing, which they are not; this causes much confusion. Until one has understood this key point, almost all of the important questions (and answers!) make no sense, which is why I've tried so hard to explain it (with the result that several readers have complained about the long-winded stuff in here: well, if I could explain it better then I would - suggestions are welcomed).

The term "Document Character Set" has a specific meaning in SGML terms; its significance for HTML is that it determines the numbers that are used in the &#number; numerical character reference representations. It has nothing to do with the character coding that is used for transferring a document - the document could be transferred in any compatible manner that the two ends of the transfer can agree on.

There is a vital, but most confusingly named, feature of the HTTP protocol: the charset parameter of the Content-type header. This parameter does not define the "Document Character Set" in HTML/SGML terms, it only defines the transmission coding (i.e the character encoding scheme which is used for the network transmission) of the document's characters, i.e it defines how the recipient should interpret the meaning of the data stream received (byte stream, octet stream, call it what you will). So, the HTTP protocol provides (via that confusingly named charset parameter) the mechanism for the client optionally to inform the server of which transmission encodings they can accept (the Accept-charset header), and for the sender to announce the transmission code of the document to the client.

Up to and including HTML3.2, the only charset (encoding) which a client agent is required to support is ISO-8859-1, and in that case the HTTP/1.0 recommendation was to omit the charset attribute altogether; the climate has changed since, and the HTTP/1.1 specification and other documents from the W3C now encourage the use of an explicit charset parameter even when it is iso-8859-1.

According to the original HTML internationalization (i18n) specification, RFC2070, the Document Character Set for HTML is ISO-10646 (effectively "Unicode"). The formal statement in RFC2070 is as follows

The document character set, in the SGML sense, is the Universal Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended. Currently, this is code-by-code identical with the Unicode standard, version 1.1 [UNICODE].

Your corresponding reference in the HTML4.0 Recommendation is to Chap.5: HTML Document Representation, where it is stressed that the character set standard is subject to updating (the implication presumably being that these updates are automatically adopted into HTML). The Unicode Consortium has its own web site.

As previously mentioned: iso-10646 (Unicode) defines a "Character Set": it does not define a character encoding scheme, and the occasionally-seen HTTP headers which say charset=iso-10646 are nonsense. There are several different encoding schemes for Unicode - the current specification includes seven schemes, not counting some other schemes which are now obsolete but may still be found (specifically utf-7).

This Document Character Set is not negotiable in HTML; it is set by the "SGML Declaration" for HTML, which, unlike the transmission character encoding (charset), is not the subject of any kind of announcement or negotiation between client and server.

The first 256 code positions of Unicode are identical with ISO-8859-1, so a document that uses only ISO-8859-1 in its Document Character Set is trivially just a special case of Unicode as the Document Character Set. But iso-8859-1 as a character coding is not directly compatible with any of the unicode character coding schemes. Until this distinction is clear in one's mind, all manner of confusion can arise.

Some important notes:

One browser that had been developed with attention to the standards was the Tango browser from Alis Technologies. (I used the 30-day trial of 2.5.1 for study, and that was many years ago now). However, MSIE4 seemed to have pretty well caught up; and NS4, although it still had some severe shortcomings in this area, could do rendering of content encoded in utf-8 well enough to be useful, although forms input in this situation was hopelessly broken. By the time of adding this note (2005), I'd say that Netscape 4.* versions are now best ignored if you need to supply such content; and particularly if you need to support forms input.

HTML has three ways of representing characters: encoded characters, SGML numeric character references of the form &#number;, and the &entityname; mechanism. But, the entitynames are defined by SGML statements such as e.g

<!ENTITY Epsilon CDATA "&#917;" -- greek capital letter epsilon, U0395 -->

i.e in terms of the character references already described, so this introduces no new fundamental principle, and for the purposes of our discussion here can be seen just as an additional convenience.

A reference for entities, which may also be found to be a handy aide-memoire on a useful range of Unicode hex code values and corresponding HTML/SGML decimal character references, is Section 24 of the HTML4.0 Recommendation.

The Unicode Consortium also provides character code mappings and other useful materials.

Illustration

I find it illustrative, and it may be helpful to you also (please suspend your disbelief meantime), to consider the case where we intend to transfer an HTML document over a network in the EBCDIC code.

Then, all of the individual characters are represented by different bit-patterns: the letter "A" for example, which would be represented by hex'41' (decimal 65) in ASCII, is hex'C1' in EBCDIC. As long as the recipient understands that the transfer is being made in EBCDIC, the meaning of the data remains unchanged. Now, what about the HTML content? The HTML sequence &#65; represents an upper case A: what should we do when converting this HTML file into EBCDIC? And the answer is, we convert the individual characters: ampersand, hash, six, five, semicolon: into their EBCDIC equivalents, but the numerical value, sixtyfive, remains unaffected.

The product of this conversion, therefore, is a document whose "Document Character Set" (in HTML/SGML terms) is still ISO-8859-1, but the document itself has been transmitted in the EBCDIC code.

HTML4.0 goes limp relative to RFC2070

The HTML4.0 recommendation seems to have given up hope that servers and server admins will actually send a proper charset attribute on the HTTP header, and goes into a long-winded discussion of the various ways in which a client agent might become aware of - or even have to guess at - the charset appropriate to the document that they are receiving. Oh well, as long as you understand the principles as they were set out in RFC2070 (and, I hope, demystified in the present briefing paper), then you'll be in a position to cope with this particular piece of hand-wringing.

A word of warning, however: according to the CERT security alert CA-2000-02, a document which is sent out without an explicit charset specification can represent an increased security hazard.

De-facto example: Russian Cyrillic and the KOI8-R code

Disclaimer: I am using this example as a means to illustrate how character codes other than ISO-8859-1 work (or should work) on the WWW. It's not my place to take sides about how to write Cyrillic: this isn't my area of study at all. See also Fingertip Software. I am using koi8-r here only as a practical example.

A description of KOI8-R may be found at the so-called "Home of KOI8-R" (unfortunately the URLs for this have subsequently become too unstable to include here - try a web search). Another useful resource, which includes additional practical advice on actually authoring Cyrillic documents, is provided by Paul Gorodyansky; but such details are outside of what I want to cover here, and also, it deals only with the use of a single non-Latin repertoire (i.e Cyrillic) whereas the present page is addressing the use of extended character repertoires.

Note that many of the characters defined in ISO-8859-1 do not exist in this code; and some of those that do exist (copyright, degree, superscript-2, division sign, the no-break space) are in different places. So there's plenty of scope for confusion.

You'll recall that in HTML 2.0 or 3.2, the ISO-8859-1 characters can be represented in three different ways: as a named entity &name;, as a numerical character reference &#number;, or as an 8-bit character.

Things start to get confusing when the 8-bit transmission code no longer comprises the ISO-8859-1 repertoire, and of course this is the case for KOI8-R, where the upper half is assigned to Cyrillic letters etc., and this is what we tackle next.

OK, so a document has been sent in the koi8-r coding. It means that the 8-bit characters (specifically, those in the upper half) have to be interpreted according to the koi8-r code; however, according to the standards, all of the named entities and numerical character references which it contains are still meant to be interpreted by reference to the Document Character Set, Unicode (or its subset, ISO-8859-1). Obviously, in general, this needs a repertoire of printable characters that exceeds 256. Whether that is feasible in a particular browser depends on its internal design. The important point to keep in mind at all times is that the meaning of an HTML document never can be changed by some incidental feature of the browser that receives it. Either the browser renders it usefully or it does not. But the meaning of the document remains invariant. This may seem obvious, as stated - but it's amazing the extent to which people refer to features of browser platforms (DOS codepages, the Mac character code, named fonts, etc.) as if these might be allowed to influence the meaning of the received documents... that way lies madness.

Documents with and without explicit charset

Various communities of users have in the past got into the habit of sending out their documents with a Content-type: text/html (i.e without an explicit charset specification) on the assumption that the reader would have already set their browser to the expected value. This was always a technically-incorrect thing to do: documents should be sent out with a Content-type that specifies the correct charset. Earlier versions of HTML (2.0 and 3.2) were based on the default charset being iso-8859-1, whereas HTML4.0 explicitly states that there is no default charset: so under either specification, an explicit charset is required.

Unfortunately, some ancient browser/versions got upset when sent a Content-type header that included an explicit charset, and refused to display the document at all. So, users in these communities got accustomed to configuring those (now-obsolete) browsers to use a different font - one in which the characters were arranged according to their favourite encoding - so that the 8-bit characters would look right, and ignored the charset issue entirely. Nowadays, the supporters of this (unwise) procedure would tell the reader to use their browser's character-encoding override feature as a workaround. While this swerve might be understandable in relation to those older browsers, it has to be pointed out that this isn't (and actually never was) in accord with the published specifications: it's a pity that there were browsers that failed to implement the specs correctly, leaving us a legacy of non-conforming documents.

As if that wasn't bad enough, we now have the problem that increasing numbers of web pages are being provided by authors who have neither the necessary level of support from their provider, nor the expertise themselves, to know how to send out pages with a proper charset header. They get told to put a META HTTP-EQUIV into their HTML documents instead. Well, in many situations this gives a good enough impression of working, but there are situations where it is actually counter-productive: for example the Russian Apache server offers on-the-fly character code translation ("transcoding"), which results in a document that is correctly coded in a different charset, but the transcoder needs to filter-out the original META HTTP-EQUIV specifying the original character coding, or it could cause confusion (e.g when the HTML is stored to file, or on older NN4.* versions which implemented the wrong priority between HTTP and META charset). And similar things would happen with recoding proxy servers (such things were for example used in Japan, to mediate between the different character codings that are used in that area, although their current prevalence on the WWW is not known to me). Well, if browsers followed the specifications correctly, this would not be so bad, because the server's explicit HTTP header would override the now-incorrect one in the document's META, but unfortunately some browser implementers (notably Netscape) got this wrong, and so the browser will display wrongly in this case.

Additional complications arise if authors try to use meta http-equiv in conjunction with XHTML (see XHTML/1.0 Appendix C section 9 for the formal position). These complications can be avoided by using the HTTP Content-type header for specifying charset, just as was being recommended anyway.

Typical browser behaviour

Within the scope of this present article, the tests were meant only to illustrate the kinds of things that go wrong in some actual browsers, so as to supplement the theoretical account of what's supposed to happen. There's also a separate article with some detailed browser tests.

The tests cover the use of individual non-Latin-1 repertoires (the one defined by the encoding used, i.e the charset) in conjunction with Latin-1 (called out by using &entity; and &#number; references).

An early RFC2070-compliant browser: Alis Tango

This screen shot shows the Alis Tango browser displaying a (pre-formatted) document that has been sent out with charset=koi8-r. Note: koi8-r also assigns printable characters in the range 128-159 inclusive, but they were not included in this demonstration. It may be worth noting that this screen shot was made in late 1997, although I had seen this browser behaving correctly quite some time previously: Netscape were still releasing updates of their version-4 browser, which is fundamentally broken in this area, right into the year 2002.

The test material was, apart from the charset parameter, the same ISO-8859-1 test table that I was using for Latin-1 compliance, but of course the 8-bit characters (col.6 in this version of the table) instead of following the ISO-8859-1 code now must follow the KOI8-R code.

[Alis Tango browser screen shot, 15k gif]

Note that columns 7 and 8 are still displaying the ISO-Latin-1 repertoire, as required by the HTML internationalization specifications.

Do other browsers get this right yet?

All too often the answer was "no", as you can see from the accompanying detailed tests. The commonest blunder, at the time this page was originally formulated, was to interpret the named entities and/or numbered character references by reference to the incoming transmission coding (that ineptly-named "charset" parameter), instead of to the Unicode assignments as prescribed in the standard. And of course, sadly, you'll find documents authored by people who, instead of avoiding the problematical constructions, have coded them wrongly in order to get the desired results out of the broken browsers - meaning, of course, that they get unwanted results from non-broken browsers. Fortunately, things are much better now (2005/2006) and these old mistakes are documented only because they can throw light on the kind of misconceptions which still afflict some newcomers to the field.

My own tests only covered the browsers' ability to display normal HTML markup, it did not assess ALT texts, forms submission etc. (for koi8-r those issues are addressed at the "koi8-r home page"). My results are summarised in the following table, as far as they are available. Those old versions of Lynx probably worked fine with Greek, but weren't tested: currently available versions of Lynx have excellent, comprehensive character code support, when properly configured in a suitable terminal/emulation environment.

Browser: 8-bit encoded character Numeric character reference &#number; Named entity
koi8-r Greek koi8-r Greek koi8-r Greek
Alis Tango (2.7.1) Yes Yes Yes
Lynx 2.6/2.7 Yes don't know approximations don't know approximations don't know
MS IE 3.01 w/ Pan European kit OK, some approximations don't know Wrong! don't know Yes don't know
MS IE 3.03 16bit Yes Yes Yes
MS IE 4 32bit Yes Yes Yes
NS Nav 3.01/3.03 Yes Wrong! Wrong!
NS C 4.04 Cyrillic letters OK, many other chars wrong Greek letters OK, some other chars wrong Most characters are rendered as "?"; just a few (seemingly those that were present in the "charset" repertoire) are rendered correctly.

It seems little short of absurd that some graphical browsers that aim to support these features put up such a pathetic performance. The protocols (HTTP and HTML) had been clear enough for several years already, even if they were only in draft; given access to appropriate, compatible, fonts, the rendering of multi alphabet documents is surely a solved problem.

Clarification: I should make it plain that the browser problems that I am describing here are only an issue if you hope to introduce into your 8-bit non-Latin-1 document also some entities or numerical character references from Latin-1 or from other repertoires. If, in fact, you are content to stay with your 8-bit non-Latin-1 repertoire and to make no use of anything else, then (so long as your document contains 8-bit characters, "real Russian characters" as Paul Gorodyansky called them in a Russian context), there isn't a problem.

Playground

This is now in a separate page.

Can I have it with Turkish?

On the German-language authoring group, the question was asked whether one can use German together with Turkish. Well, the question could as well be asked for Turkish with French or other W.European languages, as we will see (but not icelandic).

The 8-bit ISO coding used for Turkish is iso-8859-9, and it is identical with iso-8859-1 except in six places: upper and lower case eth, thorn and y-acute. So, if a browser supports iso-8859-9 at all, then it's likely that it will display all the Latin-1 characters too, except those six. And so indeed I found in my tests of recent browsers (NS3.01, NS4.05, MSIE4). On the other hand, MSIE3.03 (16-bit) didn't support iso-8859-9 anyway. MSIE4 even represented the remaining six Latin-1 characters correctly, when they were presented as entities or numerical character references, as indeed a browser must for standards conformance; NS3.01 displayed them wrongly, and NS4.05 displayed the six as "?". But the conclusion seemed to be the expected one, that if the browser supported iso-8859-9 at all, then it could be used for documents containing German, French etc. with Turkish.

Those pesky codes 128-159

In the iso-8859-* codes, characters 128-159 have been reserved for control functions; the original motivation seems to have been the risk that they would be transmitted inadvertently over a 7-bit path and could cause disastrous control functions (such as page eject, or putting the display device into graphics mode etc.) instead of just a single wrong display character. Whatever the original motivation, this principle has been firmly followed by the iso-* codes, including iso-10646/Unicode. It isn't necessarily followed in other encodings, though: examples of encodings where these codes represent printable characters include koi8-r, as well as the MS Windows codings.

MS Windows introduced a group of codings in which these code positions were used for printable characters, some of which are much in demand with certain authors: the trademark glyph, matched quotes and so forth. These are the encodings such as "code page" 1252. It would appear to be protocol-correct to offer documents in these encodings, with 8-bit characters in that range, as long as they are sent with an appropriate charset value and the recipient accepts this charset encoding. That is not at all the same thing as attempting to represent those characters by numeric character references such as &#153; as one so often sees. The meaning of the latter construct is undefined (N.B: not "illegal", but "undefined") in standard HTML: the protocol-correct representation of a trademark as a numeric character reference is in fact &#8482; as can be seen in the W3C reference already cited; and correspondingly for the matched quotes and such.

In documents from Microsoft themselves, these codings are frequently referred to as "ANSI", but no-one seems to be able to adduce any formal basis for this usage: for example in one Microsoft document the Windows character sets (plural!) are termed "an assortment of Windows ANSI character sets", but without citing any official ANSI standard. At the Unicode site, the mappings for Windows codes are firmly listed under "VENDORS/MICSOFT", and not under any American National standards body.

As Markus Kuhn is quoted as reporting:

Ich habe als ich letztes Jahr an der Purdue University war stundenlang in der dortigen Normbibliothek (die ueber alle aktuellen ANSI-Normen verfuegte) gewuehlt und mir waere dieser Standard dabei sicher in die Haende gefallen. Alle ANSI Zeichensatznormen sind inzwischen nur nationale Ausgaben der entsprechenden ISO-Normen. Es macht also wenig Sinn, von "dem ANSI-Zeichensatz" zu sprechen, wie dies etwa Microsoft tut wenn sie in Wirklichkeit ihren proprietaeren CP1252 Zeichensatz meinen der eine Erweiterung von ISO 8859-1 ist, oder von "8-bit ASCII".

(loosely translated: he hunted through the Purdue University standards library and would certainly have located any relevant ANSI standard. But by that time the ANSI character coding standards are merely national editions of the ISO standards. It therefore makes little sense to speak of an "ANSI character set" as MS do when they are in reality describing their CP1252 coding. [...])

If anyone can support MS's attribution of these codes to ANSI, I would be happy to cite it here. The closest we have got so far, was the theory that ANSI had previously started drafting a standard for 8-bit coding, but had already fallen-in with the ISO standards before any national standard saw the light of publication.

Issue: charset value for MS Windows codepage 1252.

Several years back, consultation of the official IANA character set assignments document revealed the rather surprising oversight that codepage 1252 was missing from the list, although 1250, 1251, and 1253-1258 were included since 1996. By analogy with the existing registrations, it appeared that the valid charset name for this code would be windows-1252.

I later received an email from Chris Wendt, Program Manager for IE at MS, saying that they were getting windows-1252 registered at IANA. After some further delay, the entry appeared, dated December 1999. Possible aliases such as cp1252 or cp-1252 have not been registered.

In practice, the mass-market browsers tend to behave as if their default charset were code page 1252 anyway, rather than iso-8859-1 as the specification calls for: but note that these characters will disappear entirely when displayed with iso-8859-1 fonts, which often happens by default with X Windows systems (at least as they are set up in Western locales). For a rather trenchant account of this problem, see the demoroniser.

Greek: iso-8859-7 versus Windows-1253

The relationship between iso-8859-7 and Windows-1253 is somewhat analogous to that between iso-8859-1 and Windows-1252, in that the Windows coding assigns additional displayable characters in the range 128-159 decimal, which the iso-8859-* codings reserve for control functions.

However, whereas the range 160-255 is the same in iso-8859-1 as in Windows-1252, there are a few differences between iso-8859-7 and Windows-1253 in this range.

Fontasmagorical Fantasies

There's a habit, learned apparently from platform-specific word-processing applications, of trying to get exotic characters displayed in HTML documents by using <FONT FACE=Symbol> and suchlike to select a different repertoire of displayable characters. (Here I cover the topic only briefly; later I wrote a more detailed page, "Using FONT FACE to extend repertoire?", about it.) As far as HTML is concerned, this is at entirely the wrong protocol level. The transmitted octets, the &name; entities, and the &#number; representations, have meaning that is defined by the HTTP and HTML protocol standards: that meaning could be displayed by cosmetically different fonts (controlled by a style sheet, or by FONT FACE for those who care for it), but to select a font that produces a quite different displayed character is entirely contrary to the intentions of HTML. Although it might appear to produce the effect intended by the author, in a limited range of viewing situations (yes, I'm well aware that these viewing situations are statistically very common, but I still say they represent a "limited range of viewing situations"), it can produce all kinds of deleterious consequences, including undiagnosed but incorrect display in other viewing situations, incorrect indexing by search robots, etc. etc. Who knows how a speaking machine is supposed to cope with this?

Properly, it's the job of a browser to recognize the meaning (e.g &#bignumber;) of HTML markup, and to make whatever font selection is needed internally for displaying that meaning to the reader. It should be no part of an HTML author's job to second-guess what fonts the reader might have at their disposal, and to interfere in the browser's selection of them (other than for cosmetic reasons).

A properly standards-compliant browser would do better to treat that construct e.g

<FONT FACE=Symbol>a</FONT>

by noticing that the "Symbol" font does not contain the character "a", and should refuse to display it, or maybe to choose a cosmetically-similar font in which the character "a" is present, in order to ensure an uncorrupted display in HTML terms. But this is not at all what those platform-specific Gatesware tools are trying to lock you into. The result wouldn't, of course, be what the misguided author intended, but they would be what the specifications call for: authors who ask for the wrong thing shouldn't be too surprised when they occasionally get it, hmm?

Indeed, the HTML author should not need to know anything about the machinery that exists in a client platform for turning coded characters etc. via font resources into a screen display: the whole thing should be treated as a black box as far as the HTML author is concerned.

And the same principles apply, for sure, when writing style sheets that specify named fonts.

Beware of other sites that "support" different writing systems not by using a defined character encoding, but by using specialised fonts that appear to display the desired character in response to some normal (e.g Latin-1) character code. This technique commonly adopted in earlier times goes very much contrary to the principles of HTML and the WWW. [In a way it's a pity, as there are some really fun resources out there (see for example the Yamada Language Center's font archive at UOregon) based on that way of doing things: but, standards are standards!].

I've said this before, but I make no excuse for saying it again, because the topic just keeps coming up, over and over. People have got so accustomed from experience with earlier browsers to the idea that the browser just has to get a font that corresponds to the incoming character coding, plug it in, and the problem is solved. In the terms of HTML i18n (RFC2070/HTML4.0) this is a misconception. For example, when I'm using 16-bit MSIE3 to view a koi8-r document, it's necessary to configure its Cyrillic preferences to be a windows-1251 font, not a KOI8-R font. And when the browser is presented with Latin-1 entities or character references in this situation, the browser calmly goes and uses a Latin font for the purpose, just as the theory says it should.

To put it briefly, again, it is entirely a private matter for the browser implementer to decide how to render the HTML in terms of the resources at their disposal: for example, a selection of various cosmetically-compatible fonts. What the HTML means is defined by the interworking specifications (RFC2070, HTML4.0). The browser's job is to render the HTML in accordance with that specification. There's nothing that says a unicode character stream (utf-8, say) needs a unicode font (it could use several 8-bit fonts), nor on the other hand that an 8-bit stream needs an 8-bit font (it could use the appropriate characters from a unicode font).

Unicode, utf-8 encoding

I'm deliberately not covering the issue of data encodings in any detail. I'm recommending that, at least for authors like myself who write predominantly in a Latin alphabet and occasionally want to include "foreign" characters, it's more robust and portable for characters to be represented by &-representations (entities and numerical character references) in HTML. If you already work in a multibyte or unicode environment using suitable tools (for example Far-Eastern readers) then you presumably know what to do already; but for occasional snippets using non-Roman character sets, I suggest that the recommendations I give here should suffice - and that includes use for mathematics, in so far as that is feasible in standard HTML.

However, to circumvent a shortcoming in Netscape browsers, there can be advantages in advertising a content-type charset of utf-8, and it may be useful to understand briefly what this implies.

The theoretical part can be found in RFC2279. Every character outside of the 7-bit us-ascii repertoire is represented in a utf-8 datastream by a specific sequence of two, three or more bytes, all of which have the high bit set (note that in a utf-8 data stream, an individual byte with the high bit set has no meaning on its own, but only as part of the multibyte sequence to which it belongs). For my own education I wrote a trivial perl script that generates utf-8 encoded output in response to &#number; representations in the input - and the result seems to be acceptable to utf-8-supporting browsers. Don't misunderstand me: this script isn't a piece of production software, I only wrote it to educate myself, and if you're interested, I might suggest you could do the same, in whatever language you favour. You can check the result by feeding it to a suitable browser and comparing the display with the original. You could consider storing such a datastream, and serving it out via your web server (HTTPD) with an appropriate content-type charset attribute. This all works just fine on platforms such as unix or MS-Win: if you are working on a classic Mac (or worse, an EBCDIC-based system) or other kind of system whose native storage encoding of characters gets mapped into iso-8859-1 when served out through an HTTPD, then you're likely to get into difficulties handling this kind of thing. In any case, as I said, I'm not really recommending this to the casual author, but only exploring it briefly in order to illuminate what the standards mean.

It's an important and useful property of the utf-8 encoding that any stream that contains only 7-bit us-ascii characters is also a legal utf-8 datastream. You're already aware that a stream of 7-bit us-ascii characters is also a legal iso-8859-1 stream (or indeed iso-8859-anything). So, if you keep your HTML documents written entirely in 7-bit characters, using &-notation for anything else, then you can claim the datastream to be iso-8859-1 (the HTML default) or utf-8 whenever it suits you, without needing to tamper with the data itself in any way.

Let's stress this point again.

Recommendations for information providers

The title says "information providers" rather than authors, since there's no fundamental reason that the author has to deliver the format that goes out onto the WWW. A correspondent asked me, for instance, "do you really want me to author my Russian documents by typing those &#number; representations for each letter?", and the answer is "no, I am only telling you that there are reasons why it may be useful for the server to send out documents in that form". The act of "publishing" an authored document via a web server could well involve some kind of transformation, from a format that's convenient for the author to produce, to a format that's robust for use on the WWW. And indeed that latter format could change with time, as the population of browser/versions evolves.

As you see above, several of the browsers that offer to support 8-bit character codes other than iso-8859-1 only seem to work reliably when offered 8-bit data; use of entities (apart from gt lt amp and quot, of course) would seem to be problematical for the poor wee things.

Authors certainly may use quite a range of 8-bit codings successfully, in full accord with HTML standards: that includes not only Latin repertoires other than Latin-1, which we aren't studying closely here, but also repertoires such as koi8-r, iso-ir-111, Greek, etc.: but, if they expect to get good browser coverage when using such non-Latin repertoires, then (at least at the time of writing this, when Netscape 4.* versions were still in widespread use), for the Latin-1 characters they would need to restrict themselves to the 7-bit un-accented letters of US-ASCII. There are no grounds for this in theory, but the shortcomings of some popular browsers (particularly Netscape 4.*) made it necessary if successful display was to be achieved. (Note added 2005: I'm not aware of any recently available browser which has this specific shortcoming: some simple browsers are just plain incapable of displaying non-Latin repertoires, and so it's pointless to discuss them in this context, but those which can display a wider character repertoire, as most of the popular browsers can, have no problem displaying a full Latin-1 character repertoire when the character encoding is something quite different.).

In theory, if you have a document predominantly in one non-Latin alphabet, let's say Cyrillic, and you want to include just a few characters from some other repertoire, let's say for example Greek or some mathematical symbols (or even just some Latin-1 accented letters), the HTML specifications would permit you to offer this document in an 8-bit encoding, say koi8-r, and to represent the Greek/math using &#bignumber; representations. And indeed this works fine in MSIE4, or any other browser that has been implemented according to the spec. But it doesn't work in NS4, because as long as the browser's "charset" setting is koi8-r, it fails to display the Greek/math characters, whereas, if you attempt to set the "charset" manually to utf-8, it of course wrecks the interpretation of the 8-bit characters. So this doesn't work in practice with NS4.*.

So this takes away a number of otherwise attractive options that the specifications would permit. There are a few exceptions, where the characters in question exist in both codes (copyright, for instance) and NS4 will render those successfully.

To sum up these observations, then, it is, frankly, my impression that authors starting on such a project today would do better not to mess around with these individual 8-bit language-oriented encodings, but to start right in to Unicode. I can vouch for the support for &#bignumber; representations in the popular browser versions in use at the time this was written (NS C 4.x and MSIE 4.x) at least on the MS-Win platform; if you are in a position to generate it, then the support for utf-8-encoded data also seems to be good.

I have had good results by composing documents entirely in 7-bit US-ASCII, representing accented Latin-1 letters by their &entityname; representations, and other characters by &#number; references; for the benefit of Netscape (4) such documents need to be sent out with a charset of utf-8 although, according to the specifications, that should not be necessary. (Of course, for documents that were predominantly not in the Roman alphabet this would represent a massive bloat and would be much better avoided, if only the current browsers supported a better way.)

It might be worth bearing in mind that if you have a document already created in some specific form, it might well be feasible to convert it programmatically into a form that complies with WWW standards. For example, given a class of (quasi)-HTML documents that contained FONT FACE=Symbol representations, I wrote a rough-and-ready converter to turn them into &#bignumber; equivalents (this was a proof-of-concept script, not intended as a fully-functioned production-quality piece of software, so please don't expect me to give you a copy of it and get it to work on your arbitrary documents!).

My i18n work for rtftohtml also serves as an example of taking a source document composed in a way convenient to the author, and turning it into usable and valid HTML.

When sending out a document that is other than Latin-1 (iso-8859-1) you really ought to send that charset attribute, preferably on the HTTP content-type header or, if that's not possible, then on a META HTTP-EQUIV. The habit that some communities have, of sending out documents in their local charset without any charset attribute, may be the de facto custom within their community, but it's unacceptable for use on the WWW. Hint: the usual method nowadays to configure this in Apache is to use the AddCharset directive, which could be placed in appropriate .htaccess files for those who do not have access to the main configuration.

But I do want to emphasise that my advice is offered in a situation where documents predominantly use a Latin alphabet, and the use of other characters is on a relatively small scale. If you are authoring in a situation that involves, and with software tools that permit the use of, other kinds of coding, then there may be better ways which you can use.

This note doesn't aim specifically to deal with typographical characters, but there's a useful page by Henry Churchyard.


|Previous|Up|Next | |RagBag|About the author||