Some kind of approachable document seemed to be needed to explain how HTML was meant to be extended to international character sets and codings, beyond the ISO-8859-1 that was used up to and including HTML3.2.
Dan Connolly asked me in July 1996 whether I could write something up, but I didn't really have the time, plus I knew that my knowledge of SGML and of Unicode terminology was rather shaky. But, discussions on usenet showed the need for some kind of briefing on this topic, so I decided to have a try. This covers 8-bit codings other than iso-8859-1 (which have, of course, been in use for quite some time, though unfortunately not always implemented in conformity with RFC2070), as well as covering iso-10646/Unicode. I do not specifically cover other techniques such as iso-2022-based codes etc., as I have very little experience of them myself.
This is quite a heavy document: you might be interested in the quickstart first.
Some of the terminology here may be a bit loose and informal. It's intended to explain, rather than to give people a rigorous theoretical background. For that, and for pointers to authoritative sources, see http://www.w3.org/International/ at the W3C. And in particular, Standards Track RFC2070, "Internationalization of the Hypertext Markup Language", available from your usual source of Internet RFCs or as rfc2070.html.
Documents that are limited to one non-Latin-1 repertoire have been in practical use for a considerable time, even if they were outside of what HTML/2.0 and /3.2 actually specified. HTML/2.0 (RFC1866) indicated how the specifications were meant to be extended, and these indications have been vindicated by subsequent developments (RFC2070 and HTML/4). However, you should be aware that there are also some bogus techniques out there, which give a visual impression of working on consenting browsers, but which are fundamentally unsound in terms of HTML protocol specs.
There are two important issues: "Document Character Set", on the one hand, and what I used to call "Data Transfer Coding", on the other hand. The latter is referred to in Dan Connolly's paper Character Set Considered Harmful by means of the term "character encoding scheme", and in RFC2070 it's referred to as "the external character encoding"; in HTML4 the term "character encoding" is also used for this concept, and there's a reasonably clear explanation of it in the text. Recent versions of Unicode have layered the concept of "Character Encoding" into the "Character Encoding Form" and the "Character Encoding Scheme", as explained in chapter 2 of the Unicode specification.
The bottom line is that what's significant for the HTTP protocol transfer between web server and client is what is now properly called the "Character Encoding Scheme", which XML declares by means of its "encoding" attribute, while MIME (and hence HTTP) still uses the older, and now quite confusing, term "charset".
The key point, as regards the WWW, is that what matters is the character coding scheme that's used on the data transmission from the server to the client: this might be the same as the one used natively for text storage on the server's own platform (and indeed this is now the situation in the overwhelming majority of cases), but the specifications have been formulated so that the can properly deal with the cases where it is not so (consider classic Macs, or consider EBCDIC-based machines, whose native internal character coding would be unacceptable for transmission to an arbitrary WWW client).
In the original HTML/HTTP versions, both the "Document Character Set" and the "character encoding" are defined by reference to the ISO-8859-1 specification. There was nothing wrong with doing that, within the terms of earlier versions of HTML, but it results in many people thinking that "Document Character Set" and "character encoding" are one and the same thing, which they are not; this causes much confusion. Until one has understood this key point, almost all of the important questions (and answers!) make no sense, which is why I've tried so hard to explain it (with the result that several readers have complained about the long-winded stuff in here: well, if I could explain it better then I would - suggestions are welcomed).
The term "Document Character Set" has a specific meaning in SGML
terms; its significance for HTML is that it determines the numbers
that are used in the
&#number; numerical character
It has nothing to do with the character coding
that is used for transferring a document - the
document could be transferred in any
compatible manner that the two ends of the transfer can agree on.
There is a vital, but most confusingly named, feature of the HTTP
parameter of the Content-type
header. This parameter does not
define the "Document Character Set" in HTML/SGML terms, it only
defines the transmission coding (i.e the character encoding scheme
which is used for the network transmission) of the document's characters,
i.e it defines how the recipient should interpret the meaning of the
data stream received (byte stream, octet stream, call it what you will).
So, the HTTP protocol provides (via that confusingly named
charset parameter) the mechanism for the client optionally to
inform the server of which transmission encodings they can accept (the
Accept-charset header), and for the sender to announce
the transmission code of the document to the client.
Up to and including HTML3.2, the only
which a client agent is required to support is ISO-8859-1,
and in that case the HTTP/1.0 recommendation was to omit the charset
attribute altogether; the climate has changed since, and the HTTP/1.1
specification and other documents from the W3C now encourage the use of
an explicit charset parameter even when it is
According to the original HTML internationalization (i18n) specification, RFC2070, the Document Character Set for HTML is ISO-10646 (effectively "Unicode"). The formal statement in RFC2070 is as follows
The document character set, in the SGML sense, is the Universal Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended. Currently, this is code-by-code identical with the Unicode standard, version 1.1 [UNICODE].
Your corresponding reference in the HTML4.0 Recommendation is to Chap.5: HTML Document Representation, where it is stressed that the character set standard is subject to updating (the implication presumably being that these updates are automatically adopted into HTML). The Unicode Consortium has its own web site.
As previously mentioned: iso-10646 (Unicode) defines a "Character
Set": it does not define a character encoding scheme,
and the occasionally-seen HTTP headers which say
charset=iso-10646 are nonsense. There are several different
encoding schemes for Unicode - the current specification includes seven
schemes, not counting some other schemes which are now obsolete but
may still be found (specifically
This Document Character Set is not negotiable in HTML; it is set by the "SGML Declaration" for HTML, which, unlike the transmission character encoding (charset), is not the subject of any kind of announcement or negotiation between client and server.
The first 256 code positions of Unicode are identical with ISO-8859-1, so a document that uses only ISO-8859-1 in its Document Character Set is trivially just a special case of Unicode as the Document Character Set. But iso-8859-1 as a character coding is not directly compatible with any of the unicode character coding schemes. Until this distinction is clear in one's mind, all manner of confusion can arise.
Some important notes:
One document - one character encoding (the one advertised by the HTTP header or by a corresponding META HTTP-EQUIV in the HEAD of the HTML document): the WWW procedures don't make use of techniques for switching between encodings within a single document. However, you should be aware that some character encodings, for example those of the ISO-2022 type, contain mechanisms that switch between several repertoires within the framework of the one encoding, so the distinction is a rather subtle one when such encodings are used.
Neither the character code nor the Document Character Set have anything directly to do with the document Language. As far as WWW protocols are concerned, it's a co-incidence that Russian is written in Cyrillic, and the Greek language is written using Greek characters! An HTML document can contain several languages, which can (if desired) be marked up with appropriate language name attributes in HTML. Refer to the W3C documents for more about this, as it's somewhat tangential to the present topic. Even where a document contains several languages, it can only, as already noted, use a single character coding ("character encoding scheme"), as given by that now-misleadingly-named charset attribute.
In theory, a document could achieve a fuller repertoire, e.g including Cyrillic, Greek, Arabic etc. in the same document, either by using a multibyte encoding scheme (e.g one of the encoding schemes for Unicode characters), and/or by using named entities or numerical character references. But implementation of the named entities which were introduced by HTML4 was very slow, particularly in some popular browsers (Netscape 4.*, for example), and, even in browsers which were technically competent to display HTML4 properly, there was (and still is) no way to guarantee that the reader has installed their OS with the necessary multinational options (this has been a particular problem for MS Windows users in the USA).
Support for one of the non-Latin-1 8-bit character sets, such as iso-8859-7 Greek, or koi8-r the de-facto Russian Cyrillic, has been present in browsers for quite a bit longer, and may be used, but in a proportion of browser versions the Latin-1 repertoire then became inaccessible, as we will see in the browser tests: fortunately, the last browser of any great significance which had this shortcoming was the old Netscape Navigator, up to versions 4.* inclusive: if we can consider that to be obsolete, then this particular problem goes away. I'm not personally familiar with usage of ISO-2022-based encodings, but these too appear to have been available in browsers for some time, chiefly being used by languages such as Japanese or Korean: they might need some specialised support in the operating system too, e.g a Japanese version of the OS.
Another source of much confusion is to muddle up the meaning of the transmitted document with the technical means of rendering it (display character codes, font names, etc.) inside a particular browser. A little thought should show that these must be kept clearly separate, if there is to be any hope of representing an open, portable, text markup.
One browser that had been developed with attention to the standards was the Tango browser from Alis Technologies. (I used the 30-day trial of 2.5.1 for study, and that was many years ago now). However, MSIE4 seemed to have pretty well caught up; and NS4, although it still had some severe shortcomings in this area, could do rendering of content encoded in utf-8 well enough to be useful, although forms input in this situation was hopelessly broken. By the time of adding this note (2005), I'd say that Netscape 4.* versions are now best ignored if you need to supply such content; and particularly if you need to support forms input.
HTML has three ways of representing characters: encoded characters,
SGML numeric character references of the form
But, the entitynames are defined by SGML statements such as e.g
<!ENTITY Epsilon CDATA "Ε" -- greek capital letter epsilon, U0395
i.e in terms of the character references already described, so this introduces no new fundamental principle, and for the purposes of our discussion here can be seen just as an additional convenience.
A reference for entities, which may also be found to be a handy aide-memoire on a useful range of Unicode hex code values and corresponding HTML/SGML decimal character references, is Section 24 of the HTML4.0 Recommendation.
The Unicode Consortium also provides character code mappings and other useful materials.
I find it illustrative, and it may be helpful to you also (please suspend your disbelief meantime), to consider the case where we intend to transfer an HTML document over a network in the EBCDIC code.
Then, all of the individual characters are represented by
different bit-patterns: the letter "A" for example, which
would be represented by hex'41' (decimal 65) in ASCII, is hex'C1'
in EBCDIC. As long as
the recipient understands that the transfer is being made in
EBCDIC, the meaning of the data remains unchanged.
Now, what about the HTML content?
The HTML sequence
A represents an upper case
A: what should we do when converting this HTML file into EBCDIC?
And the answer is, we convert the individual characters: ampersand,
hash, six, five, semicolon: into their EBCDIC equivalents, but
the numerical value, sixtyfive, remains unaffected.
The product of this conversion, therefore, is a document whose "Document Character Set" (in HTML/SGML terms) is still ISO-8859-1, but the document itself has been transmitted in the EBCDIC code.
The HTML4.0 recommendation seems to have given up hope that servers and server admins will actually send a proper charset attribute on the HTTP header, and goes into a long-winded discussion of the various ways in which a client agent might become aware of - or even have to guess at - the charset appropriate to the document that they are receiving. Oh well, as long as you understand the principles as they were set out in RFC2070 (and, I hope, demystified in the present briefing paper), then you'll be in a position to cope with this particular piece of hand-wringing.
A word of warning, however: according to the CERT security alert CA-2000-02, a document which is sent out without an explicit charset specification can represent an increased security hazard.
Disclaimer: I am using this example as a means to illustrate how character codes other than ISO-8859-1 work (or should work) on the WWW. It's not my place to take sides about how to write Cyrillic: this isn't my area of study at all. See also Fingertip Software. I am using koi8-r here only as a practical example.
A description of KOI8-R may be found at the so-called "Home of KOI8-R" (unfortunately the URLs for this have subsequently become too unstable to include here - try a web search). Another useful resource, which includes additional practical advice on actually authoring Cyrillic documents, is provided by Paul Gorodyansky; but such details are outside of what I want to cover here, and also, it deals only with the use of a single non-Latin repertoire (i.e Cyrillic) whereas the present page is addressing the use of extended character repertoires.
Note that many of the characters defined in ISO-8859-1 do not exist in this code; and some of those that do exist (copyright, degree, superscript-2, division sign, the no-break space) are in different places. So there's plenty of scope for confusion.
You'll recall that in HTML 2.0 or 3.2, the ISO-8859-1
characters can be represented in three different ways: as a named
&name;, as a
numerical character reference
or as an 8-bit character.
Things start to get confusing when the 8-bit transmission code no longer comprises the ISO-8859-1 repertoire, and of course this is the case for KOI8-R, where the upper half is assigned to Cyrillic letters etc., and this is what we tackle next.
OK, so a document has been sent in the koi8-r coding. It means that the 8-bit characters (specifically, those in the upper half) have to be interpreted according to the koi8-r code; however, according to the standards, all of the named entities and numerical character references which it contains are still meant to be interpreted by reference to the Document Character Set, Unicode (or its subset, ISO-8859-1). Obviously, in general, this needs a repertoire of printable characters that exceeds 256. Whether that is feasible in a particular browser depends on its internal design. The important point to keep in mind at all times is that the meaning of an HTML document never can be changed by some incidental feature of the browser that receives it. Either the browser renders it usefully or it does not. But the meaning of the document remains invariant. This may seem obvious, as stated - but it's amazing the extent to which people refer to features of browser platforms (DOS codepages, the Mac character code, named fonts, etc.) as if these might be allowed to influence the meaning of the received documents... that way lies madness.
Various communities of users have in the past got into the habit
of sending out their documents
Content-type: text/html (i.e without an
explicit charset specification) on the assumption that
the reader would have already set their browser to the expected
This was always a technically-incorrect thing to do:
documents should be sent out with a Content-type that
specifies the correct charset.
Earlier versions of HTML (2.0 and 3.2) were based on the default
charset being iso-8859-1, whereas HTML4.0 explicitly states that
there is no default charset: so under either specification,
an explicit charset is required.
Unfortunately, some ancient browser/versions got upset when sent a Content-type header that included an explicit charset, and refused to display the document at all. So, users in these communities got accustomed to configuring those (now-obsolete) browsers to use a different font - one in which the characters were arranged according to their favourite encoding - so that the 8-bit characters would look right, and ignored the charset issue entirely. Nowadays, the supporters of this (unwise) procedure would tell the reader to use their browser's character-encoding override feature as a workaround. While this swerve might be understandable in relation to those older browsers, it has to be pointed out that this isn't (and actually never was) in accord with the published specifications: it's a pity that there were browsers that failed to implement the specs correctly, leaving us a legacy of non-conforming documents.
As if that wasn't bad enough, we now have the problem that increasing numbers of web pages are being provided by authors who have neither the necessary level of support from their provider, nor the expertise themselves, to know how to send out pages with a proper charset header. They get told to put a META HTTP-EQUIV into their HTML documents instead. Well, in many situations this gives a good enough impression of working, but there are situations where it is actually counter-productive: for example the Russian Apache server offers on-the-fly character code translation ("transcoding"), which results in a document that is correctly coded in a different charset, but the transcoder needs to filter-out the original META HTTP-EQUIV specifying the original character coding, or it could cause confusion (e.g when the HTML is stored to file, or on older NN4.* versions which implemented the wrong priority between HTTP and META charset). And similar things would happen with recoding proxy servers (such things were for example used in Japan, to mediate between the different character codings that are used in that area, although their current prevalence on the WWW is not known to me). Well, if browsers followed the specifications correctly, this would not be so bad, because the server's explicit HTTP header would override the now-incorrect one in the document's META, but unfortunately some browser implementers (notably Netscape) got this wrong, and so the browser will display wrongly in this case.
Additional complications arise if authors try to use
http-equiv in conjunction with XHTML (see XHTML/1.0
Appendix C section 9 for the formal position).
These complications can be avoided by using the HTTP Content-type
header for specifying
charset, just as was being
Within the scope of this present article, the tests were meant only to illustrate the kinds of things that go wrong in some actual browsers, so as to supplement the theoretical account of what's supposed to happen. There's also a separate article with some detailed browser tests.
The tests cover the use of individual non-Latin-1 repertoires (the one defined by the encoding used, i.e the charset) in conjunction with Latin-1 (called out by using &entity; and &#number; references).
This screen shot shows the Alis Tango browser displaying a (pre-formatted) document that has been sent out with charset=koi8-r. Note: koi8-r also assigns printable characters in the range 128-159 inclusive, but they were not included in this demonstration. It may be worth noting that this screen shot was made in late 1997, although I had seen this browser behaving correctly quite some time previously: Netscape were still releasing updates of their version-4 browser, which is fundamentally broken in this area, right into the year 2002.
The test material was, apart from the charset parameter, the same ISO-8859-1 test table that I was using for Latin-1 compliance, but of course the 8-bit characters (col.6 in this version of the table) instead of following the ISO-8859-1 code now must follow the KOI8-R code.
Note that columns 7 and 8 are still displaying the ISO-Latin-1 repertoire, as required by the HTML internationalization specifications.
All too often the answer was "no", as you can see from the accompanying detailed tests. The commonest blunder, at the time this page was originally formulated, was to interpret the named entities and/or numbered character references by reference to the incoming transmission coding (that ineptly-named "charset" parameter), instead of to the Unicode assignments as prescribed in the standard. And of course, sadly, you'll find documents authored by people who, instead of avoiding the problematical constructions, have coded them wrongly in order to get the desired results out of the broken browsers - meaning, of course, that they get unwanted results from non-broken browsers. Fortunately, things are much better now (2005/2006) and these old mistakes are documented only because they can throw light on the kind of misconceptions which still afflict some newcomers to the field.
My own tests only covered the browsers' ability to display
normal HTML markup, it did not assess
forms submission etc. (for koi8-r those issues are addressed at
the "koi8-r home page").
My results are summarised in the
following table, as far as they are available.
Those old versions of
Lynx probably worked fine with Greek, but weren't tested: currently available
versions of Lynx have excellent, comprehensive character code support,
when properly configured in a suitable terminal/emulation environment.
|Browser:||8-bit encoded character||Numeric character reference
|Alis Tango (2.7.1)||Yes||Yes||Yes|
|Lynx 2.6/2.7||Yes||don't know||approximations||don't know||approximations||don't know|
|MS IE 3.01 w/ Pan European kit||OK, some approximations||don't know||Wrong!||don't know||Yes||don't know|
|MS IE 3.03 16bit||Yes||Yes||Yes|
|MS IE 4 32bit||Yes||Yes||Yes|
|NS Nav 3.01/3.03||Yes||Wrong!||Wrong!|
|NS C 4.04||Cyrillic letters OK, many other chars wrong||Greek letters OK, some other chars wrong||Most characters are rendered as "?"; just a few (seemingly those that were present in the "charset" repertoire) are rendered correctly.|
It seems little short of absurd that some graphical browsers that aim to support these features put up such a pathetic performance. The protocols (HTTP and HTML) had been clear enough for several years already, even if they were only in draft; given access to appropriate, compatible, fonts, the rendering of multi alphabet documents is surely a solved problem.
Clarification: I should make it plain that the browser problems that I am describing here are only an issue if you hope to introduce into your 8-bit non-Latin-1 document also some entities or numerical character references from Latin-1 or from other repertoires. If, in fact, you are content to stay with your 8-bit non-Latin-1 repertoire and to make no use of anything else, then (so long as your document contains 8-bit characters, "real Russian characters" as Paul Gorodyansky called them in a Russian context), there isn't a problem.
This is now in a separate page.
On the German-language authoring group, the question was asked whether one can use German together with Turkish. Well, the question could as well be asked for Turkish with French or other W.European languages, as we will see (but not icelandic).
The 8-bit ISO coding used for Turkish is iso-8859-9, and it is identical with iso-8859-1 except in six places: upper and lower case eth, thorn and y-acute. So, if a browser supports iso-8859-9 at all, then it's likely that it will display all the Latin-1 characters too, except those six. And so indeed I found in my tests of recent browsers (NS3.01, NS4.05, MSIE4). On the other hand, MSIE3.03 (16-bit) didn't support iso-8859-9 anyway. MSIE4 even represented the remaining six Latin-1 characters correctly, when they were presented as entities or numerical character references, as indeed a browser must for standards conformance; NS3.01 displayed them wrongly, and NS4.05 displayed the six as "?". But the conclusion seemed to be the expected one, that if the browser supported iso-8859-9 at all, then it could be used for documents containing German, French etc. with Turkish.
In the iso-8859-* codes, characters 128-159 have been reserved for control functions; the original motivation seems to have been the risk that they would be transmitted inadvertently over a 7-bit path and could cause disastrous control functions (such as page eject, or putting the display device into graphics mode etc.) instead of just a single wrong display character. Whatever the original motivation, this principle has been firmly followed by the iso-* codes, including iso-10646/Unicode. It isn't necessarily followed in other encodings, though: examples of encodings where these codes represent printable characters include koi8-r, as well as the MS Windows codings.
MS Windows introduced a group of codings in which these code
positions were used for printable characters, some of which are
much in demand with certain authors: the trademark glyph, matched
quotes and so forth. These are the encodings such as "code page" 1252.
It would appear to be protocol-correct to offer documents in these
encodings, with 8-bit characters in that range, as long as they
are sent with an appropriate charset value and the recipient
accepts this charset encoding.
That is not at all the same thing as attempting to
characters by numeric character references such as
™ as one so often sees.
The meaning of the latter construct is undefined
(N.B: not "illegal", but "undefined") in standard HTML:
the protocol-correct representation
of a trademark as a numeric character reference is in fact
™ as can be seen in the W3C reference
already cited; and correspondingly for the matched quotes and such.
In documents from Microsoft themselves, these codings are frequently referred to as "ANSI", but no-one seems to be able to adduce any formal basis for this usage: for example in one Microsoft document the Windows character sets (plural!) are termed "an assortment of Windows ANSI character sets", but without citing any official ANSI standard. At the Unicode site, the mappings for Windows codes are firmly listed under "VENDORS/MICSOFT", and not under any American National standards body.
As Markus Kuhn is quoted as reporting:
Ich habe als ich letztes Jahr an der Purdue University war stundenlang in der dortigen Normbibliothek (die ueber alle aktuellen ANSI-Normen verfuegte) gewuehlt und mir waere dieser Standard dabei sicher in die Haende gefallen. Alle ANSI Zeichensatznormen sind inzwischen nur nationale Ausgaben der entsprechenden ISO-Normen. Es macht also wenig Sinn, von "dem ANSI-Zeichensatz" zu sprechen, wie dies etwa Microsoft tut wenn sie in Wirklichkeit ihren proprietaeren CP1252 Zeichensatz meinen der eine Erweiterung von ISO 8859-1 ist, oder von "8-bit ASCII".
(loosely translated: he hunted through the Purdue University standards library and would certainly have located any relevant ANSI standard. But by that time the ANSI character coding standards are merely national editions of the ISO standards. It therefore makes little sense to speak of an "ANSI character set" as MS do when they are in reality describing their CP1252 coding. [...])
If anyone can support MS's attribution of these codes to ANSI, I would be happy to cite it here. The closest we have got so far, was the theory that ANSI had previously started drafting a standard for 8-bit coding, but had already fallen-in with the ISO standards before any national standard saw the light of publication.
Issue: charset value for MS Windows codepage 1252.|
Several years back, consultation of the official IANA character set assignments document revealed the rather surprising oversight that codepage 1252 was missing from the list, although 1250, 1251, and 1253-1258 were included since 1996. By analogy with the existing registrations, it appeared that the valid charset name for this code would be windows-1252.
I later received an email from Chris Wendt, Program Manager for IE
at MS, saying that they were getting windows-1252
registered at IANA.
After some further delay, the entry appeared, dated December 1999.
Possible aliases such as
In practice, the mass-market browsers tend to behave as if their default charset were code page 1252 anyway, rather than iso-8859-1 as the specification calls for: but note that these characters will disappear entirely when displayed with iso-8859-1 fonts, which often happens by default with X Windows systems (at least as they are set up in Western locales). For a rather trenchant account of this problem, see the demoroniser.
The relationship between iso-8859-7 and Windows-1253 is somewhat analogous to that between iso-8859-1 and Windows-1252, in that the Windows coding assigns additional displayable characters in the range 128-159 decimal, which the iso-8859-* codings reserve for control functions.
However, whereas the range 160-255 is the same in iso-8859-1 as in Windows-1252, there are a few differences between iso-8859-7 and Windows-1253 in this range.
There's a habit, learned apparently from
platform-specific word-processing applications, of
trying to get exotic characters displayed in HTML documents by using
<FONT FACE=Symbol> and suchlike to select a
different repertoire of displayable characters.
(Here I cover the topic only briefly; later I wrote a more detailed
page, "Using FONT FACE to extend
repertoire?", about it.)
As far as HTML is concerned, this is at entirely the wrong protocol
level. The transmitted octets, the &name; entities, and the
&#number; representations, have meaning that is defined
by the HTTP and HTML protocol standards: that meaning could be displayed
by cosmetically different fonts (controlled by a style sheet, or by
FONT FACE for those who care for it), but to select a font
that produces a quite different displayed character is entirely
contrary to the intentions of HTML.
Although it might appear to produce the effect intended
by the author, in a limited
range of viewing situations (yes, I'm well aware that these viewing
situations are statistically very common, but I still say they
represent a "limited range of viewing situations"),
it can produce all kinds of deleterious
consequences, including undiagnosed but incorrect display in other
viewing situations, incorrect indexing by search robots, etc. etc.
Who knows how a speaking machine is supposed to cope with this?
Properly, it's the job of a browser to recognize the meaning (e.g &#bignumber;) of HTML markup, and to make whatever font selection is needed internally for displaying that meaning to the reader. It should be no part of an HTML author's job to second-guess what fonts the reader might have at their disposal, and to interfere in the browser's selection of them (other than for cosmetic reasons).
A properly standards-compliant browser would do better to treat that construct e.g
by noticing that the "Symbol" font does not contain the character
a", and should refuse to display it, or maybe to
choose a cosmetically-similar font in which the character
a" is present, in order to
ensure an uncorrupted display in HTML terms. But this is not at
all what those platform-specific Gatesware tools are trying to lock
you into. The result wouldn't, of course, be what the misguided author
intended, but they would be what the specifications
call for: authors who ask for the wrong thing shouldn't be too
surprised when they occasionally get it, hmm?
Indeed, the HTML author should not need to know anything about the machinery that exists in a client platform for turning coded characters etc. via font resources into a screen display: the whole thing should be treated as a black box as far as the HTML author is concerned.
And the same principles apply, for sure, when writing style sheets that specify named fonts.
Beware of other sites that "support" different writing systems not by using a defined character encoding, but by using specialised fonts that appear to display the desired character in response to some normal (e.g Latin-1) character code. This technique commonly adopted in earlier times goes very much contrary to the principles of HTML and the WWW. [In a way it's a pity, as there are some really fun resources out there (see for example the Yamada Language Center's font archive at UOregon) based on that way of doing things: but, standards are standards!].
I've said this before, but I make no excuse for saying it again, because the topic just keeps coming up, over and over. People have got so accustomed from experience with earlier browsers to the idea that the browser just has to get a font that corresponds to the incoming character coding, plug it in, and the problem is solved. In the terms of HTML i18n (RFC2070/HTML4.0) this is a misconception. For example, when I'm using 16-bit MSIE3 to view a koi8-r document, it's necessary to configure its Cyrillic preferences to be a windows-1251 font, not a KOI8-R font. And when the browser is presented with Latin-1 entities or character references in this situation, the browser calmly goes and uses a Latin font for the purpose, just as the theory says it should.
To put it briefly, again, it is entirely a private matter for the browser implementer to decide how to render the HTML in terms of the resources at their disposal: for example, a selection of various cosmetically-compatible fonts. What the HTML means is defined by the interworking specifications (RFC2070, HTML4.0). The browser's job is to render the HTML in accordance with that specification. There's nothing that says a unicode character stream (utf-8, say) needs a unicode font (it could use several 8-bit fonts), nor on the other hand that an 8-bit stream needs an 8-bit font (it could use the appropriate characters from a unicode font).
I'm deliberately not covering the issue of data encodings in any
detail. I'm recommending that, at least for authors like myself who
write predominantly in a Latin alphabet and occasionally want to
include "foreign" characters, it's more robust and portable for
characters to be represented by
(entities and numerical character references) in HTML.
If you already work in a multibyte or unicode environment using
suitable tools (for example Far-Eastern readers) then you presumably
know what to do already; but for occasional snippets using non-Roman
character sets, I suggest that the
recommendations I give here should suffice - and that includes
use for mathematics, in so far as that is feasible in standard HTML.
However, to circumvent a shortcoming in Netscape browsers, there can be advantages in advertising a content-type charset of utf-8, and it may be useful to understand briefly what this implies.
The theoretical part can be found in RFC2279.
Every character outside of the 7-bit us-ascii repertoire is
represented in a utf-8 datastream by a specific sequence of
two, three or more bytes, all of which have the high bit set
(note that in a utf-8 data stream,
an individual byte with the high bit set has no
meaning on its own, but only as part of the multibyte
sequence to which it belongs).
For my own education I wrote a trivial perl script that generates
utf-8 encoded output in response to
&#number; representations in the input - and
the result seems to be acceptable to utf-8-supporting browsers.
Don't misunderstand me: this script isn't a piece of
production software, I only
wrote it to educate myself, and if you're interested, I might suggest
you could do the same, in whatever language you favour.
You can check the result by feeding it to a suitable browser
and comparing the display with the original.
You could consider storing such a datastream, and serving it out
via your web server (HTTPD) with an appropriate content-type charset attribute.
This all works just fine on platforms such as unix or MS-Win:
if you are working on a classic Mac (or worse, an EBCDIC-based system)
or other kind of system whose native storage encoding
of characters gets mapped
into iso-8859-1 when served out through an HTTPD, then you're
likely to get into difficulties handling this kind of thing.
In any case, as I said, I'm not really recommending this to the
casual author, but only exploring it briefly in order to
illuminate what the standards mean.
It's an important and useful property of the utf-8 encoding that any stream that contains only 7-bit us-ascii characters is also a legal utf-8 datastream. You're already aware that a stream of 7-bit us-ascii characters is also a legal iso-8859-1 stream (or indeed iso-8859-anything). So, if you keep your HTML documents written entirely in 7-bit characters, using &-notation for anything else, then you can claim the datastream to be iso-8859-1 (the HTML default) or utf-8 whenever it suits you, without needing to tamper with the data itself in any way.
Let's stress this point again.
&representations in HTML (as I already recommended), or transform the data into a genuine utf-8-encoded datastream (a topic that I haven't covered in any detail here).
The title says "information providers" rather than
authors, since there's no fundamental reason that the
author has to deliver the format that goes out onto the WWW.
A correspondent asked me, for instance, "do you really
want me to author my Russian documents by typing those
&#number; representations for each
letter?", and the answer is "no, I am only telling you that there
are reasons why it may be useful for the server to send out
documents in that form".
The act of "publishing" an authored document via a
web server could well involve some kind of
transformation, from a format that's convenient for
the author to produce, to a format that's robust
for use on the WWW.
And indeed that latter format could change with time,
as the population of browser/versions evolves.
As you see above, several of
the browsers that offer to support 8-bit character codes other
than iso-8859-1 only seem to work reliably
when offered 8-bit data; use of entities (apart from
gt lt amp and
of course) would seem to be problematical for the poor wee things.
Authors certainly may use quite a range of 8-bit codings successfully, in full accord with HTML standards: that includes not only Latin repertoires other than Latin-1, which we aren't studying closely here, but also repertoires such as koi8-r, iso-ir-111, Greek, etc.: but, if they expect to get good browser coverage when using such non-Latin repertoires, then (at least at the time of writing this, when Netscape 4.* versions were still in widespread use), for the Latin-1 characters they would need to restrict themselves to the 7-bit un-accented letters of US-ASCII. There are no grounds for this in theory, but the shortcomings of some popular browsers (particularly Netscape 4.*) made it necessary if successful display was to be achieved. (Note added 2005: I'm not aware of any recently available browser which has this specific shortcoming: some simple browsers are just plain incapable of displaying non-Latin repertoires, and so it's pointless to discuss them in this context, but those which can display a wider character repertoire, as most of the popular browsers can, have no problem displaying a full Latin-1 character repertoire when the character encoding is something quite different.).
In theory, if you have a document predominantly
in one non-Latin alphabet, let's say Cyrillic, and you want to include
just a few characters from some other repertoire, let's say for
example Greek or some mathematical symbols (or even
just some Latin-1 accented letters), the HTML specifications
would permit you to offer this document in an 8-bit encoding,
say koi8-r, and to represent the Greek/math
And indeed this works fine in MSIE4, or any other browser that has
been implemented according to the spec.
But it doesn't work in NS4, because as long as the
browser's "charset" setting is koi8-r, it fails to display the
Greek/math characters, whereas, if you attempt to set the "charset"
manually to utf-8, it of course wrecks the interpretation of the 8-bit
characters. So this doesn't work in practice with NS4.*.
So this takes away a number of otherwise attractive options that the specifications would permit. There are a few exceptions, where the characters in question exist in both codes (copyright, for instance) and NS4 will render those successfully.
To sum up these observations, then,
it is, frankly, my impression that authors starting on such a
project today would do better not to mess around with these
individual 8-bit language-oriented encodings,
but to start right in to Unicode.
I can vouch for the support for
representations in the popular browser versions in use at the time this
was written (NS C 4.x and MSIE 4.x) at least on the MS-Win platform;
if you are in a position to generate it, then the support for
utf-8-encoded data also seems to be good.
I have had good results
by composing documents entirely in 7-bit US-ASCII, representing
accented Latin-1 letters by their
representations, and other characters by
references; for the benefit of Netscape (4) such documents need
to be sent out with a charset of utf-8 although,
according to the specifications, that should not be necessary.
(Of course, for documents that were predominantly not in the
Roman alphabet this would represent a massive bloat and would
be much better avoided, if only the current browsers supported
a better way.)
It might be worth bearing in mind that if you have a document
already created in some specific form, it might well be
feasible to convert it programmatically into a form that complies
with WWW standards. For example, given a class of
(quasi)-HTML documents that contained
FONT FACE=Symbol representations, I wrote a rough-and-ready converter
to turn them into
(this was a proof-of-concept script, not intended as a fully-functioned
production-quality piece of software, so please don't expect me to give
you a copy of it and get it to work on your arbitrary documents!).
My i18n work for rtftohtml also serves as an example of taking a source document composed in a way convenient to the author, and turning it into usable and valid HTML.
When sending out a document that is other than Latin-1 (iso-8859-1)
you really ought to send that
charset attribute, preferably on
the HTTP content-type header or, if that's not possible, then on
a META HTTP-EQUIV.
The habit that some communities have, of sending out documents in
their local charset without any charset attribute, may be the de facto
custom within their community, but it's unacceptable for use on the
WWW. Hint: the usual method nowadays to configure this in
Apache is to use the
AddCharset directive, which could
be placed in appropriate
.htaccess files for those who
do not have access to the main configuration.
But I do want to emphasise that my advice is offered in a situation where documents predominantly use a Latin alphabet, and the use of other characters is on a relatively small scale. If you are authoring in a situation that involves, and with software tools that permit the use of, other kinds of coding, then there may be better ways which you can use.
This note doesn't aim specifically to deal with typographical characters, but there's a useful page by Henry Churchyard.
Original materials © Copyright 1994 - 2006 by A.J.Flavell