Checklist for HTML character encoding

With a postlude about CSS

This page presents a number of character repertoire scenarios and makes recommendations for optimising accessibility on older browser/versions. The emphasis here is on clear recommendations - which can be rendered on an appropriate range of browsers if they have been properly configured - rather than on explaining too many exceptions and special-cases: other supporting material in this area should be helpful in better understanding the choices. Handling of forms input isn't covered here (some incomplete notes on forms submission are available separately).

Important: this web page concentrates on the form of the document as it will be sent out as text/html from a web (HTTP) server, and does not address how to author it in the first place nor how to get it onto the server. Those details are too many and varied (and OS-dependent) to deal with them adequately here; whereas what is sent from the server to the client is clearly-defined and platform-independent (if it was not, then it would be a failure in WWW terms!), and that is what we are concentrating on here.

Compatibility for older browsers is only really relevant for the content-type of text/html: that is to say, either "HTML proper", or compatibility-mode XHTML/1.0 ("appendix C"). Those who want to use overtly XML-based content types will inevitably be incompatible with older browsers (and quite a few current ones, indeed), and so, if utf-8 is appropriate, then just go ahead and use it. There's a note about writing xhtml/1.0 for compatibility.

Terminology

As I've commented elsewhere in this area, the terminology used in relation to the representation of characters in HTML and the WWW often causes confusion to those who gained experience in character handling in a different field, e.g word-processing. This is not the place for a full tutorial: I've tried to keep the checklist understandable even without a deep knowledge of the topic, but I do encourage readers to develop a familiarity with the HTML character model to avoid unnecessary confusion.

"8-bit Character Repertoire" refers here to a repertoire of no more than 256 characters that is supported by one of the various 8-bit character codes. Examples would be the 8-bit codes defined by ISO (iso-8859-n for various values of n) or by others (localised encodings such as Thai encoding TIS-620, or vendor-defined encodings such as Windows-1250, macRoman...).

An 8-bit repertoire represents the largest repertoire that many early browsers could support at any one time, at least by means of specification-conforming techniques (and remember, an HTML document has to be entirely in one encoding, it's not possible to change coding partway through a single document). Many of these older browsers could support different "8-bit character repertoires", with the browser automatically switching in response to the incoming character encoding (that MIME charset= attribute) of each individual document.

Some more-recent browser versions, even if not offering full support for Unicode, nevertheless could deal with a wider document repertoire than just one 8-bit encoding, like for example scenario 5.

"advertise as" refers to the specified "character encoding" (HTML terminology) with which the document is to be sent out from the HTTP server. This should be defined by the "charset=" parameter (correct MIME terminology, but now rather misleading in an HTML context) specified on the HTTP Content-type: header. The specifications also allow for this to be specified via meta http-equiv within the HTML document, but this is less satisfactory both on theoretical grounds, and on some practical considerations, as is discussed in more detail elsewhere.

"Coded character" refers to the character itself, expressed in the advertised character encoding: i.e a single octet (byte) if this is an 8-bit coding, or an appropriate sequence of octets if this is a multibyte coding. This is in distinction to a character expressed by one of HTML's &-representations: character entity of the form &entity;, or numerical character reference of the form &#number; decimal (widely supported) or &#xhhhh; hexadecimal (somewhat less widely supported), remembering that these numbers in &#-notation refer to the character's position in iso-10646/Unicode, irrespective of the character encoding (charset) used.

Theoretically, the three different kinds of character representation - the coded character, the numerical character reference, and (where available) the named character entity - are fully equivalent as far as HTML is concerned; the point of this checklist is the practical issues that favour the choice of one representation rather than another in various actual situations (the "scenarios") presented below.

I'm not attempting to cover characters of the Chinese, Japanese, Korean (CJK) kinds, as these are outside my area of expertise.

Scenario Character Repertoire Recommendation Notes
1 Latin-1 Use &entityname; notation (widely supported and more mnemonic), or &#number; "Numerical Character References".
Advertise as iso-8859-1
Particularly recommended for those working cross-platform (e.g Macs) without the relevant expertise in handling 8-bit coded text.
See also the WDG's advice
2 Latin-1 Alternative to scenario 1:
Use 8-bit coded characters, advertised as iso-8859-1
Contrary to rather widespread superstition, 8-bit coded characters are entirely legal on the WWW (see Note A).
3 Latin-1 with Windows typographical extras (matched quotes, em-dash etc.) Not Recommended, see Note B. If extended character coverage is being used anyway, then use the methods of scenario 6 (or 7).
3a Windows Latin-1 repertoire
(including euro)
Proprietary and not really recommended, but admittedly rather widely supported, even by relatively old browser versions which cannot handle scenario 6: use 8-bit characters, coded with charset=windows-1252
Alternatives: Note B.
The more forward-looking approach is to follow the methods of scenario 6 or 7.
A composite of Latin-1 with one other 8-bit repertoire (e.g Latin-2) could be done as in scenario 5. See also discussion in Note B.
€ is now rather widely supported, and might be used in scenario 1 or 2 if desired.
4 One 8-bit Repertoire Choose an 8-bit encoding appropriate to the desired repertoire (preferably an ISO code, e.g iso-8859-7 for Greek, or one that is widely used in its native habitat, e.g TIS-620 for Thai).
Use 8-bit characters, advertised accordingly.
This form of document is accessible to a very wide range of browser versions, even old ones, although it might require additional setup or fonts to take advantage of the browser's capability. Some issues are explored by J.Korpela.
For HTML use, avoid iso-8859-15.
5 One 8-bit non-Latin-1 Repertoire, together with Latin-1 Code the non-Latin-1 characters like scenario 4, as 8-bit coded characters, advertising the document with the appropriate encoding ("charset"). Express the Latin-1 characters like scenario 1, as &entity; references. This form of document is entirely valid and accessible to any client which conforms to RFC2070, but many characters fail on Netscape 4.* versions. See Note F.
See scenario 8 for possible workaround.
6 More than one 8-bit repertoire, but predominantly Latin text Code everything using only us-ascii characters (i.e 7-bit), expressing all other characters, even Latin-1, by means of &-notation. For Latin-1 characters, &entityname; is recommended, whereas for non-Latin-1 characters, &#bignumber; (unicode values) is preferred, even where an HTML4 entity is defined, since browser support is more widespread.
Advertise the document as utf-8
This of course needs a browser version which supports enough of HTML4/RFC2070 to understand what's needed. It therefore shuts out some old browser versions which could have coped with scenario 4 or 5. The browser might need some extra setup to enable this capability, e.g extra font(s) and settings. Refer also to scenario 8.
See Note C.
7 More than one 8-bit repertoire, not limited to predominantly Latin text Use actual utf-8 coded characters, advertised as such. Just like scenario 6, this is an entirely valid form to send out documents, and is acceptable to any RFC2070-conforming browser as well as to Netscape 4.* versions. Browser coverage for the two forms seems rather similar. The expected difficulties are not in the browsers, but in authors (mis)handling this unfamiliar data format.
8 Compromise solution for scenario 5, using techniques of scenario 6 or 7 for browsers which support it. Make the document in two different forms (or generate them as required "on the fly").
Use server negotiation (based on the client's Accept-charset: header) to send the utf-8 version to those clients which indicate ability to accept utf-8 (this includes Netscape 4.* versions, which are otherwise defective in this regard), while sending the "scenario 5" version to any other clients. See Note D.

Commentary

Versatility:

All of the schemes recommended here utilise valid techniques according to published specifications and can (subject of course to the limitations of each scheme) be programmatically converted from one form to another. Thus, it isn't essential that your authoring tool produces the precise form recommended: there are ways of programmatically converting one form to another. There are numerous ways of doing such a conversion in an HTML-aware fashion, including simple command-line utilities and pipelines depending on your preferences - some of which could be used for on-the-fly conversion in the server, if you wish. For those who prefer a point-and-click solution prior to uploading to the server, it may be mentioned that your HTML could be loaded into Mozilla Composer (or one of its derived authoring packages such as Nvu), and then saved with encoding conversion: characters in the content will be converted between encoded characters and &-notation as appropriate for the newly-specified character encoding. To be specific, the Composer/Nvu menu item for this is File> Save And Change Character Encoding.

A word of warning: even though the browser versions discussed here are technically competent to do what is being described of them, it's not certain that a particular browser installation will do it properly: the user might need to supply some fonts supporting a specialised repertoire (e.g Thai, Armenian...) or install optional rendering features (e.g right-to-left text, Indic script support...). It might be helpful to supply a little test-case page, with a screen-shot image for them to compare, and some notes on any special actions they'd need to take to set up their browser for this situation.

Search engines and other non-rendering HTML clients

Points of interest are not only the accessibility of your documents to users' browsers, but also to search engines. A.Prilop cautioned that search engines had been slow to support indexing of utf-8-encoded content - some earlier problems with AltaVista search seem to have been fixed, but for best results across search engines it might still be advisable to offer appropriate 8-bit encoding(s) as alternatives to a utf-8 version, along the lines shown in scenario 8. The relevance of this is fading with time, however (2005).

Note that even those search engines which support utf-8 may have no support yet for utf-16 or utf-32 encodings: in WWW situations where a unicode character encoding is desired, then we definitely recommend the use of utf-8. As for utf-7, it is now considered obsolete by the Unicode consortium, and there seems to be no justification for using it in a WWW context (HTTP is a guaranteed 8-bit protocol, after all), quite apart from dubious search-engine support.

Use of Latin-1 character entities (i.e in the form &name;), in preference to other representations of these characters, can be beneficial as far as locating Latin-1 strings in any encoding, but of course this doesn't help when the characters to be located are not in the Latin-1 repertoire.

When we come to the non-Latin-1 character entities of HTML4, on the other hand, there's a dilemma. There seems no doubt that the &#bignumber; format is more widely implemented than the &entityname; form, if only because of Netscape 4.* versions. On the other hand, a browser which does not implement &entityname; is likely to display something reasonably intuitive (i.e the uninterpreted source code), whereas one that doesn't implement &#bignumber; is likely to display incomprehensible garbage. So it's hard to give general advice about which form to prefer: it depends on the context, and on your priorities for the fallback behaviour in old browsers (recent ones are not a problem).

Combining marks

Observations indicate that "combining marks" (the Unicode General Category values Mn, Me and Mc) are not as well supported by browsers as are precomposed letters. Also, support in search engines for combining marks seems to be poor: support is demonstrably better for precomposed letters.

The advice therefore is to use precomposed accented letters wherever they exist, in preference to base letters plus combining mark(s), because they work better with current browsers and fonts, and with search engines. This is certainly true for Latin, Greek and Cyrillic, at least.


Footnotes

[A] 8-bit coded characters

Contrary to rather widespread superstition, 8-bit coded characters are entirely legal on the WWW: indeed, if you are working outside of the Latin-1 repertoire, and want to be accessible also to older browsers, you have little choice (scenario 4). However, documents containing 8-bit coded characters are less robust against mishandling during authoring and publishing to the WWW, by cross-platform transfers without due attention to 8-bit encoding issues, and when browsing files locally, or via ftp://-type URLs, where no character-encoding information is passed as part of the protocol.

Nevertheless, I would recommend that you design for browsing via a properly-used and -configured HTTP server, and not let your decisions be slanted by these more-local issues. It's your responsibility to research the server uploading facilities which are available to you (there are far too many to even start to discuss them here), and to work out how to use them to get your chosen encoding(s) onto the server so that they will conform to WWW interworking standards when accessed by your readers.

[B] Windows Repertoire

The "Windows Latin-1" repertoire (i.e the repertoire of characters in the Windows-1252 coding) covers the complete ISO Latin-1 repertoire, plus additional characters: typographic niceties (em-dash, en-dash, matched quotes etc.), and characters from the Latin-9 repertoire (Z-hacek, S-hacek, euro currency character, etc.: see J.Korpela's Latin-9 overview).

Latin-9 is the repertoire of the iso-8859-15 code. There seems in fact to be no point in using iso-8859-15 in HTML: by the time that browsers were supporting iso-8859-15, they were also supporting sufficient of the techniques needed for scenario 6 or 7 to be able to use one of those more-versatile approaches. iso-8859-15 has its advantages for plain-text email, but for HTML it seems best avoided.

This note recommends that you not use a proprietary Windows code (specifically, Windows-1252 for Latin-1) merely for the purpose of getting those typographic niceties (scenario 3). However, there could be situations where the use of a wider range of Windows characters is required, that is also covered by some older browser versions, and it's certainly arguable (though I'm personally opposed to it, and any justification fades with time) that the use of Windows-1252 code is preferable to risking the use of Unicode; this option has now been noted as scenario 3a.

The windows-1252 code, albeit a proprietary character code, is otherwise standards-conforming. Compare this with valid HTML4 techniques which could be used under scenarios 6 or 7 without any criticism of principle, but would limit the accessibility of the document to browsers which support that part of HTML4, which might not be such a good idea if there is no other requirement for an extended character set.

The use of MS-Windows characters can have adverse practical effects too, as is set down in trenchant terms at the demoronizer site. There's also an informative article by J.Korpela.

A widespread "non-standard" method uses undefined numerical character references &#number; in the range 128 to 159 decimal, which the published HTML declarations mark as being unused. These are unacceptable in standard HTML, and the W3C Validator is now complaining about them. I can only admit that this misuse is, statistically, widely supported - because, statistically, many people use MS software, and some other implementers felt the pressure to implement this misuse too, no matter what the specs say. But for wide coverage across many browsers and versions, it really should be avoided.

Windows-1252 is registered at IANA: its use as an 8-bit character code seems less common in practice than the non-standard &#number; values just mentioned. Well, although IANA registration means that the code is legal as a MIME specification on the Internet, it isn't the case that WWW clients are mandated to accept it. So, again we have the situation that although statistically widespread due to the wide usage and strong influence of this vendor, you still get wider browser coverage by not challenging browsers with this vendor-defined character code. If you decide (against my better advice!) to do this, then there is some advantage in nevertheless still writing the Latin-1 characters by means of their &entityname; representation in HTML, as these could still be understood even by those few clients that don't support the Windows-1252 code specification.

An entirely plausible valid approach would be to represent the Windows typographic characters by their HTML4 character entity names, such as —, ‘, ™ and so on (— ‘ and ™ etc.). These have in fact been around for a while, and are understood even by a number of older browsers that do not support utf-8 and would not be able to understand the corresponding &#bignumber; representations. Sadly, this approach was sabotaged by Netscape 4 failing to implement these entity names; and, if you don't care about NN4, there are better ways to represent such data anyway.

Coverage for € seems rather better than for the other HTML4 character entities: see J.Korpela's page on the euro; I'm suggesting that if the euro is the only additional character required, then € used in scenarios 1 or 2 is acceptable, and preferable to trying to use iso-8859-15 in an HTML context. If legal accuracy is essential then the only possible recommendation would be to use the EUR banking code.

Another valid approach is to advertise charset=iso-8859-1 but to include the MS-Windows characters by means of their &#bignumber; Unicode references. This works well on Win-NN4 versions, but may cause problems with older browsers on some other platforms.

So, on balance, I'm recommending to avoid these characters unless your document is already requiring a wide character repertoire such as in scenarios 6 or 7, in which case you could assume that any browser that can cope with the needed repertoire will also be able to cope with the Windows typographic characters - expressed, of course, in a correct HTML4 representation.

[Updated 2005]
WebTV (the product as offered soon after its takeover by MS, see version 2.8 of WebTV Viewer) evidently ignored the encoding (charset) and treated all data as Windows-1252, minus a few characters: (the euro character is missing, as are Z-hacek and z-hacek, in this version, tried in 2003).

This WebTV fails to recognise the Windows characters' unicode references such as “, which is a nasty fault in a browser claiming to support HTML4. I would rate this product of being incapable of rendering the i18n aspects of HTML4, and would give it no further consideration when I am authoring in such contexts.

Later, MS offered a new product, "MSN TV 2", said to be based on MSIE; as yet I've found no corresponding developer's viewer to investigate its support for other character encodings and repertoires, but an email correspondent tells me that it has quite good support, and he included some digital photographs from the TV screen of the browser successfully displaying Cyrillic.

[C] The "conservative" approach to i18n

This technique (as also explained in the Quickstart page) relies on the fact that the UTF-8 encoding of Unicode has been deliberately designed so that US-ASCII is a proper subset of it. What we are doing here is to formulate our page using only the characters of US-ASCII, but pretend that it is UTF-8 (which, in a sense, it is) in order to fool Netscape 4.* into its Unicode mood. This is a perfectly valid option of HTML4 (albeit a rather bulky one if large numbers of characters have to be represented in &#bignumber; terms), and thus will also be acceptable to any other client agents which support this part of RFC2070 and HTML4. Again, please refer to the Quickstart etc. in this area for further discussion of this option.

I'm recommending use of the &#bignumber; representation, even where the HTML4 specifications lay down an &entityname; representation, since these entity names (aside from the Latin-1 characters, where &name; is preferred, and a small subset of others) are not as widely supported as one would hope (again, Netscape 4.* is a major offender in this regard).

But do keep in mind that these browsers, although technically capable of what is being suggested here, will only work when configured to use suitable fonts. There may also be additional problems with X-based versions (e.g Linux) of Netscape 4.*, whose support for Unicode is quite incomplete.

[D] Accept-charset negotiation

Of course, the composite approach of scenario 8 doesn't help to make a "scenario 5" document accessible to older browsers that don't support either technique, but there's not much we can do for those (short of fiddling around with in-line images) if the material necessarily calls for this kind of character repertoire.

Since there's no way of determining for sure whether the user setup is satisfactory or not, I'd have to conclude that an author is in their rights to send utf-8 format to any client which includes utf-8 in its Accept-charset header: beyond that, it's the user's responsibility to ensure that their browser is properly configured to do what it claims. Attempts to negotiate what one sends from a server according to the user-agent string, as opposed to using Accept-* header(s) or some other test of actual client capability, is not only fundamentally flawed on theoretical grounds, but virtually all of the practical attempts that I have seen have had major loopholes in their implementation: I really can't advise going that way.

Clients which don't indicate support for UTF-8 would then get sent the document variant coded as described in scenario 5. This has reasonably good coverage (for example Win IE3 and several older minority browsers that had been tested - see the old browser tests report in this area).

It's perfectly fine to offer both variants explicitly, if you want to give users the option or if you don't want to tangle with server-side negotiation. Myself, I'd prefer not to trouble users with technical details, but sometimes it's hard to avoid it.

Although forms input isn't explicitly covered here, it's worth mentioning that although Netscape 4.* versions indicate utf-8-capability in their Accept-charset, and can indeed handle it pretty well as far as rendering is concerned (some limitations on unix/linux-based versions), the fact is that they can't deal with it for forms submission.

[F] Scenario 5: problems with Netscape 4

Netscape versions 4.* are fundamentally broken under this scenario, although there is a subset of possibilities that works. Basically, if the characters called for by &entity; or &#number; references are not available in the repertoire that is implied by the specified character code (possibly augmented by the MS-Windows characters in the range 128 to 159 decimal), then Netscape 4.* refuses to display them. In short, with this wretched browser family, the & notations are only supported where they theoretically aren't needed, and they fail precisely when you do need them.

With basically Latin scripts such as Latin-2, Baltic area, Turkish etc. you can still use the majority of accented characters that you might need in Latin-1: it's recommended, as usual due to the shortcomings of various pieces of software, to represent the Latin-1 characters by their character entity names, even though in theory the 8-bit coded character would be entirely equivalent.

[R] RFC2070

RFC2070 is the original specification which codified the character representation model upon which HTML4 and later are based, including XML and XHTML. It is also explained quite well in section 5 of the HTML4.01 specification, as well as in much more detail in a recent W3C TR, Character Model for the World Wide Web.

At least a working familiarity with this character representation model is essential for anyone working with a rich character repertoire on the web. A word of warning: it's been my experience that many folks who come to the web from other application areas, such as word processing, with the confident belief that they already understand this topic, often turn out to be hopelessly confused about the HTML character model.

[X] Compatible XHTML/1.0 (Appendix C)

If you want to write Appendix-C compatible XHTML/1.0, then you have basically two choices for advertising the character encoding (charset), according to the specifications:

The selection of an appropriate charset can be done in just the same way as set out above: the only difference, in this regard, between HTML proper, and compatibility-XHTML/1.0, is the mechanism for advertising that charset to the client.

The most common mistake seems to be to supply only a meta http-equiv, but this comes too late for XML, which has already decided on the basis of other evidence (the omission of the <?xml...> declaration and the absence of a BOM) that the encoding has to be utf-8. If the meta then attempts to set an incompatible charset, for example iso-8859-1, then the result is problematical.

Postlude about CSS

CSS is also a text format (Content-type: text/css) and should be delivered with a proper specification of character encoding (charset). The principles are much the same as for HTML, but many of the details are different.

Reference reading for this context would be CSS style sheet representation in the CSS2.1 (draft) specification.

In CSS, for the most purposes, us-ascii is entirely sufficient to represent the operative parts of the stylesheets. On the few occasions where it is necessary to refer to characters which can not be represented in us-ascii - for example a character string to be inserted by :before or :after pseudo-elements - then they can be represented by the backslash notation shown in the immediately following section of the specification.

However, this isn't the whole story: in addition to the operative parts of the CSS, there are likely to be comments, and users will want to write these comments in their own language and writing system.

As we know from the specification:

User agents must support at least the UTF-8 encoding.

And in practice they will also support, at least, iso-8859-1.

There is a good reason for explicitly specifying the encoding. To take just one example: the browser had somehow got the idea that the CSS stylesheet was encoded in utf-8, whereas it was in fact in iso-8859-1. In the entire stylesheet there was just one instance, in a comment written in German, which contained an umlaut. This resulted in the entire document being ruled-out as invalid utf-8 encoding, and the browser ignored the whole thing. Which, according to the specifications for utf-8 encoding, it is quite entitled to do. It came as a bit of a surprise that something included in a comment could render the entire stylesheet invalid. So, we should understand how to communicate the character encoding from the server to the browser.

How to inform the client of the character encoding

Just as in HTML, we can specify the encoding by means of the MIME attribute charset= on the real HTTP header. This follows the principles of the W3C note, Setting the HTTP charset parameter, even though the note doesn't actually mention CSS. This setting takes priority if it is present.

Analogously to HTML, there also exists the possibility of defining the character encoding in the stylesheet itself, although the syntax (the @charset-rule) is completely different, and very tightly defined.

Under some circumstances, the encoding (if it is one of the Unicode character encoding schemes) can be self-defining by means of the BOM (Byte Order Mark). Details are in the CSS specification. However, relying on the BOM can cause problems, e.g the W3C CSS Validator does not support it.

Any recommendations?

Yes, a few:


|Previous|Up|Next | |RagBag|About the author||