ISO-8859 briefing and resources

This document started out as a brief introduction to the ISO-8859-1 character code, with pointers to a number of sources of additional information. about iso-8859-1 specifically and about iso-8859 codes in general. As time has gone by, it has developed into a fairly large briefing or tutorial, with sections on a number of specific topics. HTML authors who want a quick start could consult my FAQ section.

The responsibility for the selection of topics and the statements made about them is purely my own, and all information is offered with the usual disclaimers. However, I do try to indicate sources of more-authoritative information for the various assertions that I make.

Table of Contents

Associated documents:

Jargon

There are some jargon terms needed in what follows. Here I explore the jargon terms in the sense in which I try to use them (follow the link for further discussion of terminology). Here we usually disregard the distinction between a "displayed character" and a "glyph": in general this is wrong, but HTML makes no use of composite characters built-up from several "glyphs", so for this purpose the distinction between glyphs and displayed characters is unnecessary. A character (in the general sense) is however an abstract concept, independent of its visual appearance (the letter "A" is still the letter "A" even when presented in very different fonts; whereas the first letter of the Greek and Cyrillic alphabets, although maybe looking visually indistinguishable from the Roman "A", are still considered to be distinctly different characters). A displayable character (or "glyph") is discussed by giving it a name, for example little-e-acute or pound sterling. A collection of characters, without reference to the way in which they are assigned to character code values, is known collectively as a "repertoire". A character code contains a certain number of "code points", 256 of them for an 8-bit code, denoted by numerical values expressed in decimal, hexadecimal or octal according to convention. When we have assigned a repertoire of characters to code points then we have what is called a "character code" or, loosely, a "code".

The selection of a font or style takes place quite separately from the character code mechanisms that we are considering here. Although a glyph, for example little-a-grave, might look cosmetically different in different fonts, in italics, etc., they all are instances of little-a-grave, and considered to be the same glyph, and represented by the same character code point.

Several different character codes feature in the discussion below. (Except for EBCDIC) they are all extensions of the 7-bit US-ASCII code, and therefore they coincide with US-ASCII and with each other in the lower half, code points 0-127 (decimal). In the upper half they differ, both in the repertoire of glyphs which they represent, and in the assignment of glyphs to code points. The main body of the note does not consider national variants of 7-bit ASCII (as laid down in the old standard. ISO646), but there is a digression for those who would like to know more. Nor do we consider the use of the 8th bit as a parity bit, this is irrelevant to and incompatible with our discussion.

Codes, HTML and HTTP

HTML and HTTP protocols make frequent reference to ISO Latin-1 and the character code ISO-8859-1. As I understand it, strictly speaking, the term "ISO Latin-1" refers to a specific repertoire of "glyphs" without reference to a particular encoding. There is, however, only one ISO standard arrangement of these Latin-1 glyphs to code points, and that is the ISO-8859-1 character code. But there are other (non-ISO) encodings of the Latin-1 repertoire, for example the IBM PC code page CP850, and the extended EBCDIC code originally called CECP1047 (now usually referred to as CP1047, CP-1047 or "network EBCDIC"). This note will say little about EBCDIC, assuming that those who are using EBCDIC-based platforms have already come to terms with the idea of their storage encoding having to be mapped to and from US-ASCII for the network.

So, when people refer to "the Latin-1 code" or "the ISO Latin-1 code", it might be assumed that they are referring to the "ISO-8859-1 code"; however, there is the possibility that they are referring to CP850 (in the Microsoft manuals this is called the "Multilingual Latin-1 code"), or to some other code that represents the ISO Latin-1 repertoire of characters even though the code itself is not an ISO code.

In the ISO-8859 FAQ, it said vaguely that the codepage CP819 is "supposedly fully ISO-8859-1 compliant"; Netscape release notes also mention treating CP819 as a MIME Charset synonym for ISO-8859-1. See further discussion.

The HTTP specification mandates the use of the code ISO-8859-1 as the default character code that is passed over the network. The HTML specification is also formulated in terms of the ISO-8859-1 code, and an HTML document that is transmitted using the HTTP protocol is by default in the ISO-8859-1 code (at least, this was true prior to the HTML4.0 spec).

The MIME protocol, that is used by HTTP and by MIME mail, contains a clearly defined mechanism for explicitly defining a character encoding, but, at the level of the HTTP1.0/HTML2.0 specifications, browsers are not actually required to support any code other than the default i.e ISO-8859-1. I am writing this briefing and the related materials (except where specifically stated) entirely in terms of ISO-8859-1. I do not mean this as any kind of insult to those for whom ISO-Latin-1 is not the natural repertoire, I assure you; on the one hand the W3C had been working for a long time on an internationalization (i18n) draft, which lays out how browsers ought to support an extended range of characters, but on the other hand, when they made a practical attempt to document the common features of popular browsers as at some point in 1996, they weren't able to include any such extended characters - sad, but realistic. Subsequent developments included RFC2070, and then the i18n part of the HTML4.0 specification, although coverage in the popular browsers still leaves something to be desired (as of 1999).

In the rest of this briefing I occasionally refer to "native" 8-bit character codes: by this I am referring to the character storage codes that are used on certain platforms, e.g DOS Codepages such as CP437 or CP850, the Mac proprietary storage code (see Inside Mac for documentation), the EBCDIC code used on IBM mainframes, and so forth. I am not referring to non-Latin encodings such as Korean, Japanese, Hebrew...

As far as authors of HTML are concerned, character coding is an issue for them in two contexts: (1) where authors create files that actually contain characters from the upper half of the 8-bit code table, and (2) where they refer to such characters by their &#number; representation. If authors confine their use of characters to the low half of the 8-bit table (i.e the area defined by the US-ASCII 7-bit code), and represent any characters from the upper half by their &entity; or &#number; representation, then point (1) is not an issue, and furthermore, when transferring files between platforms by various means - Internet FTP, email, diskette etc. - there is no need to worry which particular 8-bit code is native to the sending and receiving platforms. For these reasons, this is an approach that is much to be recommended. Where a file has been composed in another form (for example, by typing in accented characters using a non-English-language keyboard), it might be wise to use one of the utility programs that convert to an &-representation of the characters in question.

What happens in practice is that the & representations are not interpreted by the web server, but are passed as they are (i.e a string of US-ASCII characters) to the browser for interpretation by the browser.

The standard requires that the &#number; representation be interpreted by reference to the code points in the ISO-8859-1 table, and not according to the native storage code of the platform on which the browser is executing. Implementations will probably achieve this by mapping (translating) the character into the platform's native storage code and offering it to the normal display routines. Another approach that is possible in theory is to define ISO-8859-1 as a private code within the browser, and to use private font tables (this approach tends to lead to unpleasant consequences elsewhere, though). Caution, in practice some (mostly older) browser versions don't behave in the way that is intended by the standard.

As was remarked above, if any codes from the upper half of the code table are placed onto the network, the standard requires that they be expressed in the ISO-8859-1 code. If, therefore, we have a document that does contain such characters, on a platform whose native storage code is different from ISO-8859-1, then the platform's Web server will have to map (=translate) these characters into ISO-8859-1 in order to place the document onto the network using the HTTP protocol. Let me stress, though, that the server is certainly not expected to look inside the HTML document for &#number; representations and make any change to those: the standard requires those to be composed in terms of ISO-8859-1 code values irrespective of the character code that is being used for storing the HTML document.

Character codes in other Internet protocols

We may recall that in the traditional Internet ("ARPA") FTP, there is no explicit provision for defining, negotiating or mapping between different variants of storage code. Basically there will be a choice between "binary" or "text" (a.k.a "ascii") modes, without a very clear idea whether the "text" mode involves any mapping between character storage codes. Some implementations take care to perform code mapping (e.g Fetch for the Mac, but be sure to have the "ISO translation" option turned on in the Preferences), while others content themselves with adjusting the newline convention and leave the character codes unchanged (often found with DOS implementations). This therefore becomes a problem when trying to exchange 8-bit HTML files using Internet FTP between machines whose native storage code is not the same (e.g between a PC using e.g CP850 code, and a Mac using its own native code), and reinforces the advice to make use of & encodings in HTML. Unless of course you are confident that your chosen FTP implementation does the right thing. See the work of André Pirard that is referenced below. If your FTP software does not do the work for you, the GNU recode utility can perform conversions before or after a file is transferred between dissimilar platforms. I have seen several different DOS2UNIX/UNIX2DOS utilities, some of which merely adjust the newline convention whereas others also map between ISO-8859-1 on the unix side and CP850 (perhaps) on the DOS side: if you plan to use such a utility, make sure that your version does the right thing for you.

MIME-Mail protocol does include facilities for announcing the specific encoding in use - but typical implementations of MIME mail agent (e.g PINE) do not necessarily have any facilities for resolving such discrepancies, they merely alert the user to the fact that the incoming file uses an encoding that is different from the local one.

Dealing with platforms whose own (native) code is not ISO-8859-1

Returning now to the question of what happens when a file that contains "native-encoded" 8-bit characters is offered via a Web (http) server, it is clear from the Web standards that the native storage code must be mapped (=translated) into the ISO-8859-1 code that is mandated for network transmission (or by using one of the transmission encodings on which sender and receiver can agree, see the HTTP Accept-charset header). Mapping is feasible for e.g CP850; it is also feasible for Mac code, with the exception of fourteen code points for which the ISO Latin-1 glyphs are not available in standard Mac character code. Conversely, when the 8-bit characters arrive at the browser, they must be interpreted according to the ISO-8859-1 code, which will probably be achieved by mapping them into the browser's native storage code. ISO-8859-1 is, by the way, used as standard by X Windows; the code points of ISO-8859-1 that are assigned to displayable glyphs are identical with those same code points in the standard MS Windows code (see further comment in next paragraph). It is not, on the other hand, the code that is used by MS DOS (which leads to considerable confusion if you try to create documents containing 8-bit characters by using a mixture of DOS and MS Windows software!), nor, as has already mentioned, by the Mac.

It is essential to bear in mind that, in addition to the range (decimal 0-31 and 127) that ASCII allocates to control characters, ISO-8859-1 does not assign displayable characters to code points in the range (decimal)128-159. Some platforms (e.g MS Windows) that are otherwise ISO-8859-1 conformant, might use these codes to represent additional character encodings, but they cannot and should not be relied on for communicating information on the World Wide Web - they could display as anything, or nothing, on other platforms or browsers.

The IBM PC code called "Multilingual Latin-1", CP850, has already been mentioned. This code also includes the ISO Latin-1 repertoire of glyphs (as well as some additional glyphs that are not in the ISO Latin-1 repertoire). But the characters are not in the same places in the two codes. and, furthermore, CP850 assigns characters throughout the upper half of the code table whereas ISO-8859-1 keeps thirty-two characters undefined. It follows, therefore, that anything represented in ISO-8859-1 can be translated into CP850, but there are characters in CP850 that do not correspond to any defined character in ISO-8859-1. It is conventional to translate these characters into the "undefined" range, decimal 128-159 of ISO-8859-1, on the understanding that their meaning is undefined as far as the standard is concerned. An attempt to display such characters could result in the equipment displaying anything, or nothing, without being in violation of the standard (in principle it could even result in executing some spurious control function, although I'm not aware of this happening in practice to any extent.)

The Macintosh uses a code that mostly covers the ISO Latin-1 repertoire, although some glyphs are missing, and includes some other glyphs. Again the assignment of glyphs to code points is not the same as in ISO-8859-1, and a code conversion is required.

Previously, the EBCDIC code used by IBM mainframes was also an issue, and mention of this will be found in the materials referenced, but with the move away from mainframes this will concern us normal mortals less and less.

There is a paper by André Pirard of Univ. of Liège in Belgium, referred to in a usenet posting which I quote comprehensively below. Let me stress that the recommendations for translating characters that fall outside the ISO Latin-1 repertoire are not part of any formal standard: they are, however, a de facto "standard" that is followed by a lot of fine Internet software for the Mac, such as Fetch, some usenet newsreaders, most browsers, etc. He must translate the codes in the "undefined" region to preserve the integrity of the file (see his text for futher explanation) but you are not entitled to use these characters in your Web documents - if you do, then users on different platforms are going to see different glyphs, or none.

Especially I want to warn readers not to take seriously the misguided efforts of some web authors who have manufactured a document containing all possible character codes, without comment, and have invited readers to display them on their own browsers. Without a text describing what should be displayed at each code point, the confused reader might assume that what they are seeing is the same as would be displayed at the corresponding point on any other browser. It is not, and they are being misled. Only the code points that are assigned to displayable characters in the ISO-8859-1 code are expected to have this property (and even there, some browsers are in violation of the HTML2.0 standard, so without a text description alongside every relevant code point, such a table is worse than useless, no matter how well-intentioned its author might have been). An alternative way of presenting a character code unambiguously is to present it as an image; the only problem with that approach is to make sure that people do not confuse similar-looking glyphs with each other, e.g mistaking an apostrophe for an acute accent, a German sharp-s for a Greek beta, or a degree sign for a superscript zero. Presenting both an image and a description would be the ideal.

In this section I have tried to cover concisely the principles that are involved in dealing with such non-ISO-8859-1 platforms, and have given brief notes on how they work out in practice. The Mac is of sufficient importance to justify a separate article.

Entity names in HTML - some remarks

The entity names for the accented letters have been clearly defined and used in HTML from the early days, and need no special treatment here. The same goes for the low-half characters (< > & and ") that have to be "entified" because they play a role in the syntax of HTML.

The HTML2.0 specification (RFC1866) contains at the end (section 14) a section entitled "Proposed Entities": this list includes the already well-known ISO Latin-1 accented letters etc., but also introduces a proposal for additional entity names. These were certainly not in general use as of HTML2.0, and so, presumably, were intended for future implementation, and some browser developers have indeed progressively added support for them, while others seem to have made progress at a snail's pace if at all.

I am assured that the policy of the HTML developers is to use the SGML entity names, as laid down in the files ISO* in, for example, the directory ftp://ftp.ucc.ie/pub/sgml/ (Dan Connolly gave me an alternative pointer to a server in Norway - but the information content should be the same!). The names that are relevant to this discussion are contained in ISOnum and ISOdia (you will see that those entity sets define also many characters that are not included in the ISO-8859-1 code, and that therefore aren't properly usable in HTML according to current standards). HTML provides no mechanism for using floating accents, so that the glyphs for umlaut/diaeresis, cedilla, and macron are of rather little benefit: however, they do have entity names, and for completeness they will be kept in the discussion.

In the archives at the W3C may be found a draft called HTML+, that pre-dated the now-expired HTML3.0 draft. Both drafts are now considered obsolete (although they make very interesting reading!), but a few browsers support some of the entity names that are peculiar to HTML+, so it is still mentioned here. It's worth noting that the text of the (uncompleted and now expired) HTML3.0 draft used the entity names "endash" and "emdash", but the associated DTD contained "ndash" and "mdash" - presumably this discrepancy would have been resolved if the draft had ever been finished.

In this section, I am only discussing the names for characters of the ISO-8859-1 code definition. The Trade Mark (TM) glyph is not defined in this code, and I discuss it separately in another section of this briefing.

A further version of the HTML entity names list can be found in Martin Ramsch's table at http://www.ramsch.org/martin/uni/fmi-hp/iso8859-1.html. However, this folds in some material relating to Hyper-G Text Format, which is not the same as HTML. The characters over which there seems to be disagreement are the following ones (see RFC1866 section 14).

    HTML+        RFC1866     ISOnum/ISOdia     Description
    -----       ---------    -------------     ----------- 
    die           uml           die, uml        diaeresis/umlaut
    macron        macr           macr           macron, overbar
    degree        deg            deg            degree
    Cedilla       cedil          cedil          cedilla
Ramsch's table agrees with the ISO list, except for designating the macron as &hibar;, a name that I haven't seen elsewhere but was, I found, supported by X Mosaic, and by adding the alternative names brkbar for brvbar and Dstrok for ETH. I propose to ignore these last two as being Hyper-G specials, but in view of having found support for hibar I have retained it in my survey.

Test cases for these entity names can be found in the preface to my character code test tables so that they could be included in my tests for browser coverage.

The saga of the disappearing quot

Readers were astonished to see that the final version of the HTML3.2 was missing the definition of the &quot; entity in its DTD, in spite of the general intention that HTML3.2 would be compatible with HTML2.0.

On investigation, attention was drawn to an item in the www-html mail archive, in which Christopher R. Maden states that in the relevant SGML documents, the quot entity is identified with the apostrophe (ASCII 39, x27), not with the quotation mark (ASCII 34, x22). Dan Connolly, on the other hand, states that the omission from HTML3.2 was a mistake.

I have no further details of the progress of discussions subsequent to that time, but I note that the &quot; has re-appeared in HTML4.0 just like it was in HTML2.0, as can be seen in the relevant section of the HTML4.0 recommendation.

Tell me about these unassigned code points

The code points 0-31 and 127 are assigned to control characters in US-ASCII, not to displayable glyphs, and the ISO-8859-1 code continues this tradition, as well as declaring the range 128-159 inclusive to be reserved for unspecified control functions: historically, this was intended to protect against 7-bit data paths that would lose the top bit and risk performing some unexpected control function, such as clearing the display! That may be less of a problem nowadays, but the rule is still there. As you might have noticed, some platforms (e.g PC, Mac) nevertheless use some of the code points in these ranges for displayable characters. But any attempt to use these code points in HTML for the WWW is going to produce undefined results on different browsers and platforms.

It's no big deal to generate a table containing all possible code values, or all possible values of &#n;, and to display them on various browsers and platforms, and to compare the results. Don't just compare several different browsers on the same platform: that is no better than buying several different English newspapers in order to get a better idea of the news in Texas! Regular X-based fonts will display a blank, or nothing at all, in response to the unassigned codes. MS-Windows based browsers will normally display some well-defined glyphs in the relevant positions; Mac-based browsers will likely display something too, which might be different. However, Mac-based users should take care to compare the results from later versions of Netscape (2, 3 etc.) with earlier Netscapes such as Mac 1.12: it is evident that Netscape are redefining the rules as they go along.

The only advice that I can possibly give you is to steer very clear of these undefined code points. The whole intention of HTML was to represent content in a portable, platform-independent fashion. There is no way that you can guarantee to get the correct results on the reader's screen. That's what standards are for: the evidence of Netscape stealthily redefining the rules (I saw no mention in their release notes that they were shifting the undefined Mac character codes around) just makes it that much more important to stick to the standards, and not get tempted by non-standard features of a commercial browser, if you want to get a message reliably out to WWW readers. If you cannot be confident of browser coverage of the construct that you'd like to use (e.g &trade; for the trademark glyph) then you would be better advised to use a substitute that has good browser coverage.

MS-DOS Code Page 819???

In the versions of MS DOS that I had tried, there was no provision for selecting codepage 819. I finally tracked down a couple of concrete references to CP819, and after following the indications in there, found something called isocp101.zip (at a place from which it later disappeared). It can (Mar.2001) be found at at Kostis Netzwerkberatung freeware page in isocp101.zip.

When un-zip-ed, this file produces some interesting information and some software, including a file ISOLATIN.CPI that can be used to support CP819 for output in MS-DOS (assuming you have EGA or VGA; the CGA display is stated to support only its own hardware code page). There is also some mention of keyboard support. Disclaimer: the above refers entirely to material that I found on the net. The nearest thing to an authoritative source is the file Doc\Isocp.txt contained in the above-cited ZIP archive (however, Mr. Kostis tells me in email that his 1993 address seen in that file is no longer valid).

I have not tried using DOS for any significant period with CP819 selected, so I have no personal experience of how it works out: in response to earlier versions of this page I received several emails that comment favourably on this method of working in DOS, and in Jan 1997 I got a more detailed email from Portugal about successful use of this method.

I don't believe there are any subtle differences from iso-8859-1: the IANA character set registrations list CP819 along with IBM819 as a synonym of iso-8859-1, so this seems to be just another codification of the same character coding.

MS Windows

This is a common source of confusion, partly because there is so little explanation provided in ordinary MS Windows documents, and what is provided seems rather misleading. My explanation here refers, I guess, only to MS Windows when used in a Latin-1 locale.

For its internal text encoding, MS Windows works in terms of its own character code: this code is identical with ISO-8859-1 at the code points that are assigned to displayable characters by ISO-8859-1, but in addition it assigns displayable characters to some of the code points that ISO-8859-1 explicitly leaves undefined.

The chief cause of confusion, I guess, is the notorious DOS convention for typing ALT/nnn to get the character whose code position is (decimal) nnn. In MS Windows, this convention has been extended (although I didn't find this explained anywhere in the normal user manuals or help information):

(both numbers in decimal). For the ISO Latin-1 repertoire, the MS Windows code co-incides with ISO-8859-1; therefore, you can type in ALT/0mmm for values of 160-255 and get the characters listed in the ISO-8859-1 code tables (for the range 32-126 there's no distinction, of course).

You can easily verify that MS Windows code is incompatible with MS DOS code in this respect, if you type in some accented characters using an MS Windows application, say Notepad, and then view the resulting file in MS DOS; or conversely if you type them in using a DOS application such as (DOS) EDIT and then view the result in MS Windows.

Let us take one example: o-circumflex. In DOS CP 437 or 850, the code point for o-circumflex is 147 (decimal). So, in DOS you type this in as ALT/147, and that is what is stored. However, when you use ALT/147 in an MS Windows application, the MS Windows (ISO-8859-1) encoding of o-circumflex is actually stored into the file, i.e the character code 244 decimal. Basically this is very simple in principle, but can lead to much confusion in practice, and you can play around a little if you want, typing ALT/nnn codes into a file one way and viewing the file the other way, and driving yourself demented trying to follow what is happening by use of a CP850 (or 437) DOS code table on the one hand, and an ISO-8859-1 (or MS Windows) code table on the other. Have fun!

Anyhow, the long and short of this is that you can perfectly well deal with 8-bit accented letters in MS Windows if you wish (subject, of course, to the caveats mentioned elsewhere in this briefing), as long as you only handle the file with MS Windows, and not mix it with DOS.

The above would no longer be a problem if everyone operated DOS in the code page 819 discussed above. Whether that would be practical, I can't say - see the discussion there.

TM - They seek it here, they seek it there

More than all other special characters added together, people post questions on the usenet WWW groups asking how to represent the TM (trademark) glyph. The answer, up to and including HTML 3.2, is that you cannot. This glyph is not included in the ISO-8859-1 character code, and there is no defined way to represent it. When the question is asked on usenet, it usually brings at least one response that states a value of n for which &#n; displays the TM glyph on that informant's browser: but readers who have taken on board my explanation so far will realise that this is of no use, since it will display something different, or nothing at all, on some other browsers. The value of n is (or was - see discussion above about Netscape quietly re-arranging their Mac version) different according to whether the informant uses a Mac-based or an MS Windows-based browser, and X-based browsers do not display this glyph at all in their normal fonts. Surely, for a glyph like this that is used principally for legal reasons, there can be no excuse for sloppy usage that has no guarantee of displaying anything to some readers.

There are a number of kludges that you might consider using at this time. Bearing in mind that by now a good range of browsers already honour the SUP tags, you might even enclose "(TM)" in those, like this: <SUP>(TM)</SUP>, and here is what your current browser does with that: (TM). A browser that did not understand the SUP tags would simply display the "(TM)" in its current font. Nest the (TM) also in <SMALL>...</SMALL> if you wish.

The recommended entity name &trade; was supported by certain browsers for some time already, but others still lack that support. Here is what your present browser displays in response to this entity: ™. The standards-compliant way to code this as a numerical character reference is by its Unicode value, 8482, and here is what your current browser does in response to that: ™. I cannot recommend that you use either of those representations at this time, especially for a mark like this that has legal significance: MSIE 3 supported only &trade; and not &#8482;, while NS 3 supported only &#8482; and not &trade;. Update Jan'98: both MSIE4 and NS4 support &#8482;, although NS4 still does not support &trade;.

The no-break space &nbsp;

The intention of the no-break space in HTML, as stated in the materials at W3C, is to act as a sticky space, joining two tokens together in such a way that they are separated by white space but will not get split across lines. The 1995 HTML3.0 draft (now expired) called for it to be treated otherwise as a normal space, which must mean that outside of preformatted text, multiple occurrences would be collapsed to a single space. However, that draft has expired, and its contents cannot be regarded as anything more than of historical interest; for a while I was unable to locate it on the W3C site, but the HTML3.0 draft is now here and here is the section dealing with the no-break space.

It is often claimed on usenet that one or more no-break space(s) can be used as a space filler, and indeed this can be seen in many HTML documents on the WWW. Some advocates suggest &nbsp; - others &#160;, while others advocate alternating one or other of those with ordinary spaces; it is indeed an observed fact that many browsers in use in 1996-7 were producing that effect, and this became pretty much universal. (The HTML4.0 specification subsequently codified the no-break space as not being a white space character, from which we may deduce that it would not be eligible for compression under the white space rules; but it explicitly does not go into any further detail about its treatment.)

In constructions like the much-touted <P>&nbsp;&nbsp;...foo, (or the corresponding thing with &#160;), the no-break space is not joining two words together as envisaged by the spec.

Some authors say that they demand two spaces between sentences (e.g after a full stop - US "period" - question mark or exclamation point), because their style rules demand it. They therefore desperately seek some kludge for inserting such an additional space, even in some cases going so far as to imbed a small blank image. It would be my own personal contention that this is a browser issue, since any imposition of style rules ought to be according to the reader's locale, not the author's. Admittedly, there is no unambiguous definition of a sentence end in HTML: there are situations where a full stop, question mark or exclamation point, followed by a single space, might appear within a sentence, and an expansion to two spaces would not be desired by the author or the reader. I leave this point for others to debate.

To those authors who desire the first line of every paragraph to be indented, on the other hand, I say very definitely that this is a browser (or style sheet) issue. The start of a new paragraph is clearly defined in HTML, and the question of how to present a paragraph is a browser issue, purely and simply. Readers might also wish to have such presentation details under their control, rather than being imposed by the author. Those authors who struggle to indent each new paragraph by means of &nbsp; tricks and/or blank images are, in my view, quite misguided. And a reader who cares enough to have configured their personal style sheet to indent paragraphs, are not likely to be amused when they find they get a double-dose of indenting thanks to the author's kludges with transparent GIFs or &nbsp;.

The soft-hyphen, &shy;

RFC2070 (from your usual supplier of RFCs, or at IETF) had this to say about the soft-hyphen:
NOTE -- the SOFT HYPHEN character (U+00AD) needs special attention from user-agent implementers. It is present in many character sets (including the whole ISO 8859 series and, of course, ISO 10646), and can always be included by means of the reference &shy;. Its semantics are different from the plain HYPHEN: it indicates a point in a word where a line break is allowed. If the line is indeed broken there, a hyphen must be displayed at the end of the first line. If not, the character is not dispalyed at all. In operations like searching and sorting, it must always be ignored.
(reproduced from the RFC complete with typo ;-)

The HTML4.0 specification says something similar, except that it implies that browsers are not mandated to support this character, which is a pity as there seems to be no viable alternative for achieving this useful result.

The meaning of this when the soft-hyphen occurs inside of a word is unambiguous; but hardly any browsers actually implement this yet (an honourable exception: Lynx). What to do with a soft-hyphen that occurs in isolation is unclear: a browser might still display a hyphen in this situation - or maybe only display it when it comes at the end of a line (recent versions of Lynx seem to be like that). Best to ignore what is displayed on this line of my test tables, since there is no formal specification of what a soft-hyphen should do in that kind of context.

Pointers to further resources

Tables of the various ISO-8859-n character codes (of course they have to be made available as images, since there's no guarantee you have a font to display them) seem hard to find, but one excellent resource is provided by Roman Czyborra.

(There used to be a usenet FAQ on ISO-8859-1, and a fascinating array of resources on character codes and internationalisation by its author; these are still referenced from other usenet FAQs, but sadly the document itself had disappeared by Dec.1999.)

The Kermit team has done much work on documenting the usage of character codes: a visit to their Web Pages is well worth while, and materials can be found in their archive at Columbia. The "texts about ISO-8859-1" by A.Pirard (following quoted article) set out the design criteria well, and should be read for a clearer understanding of the issues involved.

(From a Usenet article by J.P.Kuypers, quoted without asking permission)

Q: Is there a standard site where one can find the latest versions of these tables?

A:
There was no site where one can find the latest versions of the Macintosh<->ISO 8859-1 translation tables. But now, after your mail, I put the tables on an ftp server as a Macintosh .sea archive.

BTW, you can find there the André Pirard's texts about ISO 8859-1, other codes and several computer "languages". These texts about communication programming for international characters, can be found, too, in an ftp server at Columbia, USA or an ftp server at Univ of Liege in Belgium.

The software using these tables is not all using taBL resources. I.e. FTPd and Talk have the taBL, but not Anarchie. With resources, the developer can allow the user/manager to put an other one. Without resource, the translation is "hard-coded".

I prefer taBL resources. For software as Eudora and Telnet, it's a must because people may need several translations according to the environment, i.e. for Telnet, the code of the connected computer. For software able to transfer files/texts only, the choice is very small and Macintosh<->ISO 8859-1 is the best standard to use, IMHO.

Jean-Pierre

(End of quote.)

The code mapping that's documented in A.Pirard's materials has become the de-facto standard in Mac-based Internet software, such as Fetch, usenet newsreaders, most WWW browsers, etc.

The Kermit team had also documented the Mac problem, and indicated a solution based on the same principles, but they designed a different code mapping that has not, in the event, gained general acceptance.

An email from Terry Jones recommends the GNU recode program for a very versatile range of character conversion options, including 8-bit Latin-1 to HTML entity encoding, as well as converting between different character codes. I have to confess I had not been previously aware of this program, but having looked at the manual for it, I have no hesitation in accepting his recommendation. Beware, though, that this program can, by default, replace your input file in-place, and with some combinations of options the change would be irreversible!.

And what of the future?

HTML, at the 2.0 and 3.2 levels, was limited by its choice of the iso-8859-1 character code, both as the default code and as the only code which browsers were required to support. How to extend it to other codes was already drafted, at least by the time of the HTML3.2 recommendation, and various methods are in actual use among "consenting communities of users". Some of them are in accordance with the draft standards, and will deploy well on the WWW (e.g the charset attribute on the content-type header), whereas others (e.g sending out koi8-r without any kind of declaration) which are in de facto use in their target community would be unacceptable on the WWW in general.

From the point of view of HTML (indeed of SGML), every document has a "document character set", which in the case of HTML2.0 happens to be the ISO-8859-1 (8-bit) part of the much bigger ISO-10646 code. Furthermore, when the document is transmitted over the network, it is transmitted by using an encoding which, in HTTP/1.0 etc, is ISO-8859-1. The result is that we tend to confuse the "document character set" with the "transmission encoding". However, it's not too difficult to see that these are not the same thing. Let's consider an HTML document that's stored on an EBCDIC-based mainframe, in order to make the illustration as obvious as possible. In a document that contains the following HTML: &#169; (i.e the numbered character reference representing the copyright sign), the individual characters: ampersand, hash, one, six, nine, semicolon, each have to be translated into EBCDIC; however, the number (169 decimal) remains as 169 decimal, referring to the copyright-sign code point of the ISO-8859-1 document code: it does not get changed into a different number equal to whichever code point this sign occupies in EBCDIC. So, we now have a document whose "document character set" is still ISO-8859-1 but whose "storge encoding" is EBCDIC. Similar situations arise when HTML is stored on a platform whose native code is the Mac code, or say CP850 (DOS).

When we move to a situation involving more than one transmission encoding, the issue becomes more complex. The same ISO-10646 data can be transmitted in several different transmission encodings (UCS-2, UCS-4, UTF-8). Then we have to understand how this relates to the other encodings that exist today (CP850, EBCDIC, KOI8-R etc.).

Let me offer you a pointer to the relevant area at W3C, with particular reference to the sub-heading of Character Sets.

I think it is fair to add that a browser, such as Netscape, that offers the user the ability to configure it to a default character code other than ISO-8859-1 would then no longer be compliant with the HTML2.0 standard, since it would then no longer display a default (i.e iso-8859-1) document correctly. Such changes ought to be under control of the author/server and not under control of the user configuration (although, obviously, they cannot work properly unless the user has taken care to make the necessary resources available to the browser - browsers typically don't come with all of this set up as standard).

For the Latin-2 (Central European) situation there's an interesting resource by P.Peterlin.

If you want to find out about Unicode, try at http://www.unicode.org/. A search at one of the WWW search engines seemed quite productive, returning among other things some interesting papers and discussions from the WWW Working Groups.


[Up][More][Next] [Rag-Bag][Me]

Original materials © Copyright 1994 - 2006 A.J.Flavell