The responsibility for the selection of topics and the statements made about them is purely my own, and all information is offered with the usual disclaimers. However, I do try to indicate sources of more-authoritative information for the various assertions that I make.
The selection of a font or style takes place quite separately from the character code mechanisms that we are considering here. Although a glyph, for example little-a-grave, might look cosmetically different in different fonts, in italics, etc., they all are instances of little-a-grave, and considered to be the same glyph, and represented by the same character code point.
Several different character codes feature in the discussion below. (Except for EBCDIC) they are all extensions of the 7-bit US-ASCII code, and therefore they coincide with US-ASCII and with each other in the lower half, code points 0-127 (decimal). In the upper half they differ, both in the repertoire of glyphs which they represent, and in the assignment of glyphs to code points. The main body of the note does not consider national variants of 7-bit ASCII (as laid down in the old standard. ISO646), but there is a digression for those who would like to know more. Nor do we consider the use of the 8th bit as a parity bit, this is irrelevant to and incompatible with our discussion.
So, when people refer to "the Latin-1 code" or "the ISO Latin-1 code", it might be assumed that they are referring to the "ISO-8859-1 code"; however, there is the possibility that they are referring to CP850 (in the Microsoft manuals this is called the "Multilingual Latin-1 code"), or to some other code that represents the ISO Latin-1 repertoire of characters even though the code itself is not an ISO code.
In the ISO-8859 FAQ, it said vaguely that the codepage CP819 is "supposedly fully ISO-8859-1 compliant"; Netscape release notes also mention treating CP819 as a MIME Charset synonym for ISO-8859-1. See further discussion.
The HTTP specification mandates the use of the code ISO-8859-1 as the default character code that is passed over the network. The HTML specification is also formulated in terms of the ISO-8859-1 code, and an HTML document that is transmitted using the HTTP protocol is by default in the ISO-8859-1 code (at least, this was true prior to the HTML4.0 spec).
The MIME protocol, that is used by HTTP and by MIME mail, contains a clearly defined mechanism for explicitly defining a character encoding, but, at the level of the HTTP1.0/HTML2.0 specifications, browsers are not actually required to support any code other than the default i.e ISO-8859-1. I am writing this briefing and the related materials (except where specifically stated) entirely in terms of ISO-8859-1. I do not mean this as any kind of insult to those for whom ISO-Latin-1 is not the natural repertoire, I assure you; on the one hand the W3C had been working for a long time on an internationalization (i18n) draft, which lays out how browsers ought to support an extended range of characters, but on the other hand, when they made a practical attempt to document the common features of popular browsers as at some point in 1996, they weren't able to include any such extended characters - sad, but realistic. Subsequent developments included RFC2070, and then the i18n part of the HTML4.0 specification, although coverage in the popular browsers still leaves something to be desired (as of 1999).
In the rest of this briefing I occasionally refer to "native" 8-bit character codes: by this I am referring to the character storage codes that are used on certain platforms, e.g DOS Codepages such as CP437 or CP850, the Mac proprietary storage code (see Inside Mac for documentation), the EBCDIC code used on IBM mainframes, and so forth. I am not referring to non-Latin encodings such as Korean, Japanese, Hebrew...
As far as authors of HTML are concerned,
character coding is an issue for them
in two contexts: (1) where authors create files that actually contain
characters from the upper half of the 8-bit code table, and
(2) where they refer to such characters by their
If authors confine their use of characters to the low half of the
8-bit table (i.e the area defined by the US-ASCII 7-bit code),
and represent any characters from the upper half by their
then point (1) is not an issue, and furthermore, when
transferring files between platforms by various means - Internet FTP,
email, diskette etc. - there is no need to worry which particular
8-bit code is native to the sending and receiving platforms.
For these reasons, this is an approach that is much
to be recommended.
Where a file has been composed in another form
(for example, by typing in accented characters using a
keyboard), it might be wise to use one of the utility programs
that convert to an &-representation of the characters in question.
What happens in practice is that the & representations are not interpreted by the web server, but are passed as they are (i.e a string of US-ASCII characters) to the browser for interpretation by the browser.
requires that the
representation be interpreted by reference to the code
points in the ISO-8859-1 table, and not according to
the native storage code of the platform on which the browser
is executing. Implementations will probably achieve this by
mapping (translating) the character into the platform's native
storage code and offering it to the normal display routines.
Another approach that is possible in theory is to define
ISO-8859-1 as a private code within the browser, and to use
private font tables (this approach tends to lead to unpleasant
consequences elsewhere, though).
Caution, in practice some (mostly older) browser
versions don't behave in the way
that is intended by the standard.
As was remarked above, if any codes from the upper half of the code table are placed onto the network, the standard requires that they be expressed in the ISO-8859-1 code. If, therefore, we have a document that does contain such characters, on a platform whose native storage code is different from ISO-8859-1, then the platform's Web server will have to map (=translate) these characters into ISO-8859-1 in order to place the document onto the network using the HTTP protocol. Let me stress, though, that the server is certainly not expected to look inside the HTML document for &#number; representations and make any change to those: the standard requires those to be composed in terms of ISO-8859-1 code values irrespective of the character code that is being used for storing the HTML document.
recodeutility can perform conversions before or after a file is transferred between dissimilar platforms. I have seen several different
DOS2UNIX/UNIX2DOSutilities, some of which merely adjust the newline convention whereas others also map between ISO-8859-1 on the unix side and CP850 (perhaps) on the DOS side: if you plan to use such a utility, make sure that your version does the right thing for you.
MIME-Mail protocol does include facilities for announcing the specific encoding in use - but typical implementations of MIME mail agent (e.g PINE) do not necessarily have any facilities for resolving such discrepancies, they merely alert the user to the fact that the incoming file uses an encoding that is different from the local one.
It is essential to bear in mind that, in addition to the range (decimal 0-31 and 127) that ASCII allocates to control characters, ISO-8859-1 does not assign displayable characters to code points in the range (decimal)128-159. Some platforms (e.g MS Windows) that are otherwise ISO-8859-1 conformant, might use these codes to represent additional character encodings, but they cannot and should not be relied on for communicating information on the World Wide Web - they could display as anything, or nothing, on other platforms or browsers.
The IBM PC code called "Multilingual Latin-1", CP850, has already been mentioned. This code also includes the ISO Latin-1 repertoire of glyphs (as well as some additional glyphs that are not in the ISO Latin-1 repertoire). But the characters are not in the same places in the two codes. and, furthermore, CP850 assigns characters throughout the upper half of the code table whereas ISO-8859-1 keeps thirty-two characters undefined. It follows, therefore, that anything represented in ISO-8859-1 can be translated into CP850, but there are characters in CP850 that do not correspond to any defined character in ISO-8859-1. It is conventional to translate these characters into the "undefined" range, decimal 128-159 of ISO-8859-1, on the understanding that their meaning is undefined as far as the standard is concerned. An attempt to display such characters could result in the equipment displaying anything, or nothing, without being in violation of the standard (in principle it could even result in executing some spurious control function, although I'm not aware of this happening in practice to any extent.)
The Macintosh uses a code that mostly covers the ISO Latin-1 repertoire, although some glyphs are missing, and includes some other glyphs. Again the assignment of glyphs to code points is not the same as in ISO-8859-1, and a code conversion is required.
Previously, the EBCDIC code used by IBM mainframes was also an issue, and mention of this will be found in the materials referenced, but with the move away from mainframes this will concern us normal mortals less and less.
There is a paper by André Pirard of Univ. of Liège in Belgium, referred to in a usenet posting which I quote comprehensively below. Let me stress that the recommendations for translating characters that fall outside the ISO Latin-1 repertoire are not part of any formal standard: they are, however, a de facto "standard" that is followed by a lot of fine Internet software for the Mac, such as Fetch, some usenet newsreaders, most browsers, etc. He must translate the codes in the "undefined" region to preserve the integrity of the file (see his text for futher explanation) but you are not entitled to use these characters in your Web documents - if you do, then users on different platforms are going to see different glyphs, or none.
Especially I want to warn readers not to take seriously the misguided efforts of some web authors who have manufactured a document containing all possible character codes, without comment, and have invited readers to display them on their own browsers. Without a text describing what should be displayed at each code point, the confused reader might assume that what they are seeing is the same as would be displayed at the corresponding point on any other browser. It is not, and they are being misled. Only the code points that are assigned to displayable characters in the ISO-8859-1 code are expected to have this property (and even there, some browsers are in violation of the HTML2.0 standard, so without a text description alongside every relevant code point, such a table is worse than useless, no matter how well-intentioned its author might have been). An alternative way of presenting a character code unambiguously is to present it as an image; the only problem with that approach is to make sure that people do not confuse similar-looking glyphs with each other, e.g mistaking an apostrophe for an acute accent, a German sharp-s for a Greek beta, or a degree sign for a superscript zero. Presenting both an image and a description would be the ideal.
In this section I have tried to cover concisely the principles that are involved in dealing with such non-ISO-8859-1 platforms, and have given brief notes on how they work out in practice. The Mac is of sufficient importance to justify a separate article.
The HTML2.0 specification (RFC1866) contains at the end (section 14) a section entitled "Proposed Entities": this list includes the already well-known ISO Latin-1 accented letters etc., but also introduces a proposal for additional entity names. These were certainly not in general use as of HTML2.0, and so, presumably, were intended for future implementation, and some browser developers have indeed progressively added support for them, while others seem to have made progress at a snail's pace if at all.
I am assured that the policy of the HTML developers
is to use the SGML entity names, as laid down in the
ISO* in, for example, the directory
(Dan Connolly gave me an alternative pointer
a server in Norway - but the information content should
be the same!).
The names that are relevant to this discussion are contained
ISOdia (you will see that
those entity sets define also many characters that are not included
in the ISO-8859-1 code, and that therefore aren't properly
usable in HTML according to current standards).
HTML provides no mechanism
for using floating accents, so that the glyphs for
umlaut/diaeresis, cedilla, and macron are of rather little
benefit: however, they do have entity names, and for
completeness they will be kept in the discussion.
In the archives at the W3C may be found a draft called HTML+, that pre-dated the now-expired HTML3.0 draft. Both drafts are now considered obsolete (although they make very interesting reading!), but a few browsers support some of the entity names that are peculiar to HTML+, so it is still mentioned here. It's worth noting that the text of the (uncompleted and now expired) HTML3.0 draft used the entity names "endash" and "emdash", but the associated DTD contained "ndash" and "mdash" - presumably this discrepancy would have been resolved if the draft had ever been finished.
In this section, I am only discussing the names for characters of the ISO-8859-1 code definition. The Trade Mark (TM) glyph is not defined in this code, and I discuss it separately in another section of this briefing.
A further version of the HTML entity names list can be found in Martin Ramsch's table at http://www.ramsch.org/martin/uni/fmi-hp/iso8859-1.html. However, this folds in some material relating to Hyper-G Text Format, which is not the same as HTML. The characters over which there seems to be disagreement are the following ones (see RFC1866 section 14).
HTML+ RFC1866 ISOnum/ISOdia Description ----- --------- ------------- ----------- die uml die, uml diaeresis/umlaut macron macr macr macron, overbar degree deg deg degree Cedilla cedil cedil cedillaRamsch's table agrees with the ISO list, except for designating the macron as
&hibar;, a name that I haven't seen elsewhere but was, I found, supported by X Mosaic, and by adding the alternative names
ETH. I propose to ignore these last two as being Hyper-G specials, but in view of having found support for
hibarI have retained it in my survey.
Test cases for these entity names can be found in the preface to my character code test tables so that they could be included in my tests for browser coverage.
"entity in its DTD, in spite of the general intention that HTML3.2 would be compatible with HTML2.0.
On investigation, attention was drawn to an item
in the www-html mail archive, in which
Christopher R. Maden states that in the relevant
SGML documents, the
quot entity is identified with the
apostrophe (ASCII 39, x27), not with the quotation mark (ASCII 34, x22).
Dan Connolly, on the other hand, states that
the omission from HTML3.2 was a mistake.
I have no further details of the progress of discussions subsequent
to that time, but I note that the
re-appeared in HTML4.0 just like it was in HTML2.0, as can be seen in the relevant
section of the HTML4.0 recommendation.
It's no big deal to generate a table containing all possible code
values, or all possible values of
&#n;, and to
display them on various browsers and platforms, and to compare
Don't just compare several different browsers on the same platform:
that is no better than buying several different English newspapers
in order to get a better idea of the news in Texas!
Regular X-based fonts will display a blank, or nothing
at all, in response to the unassigned codes. MS-Windows based
browsers will normally display some well-defined glyphs in the relevant
positions; Mac-based browsers will likely display something too, which
might be different.
However, Mac-based users should take care to compare the
results from later versions of Netscape (2, 3 etc.)
with earlier Netscapes such
as Mac 1.12: it is evident that Netscape are redefining the rules as
they go along.
The only advice that I can possibly give you is to steer
clear of these undefined code points.
The whole intention of HTML was to represent content in a portable,
There is no way that you can
guarantee to get the correct results on the reader's screen.
That's what standards are for: the evidence of Netscape
stealthily redefining the rules (I saw no mention in their
release notes that they were shifting the undefined Mac
character codes around) just makes it that much more
important to stick to the standards, and not get tempted
by non-standard features of a commercial browser, if you want
to get a message reliably out to WWW readers.
If you cannot be confident of browser coverage of the construct
that you'd like to use (e.g
™ for the
trademark glyph) then you would be better advised to
use a substitute that has good browser coverage.
When un-zip-ed, this file produces some interesting information and some software, including a file ISOLATIN.CPI that can be used to support CP819 for output in MS-DOS (assuming you have EGA or VGA; the CGA display is stated to support only its own hardware code page). There is also some mention of keyboard support. Disclaimer: the above refers entirely to material that I found on the net. The nearest thing to an authoritative source is the file Doc\Isocp.txt contained in the above-cited ZIP archive (however, Mr. Kostis tells me in email that his 1993 address seen in that file is no longer valid).
I have not tried using DOS for any significant period with CP819 selected, so I have no personal experience of how it works out: in response to earlier versions of this page I received several emails that comment favourably on this method of working in DOS, and in Jan 1997 I got a more detailed email from Portugal about successful use of this method.
I don't believe there are any subtle differences from iso-8859-1: the IANA character set registrations list CP819 along with IBM819 as a synonym of iso-8859-1, so this seems to be just another codification of the same character coding.
For its internal text encoding, MS Windows works in terms of its own character code: this code is identical with ISO-8859-1 at the code points that are assigned to displayable characters by ISO-8859-1, but in addition it assigns displayable characters to some of the code points that ISO-8859-1 explicitly leaves undefined.
The chief cause of confusion, I guess, is the notorious DOS convention for typing ALT/nnn to get the character whose code position is (decimal) nnn. In MS Windows, this convention has been extended (although I didn't find this explained anywhere in the normal user manuals or help information):
You can easily verify that MS Windows code is incompatible with MS DOS code in this respect, if you type in some accented characters using an MS Windows application, say Notepad, and then view the resulting file in MS DOS; or conversely if you type them in using a DOS application such as (DOS) EDIT and then view the result in MS Windows.
Let us take one example: o-circumflex. In DOS CP 437 or 850, the code point for o-circumflex is 147 (decimal). So, in DOS you type this in as ALT/147, and that is what is stored. However, when you use ALT/147 in an MS Windows application, the MS Windows (ISO-8859-1) encoding of o-circumflex is actually stored into the file, i.e the character code 244 decimal. Basically this is very simple in principle, but can lead to much confusion in practice, and you can play around a little if you want, typing ALT/nnn codes into a file one way and viewing the file the other way, and driving yourself demented trying to follow what is happening by use of a CP850 (or 437) DOS code table on the one hand, and an ISO-8859-1 (or MS Windows) code table on the other. Have fun!
Anyhow, the long and short of this is that you can perfectly well deal with 8-bit accented letters in MS Windows if you wish (subject, of course, to the caveats mentioned elsewhere in this briefing), as long as you only handle the file with MS Windows, and not mix it with DOS.
The above would no longer be a problem if everyone operated DOS in the code page 819 discussed above. Whether that would be practical, I can't say - see the discussion there.
&#n;displays the TM glyph on that informant's browser: but readers who have taken on board my explanation so far will realise that this is of no use, since it will display something different, or nothing at all, on some other browsers. The value of
nis (or was - see discussion above about Netscape quietly re-arranging their Mac version) different according to whether the informant uses a Mac-based or an MS Windows-based browser, and X-based browsers do not display this glyph at all in their normal fonts. Surely, for a glyph like this that is used principally for legal reasons, there can be no excuse for sloppy usage that has no guarantee of displaying anything to some readers.
There are a number of kludges
that you might consider using at this time.
Bearing in mind that by now a good range of browsers already
honour the SUP tags, you might even enclose "(TM)" in those,
here is what your current browser does with that:
(TM). A browser that did not understand the
SUP tags would simply display the "(TM)" in its current font.
Nest the (TM) also in
if you wish.
The recommended entity name
was supported by certain browsers for some time already, but
others still lack that support.
Here is what your present browser displays in
response to this entity: ™.
The standards-compliant way to code this as a numerical
character reference is by its Unicode value, 8482,
and here is what your
current browser does in response to that: ™.
I cannot recommend that you use either of those representations
at this time, especially for a mark like this that has
legal significance: MSIE 3 supported only
™, while NS 3 supported only
™ and not
Update Jan'98: both MSIE4 and NS4 support
although NS4 still does not support
It is often claimed on usenet that one or more no-break space(s)
can be used as a space filler, and
indeed this can be seen in many HTML documents on the WWW.
Some advocates suggest
, while others advocate
alternating one or other of those with ordinary spaces;
it is indeed an observed fact that many browsers in use
in 1996-7 were producing that effect, and this
became pretty much universal.
(The HTML4.0 specification subsequently codified the no-break
space as not being a white space character, from
which we may deduce that it would not be eligible for compression
under the white space rules; but it explicitly does not go into
any further detail about its treatment.)
In constructions like the much-touted
(or the corresponding thing with
the no-break space is not joining two
words together as envisaged by the spec.
Some authors say that they demand two spaces between sentences (e.g after a full stop - US "period" - question mark or exclamation point), because their style rules demand it. They therefore desperately seek some kludge for inserting such an additional space, even in some cases going so far as to imbed a small blank image. It would be my own personal contention that this is a browser issue, since any imposition of style rules ought to be according to the reader's locale, not the author's. Admittedly, there is no unambiguous definition of a sentence end in HTML: there are situations where a full stop, question mark or exclamation point, followed by a single space, might appear within a sentence, and an expansion to two spaces would not be desired by the author or the reader. I leave this point for others to debate.
To those authors who desire the first line of every paragraph to
be indented, on the other hand, I say very definitely that this
is a browser (or style sheet) issue.
The start of a new paragraph is clearly
defined in HTML, and the question of how to present a paragraph
is a browser issue, purely and simply.
Readers might also wish to have such presentation details under
their control, rather than being imposed
by the author.
Those authors who struggle
to indent each new paragraph by means of
tricks and/or blank images are, in my view, quite misguided.
And a reader who cares enough to have configured their personal
style sheet to indent paragraphs, are not likely to be amused
when they find they get a double-dose of indenting thanks to
the author's kludges with transparent GIFs or
NOTE -- the SOFT HYPHEN character (U+00AD) needs special attention from user-agent implementers. It is present in many character sets (including the whole ISO 8859 series and, of course, ISO 10646), and can always be included by means of the reference ­. Its semantics are different from the plain HYPHEN: it indicates a point in a word where a line break is allowed. If the line is indeed broken there, a hyphen must be displayed at the end of the first line. If not, the character is not dispalyed at all. In operations like searching and sorting, it must always be ignored.(reproduced from the RFC complete with typo ;-)
The HTML4.0 specification says something similar, except that it implies that browsers are not mandated to support this character, which is a pity as there seems to be no viable alternative for achieving this useful result.
The meaning of this when the soft-hyphen occurs inside of a word is unambiguous; but hardly any browsers actually implement this yet (an honourable exception: Lynx). What to do with a soft-hyphen that occurs in isolation is unclear: a browser might still display a hyphen in this situation - or maybe only display it when it comes at the end of a line (recent versions of Lynx seem to be like that). Best to ignore what is displayed on this line of my test tables, since there is no formal specification of what a soft-hyphen should do in that kind of context.
(There used to be a usenet FAQ on ISO-8859-1, and a fascinating array of resources on character codes and internationalisation by its author; these are still referenced from other usenet FAQs, but sadly the document itself had disappeared by Dec.1999.)
The Kermit team has done much work on documenting the usage of character codes: a visit to their Web Pages is well worth while, and materials can be found in their archive at Columbia. The "texts about ISO-8859-1" by A.Pirard (following quoted article) set out the design criteria well, and should be read for a clearer understanding of the issues involved.
(End of quote.)
Q: Is there a standard site where one can find the latest versions of these tables?
There was no site where one can find the latest versions of the Macintosh<->ISO 8859-1 translation tables. But now, after your mail, I put the tables on an ftp server as a Macintosh .sea archive.
BTW, you can find there the André Pirard's texts about ISO 8859-1, other codes and several computer "languages". These texts about communication programming for international characters, can be found, too, in an ftp server at Columbia, USA or an ftp server at Univ of Liege in Belgium.
The software using these tables is not all using taBL resources. I.e. FTPd and Talk have the taBL, but not Anarchie. With resources, the developer can allow the user/manager to put an other one. Without resource, the translation is "hard-coded".
I prefer taBL resources. For software as Eudora and Telnet, it's a must because people may need several translations according to the environment, i.e. for Telnet, the code of the connected computer. For software able to transfer files/texts only, the choice is very small and Macintosh<->ISO 8859-1 is the best standard to use, IMHO.
The code mapping that's documented in A.Pirard's materials has become the de-facto standard in Mac-based Internet software, such as Fetch, usenet newsreaders, most WWW browsers, etc.
The Kermit team had also documented the Mac problem, and indicated a solution based on the same principles, but they designed a different code mapping that has not, in the event, gained general acceptance.
An email from Terry Jones recommends the GNU recode program for a very versatile range of character conversion options, including 8-bit Latin-1 to HTML entity encoding, as well as converting between different character codes. I have to confess I had not been previously aware of this program, but having looked at the manual for it, I have no hesitation in accepting his recommendation. Beware, though, that this program can, by default, replace your input file in-place, and with some combinations of options the change would be irreversible!.
From the point of view of HTML (indeed of SGML), every document
has a "document character set", which in the case of HTML2.0
happens to be the ISO-8859-1 (8-bit) part of the much bigger ISO-10646
Furthermore, when the document is transmitted over the network,
it is transmitted by using an encoding which, in HTTP/1.0 etc,
is ISO-8859-1. The result is that we tend to confuse the
"document character set" with the "transmission encoding".
However, it's not too difficult to see that these are not
the same thing. Let's consider an HTML document that's stored
on an EBCDIC-based mainframe, in order to make the illustration
as obvious as possible. In a document that contains the
© (i.e the numbered
character reference representing the copyright
sign), the individual characters: ampersand, hash, one, six, nine,
semicolon, each have to be translated into EBCDIC; however, the
number (169 decimal) remains as 169 decimal, referring to the
copyright-sign code point of the ISO-8859-1 document code: it
does not get changed into a different
number equal to whichever code point this sign occupies in EBCDIC.
So, we now have a document whose "document character set" is
still ISO-8859-1 but whose "storge encoding" is EBCDIC. Similar
situations arise when HTML is stored on a platform whose native
code is the Mac code, or say CP850 (DOS).
When we move to a situation involving more than one transmission encoding, the issue becomes more complex. The same ISO-10646 data can be transmitted in several different transmission encodings (UCS-2, UCS-4, UTF-8). Then we have to understand how this relates to the other encodings that exist today (CP850, EBCDIC, KOI8-R etc.).
Let me offer you a pointer to the relevant area at W3C, with particular reference to the sub-heading of Character Sets.
I think it is fair to add that a browser, such as Netscape, that offers the user the ability to configure it to a default character code other than ISO-8859-1 would then no longer be compliant with the HTML2.0 standard, since it would then no longer display a default (i.e iso-8859-1) document correctly. Such changes ought to be under control of the author/server and not under control of the user configuration (although, obviously, they cannot work properly unless the user has taken care to make the necessary resources available to the browser - browsers typically don't come with all of this set up as standard).
For the Latin-2 (Central European) situation there's an interesting resource by P.Peterlin.
If you want to find out about Unicode, try at http://www.unicode.org/. A search at one of the WWW search engines seemed quite productive, returning among other things some interesting papers and discussions from the WWW Working Groups.
Original materials © Copyright 1994 - 2006 A.J.Flavell