ISO-8859-1 and the (traditional) Mac platform

Introduction

[Note to explain the context of this page:] Since around 1997-8 my contact with Macs has pretty-much faded away, and I'm unlikely to expand coverage to HTML4.0/i18n issues on the Mac. Although this page should now be treated more as a piece of history, it still seems to contain some matters of interest regarding the relationship of macRoman to the iso-8859-1 and Windows-1252 codings. It might also be of historical interest in illustrating some of the early work-arounds, but these relate only to browsers based on 8-bit character codings and repertoires - in some cases calling for the use of custom fonts - and almost entirely inappropriate for use with modern browsers supporting HTML4/i18n.

The main briefing document considered the general principles of dealing with WWW documents on platforms whose native storage code is not ISO-8859-1. Apart from Macs, the commonly-used platforms either support ISO-8859-1 directly, or (for the purposes of standard HTML) are completely compatible with it (Windows uses the area 128-159 decimal for displayable characters, but these code points aren't used in standard HTML: see the note below).

   hex        decimal       description 
    A6          166      broken vertical bar
B2, B3, B9 178, 179, 185 superscript 2, 3, 1
BC, BD, BE 188, 189, 190 fractions 1/4 1/2 3/4
D0, DD, DE 208, 221, 222 ETH, Y-acute, THORN
F0, FD, FE 240, 253, 254 eth, y-acute, thorn
    D7          215      multiplication sign

The Mac, however, seems to call for an article of its own. The fourteen problematical characters are given in this table (the soft-hyphen gets mentioned later). Let me stress that this article was written in the era of HTML2.0/ HTML3.2, and only considers the iso-8859-1 code.

Theory

As was reviewed in the main briefing, there are two ways in principle for dealing with the problem:

  1. Establish a mapping between ISO-8859-1 (the coding generally used on the net in the Latin-1 areas), and the native storage code on the Mac,
  2. Make the browser work in ISO-8859-1, i.e use it as a private storage code, incompatible with the Mac's own native storage code.

In the first method, what one does is effectively to make all the characters that exist in both character codes "take their partners", then for the remaining characters (there are, as we know, fourteen) to choose "unwanted" characters from the other code and pair those up as substitutes. Finally, all the remaining code points (unassigned etc.) have to be paired off one by one in order to preserve code integrity (reversibility, in other words).

All of this had already been done in the work that's documented in Pirard's 1992 article - which had become the "de facto standard" for Internet purposes around that time. In order to actually implement this scheme, a simple early approach was to create a variant font that is, for the most part, the same as a regular Mac font, having the characters positioned in the way that's defined by the Mac code, but with the specific fourteen character positions swapped out in favour of their ISO-Latin1 "partners" (substitutes). A suitable (monospaced) font became available, see later. With the desire, however, to use a wider range of font appearances in browser renderings, this approach could only have become practicable if a sufficiently wide range of fonts had become available that supported this approach; and that was not the case. Instead, as we will see, the Big Two made the problem worse before it started to get better.

For the second method, what one would be doing is to store ISO-8859-1 characters as-is, and then constrain the Mac to display them in the browser. In theory this would be feasible, but it has a lot of practical implications, and, as we will see, even those browser/versions that seem to support this approach haven't really made it work consistently. In order to implement this solution, what's needed is a font that's laid out according to the ISO-8859-1 code itself: such a (monospaced) Mac font was available on the 'net, called ProFontISOLatin1. A current URL cannot be found, but it should be in the info-mac FTP archives (search for pro-font-22.hqx). Here again, however, this simple approach can cover at best a very limited range of fonts.

There appeared also a ProFontWindows font in the ProFont 2.2 distribution, which for the purposes of standard HTML should give the same results. You might find it a preferable choice so that you can use it for handling mail, etc. that has been composed using these Windows characters. It might also be useful for those non-standard HTML files that attempt to represent curly-quotes etc. using the Windows code points or the corresponding (undefined in standard HTML) numerical character references.

Please let me stress again that these notes explored ways of working on ealier, relatively simple-minded browser versions. While illustrating some important principles, it's clear that none of these approaches is really feasible for a full coverage of fonts and typography in current browsers. A much more thorough-going internal redesign of the browser is needed, but that was beyond the scope of these notes.

"MacRoman" coding

The term "MacRoman" (or sometimes "macRoman") came into use to refer to the Mac's own storage code in the Latin-1 region. This is the code that is listed amongst the mappings at the Unicode FTP site as the Mac OS Roman character set. This is the Mac's own native storage code, it is not the code that is defined in Pirard's document mentioned below. In other words, MacRoman is deficient in the fourteen characters that are under consideration here.

Occasionally, one sees WWW documents sent out with HTTP headers (or with META HTTP-EQUIV markup) stating that they have charset=macRoman; this is unwise and impolite on the WWW, for at least two reasons: (a) most non-Mac clients do not implement this coding, so the display is unpredictable, and (b): "macRoman" is not one of the charsets that is registered in the IANA charset assigments document. For the latter reason, charset=x-macroman is sometimes used: but that doesn't help in the case of browsers which don't support it anyway (under either name).

Browser method 1: code mapping

The main briefing cited the 1992 paper by André Pirard, that's available from an ftp server at Univ of Liege in Belgium, and in what follows, I assume that readers have made themselves familiar with this article, at least as far as its general principles are concerned and the part that deals specifically with the Mac.

(The Kermit team had also analyzed the problem around the same time, and come to the same conclusion in principle - and independently devised their own code equivalence table and associated font, which had not however, in the event, received wide acceptance.) The table documented by Pirard was seen at work in quite a lot of Internet software for the Mac, and that included browsers such as NCSA MacMosaic, MacWeb, and earliest versions of Netscape (but see comments below about subsequent releases of Netscape). For FTP transfers to and from the Mac, the program "Fetch" offered support for the character code mapping procedure defined by Pirard, if you turn on its "Translate ISO characters" option in the preferences.

[Just to try to clarify the historical position: I think it's fair to say that Pirard's document made good sense at the time it was formulated, and offered useful functionality for quite a number of years thereafter, as long as systems were based on 8-bit character architectures. Later, a number of problems arose, which could have been avoided if those implementing new kinds of cross-system character compatibility (i.e chiefly Unicode) had taken a little more care with interfacing to these older arrangements from the earlier 8-bit mappings. See also this for background.]

Pirard's table also sets the ISO soft-hyphen character (­) as partner to the Mac "syllable hyphen" character. HTML4.0, like RFC2070, sets out its intended semantic for the soft-hyphen, but declares support for it to be optional, and the popular browsers do not in fact support it. Which makes it (irrespective of whether you think this usage would be good or not) essentially worthless in authoring HTML for the WWW, and so this Mac-specific note does not consider the soft-hyphen issue further.

If we had a font that contained the iso-8859-1 repertoire of characters, but had them arranged according to Pirard's table, then we would have a font to implement this scheme.

Such (monospaced) fonts were produced by William I Johnston: see CourierWeb, as well as ProFontWWW from ProFont v.2.2 that was cited earlier and should be in the info-mac archive.

When such a font is selected, browsers such as NCSA Mac Mosaic, MacWeb, and earlier versions of Netscape are seen to display the character test tables correctly, including the dreaded fourteen characters, and this is the case for the 8-bit characters, for the &#number; representation, and (as far as the entity names are implemented in each browser) also for the &entityname; representation. This is with the default "document encoding" selected, where applicable.

Unfortunately I'm not aware of a proportional font that does the same job completely. Cathy Ball provides resources for students of Old English which include a font called "Times Old English version 2". This font is like Times, except for the six letters (upper and lower case eth, thorn, and y-acute), but this still leaves eight of the fourteen positions unadapted from the original Mac situation. She added some notes on the mis-handling of these characters in later versions of the Big Two browsers. (The other fonts referenced there are great fun, but none of them really address the issue we are dealing with here).

One thing that is worth mentioning is that when using this approach, you can copy/paste from the browser window to other Mac windows, e.g into word processors or email. If none of the fourteen characters are involved, then the material will be perfectly correct, even with standard Mac fonts. If any of those fourteen characters are used, of course, the pasted material will only be displayed correctly if you use an appropriately adapted font in the word-processor, just as in the browser.

Notes about Netscape Navigator

Earliest versions (0.9, 1) of Netscape behaved similarly to NCSA Mosaic and MacWeb as described above. In later versions (2, 3) of Netscape, various half-hearted kludges appeared in the Mac implementation, although I found no reference to them in their release notes. In Version 2, I saw that the following changes had occurred:

It should be clearly understood that when these kludged characters are pasted into another document, they do not come over as their original selves, but as the kludged equivalents that Netscape have provided. In other words, "integrity" (in the sense used by Pirard) has been lost, since these characters can no longer be distinguished from the genuine unbroken bar, 1, 2, 3, etc. characters.

Moving now to Mac Netscape version 3, we find, in addition to the kludges mentioned above:

I haven't examined Netscape 4 in detail, but it's reported that when the test table is displayed, there are still various anomalies:

So, it seems to me that these kludges have effectively taken away function, rather than enhancing it. Neither before nor after the change does the browser comply with the HTML2.0 specification, yet they offer no caveats with their software, nor suggest any workable solution to the problem, and the kludges have prevented the solution described here from working properly. Admittedly, the problem would be an insoluble one if the browser was determined to use only the repertoire of the standard Mac fonts, since these characters are missing from those fonts.

By the way, if you are authoring HTML documents on the Mac in the way that's described in this section, and you want to preview the document by loading the disk file into Netscape (2 or 3, at least), then you will need (at least in the Mac NS version that I tried) to manually select "Document encoding: MacRoman" in order to see the correct results: then, all three columns of my test table (8-bit chars, numerical references, and named entities) were displayed and printed out correctly when the relevant font configuration was set to e.g "Western: ProFontWWW", and was as correct as could be expected (i.e excepting those eight or fourteen characters respectively) when the font was set to Times OE or to one of the normal Mac fonts. Remember to change the default Document Encoding back again when viewing documents from the web: as you can see, there is a fundamental design fault in the browser that it fails to switch automatically between the external WWW coding (iso-8859-1) and the internal Mac coding (MacRoman) and has to be helped along by the user.

B.t.w.2: Mac NS's "Save-As: source" does not translate html documents into Mac code nor convert their newline representation into Mac conventions, so it would be inappropriate to edit such documents further on the Mac with an ordinary text editor. Look for an editor that is aware of and can cope with this situation, and/or use HTML's ampersand-notations in preference to 8-bit characters.

MS IE for the Mac (version 2)

In respect of "the fourteen" code points, MS IE 2 seemed to conform with no known standard; it even managed to get wrong some of the code points that all other browsers display correctly.

From reports that I've received, it seems that MS IE can be coaxed into displaying the full ISO Latin-1 repertoire by following the "MacRoman" procedure described below as "solution 2". I'm sorry that due to logistical problems I wasn't able to try that myself - I would assume that it represents the same challenge in terms of interworking with other Mac-based applications (copy/paste etc.) that are described there for the other browsers.

MS IE for the Mac (version 4.0)

I received a report (Jan'98) from a reputable informant that MSIE4.0 displays the "missing" Latin-1 characters correctly, presumably getting them from somewhere other than the currently selected font. In discussion with Andreas Prilop and Gerald A. Edgar (Jun'98) it seems that the characters are coming from VT100 and Symbol fonts that get installed with the browser.

So, this is an example of the "mixed fonts" approach described under case 3 below. Actually, from the screen shots the extra characters didn't seem to harmonise too well with the contributor's choice of normal font, but that's life. And note that problems can be expected when copy/pasting such material from a browser window into a text editor or other such application.

It was remarked that someone who took a copy of the application, expecting it to work without "installing" it, would not get the additional fonts installed for them. In this situation the browser has some kind of fallback display using two- and three-character strings.

The fractions (1/4, 1/2 and 3/4) were displayed as three-character strings, even when a font was present that contained them as single characters.

It's still the case that if one selects one of those Mac fonts that aren't fully populated in the other places (e.g Geneva) then it displays missing-character boxes in those positions: selecting "Cairo" was also reported to be a mess. (As I say, I no longer have a suitable Mac. Thanks for the input, folks.)

Misc.

By the way, these fonts, ProFontWWW and CourierWeb, work just great with NCSA TELNET etc. when you select iso-8859-1 coding in the terminal emulation; then for example if you execute Lynx on the remote host, and set its option charset to iso-8859-1, the whole thing displays correctly and prints correctly. Genial.

There was a project to write a native MacLynx, but the last available test version didn't support 8-bit characters at all, and the project seems to have stalled.

Browser method 2: working in ISO-8859-1

Mac Netscape (versions up to 3, at least)

Netscape offers an option (on the "Options/Document encoding" menu) to select a document coding of MacRoman. One would do this in conjunction with selecting a font that was laid out according to the ISO-8859-1 code itself, e.g the ProFontISOLatin1 font.

If this is done, then the rendering of 8-bit characters is 100% correct. Unfortunately, the rendering of the &#number; representations and the named entities was then extensively wrong.

And, when one pastes the resulting display into a normal window, say a wordprocessor, (with a normal font, or even with ProFontWWW selected) then the &#number; representations and (in so far as they are implemented) the named entities are displayed correctly in the word processor, whereas the 8-bit characters are wrong. Whereas, selecting ProFontISOLatin1 in the word processor then the 8-bit characters are correct whereas the other representations are wrong.

This doesn't seem to me to be a viable way to proceed: could I stress that the problems affect even quite inoffensive documents, that don't try to use any of the fourteen problematical characters at all - even the use of an accented letter could make the document non-portable and needing the use of this special font.

Clearly, copy/paste is merely transferring the stored character values, without any regard for what they "mean" in the currently selected font. So, this technique would have serious implications which make it impractical for widespread use. HTML authoring guides (at least in the Western European region) recommend the use of entity names or numbers in preference to 8-bit characters, and this is normally good advice, but exactly these kind of documents would then fail to render correctly.

I'll also mention that when I tried to print out the test table from Netscape, I got completely different results printed than what I was seeing on screen. Now, this could be my fault with setting up the fonts, I'm not sure. However, if it can happen to me, it could happen to anyone else. Is there a Mac expert in the house???

That discussion related to how a Mac-based reader could read a normal WWW document properly. This article has already commented adversely on the practice of some Mac-based authors offering documents on the WWW whose coding was advertised as MacRoman. Even if the page was about some Mac-specific topic, there seems no guarantee that every interested reader would be using a Mac to read the page.

MacWeb 1.1.1E

MacWeb (1.1.1E) was also able to be set up to behave similarly: taking the menu File- Preferences- Format- Character Translation and changing from the default (ISO->Mac) to None, and taking the menu Edit- Styles- Element and changing (e.g) the preformatted style to ProFontISOLatin1 (and then reloading the displayed document), things looked correct on-screen not only for the 8-bit characters but also for the &#number; representations, as well as for those &entityname; representations that were supported by this browser. So, that seemed to be better thought-through than Netscape's had been. But again, when an attempt was made to print the document, it came out comprehensively wrong: all three columns came out similar to what Netscape 3 had been printing in the first column (the 8-bit characters).

With this browser it wasn't possible to conduct the copy/paste test, since it didn't seem to support selecting text in the browser's display window.

[MS IE 2 not yet tested - technical and logistical problems - but from reports received, this is a viable method to at least display the ISO Latin-1 repertoire, whereas the first method didn't work due to MS's failure to follow the de-facto Mac/ISO8859-1 translation table.]

Note on Windows encodings

Gil Hurlbut writes:

"Windows handling of some of the undefined ISO-8859-1 codes ($80 to $9F) can present problems for the Mac. The ProFontWindows font has been provided by co-authors Steve Gilardi and Carl Osterwald which uses the Windows character set. The font is compatible with System 7 and Mac OS 8. It is available through Info-Mac and its mirrors as part of ProFont Distribution."

Actually, the iso-8859-1 code points are not so much "undefined", they are assigned to control functions (that admittedly aren't really used in practice) and aren't available for displayable characters. Use of a Windows 8-bit character code is not actually forbidden in HTML, but the client isn't required to accept/support it: HTML2.0 only called for iso-8859-1 to be supported, all other encodings were optional. The issue of sending out an HTML file in Windows encoding is discussed in detail at my pages on character sets in HTML; the important point here is that to be legal, such documents must advertise themselves with the appropriate "charset" parameter on their content-type, and must express the Windows characters as 8-bit character codes. The often-seen usage of ™ etc. is technically "undefined" in SGML/HTML. I haven't investigated to what extent any of these (mis)usages actually appear to work if you use the ProFontWindows font, sorry.

I thank Gil for calling attention to this resource, and have commented on it above, but I must add the caveat that authors should not in any way take this as encouragement from me to write non-standard HTML, OK?

Browser method 3 - mixed fonts

It's possible to think of a solution in which all except "the fourteen" characters would be rendered in the normal way using one of the normal Mac fonts, but when one of the fourteen characters was required, it would switch to a different font - compatible for style and size - (that had been either installed in the normal way, or built in to the browser) to render it. This is, for instance, how various equation editors etc. have worked. Think of dingbats and symbol fonts etc. for an analogy.

This kind of behaviour would have to be properly engineered into the browser; it's not something that could be fixed-up by swapping-out fonts as described in the first two methods. As such it needs a deep understanding of character handling in applications. But I am not a browser designer and this article isn't really about how to design a browser. For HTML4.0/RFC2070 compliance a much more thorough-going redesign is required anyway - this article was written specifically to address iso-8859-1 issues, and so I'll have to draw the line here.


Caveat on alternative fonts

You can find some fascinating Mac and PC font resources at "Yamada's Babel server".

However, you should be keenly aware that these fonts are laid out in ways that may be quite different from what browsers would need. These fonts are mostly based on the idea of a subcommunity of users who would all be using the same font, so that the same bit-patterns display in the same way for all the users, without reference to any standards for an interchange code. Without corresponding software support in your browser, the use of one of these fonts could make matters worse rather than better if you are trying to get support for, say, Latin-2 or KOI8-R.


|Previous|Up|Next | |RagBag|About the author|