I18n - Browsers and fonts

This page includes some personal experiences (almost entirely related to Win32 platforms) with browsers and fonts.

I'd definitely advise a visit to Alan Wood's Fonts page, which is excellent, and also follow some of his links to additional resources. The MS 'Font Properties Extension' is instructive, but for closer study I'd recommend a font utility such as ListFont or SIL ViewGlyph.

User perceptions

There has been frequent discussion on usenet in which one or other participant asserts that the popular browser which they use is incapable of displaying a wide character repertoire. As a general statement, this is simply not true, and most scenarios turn out to be on the one hand, a misunderstanding of how characters are meant to work on the WWW, or on the other hand, a lack of appropriate fonts or of the browser configuration needed for using them. If you still don't believe they can, take a look at a screenshot made with Netscape 4.08 from around 1998, when properly set up: yes, NN4.08  showing Greek together with several Cyrillic alphabets.

Things have certainly got better in the last few years (this written in 2006), as OSes have come with a wider character repertoire, and browsers have come configured to take advantage. However, there are still plenty of folks running older systems, so I would suggest that authors who plan to use these features of HTML should probably put in a little effort to educate their readers in order to avoid disappointment. In particular, it should be noted that many versions of MS Windows OSs have installed by default with only a limited character repertoire, indeed multilingual support may be an installation option that you have to request specifically. The MS "web fonts pack", which was distributed chiefly to remedy this shortcoming in Win'95, is now past history (see further discussion below), but remaining Win98 users should make sure they haven't omitted multilanguage support from their OS installation.

Furthermore, there are specific problems (as detailed later in this page) with versions of MSIE up to including IE6; even if IE7 improves the behaviour (as recent reports suggest that it will), it appears that IE7 will not be available for OS versions before XP SP2, so there will be plenty of folks stuck with older versions for a while yet.

Apparently, the effective character repertoire in Windows OSes can be dependent on which language options have been enabled; and the effects of that can be far from obvious. I found that installing the Japanese language option in Windows 2000 Pro resulted in a whole range of additional characters becoming evident, including characters with no obvious relevance to Japanese, for example additional "Misc Symbols" from the U+26xx range of Unicode.

These details differ between different versions of Windows, it seems; but, in general, users could be recommended to enable any multinational support and/or a selection of language options offered by their OS installation (see Control Panel> Regional Options in Win2000 for example), if they hope to see a wide repertoire of characters - especially in MSIE, but also in third-party browsers such as the Mozilla family or Opera (since these browsers use the fonts that have been installed in the OS).

MS Font Properties Extension

The MS Font Properties Extension is available from the MS free font tools page and was last spotted as Version 2.1, which is now quite old and does not cover some of the newer Unicode character sub-ranges, but is worth getting anyway.

However, it's somewhat short on features, so I'd recommend also getting a feature-rich font browser such as SIL ViewGlyph utility, or one of those listed on Alan Wood's pages.

Font configuration per browser

Disclaimer: I'm not a browser developer myself, only a user of such systems: I don't understand the internals of font management in Win32 OSes (the main topic here), even less on X, and hardly at all on Mac. So these observations will be a wee bit superficial. But I believe they are nevertheless useful in a practical sense.

Let me stress that, in addition to having properly-composed HTML source with its correct character encoding (charset=) information, two conditions must be fulfilled for proper character presentation: the browser must have access to appropriate font(s), and must have the correct configuration to use them. It may be that, as installed, it's all ready to do that, and requires no further attention. But if it doesn't seem to be working properly, it could be that either of these preconditions is failing. Until you fix the right one, fiddling with the other one might only make matters worse.

References to "Netscape" browser in here are intended to apply to Netscape Communicator versions 4.* except where otherwise stated. References to "Mozilla" may generally be assumed to apply also to Netscape 6 and later, as well as to other Mozilla-family browsers such as FireFox. Netscape 4 can be considered practically obsolete by now (2005) and will probably be removed at the next major revision of this page, but it's still retained for now.

Netscape 4.* versions

This applies as long as you intend your documents to be compatible with the remaining users of Netscape 4.* versions. The key issue, in the context of this note, is that for getting coverage beyond a single 8-bit repertoire per document, Netscape4 needs to be coaxed into believing that it is working with Unicode, and to be given a comprehensive Unicode font for the purpose: it cannot piece together different writing systems from several fonts as the newer browsers can do. The usual ways of getting Netscape4 into its unicode mood (as discussed elsewhere in this area) are either to actually use utf-8 coding, or to use &-notations in a document that is coded in us-ascii, and then to pretend that the document is coded in utf-8 (which it is, in a degenerate sense, since us-ascii is a proper subset of utf-8). Just a routine warning that forms submission from such a page by NN4.* can be disastrous.

Suitable Unicode fonts are shown below.

Win MSIE (5, 5.5, 6)

MSIE uses fonts differently, and better, than Netscape4.*, although not so well as do Mozilla and Opera (to name but two). For different 'language scripts' (MSIE's terminology), IE can pick groups of characters out of different fonts, within the same web page. Given adequately populated fonts, this allows IE to support HTML4 correctly, including proper treatment of pages prepared in an 8-bit (or even 7-bit, i.e us-ascii) coding which contain characters (&-notations) from several different language groups in the same page. So MSIE supports this aspect of HTML4 correctly, in a way that Netscape4.* was architecturally incapable of doing.

Configuration of the default fonts in Win IE is via the dialog: Tools> Internet Options> General> Fonts - note however that the option for insisting on the browser's configured fonts rather than those specified by the incoming document is in the "Accessibility" dialog, not in the "Fonts" dialog.

For 8-bit character codings, unlike the defective Netscape4.*, IE sees no problem with rendering characters that are not in that 8-bit repertoire, when the HTML calls for them by means of &- notations.

The rest of this section deals predominantly with observations of IE's behaviour when the coding implies Unicode, i.e typically utf-8.

In its default configuration, MSIE can only display characters which are in Unicode "Plane 0" (the BMP or Basic Multilingual Plane), which is a hifalutin' way of saying those characters whose Unicode numbers do not exceed U+ffff. For NT-family OSes (NT4, Win/2000, Win/XP) there is optional support from MS, described by Tex Texin, i18nguy.com for those who wish it. He also describes a manual workaround which needs no extra fiddling with registry entries or loading extra software, but which only works for characters expressed in &#number; notation - not for characters encoded in utf-8. See the cited URL for details.

When configuring default fonts for language groups, MSIE evidently knows which of your available fonts are available for each language group, and in this way restricts your selection in the font configuration dialog: however, as we will see, this isn't quite the whole story. The following details were first derived in detail with WinMSIE5.5 on Win/NT4, but other versions of IE and different MS-Windows OSes seem to be consistent with this too.

The actual selection of which font to use for displaying particular character ranges is more complex than this, and hasn't been understood in every detail. Following some observations made by James Kass and reported in somewhat worrying terms, I started to look at this area by at least studying how the default font selection works, relative to the configuration options provided. However, on closer study, I concluded that Kass's recommendation to use "User Defined" encoding for viewing the above-cited page is more in the nature of a diagnostic technique, and would not really be suitable for general web page authoring.

In fact, when I view the samples at the foot of his test page, in Win MSIE 5.5, or 6: when I set the View Encoding to utf-8, most of the texts display convincingly (albeit a few characters are missing here and there), whereas when I set it to User-defined, more than half of the characters are missing, due to the fact that it's trying to use fonts that I haven't got, and is being prevented by the user-defined coding from finding a substitute font. For any normal authored web page, the former behaviour seems to be the desirable one - hence, utf-8; whereas for diagnosis, he understandably prefers the latter result - hence, user-defined coding for diagnostics.

So, I should stress that I was working with the character encoding advertised as utf-8, the same as I would be using for other browsers - especially Netscape-4.*.

Language scripts in MSIE
FolderLang.
Script
3Latin
5Greek
6Cyrillic
7Armenian
8Hebrew
9Arabic
10Devanagari
11Bengali
.....
24Japanese
25Chin. trad.
26Chin. simpl.
.....
28Canad.
syllab.
29Cherokee
.....
31Braille
.....
35Syriac
.....
40User Defined

The first curious observation was that if a distinctive font (e.g Arial Black) was configured for "Latin" scripts, then it was evident that this font was also being used as the actual default for some other language scripts, even though it had not been configured for them. A bit of detective work with the Registry Editor showed that at the registry path
HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\International\Scripts
there was a set of folders labelled with small numbers, and in these folders were found the font names which had been configured for the various "language scripts". In fact, there were just 37 such folders, and the font selection menu has a language scripts selection menu with exactly 37 entries: clearly this is no co-incidence. By changing the font selections and hunting around with Regedit, it was possible to identify which folder was which, and a selection of them are tabulated, in numerical order, alongside. Note that these numbers are in different ordering and groupings than the groupings in Unicode itself, as well as being unrelated to the (alphabetical) ordering in which the scripts appear in the configuration pull-down menu.

As I said: if one configures the browser to use a distinctive font, say Arial Black, for the "Latin-based" group, then the browser is found to be using this font also for Greek and Cyrillic, which are indeed the other writing systems which it covers - irrespective of what is configured for Greek and Cyrillic. If, on the other hand, Latin is configured to use a font which does not cover any other language scripts (in Win/2K this might be for example "Albertus Extra Bold") then the font configuration selection(s) for Greek and/or Cyrillic come into play.

So: configuring the Latin font seemed to propagate to the font selection for all the language scripts which that chosen font covered. Such a procedure could indeed produce more-harmonious-looking results, since it attempts to minimise font changes; but on the other hand it could have the consequence of an incompletely-populated font (e.g Lucida Sans Unicode) being picked in preference for some particular language group, as a consequence of it having been selected for "Latin", even though a better-populated font for that language group was available (say, Arial Unicode MS or Code2000) and had actually been configured for that group. This results in IE displaying missing-character placeholders when one attempts to display characters that are not present in the browser-chosen font, even though the configured (but ignored) font for that language group does in fact contain them! There's an example of this behaviour reported for Urdu below.

On the other hand, if the text were Devanagari, which Lucida Sans Unicode doesn't cover, then IE would know better, and would look for another font. And indeed if I try to configure IE's font setting for Devanagari, the only option it gives me (on the particular machine tried) was Arial Unicode MS.

Similar considerations apply to author-specified font names, but I have only scratched the surface of how MSIE handles this in detail. Curiously, the discussion at the foot of Kass's page complains that font selection isn't working when utf-8 is used. I can't really speak for CJK scripts, as this is not my field, but that certainly wasn't the impression I got from the writing systems which I was trying. And it seems to me that trying to use user-defined coding in an attempt to circumvent IE's selection of the "wrong" font, is far more fraught with pitfalls in a general WWW context. More about this topic area under the "Monospaced fonts" subheading below.

The above observations all relate to a browser installed for the British locale - so it's not surprising that the browser treats the Latin script as its first choice; I guess it may be that the font search is modified if the browser is installed for some other locale...(?) It's also intriguing that the folders 1, 2 and 4 are missing, and one naturally wonders what the gaps might be intended to hold...?

I don't think there is any great cause to get worried about what is reported above. For most non-specialist i18n practical purposes, it seems to me that the results will be at least viable. If you configure a Unicode font with an extensive coverage (such as Arial Unicode MS or Code2000) for all the language groups of interest to you, then IE is capable of displaying correctly (although those choices are admittedly less than ideal on cosmetic grounds). Don't forget that an inappropriate choice of font for the Latin-based selection can have knock-on effects on other language groups, as already noted. Complicated choices of font by the browser configuration, or by the author, could be rated, it seems, to make things worse, maybe even dramatically worse, than leaving well alone (there is an option on the browser's "Accessibility" menu for disabling document-specified font choices if found necessary).

Effect of character encodings

However, the behaviour described above does not in general apply when a document is sent out with other encodings. For example, a document sent out with iso-8859-5 would cause IE to use the "Cyrillic" font selection, even for display of Latin characters contained in the document. And similarly for other encodings associated with a specific writing system (IE does not appear to take account of HTML language attributes in making such a decision, unlike Mozilla).

Win IE installation

It's been noted that some Windows OSes installed by default with versions of the usual fonts (Arial, Verdana etc.) which contain only a restricted repertoire of characters. This was notoriously the case for the US version of Win98, for example: in order to get a better coverage, one should install the "Multilanguage support" option. The font file names are the same, but the difference is evident in their file sizes (for Win98, Arial should be above 300kByte, whereas the restricted-repertoire font was only around a third of that): however, later versions of MS Windows OSes seem to come with progressively more comprehensive fonts under the same names as before: normally their version number changes, but I haven't done a close study of that - interested readers might want to review the history of older versions of their favourite fonts at the MS Typography Fonts and products page.

Note by the way (just in case it wasn't evident) that installing additional fonts, or enhanced versions of existing fonts, is an OS-wide operation, it makes them available not only to MSIE but also to any other browsers which you use.

Apparently, the effective character repertoire in Windows OSes can depend on which language options have been enabled; and the effects of that can be far from obvious. I found that installing the Japanese language option in Windows 2000 Pro resulted in a whole range of additional characters becoming evident which had no obvious relationship to Japanese: those characters included for example quite a number of additional "Misc Symbols" from the U+26xx range of Unicode. A similar effect was seen with Windows XP.

Win/IE: Tooltip Fonts (and window title too)

It's been noticed that IE's tooltips may exhibit only a limited character repertoire (possibly differing from one version of the OS to another). For example, a title= attribute containing ™ (the trademark character, ™) displayed only a rectangular placeholder on the tooltip.

A chance remark on a Usenet group brought me to a page by Jukka Korpela about setting the size of the tooltip font: the recipe given there (Windows Control-Panel> Display Properties> Appearance> Item: ToolTip) can also be used to change not only the size but also the selected font (which by default in earlier Windowses seemed to be "MS Sans Serif", a rather impoverished font even by Windows western-default standards!), and by selecting a richer-repertoire font I got much better results with the tooltips. In Windows/XP the configuration is found in ...Appearance> Advanced> Item: ToolTip. This is far from obvious, and few users will ever become aware of it I guess, so as web authors we would do well not to over-tax their browsers; but as web readers we might get some small advantage from it in configuring our own browsers, and dropping hints for other interested readers.

And I made the corresponding change for the window title bar, which has also had a positive effect when viewing pages which have i18n content in their <title> element.

Mozilla (also Firefox, SeaMonkey)

Generally speaking, Mozilla makes good use of the available fonts: even if the desired character isn't in the appropriate default font, Mozilla seems to hunt it down in some other font. Occasionally this can produce incongruous results when the fonts look very different, but at least the character is found from somewhere, rather than being displayed with a useless empty frame or place-holder.

Font chosen
charset= lang= Font
utf-8 (none)Western
he (Hebrew)Hebrew
yi (Yiddish)Western
iso-8859-8-i (none)Hebrew
he (Hebrew)Hebrew
yi (Yiddish)Western

The browser font preferences are found on the menu Edit> Preferences> Appearance> Fonts; the upper pulldown offers a number of what appear to be language scripts (Western, Japanese, Cyrillic etc.) but also has entries labelled "User Defined" and "Unicode", which seem as if they would relate to character coding or font arrangements rather than to languages or writing systems.

Some interactions have been observed between the character coding, the HTML lang attribute, and the choice of default font. A.Prilop reported, and I confirmed on several OSes and a certain range of Mozilla versions, a choice of different fonts for samples containing both Hebrew and Western content (all coded in ASCII, and using &#number; representations as necessary), depending not only on which character coding was advertised from the server but also which lang attribute was used on the element. The results which we obtained at the time, as we interpreted them, are tabulated: the browser applied these font choices to both the Western and Hebrew character samples which were offered. It should be noted that the (distinctive) fonts which were configured under all of the relevant headings (Western, Hebrew and Unicode) all contained both Western and Hebrew characters in their font repertoire.

We later broadened the scope of the study: I looked at Greek and Cyrillic, while he looked at Arabic. Of course this only represents observations at one point in time; later the Mozilla folks started handling Yiddish in the same way as Hebrew. But the principles are the same, even if the details change as additional options are supported.

Basically it seems that Mozilla takes its clues about which default preference to use, not only from the character encoding, but also from any lang= attributes on the HTML markup. For example, if the character coding is iso-8859-5 or if the language is lang="ru" (Russian) then it takes its font preference from the Cyrillic selection. It's important to understand that it does this not only for the cyrillic characters in the text, but also for any other characters in the text so long as they are also in that font's repertoire.

And correspondingly for iso-8859-7 or lang="el" (Greek).

Thus we can have Western and Greek characters rendered using the Cyrillic font preference (because the document was coded in iso-8859-5 and/or marked as language Russian); conversely the Western and Cyrillic characters might be rendered using the Greek font preference (because the document was coded in iso-8859-7 and/or marked as language Greek). In situations where the character coding (e.g iso-8859-7 Greek) and the language attribute (e.g ru=Russian) contradict each other, then Mozilla gives priority to the language attribute (i.e in that particular example, text would be rendered using the Cyrillic font preference).

The same effects were seen in Hebrew and in Arabic, as I noted above, and any other languages which Mozilla has been coded to recognise.

If the coding is utf-8 (or other Unicode coding) then the coding itself is language-neutral, but Mozilla still can use the language attribute to select a font preference (i.e it doesn't necessarily use its "Unicode" font preference for unicode character codings, in fact this font rarely seems to put in an appearance in practice, if one has well-populated fonts configured for the language script preferences).

When Chinese, Japanese, or Korean is encoded in a Unicode coding scheme, without explicit language information, the browser has to make some choice, and I'm told that Mozilla chooses Japanese by default. One should specify the language explicitly, for this and for other good reasons.

So far, so good: but Mozilla only recognises a certain number of languages in this respect (this number may change with version, of course), and behaves differently if the language attribute is set to something it doesn't explicitly support. This is why we didn't at first understand what we were seeing with Persian (Farsi) and Yiddish. Here, Mozilla (in the version we reviewed) used either the Western font (for Western characters?), or the Unicode font preference (for all the rest?), even though the character coding would imply something else (Arabic or Hebrew respectively, in the cases in point).

I'm not saying there's anything wrong with this: it doesn't violate any specifications as far as I know; but it's a bit confusing as to what is going on.

Opera

Unicode coverage was introduced with version 6.

A quick check with version 7.01 suggested that it had no problems with displaying Unicode characters, even those in Plane 1 (e.g the musical symbols in U+1D1xx).

Typefaces versus Fonts

A.Prilop rebukes me for not drawing a proper distinction between typefaces and fonts. For example, in Windows for the typeface Arial there are four fonts: Arial, Arial Italic, Arial Bold and Arial Bold Italic; and the browser will choose the appropriate one in order to get best results. Arial Unicode MS, on the other hand, only has the one font, and the italic and bold versions have to be derived from it, with results that can be suboptimal.

In the information shown by the Font Properties Extension, the "Names" tab draws a distinction between the "Font Name" and the "Font Family Name". Often, these are the same, but if we take for example Arial Bold Italic (or the other Arial-typeface fonts) then the Font Family is shown as "Arial". With "Arial Black", or "Arial Unicode MS", on the other hand, in each case there is only the one font, and the font family name is the same as the font name.

The point is well taken; but since the exact rules for matching a CSS font family specification to the available fonts are (for good reason) not exactly laid down in the CSS spec, and also are not well understood by me, it's difficult to be sure that I'm always using the terms correctly. Nevertheless, I hope this presentation is found useful in its context.

Typeface character repertoire discrepancies

The distinction between font and typeface can indeed be significant in regard to character repertoire: cases have been seen where a particular character which was present in the regular style was missing in one or more of the bold, italic, and/or bold italic styles. This resulted in certain browsers failing to disply the character(s) in question in bold, italic etc. styles, even though it displayed the character(s) successfully in the regular style.

Browsers affected at the time of the tests included not only MSIE but also Opera, whereas Mozilla (and presumably other browsers based on the same base code) were not fooled by this problem: they must have found the glyph from somewhere else.

In a particular example reported on a German-speaking usenet group, the person reporting the problem had a version of the Arial font which contained the euro character, €, in the font's normal style, but this was missing in the bold italic style, resulting in the display of a missing-character glyph. Recall that the name of a font (family) does not necessarily say anything about its character repertoire: many different versions of a font, in this case Arial, may exist, with differing - in some cases very widely differing - character repertoires.

There's really nothing that a web author can do to positively counteract these possible discrepancies at the user side. If we assume that the normal style is likely to be the one with the richest character repertoire, the best that one could do would be to avoid trying to display unusual characters with anything other than the normal style. As I've said elsewhere: when a challenging character repertoire is needed, there can be advantages in not trying to set a specific named font, leaving the well-informed reader to configure their browser to use the best font that they have available (something that's not known to the document author, of course).

WGL4

The WGL4 repertoire is a subset of characters, defined by MS as the core repertoire which they were aiming to support for their range of pan-European fonts. Page authors might consider it useful to confine their character usage to this repertoire, in order to avoid the risk of occasional oddball characters failing to be rendered in their pages - particularly when being rendered with IE6 (as we know, browsers such as Mozilla or Opera have ways of locating oddball characters from other fonts, lowering the probability of a failure). See also Alan Wood's page on WGL4.

Fonts and Security - an MS Windows issue in 2003

In 2003 I have been reading that MS have recognised that defective fonts can be a cause of OS instability and even a security hazard, and have taken steps to better enforce the conformance of fonts that are to be used in Windows. Discussion on comp.fonts around 20 Nov 2003 refers. This may of course have consequences for some third-party fonts, but it appears there are utilities which can be used to repair such problems (this is not my field, so I can do no more than to drop hints here).

Available Unicode fonts for Win32

One can research the names of fonts which MS provides with various products, at their page Microsoft Typography - fonts and products, and get some idea of their characteristics. Unfortunately these pages seem to studiously avoid giving any real information about the character repertoire which is covered by each font, and, as I've noted elsewhere, I've seen fonts with the same name but whose file size differed by a factor of 3, with the one being a distinctly crippled character repertoire, apparently for USA users, compared to the one intended for "multinational" use. So, all that I can say is "take care".

I'm aware of the following fonts which it may be useful to mention, available under various terms (but be sure to visit Alan Wood's Unicode resource pages cited above, to learn more than I know). I am not a lawyer and this is not legal advice, but please check the licensing terms of each download and ensure that they cover your intended usage. Font foundries put a great deal of work into their products, and it would not be surprising if they were to pursue any discovered abuses of their copyright etc.

Bitstream Cyberbit version 2.0

This is available for download from the Netscape FTP site: Netscape recommended it for use with Netscape 4.*. The free license permits non-commercial use.

The Cyberbit font represents an installed font file size a little over 12MBytes. Access is rather slow, especially on a machine that is low-powered or short on memory. If you don't need CJK coverage, then you might just install the Cyberbase download, which is much smaller - see the relevant ReadMe section for details. This is a true-type serif font, and the typeface contains only one font, denoted "Roman" (this term refers to the font cosmetics, it's got nothing to do with its writing-system repertoire!).

Arial Unicode MS

This is supplied with certain MS products such as MS Office. MS no longer make it available for downloading from their web site.

It is, of course, a sans-serif font. It contains a massive character repertoire, covering the whole Unicode 2.0 specification, considerably wider than Cyberbit, and weighs in as an approx 23MByte font file. However, in spite of its size it appears to work faster than the Cyberbit font. It gives a clean appearance for normal text, but, as already mentioned, italic and bold styles have to be derived, since the typeface contains only a regular font.

This font is not particularly necessary for MSIE users, since, as already stated, MSIE can pick and choose different character groups from different fonts (with some limitations in the mechanism, relative to what Mozilla does - see description elsewhere on this page). However, its use could be recommended with MSIE if you want a good coverage of mathematical operators and symbols.

Lucida Sans Unicode

This is a rather small font, as "i18n" fonts go, covering a reasonable number of glyphs (1776 in version 2.00). The typeface contains only a Regular font. It comes as standard with Windows/NT4 and later, and with Windows/98 (though it did not come with Windows/95).

Code2000

...and other fonts offered by James Kass

Palatino Linotype

Delivered with e.g Windows 2000, and includes a good coverage of writing systems for the general European area, including Cyrillic, polytonic Greek, etc. The typeface includes Regular, Italic, Bold, and Bold Italic fonts. If you, as a user, have this font (and if you're content with reading serif fonts in your browser), it would probably make a good choice for the browser configuration for Latin, Greek and Cyrillic writing systems. As an author, however, I would still recommend that in an "i18n" situation it's better not to try to force a font selection on your readers.

MS Web Fonts pack

MS used to offer a "Web Core Fonts pack", but have now withdrawn it from their web site: it is no longer relevant to any Windows version currently supported by MS. However, corefonts at SourceForge noted that the EULA permitted redistribution in unamended form, and proceeded to offer their copy for download. These fonts replace some of the fonts originally distributed with earlier MS OS installations, by fonts with the same name but having a wider character repertoire. (But don't install them on newer MS OSes which already have better versions of these fonts!).

The Corefonts web page concentrates on installing the Core fonts into Linux versions. In fact, modern Linux distributions have some excellent non-MS i18n fonts, perfectly capable of rendering challenging i18n content: as far as I can see: the only motive nowadays for installing these Core fonts into Linux would be for compatibility with some web page authors who will insist on proposing MS-only fonts for display of their web pages. But browsers usually implement some competent fallback (maybe guided by browser configuration) when none of the fonts proposed by the author is available, so this should not be a show-stopper, even in the absence of the "Corefonts".

These are not full-blown Unicode fonts like the multi-megabyte monsters mentioned above, but for most Western-based users I would expect that they will be found quite adequate, with a reasonable repertoire of math operators, Greek and other non-CJK writing systems.

Monospaced fonts

Here the situation seemed to be less satisfactory (and especially in MSIE). On being asked to investigate problems with i18n rendering in a <textarea>, I came to the conclusion that the reported problems were also present in any monospaced context, such as <tt> or <pre>. The character repertoire of the monospace fonts provided by MS seemed to be rather incomplete, relative to the rather extensive proportional fonts available.

An attempt to hunt down suitable monospace fonts available free on the web produced a number of hits, but all of them proved to be unsatisfactory for one reason or another: for those which were otherwise suitable, it was found impossible to select them in MSIE, as they were evidently not marked as monospaced in their font data.

To be specific (links here are to the TrueType reference manual at the Apple developer web site), there are two places where a TrueType font needs to marked to designate it as monospaced: the "OS/2 table" and the "post table". If this is done consistently, then indeed MSIE offers the font as a configurable "plain text" (i.e monospaced) font in its Fonts dialog.

For help in understanding the behaviour of monospaced fonts, I also added an extra column in my unicode test charts!

I located some pointers to Unicode monospaced fonts on a page about fonts supporting Greek, and downloaded what was offered as a monospace font, by George Williams, but unfortunately it too was missing the monospace indicators, as MSIE behaved as if it thought it was a proportional font: it also refused to select it for any "language script" other than User-defined, and the MS "font properties" tool confirmed that the language-script information was missing from the font. (However, see "later developments" below.)

In what follows, please note that references to "Monospace" refer to George Williams's font family called "Monospace" (fonts called "Monospace Roman", "Monospace Oblique" and "Monospace Bold"), and not to the CSS-defined "generic" font monospace.

Subsequently, in MSIE5.5 I tried applying a user stylesheet in which some markups which imply monospace fonts (tt etc.) were styled using {font-family: "Monospace" !important;} but this didn't produce the desired effect as long as the encoding was set to utf-8; when, however, the View>encoding was set to User-defined, the "Monospace" font sprang into view. The deduction would seem to be that this was because the font was lacking its Unicode/charset data, and seems to be an echo of the kind of behaviour reported by James Kass, and quoted earlier in the present web page.

In relevant Windows OSes, it appears that one can override the limited choice offered in the browser dialog by setting the default font directly using Regedit: however, this is surely not the sort of thing that a web page author could be comfortably recommending to readers whose familiarity with Windows is unproven!

My original informant later reported that he had been successful by using CSS to select fonts for the <textarea>, and I was able to confirm this also: although MSIE5.5 had refused to select proportional fonts as "plain text" fonts via the browser configuration dialogue, it seemed quite willing to select, for example, "Arial Unicode MS" for the <textarea> by means of CSS, even though it is not a fixed-pitch font: my informant found the results acceptable for his immediate purposes.

Later developments:

"Monospace" by George Williams: A.Prilop reported that he had successfully produced a version of George Williams' "Monospace" font that includes the correct "language scripts" data, and monospace indicators. Contact was made with the original author, who said that the "Bibliofile" web site is no longer maintained, and referred us to the PfaEdit site (which is now FontForge).

Correspondence showed that he (G.W) did not object on principle to alternative builds of his font being made available by third parties (subject to his original terms). A.Prilop suggested that I should make the rebuilt TTF file available.

The author states on the Monospace font web page that he is "no longer extending this font", and directs readers to the Freefont Project. As reviewed in 2005, this project indeed includes a monospace font family, named FreeMono, with regular, bold, oblique and bold oblique faces. Unfortunately this font does not seem to be marked as a fixed pitch font: just as the earlier problems reported above, MSIE will not select it as its default "plain text" (monospaced) font. "SIL ViewGlyph" reports amongst the font's properties "Is FixedPitch: No", which confirmed the suspicion.

At time of writing (June 2005) this turned into a stand-off with FontForge. According to the truetype specification, a font (even a monospaced font) is required to contain certain zero-width glyphs (e.g NUL, BS etc.), but FontForge is refusing to mark a font as monospaced unless all of its glyphs are of identical width. When this issue was pointed out to FontForge, a correction was refused. Consequently it seems that a correctly-built monospaced font (with the mandatory zero-width glyphs) will never be marked as monospaced by FontForge, making the feature rather useless. Such fonts would have to be marked as monospaced by the use of some additional font editing tool before they could be satisfactorily used in the contexts under discussion here. This issue is evidently impacting the FreeMono font (built with FontForge), just as it would for any other such font.

Everson Mono: There is what looks to be an excellent monospace font, Everson Mono, available as shareware (individual licence €25 at Aug 2004), It aims to cover a wide range of non-Han glyphs (although at time of writing it says its Arabic support is not ready). The font evidently contains the proper support for Unicode, code pages, and the monospace indicator flags, as MSIE is happy to select it when appropriate, unlike the problems described earlier with other fonts.

Monospace contexts didn't seem to present such a problem to Mozilla or to Opera 6, anyway: although the chosen glyphs for some of the unusual characters did look a bit out of style with the other characters, at least the browser had found some way of displaying what was wanted, even those which weren't available in the configured default font.

Urdu notes.

Urdu is a right-to-left script written basically with Arabic script, but using some additional characters (which are included in the repertoire of Windows codepage 1256). The particular issue discussed here is unrelated to text-direction. A.Prilop reports in relation to Windows 2000 for example:

Only two typefaces (Tahoma and Arial Unicode MS) include the special Urdu letters, even though other fonts claim to cover cp1256 (Andalus, Arial, Arabic Transparent...). To display such characters, Netscape 7 (Mozilla) takes first the typeface specified under "Edit > Preferences > Fonts for Unicode". Failing this, it takes "Arial Unicode MS" as a last resort.

Up to this point, the behaviour isn't specific to Urdu: Mozilla-based browsers recognise that the character is missing and pick it from another font, whereas MSIE just displays the "missing glyph" indication. Andreas wonders whether there might be some special cases where, after all, results could be better if authors suggested an appropriate typeface (here: Tahoma) from CSS: I viewed his test page on Win/NT4 using Tahoma "just as he'd intended", but got a load of "missing glyph" indicators, in spite of the same font name. So, I'd suggest user education might be a better line to take (and should work on other peoples' sites too, as long as those authors weren't doing something counterproductive themselves).

One Arabic-script-specific feature of Mozilla-based browsers noted by Andreas, however, was that it picked only the "isolated" glyph form for these missing characters, and did not use the initial, medial and final forms when it should.

"But I already have a (Symbol, Dingbats, Webdings...) font"

(writes the occasional reader of this page). Yes, sorry, but those fonts are constructed quite differently. To support these fonts in accordance with the HTML character model, the browser and/or display system would need to know which particular Unicode character was found at each position of these fonts. But if you examine the fonts with the Font Properties Extension, it reports for the "Supported Unicode Ranges" a blank display, and for the "Supported Code Pages" the mysterious remark "Symbol Character Set". Well, I would not call a "Symbol Character Set" a "Code Page" in the first place, so this is already something odd: and the information shown is the same for the fonts Symbol, Webdings, and WingDings, despite the fact that these fonts contain totally different characters.

How (not) to mis-use these fonts in web pages is touched-on in the next section, with a link to more-detailed discussion. Where the characters in question have proper Unicode code points, they should be referenced by those, according to the HTML character model (and the browser will pick them from some font which has them there).

Where the characters in question don't have regular Unicode points, one common approach in Symbol-type fonts is to assign them code points in the "Private Use Area" (PUA). What's wrong with this, for the purposes of HTML+CSS, is that the stylesheet is supposed to influence only presentation, rather than content; but the PUA character positions are used for completely different glyphs in different fonts, and if the user does not have the precise font intended by the author, the reader will have no idea which glyph was intended. So this approach is fundamentally unsound in HTML/CSS terms.

So in conclusion these fonts were fine for traditional not-very-portable word-processing formats, but this is not the way in which HTML extends its character repertoire: that is done by a mechanism that is in principle much more portable, namely Unicode. And, by now, that is substantially working, on the WWW: it would be quite wrong to try to break-back to an older and less portable way of working, for the sake of a few odd glyphs.

Author-specified fonts, FONT FACE and CSS

The purpose of giving authors a facility to specify fonts in CSS (and in the presentational aspects of HTML3.2, now deprecated) is to suggest a cosmetically-different presentation. It is not to influence the character repertoire as such.

As we can see, different fonts offer different character repertoire coverage, and in ways that are not obvious from the name of the font family: even fonts with identical name may have character repertoires which differ, in some cases by just a few characters, in other cases by whole ranges of characters as between one version of the font and another, or as between one platform and another. Thus there may be unfortunate interactions between an author's understandable wish to suggest a font (cosmetics), and the ability of a particular browser to display the desired character (repertoire). And there are of course also cross-platform issues to worry about, in any truly WWW context. (See also some pathological examples[1].)

In discussions, it is frequently suggested that - as a typical user would not know how to set their default font for best i18n results - it would be better for the author to set it for them. And indeed there's no disputing that for some fraction of users, the results might be better: but, and this for me is the killer, in every case that this idea has been examined, it has turned out that the idea would have significant negative consequences for some other fraction of users. What's more, if users can be given a gentle nudge in the right direction (assuming that the content is sufficiently challenging for this to seem appropriate), then any improvements they can make to their own browser setup as regards its i18n behaviour is likely to bring benefits on any properly-made web site, which, surely, is doing them a better service than any page-specific - or site-specific - fix could do.

With MSIE, it can even be harmful to specify one of the CSS "generic" font families, such as serif or sans-serif. While it is true that the extreme consequences sometimes observed in IE4 now seem to have been corrected, there are still problems to be seen in i18n terms with IE6: to take a specific case, the reader had selected Tahoma (a "sans-serif" font which covers the Arabic/Urdu repertoire well), but the author had specified sans-serif in CSS, resulting in IE6 switching from Tahoma to Arial - a font which was missing several of the needed characters.

To put it bluntly, if you are trying to use a wide character repertoire in a WWW context, then if you, as author, try to force a particular font to be used by every reader, whether by CSS or by FONT FACE, you will likely do more harm than good. It may be that you have the best of intentions, and you could indeed help some proportion of readers whose browsers are not set-up optimally, but you also risk causing real harm to some other proportion of readers. Conversely, readers who are having problems with displaying what is otherwise a properly-made i18n document on their browsers, provided of course that those browsers have been set up well for the writing systems in question, might be advised to try telling the browser to ignore document-specified font selection (note that in IE, this option is found on the Accessibility... sub-menu, not on the Fonts.. sub-menu).

Other Reading

There's a related W3C tutorial: Using language information in (X)HTML and CSS. This shows, amongst other things, how to propose different fonts to be used for different languages within a document. Don't confuse this, however, with the use of different fonts for different "language script" groups of characters as discussed above. The effects clearly do overlap in some sense, but they are not the same thing. CSS2.1 does not make provision for specifying that different language script character groups should be rendered with different fonts. The @font-face specification for embedded fonts, in CSS2.0 (and targetted to be reintroduced in CSS3), includes a unicode-range specification, but embedded fonts are beyond the scope of the present page. (For MSIE you might want to see my page I18n and MS WEFT.)


[1] Some pathological examples

To take a particularly extreme case, A.Prilop called my attention to Euro fonts being offered by Adobe. These fonts display the Euro currency sign at every position in the font!

Another example of this kind of thing, and again not the sort of thing you should be doing in HTML, are the ROT13 fonts mentioned in the PINE FAQ. But I digress... As I said above, the more usual problem (once authors have been persuaded not to use custom fonts in a misguided attempt to extend the character repertoire) is that there are significant differences in the character repertoire covered by different fonts, in ways that are not apparent from the font name, indeed fonts of the same name may cover different character repertoires on different platforms, and in different versions of the same font.


|Previous|Up|Next | |RagBag|About the author||