I18n - text direction

Not a complete tutorial: rather, an examination of a number of topics that have come up in discussions related to text-direction. With links to other relevant resources.

Text directionality in Unicode

As you obviously know, some writing systems (scripts) run from right to left, for example Arabic or Hebrew. Unicode recognises this by assigning an inherent directionality, whether left-to-right (ltr) or right-to-left (rtl), to characters which are evidently a part of a directional script. Other characters however have neutral directionality, and their behaviour is meant to be dependent on their context in relation to other characters.

The formal position is set out in the Unicode standards, last spotted in Unicode Standard Annex 9, "The Bidirectional Algorithm". See the Unicode's Writing Direction FAQ for some issues.

It should be noted that Unicode defines some special characters: LEFT-TO-RIGHT EMBEDDING, RIGHT-TO-LEFT EMBEDDING, POP DIRECTIONAL FORMATTING, whose purpose is to influence directionality in Unicode texts (for example, these could be effective in plain-text). The question of their use in HTML is mentioned later.

Content, or Presentation?

In web page design, the directionality (ltr or rtl) is considered to be a property of the content itself, rather than mere presentation, and thus the issue is primarily a matter for HTML rather than for a stylesheet. Read about this in detail in, for example, Section 8.2 of the HTML4.01 specification.

CSS (here we are referencing a CSS/2.1 draft) does have some properties relating to text direction, but, as the specification emphasises, these properties would not normally be used in an author stylesheet for HTML rendering: indeed, some of the HTML rendering rules are not capable of being expressed in CSS, and an attempt to apply CSS rules in an author stylesheet might make it impossible for the browser to render HTML correctly. Rather, such rules might be be applied internally within the design of a browser ("chrome"), used for some special purposes (e.g for viewing source markup). Such issues are beyond the scope of the present note.

(Aside - we may note that the Chinese etc. practice of sometimes writing characters in vertical columns is regarded by Unicode rather as an issue of presentation than as an inherent property of Chinese characters. However, as CJK writing is not really my field, I won't be pursuing that topic here, sorry.)

HTML constructs related to BIDI

Normally, a body of text which is in a given writing system is supposed to be rendered in the appropriate directionality. HTML features are meant to be needed only to deal with ambiguous situations, e.g where strings of rtl and neutral characters need to be included within ltr text or vice versa, to resolve ambiguities in the handling of the neutral characters. HTML offers the following constructs related to BIDI:

The HTML4.01 specification points out the potential for mayhem if these HTML facilities are combined with Unicode's own embedded direction-changing characters. It not unreasonably recommends that one or the other be used, and mixtures avoided. It generally favours the HTML-based methods. The Unicode TR20 goes further than that, generally rating embedding controls etc. as "not suitable for use with markup", and recommending the use of appropriate (x)HTML markup for this purpose.

So much for (X)HTML markup of text which forms body content, which is of course the usual situation in HTML. Where this approach breaks down, however, is when mixed-direction text needs to appear as the value of an attribute, e.g the alt= attribute of an img, the title= attribute, and so on. As mentioned below, browser support for mixed-direction text in these contexts tends not to be so well supported anyway, so restraint is recommended in the use of these features. But if you need to influence the rendering of mixed-direction texts in attribute values, you can't use markup to do it, so your only option would be the Unicode embedding characters. Some notes on this in A.Prilop's examples page.

Font issues

A separate page has already cautioned against the risks, some of which are not obvious, of specifying fonts in an i18n context. This is true irrespective of whether the specification is done with CSS, or with legacy HTML (font tags).

In the past, the users of writing systems which were poorly supported by browsers have devised various workarounds for their purposes, including 8-bit character codings which were not supported by browsers, and custom fonts laid-out according to those unsupported character codings (i.e using the so-called "user defined" character coding which is offered by some browser implementations). You may then find these techniques heavily promoted on web sites dealing with these poorly-supported writing systems. In the context of directionality we might think for example of Persian (Farsi), written with a character repertoire like Arabic with a few additions, but presented for preference with a somewhat different appearance (same Unicode characters, but cosmetically different font). We might also think of Yiddish, which has a few characters additional to the normal Hebrew repertoire: those characters may be missing in normal Hebrew-supporting fonts.

The message of this part of the page is to caution you that these older workarounds are not helpful when using HTML4-based i18n techniques, and can in fact be seriously harmful. The sneaky part is that everything might look fine on your own browser, with your choice of installed fonts; it might even be confirmed by running a different browser (on the same operating system, with access to the same fonts): but if you try to force the same font selection on your readers, the results might be terrible. Many of the writing systems which in earlier times were rated as poorly-supported in browsers, are nowadays quite well supported, using bona fide HTML4 techniques, and you can see them in use for example at the Google search service, Google in your language.

Arabic scripts: letter forms

A letter in Arabic scripts can have up to four different forms: the isolated form, the initial form (at the right hand end of a word, of course), the medial form, and the final form. Rendering systems which properly support Arabic writing are expected to select the appropriate glyph according to the context: the character code stays the same. (You are not supposed to use the Unicode "presentation forms" to control this in normal usage.)

Generally speaking, if a browser and font support Arabic, then this works; but some of the other languages which use Arabic-family scripts have a few language-specific letters which might not be so well supported, e.g the font might only contain the "isolated" form of the letter. If you want to get a complete picture of a font's coverage of Arabic scripts, therefore, it's not enough to look at a presentation of isolated letters, as is usual in code table charts: you need to see the letters in the other word positions in which those letters can occur.

In Persian (for example) there are situations where the behaviour needs to be steered by the use of ZWJ and ZWNJ characters; see also the test cases from A.Prilop.

At U.Texas there's a tutorial for use of MS Word in Persian/Farsi (a pity about the web pages' defective construction, relying on a browser bug to implement CSS font selection of non-standard fonts such as Wingdings, which a properly-behaved HTML4 browser doesn't do). Their tutorial covers the usage of ZWJ and ZWNJ characters.

Browser support

Both Mac and MS Windows OSes contain support for BiDi, but it typically isn't installed by default, you need to select the relevant installer option: this would need to be done by both authors and readers.

Some modern browser/versions make a reasonable job of supporting rtl and mixed directionality, but there are still occasional anomalies in those: others make no attempt to support it at all, while there are some with partial support. In some cases you need to install optional OS features in order to get the support. I haven't done any kind of detailed survey, so I won't try to offer random examples here. But I'm asked to add a comment about Opera: rtl support had been introduced by version 7.23 (I've dropped details of bugs in older versions now).

In theory, the Unicode BIDI rules are supposed to take care of everything, and there should be no need to mess with the direction. What is clear however is that it doesn't do any harm to specify appropriate dir= attributes, and it can often help, in the face of actual browser implementations. The recommendation to authors who are involved in authoring rtl or mixed-direction content is to make liberal use of appropriate dir= attributes. However, this might not automatically set right alignment in some browsers, see the remarks below under Other effects.

Circumstances where it might be appropriate to use the bdo element are discussed in HTML4 specification.

Even in browser setups where support for BiDi in normal content rendering is OK, you may still encounter oddities in other situations, such as for instance:

In these or similar situations, there can be problems not only with limited character repertoire, but also left-to-right rendering even of characters which ought to be rendered right-to-left. In my own experience, for example, in Win/NT4 the window titles and popups displayed Arabic or Hebrew text from left to right (i.e wrongly); and Win/2000 Pro also did the same by default, but by visiting the control panel for "Regional Settings", and enabling the support for Arabic and for Hebrew, this problem was resolved in the latter (I'm advised that in Win/NT4 or Win/9x versions, getting this right required a regional edition of the OS: I pass this on with the usual disclaimers.) As for the fonts used in these situations, may I refer you to the notes in my browsers/fonts page.

Source code representation

In the samples linked below, the source code is generally representing the rtl characters by means of &#number; notation, for the convenience of the authors.

In practice, authors of production web pages may very well wish to represent their rtl text as coded characters, for example in utf-8 encoding (or in the 8-bit encoding appropriate to their writing system). From an HTML point of view, of course, either representation is entirely valid and can be expected to be rendered correctly by supporting browsers.

However, a word of caution: mixed-direction content, interspersed with markup, is beyond the capabilities of some HTML source code editors. If you find yourself in this situation, it may be worth noting that coded characters can be programmatically converted to the corresponding &#number; notation. There are purpose-designed filters which can do that; or, for example, Mozilla Composer (and its sibling Nvu) have an option to "Save And Change Character Encoding" - if you change the encoding to iso-8859-1, then all characters outside of the Latin-1 repertoire, including of course all the rtl characters, will be converted to their corresponding &#number; notation.

Once the document had been edited, the updated version could, if you wish, be reconverted for publication to your preferred character encoding, by a similar procedure.

For one example of difficulties with source code display, see Mozilla Bug 322945.

Other effects

This portion chiefly contributed by A.Prilop (who has subsequently made his own sample page, cited below).

<html dir="rtl"> or <body dir="rtl"> may place the scrollbar on the left side in some browsers and may have other visual effects (layout of images etc.). They affect the column order of tables in probably all browsers that know the dir attribute.

Arabic and Hebrew text as such is usually not right-aligned by the browser. We have already recommended liberal use of dir="rtl" on the p etc. element (or on body for a general effect), for this and other reasons, and in most tested browsers this did put the browser into right-alignment mode. However, in at least one browser tested, this didn't happen, so, if you want to cover all bases, it would be advisable to specify right-alignment in the stylesheet.

The belief, apparently widely-held, that whitespace problems can be cured by using no-break spaces instead, isn't really true: both have neutral directionality. There is, however, a difference between them in regard to numbers: the no-break space has the property of being a "common number separator", which influences the result when it's used to space-out strings of digits.

One must be careful with the paired ASCII characters <> () [] {}

It's also worth noting that stylesheet problems can be exacerbated by mixed-direction contents: in particular, any :first-letter CSS styles can show up in surprising places.

Suppose you want to include the etymology

   kerŠtio(n) > qÓr‚t. > carat > karat
   Greek        Arabic          Russian

into an English or an Arabic text. In an Arabic text it should read

karat < carat < qÓr‚t. < kerŠtio(n)

dir=ltr
κεράτιο(ν) > قيراط > carat > карат

dir=rtl
κεράτιο(ν) > قيراط > carat > карат

More samples

A.Prilop's bidirectional examples.

Digression - RFC1556

RFC1556, from 1993, adopted some conventions which had previously been developed by ECMA for handling right-to-left writing systems, and codified them for Internet use. The explicit and implicit bidirectionality conventions are denoted by appending the relevant suffix, -e or -i, as part of the charset= attribute specifying the character coding (iso-8859-6, iso-8859-8 as the case may be). In the absence of such a suffix, RFC1556 assumes the visual directionality convention, in which the right-to-left character strings are coded "back to front" (this assumption is generally incorrect for iso-8859-6 Arabic, however).

It should be noted that the RFC1556 specification is aimed at plain text (text/plain in MIME terms), where there is no possibility of inserting markup at a higher layer of protocol; thus the only available machinery is the use of control sequences at the character stream level.

This kind of approach is inappropriate for use in HTML, where the necessary control is applied for preference at the HTML markup level. This aspect is discussed in the HTML4.01 specification at the end of section 8.2.4, although that concludes with what appears to be a somewhat incoherent recommendation on correct practice for specifying charset= for HTML documents.

When charset=utf-8 is used, then RFC1556 is not directly relevant: the applicable rules are those of Unicode (and of HTML4.01 of course). However, if an 8-bit character coding is specified, then we ought to make some kind of sense of what the HTML4.01 spec says about RFC1556. The use of the suffix -e denoting explicit (RFC1556) directionality information is ruled out, and this much is clear. For Arabic, it can evidently be assumed that iso-8859-6 refers to implicit (as opposed to visual) directionality. As for iso-8859-8 Hebrew, I would take my cue from Nir Dagan (cited below), who writes:

In HTML the character encoding does not assign directionality in any way. Directionality is assigned to characters by Unicode's bi-directional algorithm, and additional HTML markup. Thus, when writing a "visual" document one must not rely on its charset labeling and must override directionality explicitly as HTML 4 requires.

What Nir Dagan's comments mean is that in the HTML context, a charset specification of iso-8859-8 is not adequate to carry its traditional implication of visual directionality: in the event that visual directionality is intended, then the content needs to be consistently marked-up with <bdo> in order to make the visual directionality explicit at the HTML layer. However, the use of the visual technique should now be considered obsolete, in favour of using the implicit directionality of the characters, assisted with dir= attributes, as described elsewhere on this page. The latter practice corresponds to the so-called implicit directionality convention, and, according to the HTML4.01 spec, calls for charset=iso-8859-8-i to be specified, even though the HTML markup alone should be sufficient to distinguish between the two conventions, and the distinction between charset=iso-8859-8-i (implicit) and charset=iso-8859-8 (visual) appears to be superfluous in the HTML context.

A.Prilop comments: With dir="rtl" it is no longer necessary to use the encoding ISO-8859-8-i instead of ISO-8859-8. Only Google recognizes ISO-8859-8-i. Alltheweb and Altavista mistake such pages as ISO-8859-1-encoded.

Based on that and on Nir Dagan's comments, I would conclude that the HTML specification's advice to include the -i suffix on this charset is probably now better disregarded.

Unicode consortium Technical Report #20: "Unicode in XML and other Markup Languages", particularly these sections:

W3C FAQ on script direction and languages.

Right-to-left text in Markup Languages from i18nguy (Tex Texin), who evidently understands far more about these matters than I do.

Nir Dagan's Hebrew on the Web, especially Standards for Hebrew on the Web.

An introduction to writing Arabic on the Mac by Knut S. VikÝr.

Several rtl writing systems can be seen in practical use at Google's search pages:

although occasional oddities may be spotted; and, as already noted elsewhere, Win MSIE is not very good at finding characters (e.g for Urdu) if they are missing from the configured font for this language script. Mozilla, on the other hand, finds the characters from another font, even though the styles may be a poor match, and sometimes only the isolated form of a particular Arabic-script letter may be available.

These Google pages appear to be delivered with different encodings for different browsers, though it's not known whether this is based on Accept-charset negotiation or on some kind of browser sniffing.

Readers of German could find this usenet posting (archived at Google groups) of interest.

Some old workarounds which by now really must be rated as obsolete are examined in my page Using FONT FACE to extend repertoire?. If anyone tells you that in order to be able to read some non-"Roman" script, you need to download a specialised font, then it could be an indication that they are talking about one of these obsolete workarounds. Some of those superseded techniques fail entirely when used with a specification-conforming HTML4 browser; others do still give a visual impression of working, in spite of being unsound at the fundamental level (leading to problems with search engines etc.). So be on your guard if you are offered such advice. (Don't confuse this with properly-made Unicode fonts for the writing system in question, which are fine - but downloading some particular font recommended by the page author should not be essential for viewing a web page, if you already have one or more fonts which cover the writing system).


|Previous|Up|More|Next | |RagBag|About the author||