Unicode test material

In this area you will find rough-and-ready test tables, derived programmatically from the Unicode database, which is downloadable via the Unicode web site.

This is version 4.1.0, dated 2005-03-30 at the Unicode site.

It should be possible to discern the latest Unicode version at the Unicode site, which might be newer than the version which was used to prepare these pages.

To explain in a little more detail the procedure which has been used for producing these pages: the index page is based on a simple manual portioning of the official Unicode Blocks.txt file for the relevant version level, in order to get manageable sized web pages. The individual pages are generated programmatically from the UnicodeData.txt file for the respective Unicode version.

This presentation includes only the characters which are included in UnicodeData.txt: it does not include characters from the massive UniHan.txt file, i.e unified CJK characters. The location of these character blocks in the Unicode structure is shown, but without the actual characters.

Notes to this presentation

These test tables are coded with us-ascii characters, but are sent out with charset=utf-8. The specimen characters are exhibited by using &#number; notations, using decimal numbers. For increased clarity, the specimen characters are enclosed in <big> markup and styled for an enlarged display; but the page can still be enlarged further by using your browser's text zoom (or equivalent) facility when desired.

Arrangement into pages

The repertoire is broken into separate pages according to the leading hex digits of the U+xxxxxx representation, and the pages are broken down into smaller tables in order to minimise browser and printer problems. The index page is also generated programmatically, but the annotations are supplied from a manual list which, as I mentioned, is derived by ad hoc portioning of the official Blocks.txt file.

General Category

This column contains primarily the General Category Values from the Unicode data (UCD) file. Another point of interest to the author is whether the character has any "Decomposition Mapping" data in the UCD file: in the interests of compactness, the presence of such data is shown by an asterisk, "*", in this column, and the decomposition data is included in a title= attribute: a number of browsers will show this data if the cursor is hovered over the area.

Monospaced fonts

Following the observation of limited repertoire of some monospaced fonts, particularly in MSIE, I provided an additional column in which the &#number; character was repeated, but this time within <tt> markup.

Combining Marks

Three kinds of combining marks are presented in a specific way in the tables. My early attempts to show these marks in combination with white space or with no-break space were often unsatisfactory: some combining marks only really work when combined with an appropriate base character. What is being done now is this. On each page, a compromise base character is selected for use with any and all combining marks on that page. Which character that is, is displayed at the foot of the page. This is done for simplicity, and works reasonably well in most cases, although for example on page "0A" (Gurmukhi and Gujarati) this results in a Gurmukhi base character being displayed with Gujarati vowels; and a correspondent also points out that in Tibetan (page "0F"), some combining characters are only applicable to numbers. Other such anomalies are doubtless present, but this seems to me to be a useful compromise in the circumstances: feel free to refer to specialised resources for individual writing systems, which the pages here are by no means intended to replace.

The combining marks can be recognised by their General Category (Mn, Me or Mc). If a coloured display is available and CSS enabled, combining marks are shown on a coloured background as an additional reminder of the presence of the base character. The colour code is shown at the foot of those pages.

On the basis of general observation, we recommend to use precomposed characters whenever they are available, rather than using a base character with combining mark(s), as the support in browsers and fonts for precomposed characters seems to be noticeably better.

Arabic

If the code point is an Arabic letter (as shown by the relevant indications in the Unicode database itself), then (irrespective of whether this particular letter really does exist in all four forms) the page will contain the isolated letter, alongside a triplet (initial, medial and final) of the same letter joined together with ـ.

The above display technique for Arabic letters seems to be compatible with quite a range of recent browsers. An alternative approach is to use the zero-width joiner character alongside the character in question (on one, or other, or both sides) to get the browser to present the initial, final or medial forms respectively, but this doesn't work on some earlier browser/versions (e.g Mozilla 1.0).

MS WGL4

Although not directly a Unicode issue, the tables now show which characters are included in MS's Typography Specifications as their WGL4(.*) character repertoire. This might be informative in relation to which characters stand a fair chance of being rendered on recent Microsoft OSes.

The data are taken from the OpenType Specification - WGL4.0 Character Set.

Other ways of viewing Unicode data

For a different way of looking at this data, as well as a useful character search facility and other relevant resources, I highly recommend Alan Wood's Unicode Resources.

The Unicode Charts site offers images of the characters, subject to the caveats and usage limitations stated there.

The information which I provide here is offered to the best of my knowledge, but comes with all disclaimers. It is for the reader to determine whether it is useful for their purposes.