Internationalisation of messages

Some personal ramblings

These notes follow up an article that I posted to usenet. The problem is that software authors are inclined to build the message texts into their software, leading to difficulties if someone wants to run the application in a foreign language environment. Not only does the source code have to be edited and recompiled in order to get the messages translated, but in many cases the language idiom would demand that the parameters be inserted into the messages in a different sequence. In a construction like the printf call, this kind of rewriting is an error-prone procedure, if it has not been designed-in from the start.

As I noted in my posting (which must have been in 1995 or earlier, and is describing something from several years before that - we phased out our IBM mainframe in 1994), IBM had introduced a technique into their VM/CMS operating system that made it straightforward for them to convert their messages into any language, at least any based on the Roman alphabet. Even better, anyone with access to the source of the messages was able to translate them into any desired language and install them, either globally for all users or temporarily for one invocation, by a simple command. This technique seems to me a genial solution to the problem, and applicable in many different situations. I try here to present the salient points of the technique, leaving out as much irrelevant detail as possible.

Some background

It had always been IBM's practice to construct messages in two parts: a unique message code, and a text. The original purpose of this had been to enable messages to be looked up in the appropriate manuals, but, as we will see, it prepared the way for internationalisation.

A message could range from a simple prompt, e.g

DMSXMD537I Input:

through a simple error message with one parameter e.g:

HCPCFS003E Invalid option - foo

to a complex explanation with several parameters.

I merely remark at this point that the first three letters of the message code identified a major software component, e.g an operating system or a vendor package, while the next three letters identified a subcomponent of that package. The digits were an error number within the package, and the suffix indicated whether the message was Information, Warning. Error, Severe/System error etc.

The operating system contained functions for handling these messages i.e for issuing a message, supplying values to be inserted into the parameters, and displaying the texts either with the message codes (the usual choice for experts) or without (the usual choice for normal users). There was every opportunity for a vendor package to take advantage of this scheme, by assigning its own message codes and calling the relevant system routines.

I emphasise that I'm merely explaining how it was. It's clear that the techniques described here could be applied in various different ways, in different contexts/scenarios. In a situation where the operating system did not itself support such facilities, an application would have to support them for itself. I am describing the technique as I was familiar with it in IBM VM software. It could well be that it is used by other vendors, indeed I suspect that VAX/VMS has an analogous way of coding messages and may use a similar technique internally. But I am not sufficiently familiar with VAX/VMS internals to know whether an individual user could install a private language in the way it was supported on VM. I don't doubt that people familiar with the internals of various other operating systems will recognise similarities. Fine, then I have made my point.

The Technique

Traditionally, IBM used to build the error message texts into their software components, the same way I had known it in conventional programming since the early days. The only benefit of the message codes was, as I say, the ability to document uniquely where every message was issued, and to find it in a directory of error messages. (Don't you just love those complex software suites that display "Bad argument" on the screen, without giving a clue which system component issued the message, nor where it was looking for input, nor what kind of input it expected???).

However, in response to massive customer demand for native language messages, IBM introduced a different approach, fairly late in the life of the VM operating system.

The texts of the messages were removed from the software, and the system call merely requested that error message IBMABC123E be issued with parameter1 = value1, parameter2 = value2, etc. The error message texts themselves were supplied in separate files, written in plain text, and software was made available for activating a given message file, effectively "compiling" these messages into system-compatible form to facilitate efficient lookup of messages at run-time.

An important feature of the procedure is that the messages contained markups that showed which variable parameter was to be inserted at which place in the message. In this way, a message that in one language would read:

XESDSK100E Error %1 writing file %2 to directory %3

might appear in a different language in the form

XESDSK100E Writing file %2 to directory %3 caused error %1

if the idiom of that language called for it.

Summary

The salient points of the scheme would therefore seem to be

ToDo-s

I personally have no idea how this extends to non-Roman messages. Do users of Greek, Hebrew, Russian, Japanese etc. expect their filenames and suchlike parameters to be displayed in Roman characters or in their own script?

There is also a problem with prompting a user for input. If the original message says:

AJFQRY001A Please respond 'Yes', 'No' or 'Quit'

and the translated version says

AJFQRY001A Bitte, 'Ja', 'Nein' oder 'Abbruch' eingeben

what is the software to make of the user's input? Presumably in a situation like this, some extra kind of markup is needed so that the application software can realise not only which input strings are acceptable, but also what meaning to attribute to them when they are accepted.