The Netscape charset Burp

major update May 2002; minor update April 2005

Preamble, and warning

The "Netscape burp" is only one aspect of a more general, and potentially security-relevant, issue about character coding in HTTP protocol transfers. There's an article, i18n: HTTP - charset at the W3C web site urging authors to set a proper charset attribute on their HTTP protocol headers (i.e not merely in their "meta http-equiv"). According to CERT security alert CA-2000-02, it is a potential security exposure to send out HTML and other text-type documents without an explicit character coding (charset=) specified.

In practice there are two ways of specifying the character coding for HTML files sent out by HTTP protocol: (1) the Content-type HTTP header, and (2) a META HTTP-EQUIV element within the document. (There are other issues in the case of XML/XHTML documents; this is not the place to go into those details).

My recommendation would be specify the character coding on the HTTP header whenever possible. Many information providers consider that specifying it via META is easier for them, and has some advantages when viewing files locally or using FTP; however, there are quite a number of theoretical and practical reasons for preferring the real HTTP header in a WWW context; not all of those reasons are set out here.

Summary of specific issue

Various versions of the Netscape 4.* browser have a tendency to "burp" when an HTML document contains a META HTTP-EQUIV that specifies a charset value for the document. We point to another resource on the topic, and offer a possible solution.

A fix was included in NS4.5PR2 and the subsequent release, but it seems there are some situations where the effect is still observed in that browser. But, as I say, there are more fundamental considerations of principle: even if/when the Netscape-4 problem is considered to be ancient history, there's still the advice of CA-2000-02 to take into account.

This page also deals with some other oddities of META HTTP-EQUIV handling in Netscape version(s).

The Burp

It's been noticed for quite a while now that some kinds of HTML document cause Netscape 4.* to "burp": it gives an impression of starting to display the document, and then briefly stops, and then starts over again. Investigating server statistics shows that it may re-load the whole document from the server. Sometimes on a form submission, NS even puts up a dialogue asking whether it should re-post the submission to the server (which is very disturbing if, in fact, the form was supposed to be placing an order, or doing something else that should not be arbitrarily repeated).

After some study and discussion, people concluded that all of the documents that were involved in this effect contained a

<META HTTP-EQUIV="Content-type" CONTENT="text/html;charset=something">

However, the burp doesn't always occur, even with pages that contain this item. After some discussion on usenet, Sander Tekelenburg created a web page to report his investigations of the burp. He concluded that the burp did not occur if the document was already in the browser's cache. Aside from that, the burp was observed if there was anything ahead of the META HTTP-EQUIV, such as an HTML comment or, importantly, an SGML DOCTYPE declaration. So, the only way to be sure of avoiding the problem if you have this kind of META HTTP-EQUIV in the document is to put it right at the top.

Well, as Sander pointed out, it's technically mandatory to have a DOCTYPE, and its absence causes problems for on-line validation etc; the DOCTYPE must of course come before the HEAD. But, according to the HTML4.0 recommendation, it is also mandatory to specify the document's charset, in at least one of the available ways: the HTML4.0 recommendation goes so far as to forbid client agents to assume a default charset (even if browser designers tend to disregard that mandate), and the alert CA-2000-02 cited above also gives a motivation for authors to define this attribute.

One solution

There are two ways that are generally available for specifying a charset: a META HTTP-EQUIV in the HEAD, or a real HTTP header on the network transaction. Many people appear to be convinced that the only one of these which is actually available to them is the META HTTP-EQUIV, supposing that the other is not accessible to them on the server that they use.

Well, of course I can't guarantee any particular case, but I can report that numerous people who have tried the following recipe, on the server that they use, have found to their surprise (and in some cases, to the surprise of their server admin!) that it works. Certainly, this is defined to work on Apache and NCSA, although it's possible for a server admin to enable or disable whether AddType directives are honoured in the .htaccess file. Well, if you don't try it, you'll never know.

"Why make do with an ersatz HTTP-EQUIV, when you could have a real HTTP header?".

Recipe

In the .htaccess file of the relevant (or higher) subdirectory on the web server, place an entry such as the following:

    AddType  text/html;charset=iso-8859-1  html

This specifies that for file extensions of html, the document will be sent out with a modified Content-type header, with the charset specified as shown. This can be extended to other charset values if you use different filename extensions according to the desired charset value.

This form of the directive works even for antique versions of Apache version 1. Nowadays it's more customary to set the content-type and "charset" separately, as mentioned below.

You'll find examples of the above technique used in my charset Playground.

You may notice that the example given in the HTML recommendation has a space between the semicolon and the "charset", but this space isn't mandatory.

If in any doubt on server configuration issues such as this, don't hesitate to consult the excellent Apache server documentation (should also be bundled with whichever version of Apache you are using). In current Apache versions you can control Content-type e.g text/html separately from charset (with AddType and AddCharset directives respectively) if you prefer.

Commented-out META...charset
(a new report of an old bug)

In May 2002, A.Prilop called my attention to a long-standing misbehaviour in Netscape versions up to and including 4.*, of which I had been unaware. According to his report, the following construct

<!-- <META HTTP-EQUIV="Content-type" CONTENT="text/html; charset=something"> -->

resulted, in spite of the META element supposedly being commented-out, in Netscape using the enclosed character coding in rendering the page. Reportedly this bug has been present since version 2.0. I looked into this myself in some recent versions (4.7x) as well as an older version (3.01) of Netscape, and found that the problem was even more curious. As A.P had reported, indeed the page was rendered according to the commented-out META; confusingly, however, the View->Page Info menu showed the document coding as "Unknown" (on version 4.*; "(default)" on version 3.01), as if the commented-out element had been correctly ignored. Nevertheless, it was most definitely the case that this character coding was being used to determine the page rendering (there was no other possible cause for the 8-bit characters in the test documents to be rendered as they were being).

The conclusion from this was that a META...charset cannot be successfully commented-out by enclosing the whole element in HTML comment markers. What is, on the other hand, successful is to turn the pointy-brackets of the element itself into HTML comments:

<!-- META HTTP-EQUIV="Content-type" CONTENT="text/html; charset=something" -->

This was tested and proven to work.

Continuing on this theme of broken HTML parsing in Netscape versions up to and including 4.*, A.P reported having once found a web page containing the following construct:

<meta charset="ISO-8859-2">

and being surprised to find that it actually "worked" in Netscape. And again, when this incorrect construct was commented-out in this way:

<!-- <meta charset="ISO-8859-2"> -->

Netscape again used the commented-out character coding for rendering the 8-bit characters, in spite of pretending in its response to View->Page Info that the character coding of the page was unknown.


|Previous|Up| |RagBag|About the author||