User-defined character encoding (charset) to extend repertoire?

In a usenet discussion, Andreas Prilop took a critical look, from both a practical and theoretical point of view, at the options for dealing with back-level browser compatibility and with non-standard font repertoires, and described a diplomatic compromise. Before looking at the details, however, he does stress that this should not be seen as advocating or promoting this usage: what he actually advocates is deployment of the HTML4 technique, whereas this is just a diplomatic temporary fix. This, as you can tell from the other material in this area, is also my own view. Anyhow, here is his suggestion.

In order to make use of this compromise, the relevant characters should be included as 8-bit characters, and the document not advertised as being in any particular charset. From a theoretical point of view, it would be preferable to advertise it as charset=x-user-defined, or maybe even an invented value following the charset=x-* pattern, but this has some negative consequences in practice: so it may be appropriate to send the document out with no charset specification. In effect, the user is making no particular claims as to the 8-bit coding that they are using. By using actual 8-bit characters, they are not misusing the Latin-1 entities (&name;) nor their numerical character references (&#number; for values below 256) to represent extraneous characters. Of course, the usual caveats about careful handling of 8-bit codes during cross-platform uploading, etc, apply here just as elsewhere.

This is simple enough to explain when the entire document is using one non-standard repertoire: the user could be invited to simply install the relevant non-standard font and to configure their browser to use that font for the "user-defined" encoding. So far, so good. The "user defined" encoding in such a case is whatever your non-standard font arrangement implements.

Less palatable to an "HTML purist" is the idea that some portions of a document might be surrounded with FONT FACE= markup that specifies a non-standard font, while the rest of the document is treated as normal, let's say iso-8859-1, coding (although the charset doesn't actually say so). At risk of sounding monotonous, let me remind you that this is not HTML's way of extending the character repertoire. From the point of view of character coding theory, we now have a document in which different codings are being used in different portions of the document, but without using a code-switching mechanism that works at the character code level - instead, the code switching points are indicated by a higher-level markup, namely the FONT tag in HTML: this is unsatisfactory from an architectural point of view. Nevertheless, it does give a visual impression of working, on a reasonable range of available browser versions, so it's understandable that some authors wish to use it, at least as a temporary compromise.

Interestingly, 8-bit characters are also used for this purpose in the tth (TeX-to-HTML) converter by Ian Hutchinson, which relies on the Symbol font trick: the author discusses the extent to which it is supported in various browsers/platforms and offers some browser workarounds for getting the desired effect in a few additional situations. It appears that Mac users stand a chance of getting the desired effect if they pretend that the character coding is macRoman (which gives us yet another reason to not advertise such documents as being in iso-8859-1). In some versions of Netscape, users can only select their own charset if the page provider has not imposed one, which is a practical reason for not advertising x-user-defined. He remarks that Latin-1 characters (represented by &-notation) may still be displayed correctly. The author cautions against some WWW authoring packages that will forcibly convert the 8-bit characters to &-notations against the author's intentions (and A.Prilop counsels users of Netscape Composer to prevent this by setting its coding option to User-defined)..

Having cited that document by Ian Hutchinson, I might just comment on the way in which his document continues to plead for his method as correct HTML, while the fact that some recent browser versions and other platforms don't co-operate is presented as a fault in the browser. (However, his recent versions have gone a little way towards conceding that not everyone shares his view of the situation, and he does now offer support for generating Unicode instead.) I would take a different standpoint: that he has made an understandable compromise for the purposes of what he wanted to achieve in the short term, but that the documented problems are a classic demonstration of the way in which RFC2070 really did address the platform portability issue correctly, whereas his FONT FACE kludge leads to all kinds of difficulties in the longer term. I don't for a moment want to decry the amount of effort and ingenuity that has gone into his program, but that is quite a different topic.

And you can see that this approach actually falls into place with the Hebrew trick that was described in the previous section. But there is a difference, which Windows users can investigate with the most-instructive MS 'Font Properties Extension'. Substitute fonts that are intended to implement a user-defined character coding will have no information in their "CharSet/Unicode" section of the properties display (or worse, might cheat by claiming to contain Latin character ranges when in fact they contain something quite different). This can trick even properly-implemented applications into displaying the "wrong" character (i.e the one that the document author wanted, which is "wrong" in terms of the HTML specification). Microsoft's Symbol font, however, tells the truth in this section: that it supports the Symbol character repertoire; applications can use this information to protect themselves against being fooled (whether they actually do so would be decided by the implementer).

I'm really not trying to promote the technique here, either, but, given that some people are determined to use it anyway, it's interesting to note that some of the objections of principle can arguably be defused by using 8-bit characters, and taking care not to advertise a misleading charset.


|Up | PPE Home|RagBag|About the author|Email|


Last changed Sunday, 23-Oct-2005 12:25:40 BST