charset)
to extend repertoire?In a usenet discussion, Andreas Prilop took a critical look, from both a practical and theoretical point of view, at the options for dealing with back-level browser compatibility and with non-standard font repertoires, and described a diplomatic compromise. Before looking at the details, however, he does stress that this should not be seen as advocating or promoting this usage: what he actually advocates is deployment of the HTML4 technique, whereas this is just a diplomatic temporary fix. This, as you can tell from the other material in this area, is also my own view. Anyhow, here is his suggestion.
In order to make use of this compromise,
the relevant characters should be included as 8-bit
characters, and the document not advertised as being in
any particular charset.
From a theoretical point of view, it would be preferable to
advertise it as charset=x-user-defined,
or maybe even an invented value following the charset=x-*
pattern, but this has some negative consequences in practice:
so it may be appropriate to send the document out with no
charset specification.
In effect, the user is making no particular claims as to the
8-bit coding that they are using.
By using actual 8-bit characters, they are not misusing the Latin-1
entities (&name;) nor
their numerical character references (&#number;
for values below 256) to represent extraneous characters.
Of course, the usual caveats about careful handling of 8-bit
codes during cross-platform uploading, etc, apply here just as
elsewhere.
This is simple enough to explain when the entire document is using one non-standard repertoire: the user could be invited to simply install the relevant non-standard font and to configure their browser to use that font for the "user-defined" encoding. So far, so good. The "user defined" encoding in such a case is whatever your non-standard font arrangement implements.
Less palatable to an "HTML purist"
is the idea that some portions of a document
might be surrounded with FONT FACE= markup that
specifies a non-standard font, while the rest of the document
is treated as normal, let's say iso-8859-1, coding (although the
charset doesn't actually say so).
At risk of sounding monotonous, let me remind you that
this is not HTML's way of extending
the character repertoire.
From the point of view of character coding theory, we now have
a document in which different codings are being used in different
portions of the document, but without using a code-switching
mechanism that works at the character code level - instead,
the code switching points are indicated by a higher-level
markup, namely the FONT tag in HTML: this is
unsatisfactory from an architectural point of view.
Nevertheless, it does give a visual impression of working, on
a reasonable range of available browser versions, so it's
understandable that some authors wish to use it, at least as a
temporary compromise.
Interestingly, 8-bit characters are also used for this purpose
in the tth (TeX-to-HTML) converter by Ian Hutchinson,
which relies on the
Symbol font trick: the author discusses the extent
to which it is supported in various browsers/platforms and
offers some browser workarounds for getting the desired effect in a few
additional situations.
It appears that Mac users stand a chance of getting the desired
effect if they pretend that the character coding is
macRoman (which gives us yet another reason to not
advertise such documents as being in iso-8859-1).
In some versions of Netscape, users can only
select their own charset if the page provider has not imposed one,
which is a practical reason for not advertising
x-user-defined.
He remarks that Latin-1 characters (represented by
&-notation) may still be displayed correctly.
The author cautions against some WWW authoring packages that will
forcibly convert the 8-bit characters to &-notations
against the author's intentions (and A.Prilop counsels users of
Netscape Composer to prevent this by setting its coding
option to User-defined)..
Having cited that document by Ian Hutchinson, I might just
comment on the way in which his document continues to plead
for his method as correct HTML, while the fact that some recent
browser versions and other platforms don't co-operate is presented
as a fault in the browser.
(However, his recent versions have gone a little way towards conceding
that not everyone shares his view of the situation, and he does now
offer support for generating Unicode instead.)
I would take a different standpoint: that he has made an understandable
compromise for the purposes of what he wanted to achieve in the
short term, but that the documented problems are a
classic demonstration of the way in which RFC2070 really did
address the platform portability issue correctly, whereas his
FONT FACE kludge leads to all kinds of difficulties
in the longer term.
I don't for a moment want to decry the amount of effort and
ingenuity that has gone into his program, but that is quite a
different topic.
And you can see that this approach actually falls into place with the Hebrew trick that was described in the previous section. But there is a difference, which Windows users can investigate with the most-instructive MS 'Font Properties Extension'. Substitute fonts that are intended to implement a user-defined character coding will have no information in their "CharSet/Unicode" section of the properties display (or worse, might cheat by claiming to contain Latin character ranges when in fact they contain something quite different). This can trick even properly-implemented applications into displaying the "wrong" character (i.e the one that the document author wanted, which is "wrong" in terms of the HTML specification). Microsoft's Symbol font, however, tells the truth in this section: that it supports the Symbol character repertoire; applications can use this information to protect themselves against being fooled (whether they actually do so would be decided by the implementer).
I'm really not trying to promote the technique
here, either, but, given that some people are determined to use it
anyway, it's interesting to note that some of the objections
of principle can arguably be defused by using 8-bit characters,
and taking care not to advertise a misleading charset.




Last changed Sunday, 23-Oct-2005 12:25:40 BST
Original materials © Copyright 1994 - 2005 by A.J.Flavell & Glasgow University