This page presents a number of character repertoire scenarios and makes recommendations for optimising accessibility on older browser/versions. The emphasis here is on clear recommendations - which can be rendered on an appropriate range of browsers if they have been properly configured - rather than on explaining too many exceptions and special-cases: other supporting material in this area should be helpful in better understanding the choices. Handling of forms input isn't covered here (some incomplete notes on forms submission are available separately).
Important: this web page concentrates on the form of the document as it will be sent out as text/html from a web (HTTP) server, and does not address how to author it in the first place nor how to get it onto the server. Those details are too many and varied (and OS-dependent) to deal with them adequately here; whereas what is sent from the server to the client is clearly-defined and platform-independent (if it was not, then it would be a failure in WWW terms!), and that is what we are concentrating on here.
Compatibility for older browsers is only really relevant for the content-type of text/html: that is to say, either "HTML proper", or compatibility-mode XHTML/1.0 ("appendix C"). Those who want to use overtly XML-based content types will inevitably be incompatible with older browsers (and quite a few current ones, indeed), and so, if utf-8 is appropriate, then just go ahead and use it. There's a note about writing xhtml/1.0 for compatibility.
As I've commented elsewhere in this area, the terminology used in relation to the representation of characters in HTML and the WWW often causes confusion to those who gained experience in character handling in a different field, e.g word-processing. This is not the place for a full tutorial: I've tried to keep the checklist understandable even without a deep knowledge of the topic, but I do encourage readers to develop a familiarity with the HTML character model to avoid unnecessary confusion.
"8-bit Character Repertoire" refers here to a repertoire of no more than 256 characters that is supported by one of the various 8-bit character codes. Examples would be the 8-bit codes defined by ISO (iso-8859-n for various values of n) or by others (localised encodings such as Thai encoding TIS-620, or vendor-defined encodings such as Windows-1250, macRoman...).
An 8-bit repertoire represents the largest repertoire
that many early browsers could support at any one time,
at least by means of specification-conforming techniques
(and remember, an HTML document
has to be entirely in one encoding, it's not possible to change
coding partway through a single document).
Many of these older browsers could support different "8-bit character
repertoires", with the browser automatically switching in response
to the incoming character encoding (that MIME
attribute) of each individual document.
Some more-recent browser versions, even if not offering full support for Unicode, nevertheless could deal with a wider document repertoire than just one 8-bit encoding, like for example scenario 5.
"advertise as" refers to
the specified "character encoding" (HTML terminology)
with which the document is to be sent out from the HTTP server.
This should be defined by the "
(correct MIME terminology, but now rather misleading in an HTML context)
specified on the HTTP
The specifications also allow for this to be specified via
meta http-equiv within the HTML document, but
this is less satisfactory both on theoretical grounds, and on some
practical considerations, as is discussed in more detail elsewhere.
"Coded character" refers to the character
itself, expressed in the advertised character encoding:
i.e a single octet (byte) if this is an 8-bit
coding, or an appropriate sequence of octets if this is a multibyte
coding. This is in distinction to a character expressed by one of
&-representations: character entity
of the form
numerical character reference of the form
decimal (widely supported) or
(somewhat less widely supported), remembering that these numbers
&#-notation refer to the character's position
in iso-10646/Unicode, irrespective of the
character encoding (
Theoretically, the three different kinds of character representation - the coded character, the numerical character reference, and (where available) the named character entity - are fully equivalent as far as HTML is concerned; the point of this checklist is the practical issues that favour the choice of one representation rather than another in various actual situations (the "scenarios") presented below.
I'm not attempting to cover characters of the Chinese, Japanese, Korean (CJK) kinds, as these are outside my area of expertise.
|Particularly recommended for those working cross-platform
(e.g Macs) without the relevant expertise in
handling 8-bit coded text.
See also the WDG's advice
|2||Latin-1||Alternative to scenario 1:
Use 8-bit coded characters, advertised as
|Contrary to rather widespread superstition, 8-bit coded characters are entirely legal on the WWW (see Note A).|
|3||Latin-1 with Windows typographical extras (matched quotes, em-dash etc.)||Not Recommended, see Note B.||If extended character coverage is being used anyway, then use the methods of scenario 6 (or 7).|
|Proprietary and not really recommended, but admittedly
rather widely supported, even by relatively old browser
versions which cannot handle scenario 6:
use 8-bit characters, coded
Alternatives: Note B.
|The more forward-looking approach is to follow the
methods of scenario 6 or 7.
A composite of Latin-1 with one other 8-bit repertoire (e.g Latin-2) could be done as in scenario 5. See also discussion in Note B.
|4||One 8-bit Repertoire||Choose an 8-bit encoding appropriate to the desired
repertoire (preferably an ISO code, e.g
iso-8859-7 for Greek, or one that is widely used in its native habitat,
e.g TIS-620 for Thai).
Use 8-bit characters, advertised accordingly.
|This form of document is accessible to a very
wide range of browser versions, even old ones,
although it might require additional setup or fonts
to take advantage of the browser's capability.
Some issues are explored by
For HTML use, avoid iso-8859-15.
|5||One 8-bit non-Latin-1 Repertoire, together with Latin-1||Code the non-Latin-1 characters like
scenario 4, as 8-bit coded characters, advertising
the document with the appropriate encoding ("
||This form of document is entirely valid and
accessible to any client which conforms to
but many characters fail on Netscape 4.* versions.
See Note F.
See scenario 8 for possible workaround.
|6||More than one 8-bit repertoire, but predominantly Latin text||Code everything using only us-ascii characters (i.e
7-bit), expressing all other characters, even Latin-1,
by means of &-notation.
For Latin-1 characters,
Advertise the document as
|This of course needs a browser version which supports
enough of HTML4/RFC2070 to understand what's needed.
It therefore shuts out some old browser versions which could have
coped with scenario 4 or 5.
The browser might need some extra setup to enable this capability,
e.g extra font(s) and settings.
Refer also to scenario 8.
See Note C.
|7||More than one 8-bit repertoire, not limited to predominantly Latin text||Use actual
||Just like scenario 6, this is an entirely valid form to send out documents, and is acceptable to any RFC2070-conforming browser as well as to Netscape 4.* versions. Browser coverage for the two forms seems rather similar. The expected difficulties are not in the browsers, but in authors (mis)handling this unfamiliar data format.|
|8||Compromise solution for scenario 5, using techniques of scenario 6 or 7 for browsers which support it.||Make the document in two different forms
(or generate them as required "on the fly").
Use server negotiation (based on the client's
All of the schemes recommended here utilise valid
techniques according to published specifications and can
(subject of course to the limitations of each scheme)
be programmatically converted from one form to another.
Thus, it isn't essential that your authoring tool produces the precise
form recommended: there are ways of programmatically
converting one form to another.
There are numerous ways of doing such a conversion
in an HTML-aware fashion, including
simple command-line utilities and pipelines depending on your
preferences - some of which could be used for
on-the-fly conversion in the server, if you wish.
For those who prefer a point-and-click solution prior to
uploading to the server, it may
be mentioned that your HTML could be loaded into Mozilla Composer
(or one of its derived authoring packages such as Nvu), and then saved with
encoding conversion: characters in the content will be converted
between encoded characters and
appropriate for the newly-specified character encoding.
To be specific, the Composer/Nvu menu item for this is
File> Save And Change Character Encoding.
A word of warning: even though the browser versions discussed here are technically competent to do what is being described of them, it's not certain that a particular browser installation will do it properly: the user might need to supply some fonts supporting a specialised repertoire (e.g Thai, Armenian...) or install optional rendering features (e.g right-to-left text, Indic script support...). It might be helpful to supply a little test-case page, with a screen-shot image for them to compare, and some notes on any special actions they'd need to take to set up their browser for this situation.
Points of interest are not only the accessibility of your documents to users' browsers, but also to search engines. A.Prilop cautioned that search engines had been slow to support indexing of utf-8-encoded content - some earlier problems with AltaVista search seem to have been fixed, but for best results across search engines it might still be advisable to offer appropriate 8-bit encoding(s) as alternatives to a utf-8 version, along the lines shown in scenario 8. The relevance of this is fading with time, however (2005).
Note that even those search engines which support utf-8 may have no support yet for utf-16 or utf-32 encodings: in WWW situations where a unicode character encoding is desired, then we definitely recommend the use of utf-8. As for utf-7, it is now considered obsolete by the Unicode consortium, and there seems to be no justification for using it in a WWW context (HTTP is a guaranteed 8-bit protocol, after all), quite apart from dubious search-engine support.
Use of Latin-1 character entities (i.e in the form
&name;), in preference to other
representations of these characters, can be beneficial
as far as locating Latin-1 strings in any encoding, but of course
this doesn't help when the characters to be located are not
in the Latin-1 repertoire.
When we come to the non-Latin-1 character entities of HTML4,
on the other hand, there's a dilemma. There seems no doubt that
&#bignumber; format is more widely implemented
&entityname; form, if only because of
Netscape 4.* versions. On the other hand, a browser which does not
&entityname; is likely to display something
reasonably intuitive (i.e the uninterpreted source code), whereas
one that doesn't implement
&#bignumber; is likely
to display incomprehensible garbage.
So it's hard to give general advice about which form to prefer: it
depends on the context, and on your priorities
for the fallback behaviour in old browsers (recent ones are not
Observations indicate that "combining marks" (the Unicode General Category values Mn, Me and Mc) are not as well supported by browsers as are precomposed letters. Also, support in search engines for combining marks seems to be poor: support is demonstrably better for precomposed letters.
The advice therefore is to use precomposed accented letters wherever they exist, in preference to base letters plus combining mark(s), because they work better with current browsers and fonts, and with search engines. This is certainly true for Latin, Greek and Cyrillic, at least.
Contrary to rather widespread superstition,
8-bit coded characters are entirely legal
on the WWW: indeed, if you are working outside of
the Latin-1 repertoire, and want to be accessible also to
you have little choice (scenario 4).
However, documents containing 8-bit coded characters
are less robust against mishandling
during authoring and publishing to the WWW, by
cross-platform transfers without due attention
to 8-bit encoding issues, and when browsing files locally,
URLs, where no character-encoding
information is passed as part of the protocol.
Nevertheless, I would recommend that you design for browsing via a properly-used and -configured HTTP server, and not let your decisions be slanted by these more-local issues. It's your responsibility to research the server uploading facilities which are available to you (there are far too many to even start to discuss them here), and to work out how to use them to get your chosen encoding(s) onto the server so that they will conform to WWW interworking standards when accessed by your readers.
The "Windows Latin-1" repertoire (i.e the repertoire of characters in the Windows-1252 coding) covers the complete ISO Latin-1 repertoire, plus additional characters: typographic niceties (em-dash, en-dash, matched quotes etc.), and characters from the Latin-9 repertoire (Z-hacek, S-hacek, euro currency character, etc.: see J.Korpela's Latin-9 overview).
Latin-9 is the repertoire of the iso-8859-15 code. There seems in fact to be no point in using iso-8859-15 in HTML: by the time that browsers were supporting iso-8859-15, they were also supporting sufficient of the techniques needed for scenario 6 or 7 to be able to use one of those more-versatile approaches. iso-8859-15 has its advantages for plain-text email, but for HTML it seems best avoided.
This note recommends that you not use a proprietary Windows code (specifically, Windows-1252 for Latin-1) merely for the purpose of getting those typographic niceties (scenario 3). However, there could be situations where the use of a wider range of Windows characters is required, that is also covered by some older browser versions, and it's certainly arguable (though I'm personally opposed to it, and any justification fades with time) that the use of Windows-1252 code is preferable to risking the use of Unicode; this option has now been noted as scenario 3a.
The windows-1252 code, albeit a proprietary character code, is otherwise standards-conforming. Compare this with valid HTML4 techniques which could be used under scenarios 6 or 7 without any criticism of principle, but would limit the accessibility of the document to browsers which support that part of HTML4, which might not be such a good idea if there is no other requirement for an extended character set.
The use of MS-Windows characters can have adverse practical effects too, as is set down in trenchant terms at the demoronizer site. There's also an informative article by J.Korpela.
A widespread "non-standard" method uses undefined
numerical character references
&#number; in the
range 128 to 159 decimal, which the published HTML declarations
mark as being unused.
These are unacceptable in standard HTML, and
Validator is now complaining about them.
I can only admit that this misuse is, statistically, widely
supported - because, statistically, many people use
MS software, and some other implementers felt the pressure to
implement this misuse too, no matter what the specs say.
But for wide coverage across many browsers and versions,
it really should be avoided.
Windows-1252 is registered at IANA:
its use as an 8-bit character code seems less common in practice
than the non-standard
&#number; values just mentioned.
Well, although IANA registration means that the code is
legal as a MIME specification on the Internet, it isn't
the case that WWW clients are mandated to accept it.
So, again we have the situation that although statistically widespread
due to the wide usage and strong influence of this vendor,
you still get wider browser coverage by not challenging
browsers with this vendor-defined character code.
If you decide (against my better advice!) to do this,
then there is some advantage in
nevertheless still writing the Latin-1 characters by means of
&entityname; representation in HTML, as
these could still be understood even by those few clients that don't
support the Windows-1252 code specification.
An entirely plausible valid approach
would be to represent the Windows typographic
characters by their HTML4 character entity names, such as
—, ‘, ™ and so on
(— ‘ and ™ etc.).
These have in fact been around for a while, and
are understood even by a number of older
browsers that do not support utf-8 and would not be able to
understand the corresponding
Sadly, this approach was sabotaged by Netscape 4 failing to
implement these entity names; and, if you don't care about NN4,
there are better ways to represent such data anyway.
€ seems rather better than
for the other HTML4 character entities: see J.Korpela's
page on the euro; I'm suggesting that if the euro is the
only additional character required, then
used in scenarios 1 or 2 is acceptable, and preferable to
trying to use
iso-8859-15 in an HTML context.
If legal accuracy is essential then the only possible recommendation
would be to use the
EUR banking code.
Another valid approach is to advertise
but to include the MS-Windows characters by means of their
&#bignumber; Unicode references.
This works well on Win-NN4 versions, but may cause
problems with older browsers on some other platforms.
So, on balance, I'm recommending to avoid these characters unless your document is already requiring a wide character repertoire such as in scenarios 6 or 7, in which case you could assume that any browser that can cope with the needed repertoire will also be able to cope with the Windows typographic characters - expressed, of course, in a correct HTML4 representation.
WebTV (the product as offered soon after its takeover by MS, see version 2.8 of WebTV Viewer) evidently treated all encodings as if they were Windows-1252, minus a few characters: (the euro character is missing, as are Z-hacek and z-hacek, in this version, tried in 2003).
This WebTV fails to recognise the Windows characters' unicode
references such as
“, which is a nasty fault
in a browser claiming to support HTML4.
I would rate this product of being incapable of rendering the
i18n aspects of HTML4, and would give it no further consideration
when I am authoring in such contexts.
Later, MS offered a new product, "MSN TV 2", said to be based on MSIE; as yet I've found no corresponding developer's viewer to investigate its support for other character encodings and repertoires, but an email correspondent tells me that it has quite good support, and included some digital photographs from the TV screen of the browser successfully displaying Cyrillic.
This technique (as also explained in the Quickstart
page) relies on the fact that the UTF-8 encoding of Unicode has
been deliberately designed so that US-ASCII is a proper subset of it.
What we are doing here is to formulate our page using only the
characters of US-ASCII, but pretend that it is
UTF-8 (which, in a sense, it is) in order to fool Netscape 4.*
into its Unicode mood.
This is a perfectly valid option of HTML4 (albeit a rather bulky one
if large numbers of characters have to be represented in
&#bignumber; terms), and thus will also be
acceptable to any other client agents which support this part of
RFC2070 and HTML4.
Again, please refer to the Quickstart etc. in
this area for further discussion of this option.
I'm recommending use of the
representation, even where the HTML4 specifications lay down an
&entityname; representation, since these entity
names (aside from the Latin-1 characters, where
is preferred, and a small subset of others)
are not as widely supported as one would hope (again, Netscape 4.*
is a major offender in this regard).
But do keep in mind that these browsers, although technically capable of what is being suggested here, will only work when configured to use suitable fonts. There may also be additional problems with X-based versions (e.g Linux) of Netscape 4.*, whose support for Unicode is quite incomplete.
Of course, the composite approach of scenario 8 doesn't help to make a "scenario 5" document accessible to older browsers that don't support either technique, but there's not much we can do for those (short of fiddling around with in-line images) if the material necessarily calls for this kind of character repertoire.
Since there's no way of determining for sure whether the user setup
is satisfactory or not, I'd have to conclude that an author is in their
rights to send utf-8 format to any client which includes
utf-8 in its
header: beyond that, it's the user's responsibility to ensure that
their browser is properly configured to do what it claims.
Attempts to negotiate what one sends from a server according to
the user-agent string, as opposed to using
header(s) or some other test of actual client capability,
is not only fundamentally flawed on theoretical
grounds, but virtually all of the practical attempts that I have seen
have had major loopholes in their implementation: I really can't
advise going that way.
Clients which don't indicate support for UTF-8 would then get sent the document variant coded as described in scenario 5. This has reasonably good coverage (for example Win IE3 and several older minority browsers that had been tested - see the old browser tests report in this area).
It's perfectly fine to offer both variants explicitly, if you want to give users the option or if you don't want to tangle with server-side negotiation. Myself, I'd prefer not to trouble users with technical details, but sometimes it's hard to avoid it.
Although forms input isn't explicitly covered here, it's worth mentioning that although Netscape 4.* versions indicate utf-8-capability in their Accept-charset, and can indeed handle it pretty well as far as rendering is concerned (some limitations on unix/linux-based versions), the fact is that they can't deal with it for forms submission.
Netscape versions 4.* are fundamentally broken under this
scenario, although there is a subset of possibilities that works.
Basically, if the characters called for by
&#number; references are not available in the
repertoire that is implied by the specified character code (possibly
augmented by the MS-Windows characters in the range 128 to 159 decimal),
then Netscape 4.* refuses to display them.
In short, with this wretched browser family,
& notations are only supported where they
theoretically aren't needed, and they fail precisely when you
do need them.
With basically Latin scripts such as Latin-2, Baltic area, Turkish etc. you can still use the majority of accented characters that you might need in Latin-1: it's recommended, as usual due to the shortcomings of various pieces of software, to represent the Latin-1 characters by their character entity names, even though in theory the 8-bit coded character would be entirely equivalent.
RFC2070 is the original specification which codified the character representation model upon which HTML4 and later are based, including XML and XHTML. It is also explained quite well in section 5 of the HTML4.01 specification, as well as in much more detail in a recent W3C TR, Character Model for the World Wide Web.
At least a working familiarity with this character representation model is essential for anyone working with a rich character repertoire on the web. A word of warning: it's been my experience that many folks who come to the web from other application areas, such as word processing, with the confident belief that they already understand this topic, often turn out to be hopelessly confused about the HTML character model.
If you want to write Appendix-C
compatible XHTML/1.0, then you have basically two choices
for advertising the character encoding (
to the specifications:
Supply the character encoding on the HTTP Content-type header
(the real header, not a
You may then omit the
meta http-equiv content-type specification.
You may include one or both of those if you wish -
of course they will need to be consistent with what the
real HTTP header specifies.
If the character encoding is not specified on the HTTP header,
then the compatibility guidelines of Appendix C call for the character
encoding to be specified on both the
<?xml...> XML declaration and
meta http-equiv Content-type specification.
However, this is not a good idea in practice, as the
<?xml...> XML declaration can cause problems in
some non-XHTML-aware browsers (most notably MSIE, which goes into
its "quirks" mode in response).
If you really, really, cannot specify charset on the HTTP header
(which is definitely my preferred solution), then you could use
utf-8 encoding: omit the
leaving XML to take
utf-8 as its default,
and confirm it for HTML by specifying
This option is not set out as such in Appendix C, but does seem to
be compatible with the general principles. Note that if the
meta http-equiv specifies a different
than the one which was determined by the XML rules, then the results
The selection of an appropriate
charset can be done
in just the same way as set out above: the only difference,
in this regard, between
HTML proper, and compatibility-XHTML/1.0, is the
mechanism for advertising that charset to the client.
The most common mistake seems to be to supply only a
meta http-equiv, but this comes too late for XML,
which has already decided on the basis of other evidence (the omission
<?xml...> declaration and the
absence of a BOM) that the encoding has to be utf-8.
meta then attempts to set an incompatible
charset, for example
the result is problematical.
Last changed Wednesday, 01-Feb-2006 01:11:36 GMT
Original materials © Copyright 1994 - 2006 by A.J.Flavell & Glasgow University