i18n: HTML Character set issues beyond HTML3.2

"i18n"? - the word "internationalisation" (or "internationalization" as some prefer to spell it) starts with the letter "i", ends with the letter "n" and has eighteen (in numbers: 18) letters in between. In the interests of brevity and to avoid arguments about "s" or "z", the term "i18n" (i-eighteen-n) has come into widespread use.

Road map to this area

There's also some tutorial-ish notes on FORM submission and i18n.

I18n and MS WEFT, an MSIE-specific technique for offering enbedded fonts, which can be applied without harming web-compatible browsers.

Details of resources here and elsewhere

Using HTML in a single non-Latin-1 locale had been working for a considerable time already, and you can find appropriate resources on the WWW that cover one or other of those locales. In various places I am including pointers to some of the resources that happen to be known to me, but here I am concentrating on the use of an extended character repertoire: using, at the very least, one non-Latin-1 repertoire together with the full repertoire of Latin-1.

Formal stuff

Materials offered here

Background and supplementary materials

A brief mention here of iso-8859-15, a worked-over version of Latin-1 that is officially called Latin-9 but had been nicknamed Latin-0. One of its proponents, Alain LaBonté, who enjoys word games, mentioned that it can be pronounced "Latin Zeuro", but he preferred "Latin 9, c'est un latin tout neuf". Misha Wolf's comment on it was:

We have just the place for ISO 8859-15 here in London. It is called the Science Museum and is full of charming historical relics, like Babagge's difference engine, used by Ada Lovelace (I think that was her family name).

What a relief that we now have Unicode and won't have to implement this amusing piece of history.

Browser support was originally better for utf-8 than it was for iso-8859-15, and I think it's fair to say that there is no point in using iso-8859-15 for encoding HTML documents, although it has found fairly wide user acceptance, in the European area, for use in Usenet (8-bit plain text) postings.

For proper support of Celtic languages such as Welsh, a different 8-bit coding would be needed. Refer to Michael Everson's page on Celtic fonts.

Markus Kuhn offers a UTF-8 and Unicode FAQ for Unix/Linux which includes an explanation of the utf-8 transformation format. An official place to read about it is RFC2279, available from your usual source of RFCs or a copy at faqs.org.

The W3C's own i18n wizards presented a Tutorial: Weaving the multilingual Web at the 1999 International Unicode Conference.

There's an excellent and comprehensive article on character sets and MIME by Jukka Korpela, which covers the present topic and more. Also a page that focusses on the HTML issues.

Jukka Korpela called attention to a fascinating resource page which accesses a Database on letters, languages, character sets etc.

Alan Wood's Unicode Resources - Multilingual support on the Web are very good.

A.Prilop has a Multilingual Macintosh Resources page.

Some notes on Baltic codings which summarise a discussion with A.Prilop and others in Sept.2000

I made some rough-and-ready unicode test tables based on the Unicode database.

FONT FACE techniques discussed - and criticised.

I created code mapping tables to enable an earlier version of the rtftohtml program (which later became LOGICTRAN) to generate HTML4 by using &#bignumber; representations. The author incorporated this work into rtftohtml in version 3.8 and later. The original rtftohtml extras were made available here and include a test table for this part of the repertoire.

My original iso-8859 materials are there.

Nir Dagan's notes on Hebrew.

MSIE can unilaterally decide that your page is in utf-7 if you don't specify charset explicitly. [cited page is in German]


Perl, Unicode and i18N FAQ looks to be a treasure house of information and links.

FRAMEs warning

It has been observed that some versions/platforms of the popular browsers will misbehave if different charset codes are used in different frames and/or framesets of a given page. I haven't investigated this in detail myself, but I thought I should mention it.

I make no secret of my dislike for frames; but the problem is there, whether you like them or not.

|Up|Next | PPE Home|RagBag|About the author|Email|

Last changed Thursday, 02-Feb-2006 17:21:55 GMT