W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2000

Let's add a Cyrillic char encoding support

From: Andrzej Novosiolov <anovos@rs-ukraine.kiev.ua>
Date: Tue, 22 Feb 2000 11:15:55 +0200
Message-Id: <200002220916.LGG49453@zirafe.carrier.kiev.ua>
To: html-tidy@w3.org
For now, Tidy can be used for Cyrillic pages cleanup in -raw mode. But
there is a way to preserve Cyrillic letters and still benefit from converting
character codes 126-159 to corresponding entitites.

There are 5 different Cyrillic encodings which are used in web pages:

koi8-r - Cyrillic letter codes are in range 160-255
windows-1251 - in range 160-255
iso-8859-5 - in range 160-255
cp866 (or ibm866) - in range 128-175, 224-255
x-mac-cyrillic - in range 128-159, 224-255

The most widely used encodings are koi8-r and windows-1251, other 3 are
extremely rare.

So I propose to add 2 more encodings:

CYRWIN for windows-1251
CYRKOI for koi8-r and iso-8859-5

and still use RAW for cp866 and x-mac-cyrillic.

The source changes are minimal (plus 2 more constants in html.h and
2 more switches processing in tidy.c and config.c):

tidy.c, line 373:
-----------------
if (in->encoding == RAW || in->encoding == ISO2022 || in->encoding == CYRKOI)
  /* don't convert koi8-r and iso-8859-5 to Unicode */

pprint.c, line 407:
-------------------
if (MakeClean && CharEncoding != CYRKOI) /* CYRWIN may be cleaned up too */

pprint.c, line 483:
-------------------
if ((c < ' ' && c != '\t') || (c > 126 && c < 160) || (c > 255)
  || ((CharEncoding == ASCII) && (c > 126)))
  /* let's convert character codes 126-159 to entities for CYRWIN and CYRKOI */
Received on Tuesday, 22 February 2000 04:16:34 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:43 GMT