Let's add a Cyrillic char encoding support

For now, Tidy can be used for Cyrillic pages cleanup in -raw mode. But
there is a way to preserve Cyrillic letters and still benefit from
converting
character codes 126-159 to corresponding entitites.

There are 5 different Cyrillic encodings which are used in web pages:

koi8-r - Cyrillic letter codes are in range 160-255
windows-1251 - in range 160-255
iso-8859-5 - in range 160-255
cp866 (or ibm866) - in range 128-175, 224-255
x-mac-cyrillic - in range 128-159, 224-255

The most widely used encodings are koi8-r and windows-1251, other 3 are
extremely rare.

So I propose to add 2 more encodings:

CYRWIN for windows-1251
CYRKOI for koi8-r and iso-8859-5

and still use RAW for cp866 and x-mac-cyrillic.

The source changes are minimal (plus 2 more constants in html.h and
2 more switches processing in tidy.c and config.c):

tidy.c, line 373:
-----------------
if (in->encoding == RAW || in->encoding == ISO2022 || in->encoding ==
CYRKOI)
  /* don't convert koi8-r and iso-8859-5 to Unicode */

pprint.c, line 407:
-------------------
if (MakeClean && CharEncoding != CYRKOI) /* CYRWIN may be cleaned up too */

pprint.c, line 483:
-------------------
if ((c < ' ' && c != '\t') || (c > 126 && c < 160) || (c > 255)
  || ((CharEncoding == ASCII) && (c > 126)))
  /* let's convert character codes 126-159 to entities for CYRWIN and
CYRKOI */

Received on Friday, 24 March 2000 13:12:54 UTC