Re: Automatic Filter for Accented Characters ?

Jose Fernando Tepedino Martins <jftm@di.ufpe.br> in <9506182110.AA20081@di.ufpe.br>:
>   I have now a lot of documents with accented characters, and I love 
>them, because they make the text much more readable than using any other
>codification. I've been wondering if I really should run a transformation
>program for each HTML file I write before putting it in the WWW Server
>public area. What I mean is: 
>
>-> Is there a way to associate a filter to the WWW Server in order 
>   that it reads the HTML documents with 8-bit characters (with accented 
>   characters) and send a 7-bit (as &eacute;...) equivalent codification?

Here is something not quite what you asked for but hopefully
nevertheless useful. And it also converts the other way round.
Please excuse the crude interface and documentation but I hacked
this just for internal use.

Best regards
Rainer Klute

  Dipl.-Inform. Rainer Klute        NADS - Advertising on nets
  NADS GmbH
  Emil-Figge-Str. 80                Tel.: +49 231 9742570
D-44227 Dortmund                    Fax:  +49 231 9742571

            <http://www.nads.de/~klute/>




#!/usr/local/bin/perl

# Options:
#
# +iso:        convert to ISO 8859-1
# -iso:        convert to entity representation (default)
# +characters: convert characters outside ASCII
# -characters: don't convert characters outside ASCII (default)
# +html:       convert HTML special characters &, ", <, >
# -html:       don't convert HTML special characters &, ", <, > (default)

$characters = "no";
$html = "no";
$iso = "no";
for ($i = 0; $i <= $#ARGV; $i++)
{
    if ($ARGV[$i] =~ /^([+-])(.*)/)
    {
	$c = $1;
	splice (@ARGV, $i, 1);
	$i--;
	$_ = $2;
	{
	    /^c/ && ($characters = ($c eq "+" ? "yes" : "no"), last);
	    /^h/ && ($html       = ($c eq "+" ? "yes" : "no"), last);
	    /^i/ && ($iso        = ($c eq "+" ? "yes" : "no"), last);
	}
    };
};

@html = ('&', '&amp;',		# must be the first one
	 '<', '&lt;',
	 '>', '&gt;',
	 '"', '&quot;');

@characters = ('á', '&aacute;',
	       'Á', '&Aacute;',
	       'â', '&acirc;',
	       'Â', '&Acirc;',
	       'à', '&agrave;',
	       'À', '&Agrave;',
	       'å', '&aring;',
	       'Å', '&Aring;',
	       'ã', '&atilde;',
	       'Ã', '&Atilde;',
	       'ä', '&auml;',
	       'Ä', '&Auml;',
	       'æ', '&aelig;',
	       'Æ', '&AElig;',
	       'ç', '&ccedil;',
	       'Ç', '&Ccedil;',
	       'ð', '&eth;',
	       'Ð', '&ETH;',
	       'é', '&eacute;',
	       'É', '&Eacute;',
	       'ê', '&ecirc;',
	       'Ê', '&Ecirc;',
	       'è', '&egrave;',
	       'È', '&Egrave;',
	       'ë', '&euml;',
	       'Ë', '&Euml;',
	       'í', '&iacute;',
	       'Í', '&Iacute;',
	       'î', '&icirc;',
	       'Î', '&Icirc;',
	       'ì', '&igrave;',
	       'Ì', '&Igrave;',
	       'ï', '&iuml;',
	       'Ï', '&Iuml;',
	       'ñ', '&ntilde;',
	       'Ñ', '&Ntilde;',
	       'ó', '&oacute;',
	       'Ó', '&Oacute;',
	       'ô', '&ocirc;',
	       'Ô', '&Ocirc;',
	       'ò', '&ograve;',
	       'Ò', '&Ograve;',
	       'ø', '&oslash;',
	       'Ø', '&Oslash;',
	       'õ', '&otilde;',
	       'Õ', '&Otilde;',
	       'ö', '&ouml;',
	       'Ö', '&Ouml;',
	       'ß', '&szlig;',
	       'þ', '&thorn;',
	       'Þ', '&THORN;',
	       'ú', '&uacute;',
	       'Ú', '&Uacute;',
	       'û', '&ucirc;',
	       'Û', '&Ucirc;',
	       'ù', '&ugrave;',
	       'Ù', '&Ugrave;',
	       'ü', '&uuml;',
	       'Ü', '&Uuml;',
	       'ý', '&yacute;',
	       'Ý', '&Yacute;',
	       'ÿ', '&yuml;');

while (<>)
{

    if ($iso eq "no")
    {
	if ($html eq "yes")
	{
	    for ($i = 0; $i <= $#html; $i += 2)
	    {
		s/$html[$i]/$html[$i+1]/eg;
	    };
	};
	
	if ($characters eq "yes")
	{
	    for ($i = 0; $i <= $#characters; $i += 2)
	    {
		s/$characters[$i]/$characters[$i+1]/eg;
	    };
	};
    }
    else
    {
	if ($html eq "yes")
	{
	    for ($i = 1; $i < $#html; $i += 2)
	    {
		s/$html[$i]/$html[$i-1]/eg;
	    };
	};
	
	if ($characters eq "yes")
	{
	    for ($i = 1; $i < $#characters; $i += 2)
	    {
		s/$characters[$i]/$characters[$i-1]/eg;
	    };
	};
    }

    print;
}

Received on Monday, 19 June 1995 12:05:36 UTC