Re: Non-us-ascii characters in Tidy Beta version 15-Jan-2004 from Harold Baughan [RockSolidSite.com] on 2004-02-18 (html-tidy@w3.org from January to March 2004)

From: Harold Baughan [RockSolidSite.com] <hbaughan@rocksolidsite.com>
Date: Wed, 18 Feb 2004 09:41:26 -0500
To: <html-tidy@w3.org>
Message-ID: <002101c3f62d$78b51380$c92e4b43@e3f3y7>
Hello Bjoern,

Re-examination is now complete.  I think I've isolated the problem...

Text was copied from an HTML 4.1 file via Notepad and pasted into an XHTML
1.1 file.  Adjustments were made, Tidy was run, and the result was validated
with the W3C on-line validator.  The file validated properly as XHTML 1.1
under this declaration...

<?xml version="1.1" encoding="us-ascii"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

Then I ran the Beautify function using Tidy Beta.  Now the on-line validator
responded...

> Sorry, I am unable to validate this document because on lines 92,
> 99-101, 103, 106-107, 129-130, 138-139, 142, 146-147, 149, 154-157,
> 161, 163-165, 169, 175-176, 182, 188 it contained one or more bytes
> that I cannot interpret as us-ascii (in other words, the bytes found
> are not valid values in the specified Character Encoding). Please
> check both the content of the file and the character encoding indication.

The first occurrence appeared in this snippet.

89            </p>
90          </td>
91        </tr>
92      </table>
93    </div>

For this, the Frhed editor shows

\</p>
          \</td>
        \</tr>
      \</table><bh:a0>
    \</div>

So, there might be a *couple* of things going on, here.  Last night I might
have imported an "a0" when there were two spaces after punctuation, the
first one being an &nbsp; .  That would be my fault.  However, *this* one is
definitely coming from the Tidy beautify function.

Note that when the validator responds with more than one line number
(99-101) it is indicating that the last character on the first line is
<bh:a0>, then there are space characters <bh:20> up to the first character
of the next line.

I hope this helps you to repair the problem.

Question... should I change from us-ascii to another character set in the
meantime?  Which one?  Thanks.

A curious item... I looked at the file with Netscape 7.0.  At every <bh:a0>
a question mark showed up in a black diamond.  So, at least the problem
points are now visible!  Why is this curious?  ...Because I seldom have
anything nice to say about Netscape, and this time it was actually useful.
:-)

I am not on the distribution list, so the only way I can receive a response
is directly.  Good luck with this one.

Cordially,

Harold Baughan

^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^
Baughan & Company, email: hbaughan@rocksolidsite.com

- - - - - -

----- Original Message ----- 
From: "Bjoern Hoehrmann" <derhoermi@gmx.net>
To: "Harold Baughan [RockSolidSite.com]" <hbaughan@rocksolidsite.com>
Cc: <html-tidy@w3.org>
Sent: Tuesday, February 17, 2004 5:25 PM
Subject: Re: Non-us-ascii characters in Tidy Beta version 15-Jan-2004


> * Harold Baughan [RockSolidSite.com] wrote:
> >On Feb. 9 I asked several questions re: Tidy Beta version 15-Jan-2004,
and
> >received some good help in operation.  However, one error keeps getting
> >introduced.  Note... the version in use is a plug-in to Chami's HTML-Kit
> >Version 1.0, Build 292, on Win-98.
>
> It'd be best if you try the command line application to reproduce the
> problem with an ideally simple test case. If you are able to reproduce
> it, send a message to the list or file a bug report on the sf.net site.
> I cannot fix bugs I cannot reproduce.
>
> >After a beautify function, a non-visible, non-us-ascii character is being
> >added somewhere in several strings.  It seems to be happening on a line
> >which includes <br /> by itself, or a space after </a> at the end of a
line,
> >or when there are two spaces between a period and the beginning of the
next
> >sentence (such as in "...word.  Word...")
>
> By default, Tidy does not generate non-ascii output unless there are
> non-ascii characters inside constructs where it cannot use character
> references (comments, for example); in this case the characters come
> out garbled. So there must be some configuration option active,
> -latin1 for example. If the character is not visible it is most likely
> U+00A0 (&nbsp;). Tidy would insert them e.g. if there is a <nobr>
> element in the source document.
>
> >Does anyone know how to make whatever this character visible so that it
can
> >be edited out?  And, of course, it should be looked into for the next
build.
>
> A hex editor would probably work best, if you don't have one yet, there
> are lots freely available. Certain viewer applications might also help,
> e.g. the Total Commander file manager supports hex view. You could also
> put the file online and I'll have a look.
>
Received on Wednesday, 18 February 2004 09:45:00 UTC