W3C home > Mailing lists > Public > www-validator@w3.org > June 2001

Re: Producing an XML report from the validator

From: Christian Smith <csmith@barebones.com>
Date: Mon, 18 Jun 2001 11:56:16 -0400
To: Terje Bless <link@tss.no>
cc: Martin Duerst <duerst@w3.org>, Nick Kew <nick@webthing.com>, Esmond Walshe <esmond.walshe@eeng.dcu.ie>, www-validator@w3.org
Message-ID: <20010618115616-b01010704-719d57da-0910-010c@204.107.232.107>
On Monday, June 18, 2001 at 5:39 PM, link@tss.no (Terje Bless) wrote:

> On 18.06.01 at 10:38, Christian Smith <csmith@barebones.com> wrote:
> 
> >On Monday, June 18, 2001 at 1:39 AM, link@tss.no (Terje Bless) wrote:
> >
> >>Right. We can't guarantee a lossless roundtrip to ISO 10646[0] so every
> >>offset -- be it byte or character -- would need to be in terms of the
> >>UTF-8 encoded ISO 10646 version of the file. How would that affect your
> >>Error Browser Chris?
> >
> >Dealing with byte offsets would be nearly impossible. as long as I get
> >back a character offset everything should be fine.
> 
> Yeah, but if your file is in MacRoman, will you be able to make use of a
> character offset obtained _after_ we've converted it into some encoding of
> UNICODE?

I see no rason why not. Character X of the document should be the same
regardless of whether the file is in 7bit ascii or encoded as Unicode. The
byte offset will be different but not the character offset.

> The problem I'm envisioning is where there are several possible UNICODE
> forms of a given MacRoman (or ISO-whatever, or ...) character. For
> instance, some combination symbols can be expressed either as their own
> unique code point, or as a set of equivalent combination characters. For
> instance, "LATIN CAPITAL LETTER A WITH RING ABOVE" ("") can be encoded as
> U+00C5, or decomposited as U+0041 U+030A; "LATIN CAPITAL LETTER A" followed
> by "COMBINING RING ABOVE". Which one you get is dependant on whether you
> use Normalization Form C (use combination chars) or Normalization Form D
> (decompose) or whatever random variant made sense at the time. :-|

And would this cause you to report a different character offset or would
you report the same character offset regardless? In either case I'm not
overly concerned at this point. The worst that happens is the insertion
point my not be set quite right in some cases. C'est le vie.


-- 
Christian Smith  |  csmith@barebones.com  |  http://web.barebones.com

He who dies with the most friends... Is still dead!
Received on Monday, 18 June 2001 11:56:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:13:58 GMT