Re: Producing an XML report from the validator from Martin Duerst on 2001-06-19 (www-validator@w3.org from June 2001)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 19 Jun 2001 12:06:18 +0900
To: Terje Bless <link@tss.no>, Christian Smith <csmith@barebones.com>
Cc: Nick Kew <nick@webthing.com>, Esmond Walshe <esmond.walshe@eeng.dcu.ie>, www-validator@w3.org
Message-Id: <4.2.0.58.J.20010619114714.02c6e930@sh.w3.mag.keio.ac.jp>

At 18:20 01/06/18 +0200, Terje Bless wrote:
>On 18.06.01 at 11:56, Christian Smith <csmith@barebones.com> wrote:

> >>some combination symbols can be expressed either as their own unique code
> >>point, or as a set of equivalent combination characters.
> >
> >And would this cause you to report a different character offset or would
> >you report the same character offset regardless?
>
>There may be something in the UNICODE or ISO spec that specifies how to
>count characters when faced with this problem, but I'm not familiar enough
>with it to know either way (Martin?). Our offset would be sum of the length
>of all previous lines, plus the character offset on the current line that
>SP reports. Both the line lengths and the offset in the current line would
>be dependant on how our chosen implementation does the counts.

An overview can be found at http://www.w3.org/TR/charmod/#sec-Indexing.
The summary is: It depends :-(.

In any way, I think it's better to stay with (line, character)
because this is much stabler than just counting from the beginning
of the document.

> >In either case I'm not overly concerned at this point. The worst that
> >happens is the insertion point my not be set quite right in some cases.
> >C'est le vie.

[le vie -> la vie]

>Given that all the common accented characters from German ("��), Spanish,
>French (鳧�etc.), and the Scandinavian languages (蹂褻�etc.) can
>potentially be one or two characters, the probability of a significant
>error increases proportionally with the lingth of the text. You could end
>up several hundred characters off in even in medium sized documents with
>NFD. Given use of NFC, the error should probably be within a few characters
>and always less then the "real" offset.

Yes, if we count from the start of the document. But as Karl Uve
has said, NFC should be used. NFC has the nice property that most
data is already in NFC, or that it gets to NFC in a straightforward
way if converted one-to-one from a legacy (i.e. non-Unicode) encoding.
windows-1258 (Vietnamese) is an exception, that's why it's currently
excluded in /htdocs/config/charset.cfg (see
http://dev.w3.org/cvsweb/validator/htdocs/config/charset.cfg?rev=1.1)

>Anyways, you may be right that the problem is academic at this point. We
>can probably figure out a solution once we get to a point that we actually
>report these offsets somewhere. :-)

Yes, I wouldn't worry too much about it at this point.

Regards,    Martin.

Received on Tuesday, 19 June 2001 01:42:33 UTC