Re: Producing an XML report from the validator

On 18.06.01 at 11:56, Christian Smith <csmith@barebones.com> wrote:

>Character [offset] X of the document should be the same regardless of
>whether the file is in 7bit ascii or encoded as Unicode.

Unless I'm missing something, yes, but with the caveat below.


>>some combination symbols can be expressed either as their own unique code
>>point, or as a set of equivalent combination characters.
>
>And would this cause you to report a different character offset or would
>you report the same character offset regardless?

There may be something in the UNICODE or ISO spec that specifies how to
count characters when faced with this problem, but I'm not familiar enough
with it to know either way (Martin?). Our offset would be sum of the length
of all previous lines, plus the character offset on the current line that
SP reports. Both the line lengths and the offset in the current line would
be dependant on how our chosen implementation does the counts.


>In either case I'm not overly concerned at this point. The worst that
>happens is the insertion point my not be set quite right in some cases.
>C'est le vie.

Given that all the common accented characters from German ("צ"), Spanish,
French (יטב etc.), and the Scandinavian languages (זרוצה etc.) can
potentially be one or two characters, the probability of a significant
error increases proportionally with the lingth of the text. You could end
up several hundred characters off in even in medium sized documents with
NFD. Given use of NFC, the error should probably be within a few characters
and always less then the "real" offset.


Anyways, you may be right that the problem is academic at this point. We
can probably figure out a solution once we get to a point that we actually
report these offsets somewhere. :-)

Received on Monday, 18 June 2001 12:21:28 UTC