[Fwd: Re: Bytes, character encodings and characters]

-------- Original Message --------
Subject: Re: Bytes, character encodings and characters
Date: Thu, 04 May 2006 09:34:39 +0200
From: Johannes Koch <johannes.koch@fit.fraunhofer.de>
To: Shadi Abou-Zahra <shadi@w3.org>
References: <4458D046.1040003@fit.fraunhofer.de> 
<4458E233.6010105@w3.org> <44591743.5070901@fit.fraunhofer.de> 
<44591F97.7020709@w3.org>

Shadi Abou-Zahra wrote:
> Johannes Koch wrote:
> 
>>> and count the byteOffset correctly to publish a clean and valid report.
>>
>> If you want to use earl:byteOffset together with earl:textContent an 
>> EARL reading tool will need the character encoding that you used to 
>> create the byte sequence which forms the base for your counting.
> 
> Typo, I *did* mean charOffset!

Ah, ok.

>> If there is a byte sequence in the resource that cannot be transformed 
>> into a proper character sequence using the chosen character encoding, 
>> use base64Content for the snippet. Which character sequence would you 
>> like to put into a textContent snippet?
> 
> So this means we need to say something along the lines of "if the 
> original encoding in the Web content can not be represented in the 
> encoding of the EARL report, then base64Content needs to be used". To 
> use the same example, because you could not display UTF-16 in ASCII, you 
> should record the snippet in base64. Correct?

1. The problem is a problem with the resource. The resource's bytes
cannot be transformed into characters properly with the chosen character
encoding. That was a use case Nick mentioned last week, I think. If you
want to record this error (improper byte sequence for character encoding
xxxxx), you will need the base64Content with byteOffset. You cannot
create a textContent with charOffset because you cannot transform the
bytes into characters. At least not the problematic ones. Of course you
could create a textContent with the characters up to the problematic
point. But then you could not create a charOffset pointing to a
character position in the textContent, because the problematic point is
not in the textContent.

2. You could encode an EARL report with whatever character encoding you
want. But when using US-ASCII you would need character references for
all characters above U007F.
-- 
Johannes Koch - Competence Center BIKA
Fraunhofer Institute for Applied Information Technology (FIT.LIFE)
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628


-- 
Johannes Koch - Competence Center BIKA
Fraunhofer Institute for Applied Information Technology (FIT.LIFE)
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628

Received on Tuesday, 9 May 2006 21:36:42 UTC