[Fwd: Re: Bytes, character encodings and characters]

Sorry, this one went only to Shadi.

-------- Original Message --------
Subject: Re: Bytes, character encodings and characters
Date: Wed, 03 May 2006 22:49:07 +0200
From: Johannes Koch <johannes.koch@fit.fraunhofer.de>
To: Shadi Abou-Zahra <shadi@w3.org>
References: <4458D046.1040003@fit.fraunhofer.de> <4458E233.6010105@w3.org>

Shadi Abou-Zahra wrote:
> I believe we are on the same page but just for the sake of 
> completeness, here a possible issue:
> 
> Say CE1 is UTF-16, and CE2 ASCII.

This is only possible if the EARL only contains characters in the
Unicode range up to U007F, because US-ASCII is limited to these.

> You translate double-byte UTF-16 
> characters into single-byte ASCII characters

I translate characters (no matter where they came from) into a byte
sequence using US-ASCII ...

> and count the byteOffset 
> correctly to publish a clean and valid report.

If you want to use earl:byteOffset together with earl:textContent an
EARL reading tool will need the character encoding that you used to
create the byte sequence which forms the base for your counting.

But, as I said last week, this mixing of levels doesn't make sense to me.

Use charOffset together with textContent.
Use byteOffset together with base64Content.

If there is a byte sequence in the resource that cannot be transformed
into a proper character sequence using the chosen character encoding,
use base64Content for the snippet. Which character sequence would you
like to put into a textContent snippet?
-- 
Johannes Koch - Competence Center BIKA
Fraunhofer Institute for Applied Information Technology (FIT.LIFE)
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628


-- 
Johannes Koch - Competence Center BIKA
Fraunhofer Institute for Applied Information Technology (FIT.LIFE)
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628

Received on Tuesday, 9 May 2006 21:34:55 UTC