Re: Bytes, character encodings and characters

Hi Johannes,

Thanks for taking this discussion to the mailing list for more reflection. I believe we are on the same page but just for the sake of completeness, here a possible issue:

Say CE1 is UTF-16, and CE2 ASCII. You translate double-byte UTF-16 characters into single-byte ASCII characters and count the byteOffset correctly to publish a clean and valid report. However, the reader/processor may need to know that the textContent was originally UTF-16 in order to decode the "UTF-16 in ASCII" and reassemble the original content (for example to display it to the end-user). Or am I overseeing how you want to translate UTF-16 characters into ASCII ones without going into the byte-level?

Regards,
  Shadi


Johannes Koch wrote:
> 
> Hi group,
> 
> I didn't want to be rude, but I really could not see a problem with the 
> textContent property. So I try to clarify my opinion.
> 
> 1. I make a request for a resource I want to check.
> 2. I get a response containing a sequence of bytes and, if it is a text 
> resource, hopefully a character encoding (CE) via some metadata 
> (Content-Type header in HTTP). Otherwise I use a default CE.
> 3. I use the CE1 (specified or default) to transform the sequence of 
> bytes into a sequence of characters. From now on, I'm on the character 
> level, no bytes around anymore.
> 4. I extract a text snippet from the resource characters.
> 5. I create an EARL report containing the snippet. I'm still on the 
> character level.
> 6. I want to store the EARL report on the file system or send them over 
> the network. Therefore I have to transform the EARL report characters 
> into a sequence of bytes using a character encoding CE2. CE2 is not 
> required to be the same as CE1. However, CE2 should contain mappings for 
> all characters in the EARL report.
> 
> Of course, there is a step #0 prior to #1:
> 0: An author creates a text document by writing characters, then storing 
> the document on the web server file system by transforming the 
> characters into a sequence of bytes using a character encoding CE0.
> 
> It may also happen that the document is created by merging different 
> sources on the byte level instead of the character level. So there's a 
> problem when source1 uses a different CE than source2. Transforming the 
> merged bytes into a sequence of characters will not give the proper 
> result. But this is a problem with resource document creation, not with 
> EARL report creation.

-- 
Shadi Abou-Zahra     Web Accessibility Specialist for Europe | 
Chair & Staff Contact for the Evaluation and Repair Tools WG | 
World Wide Web Consortium (W3C)           http://www.w3.org/ | 
Web Accessibility Initiative (WAI),   http://www.w3.org/WAI/ | 
WAI-TIES Project,                http://www.w3.org/WAI/TIES/ | 
Evaluation and Repair Tools WG,    http://www.w3.org/WAI/ER/ | 
2004, Route des Lucioles - 06560,  Sophia-Antipolis - France | 
Voice: +33(0)4 92 38 50 64          Fax: +33(0)4 92 38 78 22 | 

Received on Wednesday, 3 May 2006 17:02:56 UTC