Re: Unconference topic suggestion: Conformance checker tests from Henri Sivonen on 2007-11-02 (public-html@w3.org from November 2007)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 2 Nov 2007 13:57:23 +0200
To: L.David Baron <dbaron@dbaron.org>
Cc: HTML WG <public-html@w3.org>
Message-Id: <49125ABC-1004-479D-93AA-E876BB5D8A07@iki.fi>

On Nov 1, 2007, at 17:13, L. David Baron wrote:

> On Wednesday 2007-10-31 21:37 +0200, Henri Sivonen wrote:
>>  * To avoid issues with counting counting column positions in UTF-16
>> code units as opposed to Unicode characters, test cases should use
>> the Basic Latin range when a given error can be elicited with Basic
>> Latin only.
>
> This one makes me a little uneasy, since tests often catch bugs that
> aren't the bugs the test author was trying to catch.  This means
> having variety in tests is good, since it means the tests will catch
> more coincidental bugs and more bugs caused by feature interactions.

Yes, finding coincidental bugs is good. My point was over-cautiously  
formulated.

What I wanted to say was this:
Counting source locations by UTF-16 code units is a relatively deeply  
entrenched practice. Even though it isn't quite OK theoretically, it  
works so well that I think a test suite should not require counting  
by UTF-32 code units. Converting to UTF-32 locations *correctly* has  
little practical value compared to the software complexity it would  
entail. OTOH, counting by UTF-16 code units is admittedly  
theoretically quirky, so implementations should be allowed to count  
UTF-32 code units if that suits them better. When a test case isn't  
specifically testing counting UTF-16 code units vs. counting UTF-32  
code units, it shouldn't coincidentally poke this issue.

I put Basic Latin there to dodge the issue of "column" suggesting  
counting graphemes.

As a related issue, designating the source location of a line break  
is a thorny issue. The common practice (for performance reasons) is  
to count the position as column zero on the line after the break.  
However, I found that for implementing source code highlighting, it  
was better to take a perf hit and designate the position as last non- 
break position plus one on the line that the line break terminates.  
An implementation-independent test suite should probably refrain from  
poking this issue when it claims to be testing something else,  
although this may be harder than steering clear of the UTF-16 vs.  
UTF-32 issue.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 2 November 2007 11:57:48 UTC