- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Fri, 2 Nov 2007 13:57:23 +0200
- To: L.David Baron <dbaron@dbaron.org>
- Cc: HTML WG <public-html@w3.org>
On Nov 1, 2007, at 17:13, L. David Baron wrote: > On Wednesday 2007-10-31 21:37 +0200, Henri Sivonen wrote: >> * To avoid issues with counting counting column positions in UTF-16 >> code units as opposed to Unicode characters, test cases should use >> the Basic Latin range when a given error can be elicited with Basic >> Latin only. > > This one makes me a little uneasy, since tests often catch bugs that > aren't the bugs the test author was trying to catch. This means > having variety in tests is good, since it means the tests will catch > more coincidental bugs and more bugs caused by feature interactions. Yes, finding coincidental bugs is good. My point was over-cautiously formulated. What I wanted to say was this: Counting source locations by UTF-16 code units is a relatively deeply entrenched practice. Even though it isn't quite OK theoretically, it works so well that I think a test suite should not require counting by UTF-32 code units. Converting to UTF-32 locations *correctly* has little practical value compared to the software complexity it would entail. OTOH, counting by UTF-16 code units is admittedly theoretically quirky, so implementations should be allowed to count UTF-32 code units if that suits them better. When a test case isn't specifically testing counting UTF-16 code units vs. counting UTF-32 code units, it shouldn't coincidentally poke this issue. I put Basic Latin there to dodge the issue of "column" suggesting counting graphemes. As a related issue, designating the source location of a line break is a thorny issue. The common practice (for performance reasons) is to count the position as column zero on the line after the break. However, I found that for implementing source code highlighting, it was better to take a perf hit and designate the position as last non- break position plus one on the line that the line break terminates. An implementation-independent test suite should probably refrain from poking this issue when it claims to be testing something else, although this may be harder than steering clear of the UTF-16 vs. UTF-32 issue. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Friday, 2 November 2007 11:57:48 UTC