Re: [whatwg] Case-sensitivity of CSS type selectors in HTML from Roger Hågensen on 2015-05-08 (public-whatwg-archive@w3.org from May 2015)

From: Roger Hågensen <rh_whatwg@skuldwyrm.no>
Date: Fri, 08 May 2015 17:56:24 +0200
To: whatwg@lists.whatwg.org
Message-ID: <554CDCA8.90204@skuldwyrm.no>

On 2015-05-07 15:59, Boris Zbarsky wrote:
> On 5/7/15 7:16 AM, Rune Lillesveen wrote:
>> This adds an implementation complexity to type selector matching.
>> What's the rationale for matching the selector case-sensitively in the
>> svg case?
>
> The idea is to allow the selector match to be done case-sensitively in
> all cases so it can be done as equality comparison on interned string
> representations instead of needing expensive case-insensitive matching
> on hot paths in the style system.

(Note! This is veering a little off topic.)

One way to cheapen the computational cost is to have partial case 
insensitive matching.

If (character >= $0041) And (character <= $005A)
     character = (character | $0020)
EndIf

Basically if the character is 'A' to 'Z' then the 6th bit is set, 
thereby turning 'A' to 'Z' into 'a' to 'z' this works both for Ascii-7 
and Latin-1 and Unicode (like UTF-8 for example). No need for table 
lookups, it can all be done in the CPU registers.

Other commonly used characters like '0' to '9' or '_' or similar has no 
lower/upper case. And more language specific characters is not ideal for 
such use anyway (people of mixed nationalities wold have issues typing 
those characters).

So there is no need to do full case insensitive matching. Just do a 
partial "to lower case" normalization of  'A' to 'Z' and then do a 
simple binary comparison.
In optimized C or or ASM this should perform really well compared to 
calling a Unicode function to normalize and lower case the text.

This would mean restricting to 'A' to 'Z', 'a' to 'z', '0' to '9, and 
'_' but all tags/elements/properties/whatever that I can recall seeing 
only ever use those characters.
I certainly won't complain if I can't use the letter 'å' in the code, 
then again I never use "weird characters" in code in the first place.

How does it look in the wild? If only A to Z is used in xx% of cases 
then restricting to that character range would allow very quick 
lowercasing and thus allow use of fast binary matching.

-- 
Roger Hågensen, Freelancer, http://skuldwyrm.no/

-- 
Roger Hågensen, Freelancer, http://skuldwyrm.no/

Received on Friday, 8 May 2015 15:56:53 UTC