Re: several messages about New Vocabularies in text/html from Henri Sivonen on 2008-04-02 (public-html@w3.org from April 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 2 Apr 2008 11:47:31 +0300
To: Ian Hickson <ian@hixie.ch>
Cc: Sam Ruby <rubys@us.ibm.com>, Neil Soiffer <Neils@dessci.com>, public-html@w3.org, www-math@w3.org
Message-Id: <73A29D0D-558F-4DA2-9C1D-D2A306A8C89C@iki.fi>

On Apr 2, 2008, at 04:57, Ian Hickson wrote:
> On Tue, 1 Apr 2008, Neil Soiffer wrote:
>>
>> I meant that content MathML doesn't need to be directly supported.
>> However, it should be accepted as part of <annotation-xml>, where  
>> it is
>> easily ignored.
>
> HTML5 today has about 110 elements. Presentational MathML has about  
> 30.
> Content MathML has about 140.
>
> _Doubling_ the number of elements allowed in text/html just so that  
> all
> those elements can be ignored seems like a fundamentally bad idea. (It
> also more than doubles the number of elements that the parser has to  
> know
> about.)
[...]
> Should we really be dedicating _half of the language's vocabulary_  
> to such
> a small use case?

Devil's advocate mode on:

Doubling the number of elements indeed seems like a really bad idea on  
the face of it, especially if the browser isn't doing anything with  
those elements but passing them to the clipboard for export.

However, not having to do anything with those elements means that  
outside the parser, the per-element implementation cost is zero within  
the browser. For <msqrt>, you have to implement non-trivial glyph  
stretching in rendering. For <root>, you don't need to implement  
anything!

Now, within the parser an efficient token interning function is what  
really is needed. That is, for each known element, there should be an  
object that has three fields: interned local name, interned namespace  
URI and a magic enumeration/integer representing a tree builder token  
treatment class for doing a switch on in the tree builder. That is,  
when it is time to emit a token, the the value of the current name  
buffer would be used to locate the corresponding interned token  
object. Presumably, Content MathML wouldn't even need additional magic  
enumeration values but could use two magic values that would already  
be needed for SVG and Presentation MathML: GENERIC_CONTAINER and  
GENERIC_VOID.

The most naive but still not totally silly implementation for the  
token interning function would be an array sorted by local name and  
doing a binary search on that array. With binary search, doubling the  
number of known elements adds only one string comparison per tag!

P.S. I'm not sure if a sorted array with binary search even makes the  
most efficient interning function here. I've observed that within  
HTML5, for a given element name length the last couple of characters  
in an element name are enough to prune the number of possible element  
candidates down to one. Therefore, a possible (generated) interning  
function which would use fewer virtual method invocations but would  
increase the code size would first do a switch on the name length,  
then do a switch on the last character and then on the second-last  
until there's one candidate and then inspect the remaining characters  
for a match against the single candidate. But I haven't really  
analyzed if this approach would beat binary search.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 2 April 2008 08:48:26 UTC