Re: Minting new FPIs and the entity problem in browsers

(removed WAI-PF - they wanted out of this discussion ;-)

Sorry for the delay in responding, I got distracted with child rearing.

Simon Pieters wrote:
>
> On Wed, 30 Jul 2008 19:32:29 +0200, Shane McCarron <shane@aptest.com> 
> wrote:
>>> You're causing grief for browser vendors. See 
>>> http://annevankesteren.nl/2007/12/xml-entities
>> The XHTML Family supports a well defined collection of entities.  You 
>> can 1) dereference the DTD from the DOCTYPE declaration to learn them,
>
> No, we can't really do this. Doing so would be a massive distributed 
> denial of service on w3.org and make w3.org a single point of failure. 
> See http://hsivonen.iki.fi/no-dtd/

Well.... not really.  I mean, you could pre-cache well known ones, cache 
new ones as you encounter them, and fallback to normal XML entity 
processing rules if the DTD were not retrievable...   However, you could 
not just cache based upon FPI - it would have to be the combination of 
FPI and SYSTEM id.  Otherwise someone could do a pretty neat 
man-in-the-middle attack, masquerading their weird version of 
XHTML-whatever as the real one.

>> or 2) you can say "that's a well known FPI, I know what that means" or
>
> Yeah, that's what we're doing now, but a browser can't know about ones 
> that are minted after the browser shipped...

Of course.  But it could learn them.  Or *you* could learn them and 
notify your installed base as part of what would effectively be similar 
to an anti-virus "signature update".  Periodically update the collection 
of known, supported FPIs and their entity collections.  Assuming it is 
only the entity collections you care about.  That surprises me, but 
whatever.

>> 3) you can say "That FPI matches the pattern for XHTML Family FPIs, 
>> which is well know, and I know what that means".
>
> Could you elaborate on how this would work, exactly? See if the FPI 
> starts with "-//W3C//DTD XHTML" and, if so, feeding all the entity 
> declarations?

Well, I think that would be legitimate.  All XHTML Family Markup 
Languages included the same base collection of defined entities.  
However, that is not actually mandated.  If it would make the browser 
vendor's lives easier, I think the XHTML 2 Working Group would be open 
to requiring that those entity sets are a part of all XHTML Family 
languages.

>> Finally, you could do 1) when 2) or 3) was not true, but then learn 
>> the FPI and treat it as 2) from then on.  Or, you could just do what 
>> XHTML M12N says you should do in this case, which is in production 6 
>> of clause 3.5 of XHTML M12N.
>
> Do you have a pointer?

Of course.  See 
http://www.w3.org/TR/xhtml-modularization/conformance.html#s_conform_user_agent 
- clause 6 clearly says how a user agent should behave when it doesn't 
know an entity.  This is consistent with the requirements for a 
non-validating XML processor as defined in the XML spec section 4.4.3.  
The XML parser tells the "application", the user agent, that an entity 
reference was not expanded.  The user agent is then free to do 
something.... ala what is recommended in the M12N spec.

>> There are lots of solutions to the XML Entity problem.
>
> The one I like best is to add all HTML and MathML entities as 
> predefined ones in XML so they can be used with any doctype or no 
> doctype at all.

I honestly don't think that is very forward looking.  Its a fine 
default, but if I provide a user agent with a DOCTYPE and a SYSTEM 
identifier, and it doesn't know what that is, it should try to load it 
and use it.  IMHO.

>> But that problem is not really relevant to the question of whether 
>> FPIs have meaning and whether creating new ones is problematic.
>
> Well for browsers in XML, the only meaning FPIs have is whether or not 
> to load a bunch of entity declarations.

For some browsers I suppose that is true.  Some user agents and 
processors do validation on the fly (WAP gateways, for example).  They 
pay attention to these. 

>> Let me put this another way.  If there were no DOCTYPE - no 
>> declaration of any type about what version of what markup language a 
>> document was written in, what would a browser do?
>
> The same as it would if there was a doctype, except don't load in any 
> entity declarations...
>
>
>> I mean, I know what it would do today if it were served as 
>> text/html... use some broken tag soup parser that has been around for 
>> ages and try to guess what I meant.  Reverse engineer a DOM tree that 
>> is probably right, but maybe not.  'cause it guessed.  Makes me 
>> crazy.  Makes most people crazy.  It's the best reason for the HTML5 
>> spec.  Lock down the broken behavior so it makes people predictably 
>> crazy.
>>
>> But what if it were XML and served as application/xhtml+xml?  What 
>> would a user agent do then?  Presumably it would follow the rules as 
>> set forth in XML for parsing, the XML DOM rules for DOM generation, 
>> and for elements in the XHTML namespace, it would look to XHTML M12N 
>> for behavior. Probably with some arcane knowledge based upon 
>> historical practice, because that's just how programmers work.  But 
>> in general it would follow the rules for behavioral requirements for 
>> XHTML...  And that's exactly what it should do.
>>
>> If there were a DOCTYPE declaration, what would it do differently?
>
> If the FPI is unknown, the browser could opt to not be fatal when 
> encountering an undeclared entity reference (like Opera does). If the 
> FPI is known it would load in a bunch of entity declarations. And 
> that's all.

Actually, I think it MUST opt to not be fatal unless the XML says 
'standalone="yes"' but I am not 100% certain.

>> If that DOCTYPE adheres to the naming requirements in M12N and 
>> matches the pattern for XHTML Family, then it should do the same thing.
>
> Are there conformance criteria in M12N (or elsewhere) that UAs should 
> match against a pattern for "XHTML Family"?

Yes - see the reference above.

>> If it has some inbuilt knowledge about some XHTML family document 
>> types and wants to do something special for them, I suppose that's 
>> okay too.
>
> Such as loading in entity declarations?

Yes, that is certainly permitted.  Or knowing the content model ahead of 
time, permissible datatypes for elements.... whatever.  If you are 
caching FPI related information, you should cache whatever your user 
agent needs to grok the elements and attributes that markup language uses.

>> But the default rules can and should apply in all cases for all 
>> unknown family members.
>
> How does one know if an unknown FPI is an XHTML family member or not?

See above.

>> That's what the recommendations say,
>
> Where?

Again, see above.

>> and I am pretty sure we meant it when we wrote them.
>>
>> What's the alternative?  Have inbuilt knowledge about a handful of 
>> predefined markup languages based upon FPIs, and then fail to process 
>> new members of the XHTML Family?
>
> This is not about support for the language -- that's dispatched on 
> namespaces -- but merely about entities.
>
> Mozilla, Opera and WebKit have knowledge about a handful of FPIs that 
> will load in the HTML or HTML+MathML entity declarations. If you use a 
> different FPI, only the 5 XML entities can be used -- others will 
> fail, either gracefully (Opera) or fatally (Mozilla/WebKit).

Well - a fatal failure is a bug.  XML doesn't permit that as I read it.  
And switching on namespaces is not really safe - namespaces are 
overloaded.  But regardless, if that is what people do...

>> That is certainly a violation of the spirit of the requirements for 
>> user agent conformance in M12N.
>
> Well, loading in entity declarations based on pattern matching on the 
> FPI and hoping that the referenced DTD contained those same 
> declarations is certainly a violation of the XML spec, AFAICT.

Not if there is a superseding standard that permits it.  But your point 
is well taken.  You should really be loading them, not guessing.


>> What a user agent that claims to support XHTML must do is say "oh, 
>> this is XHTML family. I know the rules for that" and just deal with 
>> it. FWIW, the XHTML2 Working Group "mints" new FPIs all the time.
>
> Well so long as they aren't used on the Web we're not really affected. 
> :-) But it would be nice if new FPIs weren't minted, at least until we 
> change XML to get more default entities.

Understood.  But the truth is that we are crafting new variations of 
XHTML all the time.  Moreover, that was the whole point of XHTML M12N.  
Perhaps what we should be doing is telling people to not use named 
entities at all.  That would sort of solve your problem.  Or fix your 
implementations so they work in a way consistent with the standard.  I 
mean..... for whom is this easier?  The millions of web content authors 
who want to use entities in their markup, or the 5 or 6 user agent 
development groups who need to make their agents understand them.  Easy 
choice.


-- 
Shane P. McCarron                          Phone: +1 763 786-8160 x120
Managing Director                            Fax: +1 763 786-8180
ApTest Minnesota                            Inet: shane@aptest.com

Received on Friday, 1 August 2008 22:06:38 UTC