Re: suggest validator prefer URI to FPI

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dominique Hazaël-Massieux <dom@w3.org> wrote:

>>* Dan Connolly wrote:
>>>I'm interested to know if others find the arguments in
>>>http://www.w3.org/TR/webarch/#uri-benefits
>>>persuasive or not; i.e. whether they agree with me that the markup
>>>validation service should prefer URIs to FPIs.
>>
>>Two rather unrelated questions. The section you cite discusses good
>>practise for "web agents" when providing resources, not whether XML
>>processors should prefer system identifiers over public identifiers
>>when resolving external entities.
>
>I think DanC's point was that since URIs are preferred to FPIs in the
>Web Architecture,

I think that's reading a bit much into the WebArch document — what little I
read of it does not seem to support that statement — but I'll take your word
for it.


> the Markup Validator should always include them in the
>validation step, the point being not that it should always dereference
>them, but rather that, since (HTTP) URIs can be dereferenced, processed
>by various tools, etc, "validating" their usage in the Doctype sounds
>like a useful feedback to the document creators.
>
>Taking a practical example: - if the FPI and the System ID differs, it's
>probably a good idea to tell the user - 

Yes, without question; which is why it's been on our feature «wishlist» ever
since Karl brought it up a couple of years ago (IIRC). Unfortunately this is
not easily implemented, so it's stayed — and will likely stay for the
forseeable future — on the «wishlist»[0].

Patches gratefully accepted!

But this is _not_ what I read Dan's suggestion to be. Even after several
readings of what he wrote, my understanding of it is that he wants the SysID
to be preferred over a PubID; regardless that both are present and whether
they differ, and with no particular note of a desire for a user warning when
they differ.

As best I can tell this is just another effort to impose arbitrary preferences
expressed in WebArch on the Markup Validator; since, at least as far as I was
able to interpret Dan's message, his suggestion was not attempting to solve
any actual technical problem.

You'll note that the document cited as an example has been revised no less
than three times since first referenced here, and the SysID has still not been
corrected. It hardly seems as if this was a particularly pressing problem
confounded by the current Markup Validator behaviour.


>if the FPI and the System ID differs, it's probably a good idea to use the
>System ID to check the document instead of the FPI, since that's what an
>agent that wouldn't know the FPI would do
>
>What would be the drawbacks in terms of user experiences/implementations
>against this approach?

First of all, let me note again that I disagree vehemently that a URI is a
superior identifier in general, and yet more so in the specific case of an
entity reference.

But leaving aside «WebArch» for a moment[1], lets go look at web history and
implementation (instead of «Architecture»).


The earlier specifications for HTML have used only a FPI in their examples,
and some of the newer specifications have used both but with a bogus SysID.

Also, in the SGML world the SysID is just a random blob of data — it could be
"Blarghl!" and still be perfectly sensible, and Valid, SGML — while a (Formal)
Public Identifier has actual structure, hierarchy, and registration procedures
(which the W3C has ignored for a decade, but that's a different beef).

The majority of pages out there will have an FPI, but only a subset will
actually have a SysID; and the provenance of that SysID, iff included, is
questionable. IOW our failure scenario here is an increase in pages that will
provide bogus results from the Validator with that change.

One of the reasons for this is one of pure human interaction; the FPI is
actually legible to human beings, while an «URI» — the retrofitted assumption
imposed on the opaque SysID — is only parseable with effort, and frequently
misparsed even after a good faith effort.


It is a pity that ISO8879 doesn't clearly specify the precedence between
these[2] when both are present — as differing PUBLIC and SYSTEM references are
clearly nonsense — but my distinct impression of their intent — with which I
concur, obviously — is that the FPI is the preferred method in document
instances intended for general districution.

It would be insane to assign preference to an opaque string, with semantics
only defined within what the IETF would term a «cooperating subset» (i.e.
implementation dependant) of systems, over a globally unique, system agnostic,
identfier with well established namespace management and registration
procedures.

XML's retrofitting of URIs on the SysID merely alleviates that, it doesn't
contradict it.


In any case, the Markup Validator currently prefers the FPI to a SysID — iff
both are present — because this better matches actual published documents and
causes zero problems. It has the secondary effects of keeping our catalog
files somewhat slimmer without sacrificing caching (positive), and of not
detecting the case when an author has provided an «incorrect» SysID for the
FPI used when both are present (negative).

If this part of WebArch, in its current state, gets more widely adopted in
specifications, and deployed on the Web, then it's likely the Markup Validator
would switch too to better reflect what is actually out there (despite my
opinions on the advisability of that approach).

But as it stands, I see no benefit to this change other than that of helping
(self-)fulfill the WebArch prophecy — which might be considered a laudible
goal in itself, of course — and that is simply not a persuasive argument for
me.



[0] — Actually, there is a small chance that some otherwise unrelated changes
      may enable us to bolt on a minimally useful and acceptable warning
      facility for this, but it'd be clunky and I'm not sure it's
      implementable yet.

      Did I mention that patches would be more than welcome? :-)


[1] — I find the language in XML 1.0.3 that supports this position much
      more persuasive than WebArch in any case.


[2] — But note that Goldfarb writes:

        «For obvious reasons, the use of the formal public identifier
         feature is highly recommended.»

      And in the footnote attached to that:

        «Formal public identifiers would probably have been mandatory,
         except they were introduced relatively late in the development
         cycle of ISO 8879.»

- -- 
"Fly it until the last piece stops moving..."

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0.3

iQA/AwUBQRJcxaPyPrIkdfXsEQJQnwCghSy65WXJ7gg7UYCj6JCW+1m0GJQAn32s
VapmSyEBzcVeOQFUCPYAR1fi
=F0zZ
-----END PGP SIGNATURE-----

Received on Thursday, 5 August 2004 12:35:14 UTC