Re: ACTION-156: Review of http://www.w3.org/2001/tag/doc/selfDescribingDocuments-2008-05-12.html from noah_mendelsohn@us.ibm.com on 2008-09-03 (www-tag@w3.org from September 2008)

From: <noah_mendelsohn@us.ibm.com>
Date: Tue, 2 Sep 2008 21:27:01 -0400
To: "Williams, Stuart (HP Labs, Bristol)" <skw@hp.com>
Cc: "www-tag@w3.org" <www-tag@w3.org>
Message-ID: <OF746F10D2.F33DF365-ON852574B9.0001D3BA-852574B9.0007C853@lotus.com>
Stuart Williams writes:

> At long last I have managed to take a review pass over http://www.
> w3.org/2001/tag/doc/selfDescribingDocuments-2008-05-12.html.
> 
> Broadly I think that the document reads well and is in a pretty mature 
state

Thank you!

> however I do hae a few comments below.

OK.  Individual responses below.  I have put a quick and dirty revision up 
on the Web at 
http://www.w3.org/2001/tag/doc/selfDescribingDocuments-2008-08-22.html.  I 
say quick and dirty because the revision date in the text is 2 Sept, but 
the file name in date space is still the August date of my last editors 
copy, etc.  When I say below that something is fixed or "DONE", you should 
be able to check it in this draft.  After we close on getting the 
revisions done, I will publish a new and more stable copy for review at 
the face to face under a different URI.  I'd love to do that by sometime 
next week, which should give TAG members enough reading time to support a 
decision on publication in KC.

> Introduction: Bulletted list: ~4th item.
> 
> It might be worth mentioning revived use of the Link: http header as
> means to associate metadata with a resource (and indeed the use of 
> <link> elements and/or http-equiv to induce http headers in a 
> response in HTML)

Changed to read:
"For integration with the Semantic Web, self-describing representations 
should convey RDF triples, either directly in the representation, by 
linking to the triples (perhaps using <link> elements in HTML or the link: 
header in HTTP), or by linking to transformations using technologies such 
as GRDDL. "

I think the http-equiv is one step to far for an introduction.  I'm not 
trying to be complete, just suggestive.

--

> 2 The Web's Standard Retrieval Algorithm: 1st para (editorial)
> 
> Suggest changing:
> 
>         "Indeed there is a standard algorithm that a user agent can 
> employ to obtain and interpret the representation..."
> 
> to
>         "Indeed there is a standard algorithm that a user agent can 
> employ to attempt to obtain and interpret the representation..."
> 
> Rationale: there is no certainty that application of the algorithm 
> on a particular occasion will in fact obtain a representation or 
> enable its intepretation by the particular client (the latter may 
> still require a small matter of programming).

DONE though I think the original was clear enough and shorter; this isn't 
a specification, it's a finding trying to give people a sense of the 
issues and of good practice.

> It would be really helpful if the diagram were of a size that would 
> display/print conveniently.

I infer you jumped to the diagram in the appendix.  With help from Norm, 
the sizing should be fixed.

--
> Section 2 (editorial)
> 
> "When he clicks it, his browser:
> 
> - from the <code>http:</code> at the beginning of the URI, 
> determines that the http scheme has been used - "
> 
> Suggest reversing these clauses:
> 
> ie.
> "When he clicks it, his browser:
> 
> - determines that the http scheme has been used from the <code>http:
> </code> at the beginning of the URI  - "

I'm not happy with that because it parses ambiguously on first reading. In 
addition to the intended reading, you can try to scan as "the scheme has 
been used from the http:".  Note changed/

--

> Section 2 (substantive)
> 
> " - this tells the browser that a repesentation retrieved using the 
> HTTP protocol is authoritative "
> 
> I don't think that the http: at the start of an HTTP URI does that. 
> A 200 response accompanying a representation does either with 
> respect to the request URI/host: combination in the corresponding 
> HTTP request or wrt the URI given in a Content-Location: header 
> accompanying the response, or wrt to both. [all modulo a level of 
> trust in the proxy and caching infrastructure not to mis-represent 
> the intent of the origin server and of course these days modulo DNS 
> cache poisoning attacks].

Again, I'm a little concerned that if we try to cross all the T's and dot 
I's, as we would in the HTTP RFC itself, we will just cut into the 
readability of the document without substantially improving its accuracy 
or impact.  I honestly think that, in context, the intention is clear. 

More to your point:  I do think the http: at the beginning plays a 
significant role in establishing that the retrieved representation, if 
any, is authoritative.  In contrast, if I have a URI using the https 
scheme, and if I make what is generally the mistake of attempting to 
retrieve a representation using ordinary (no SSL) HTTP, then a 
representation that comes back is not authoritative, even if the server is 
willing to provide one.   As I understand it, part of the contract for a 
URI employing the https scheme is that responses are considered 
authoritative only if HTTPS is used. 

In any case, I'd rather not get into the complications you list.  I 
certainly don't see a need to go into Content-location, for example, since 
the paragraph above indicates that we're talking about a "typical path" 
through an HTTP retrieval, not an exhaustive exploration of the options. 
This example is intended to get people thinking about "follow your nose" 
in the context of a typical retrieval.  If there are errors in it, we 
should fix them, but I'd rather avoid trying to restate all of RFC 2616 
here.

--

> Section 2 (substantive, minor)
> 
> " - looks up DNS name [DNS] example.com..."
> 
> Alternatively may lookup the DNS name of a configured proxy. The 
> important point here being that the TCP connection in general may 
> terminate in a different 'place' than that suggested by inspection of 
the URI.

Again, this is advertised as an illustration of a "typical path" through 
the standard retrieval algorithm of the Web.  Proxies are indeed one 
possible complication, but do we need to mention them in our first 
introduction to "follow your nose"? 

--

> Section 2 (substantive)
> 
> "Neither Bob nor his browser has any advance knowledge of the nature
> of the resource."
> 
> This usage of the word nature recurs and IMO is a little vague. I 
> think that you are really talking about the media type of the 
> representation in all cases rather than say the nature of a weather 
> report as being a weather report, or new article as a new article

No!  We're potentially talking about all of that.  In fact, the more 
powerful the self-description technology you use, the more you're likely 
to be able to discover by following your nose.  Yes, at very least HTTP 
almost ensures that you will discover the media type, but often you can do 
much more.  The media type might be application/xml, but the 
namespace-qualified root element might tell you quite reliably that you're 
looking at a resume or a work of music or an inventory report.  Similarly 
with RDF.  So, the power of a Self-describing Web is that in many cases 
you can indeed follow a link, with no a priori knowledge of the nature of 
the resource, and discover as you put it the "nature of a weather report 
as being a weather report".

> - neither of which is particularly evident in the media-type when 
> both are served up has HTML pages. Speaking of lack of prior 
> knowledge of the nature of the resource gives an allusion to 
> something way more sophisticated that lack of awarenetss/expectation
> about a the media-type of a response that is not borne out by the 
> example in the narrative.

You're certainly right that we have not established in this particular 
example that a machine could automatically discover the semantics conveyed 
by an HTML page, but in this example there is a human user.  I believe 
that, from a commonsense point of view, we have established that Bob can 
click on pretty much any link and his browser (if it's a good one) will 
show him a page such as a weather report that he as a human has a good 
shot at recognizing, or in the case of an image/jpeg a picture, or failing 
that the browser will reliably say to Bob:  I don't know what to do with 
this one.  In short, I believe that readers will understand that Bob can 
click an arbitrary link and, in practice, realize that what he's got is a 
weather report.

Note that later parts of the document do discuss the need for more 
application-specific content standards in the case where machines or 
software are supposed to extract semantics automatically.

--

> 3 Widely deployed standards and formats: 3rd para (substantive)
> 
> In the example I would only take the position that "...there are no 
> outright violations of Web architecture..." in the case where the 
> media-type has been properly registered 

Well, the example media type I used was image/x-fancyrawphotoformat, and 
media types starting with x- are experimental and in fact cannot be 
registered.  Their use is certainly discouraged, by I don't think that 
using a media type like this is a violation of Web architecture.  In fact, 
the whole point of this example is that use of such a media type is bad 
practice, but not "an outright violation of the architecture".  You 
haven't convinced me that isn't true.

> (and preferably documented 
> (openly?)). I think that it would be worth mentioning media-type 
> registration because the follow-your-nose chain breaks in case where
> this has not been done.

The very next sentences say: 

"No existing Web user agents recognize the image/x-fancyrawphotoformat 
media type, search engine spiders are unlikely to extract useful 
information from pictures in that format, and so on. Unlike Susan's, which 
can be viewed by almost anyone, Mary's photos are at best useful to a few 
people who have the proprietary software needed to decode them. "

So you're right: nobody's advertising follow your nose here; it's a 
counter example.  Use media types that nobody knows (nose?) about, or that 
aren't documented, and the follow your nose story loses a lot of its 
value.  I'm not yet convinced that the story would be improved by changing 
it.

--

4.2 URIs based Extensibility (anal)

"...and in many cases each markup tag or data value used, is identified by 
a URI."

> Absent SCUDs is that really the case? Maybe you are refering the to 
> occurance of tag in a document marked with an ID such that the base 
> URI of the document extended by a fragment ID corresponding to the 
> ID value could be taken (via relevant media type spec) as naming 
> that occurence of the use of the tag. Anyway I would quibble that 
> it's not clear what you intended to say, and if for example you were
> trying to say that for example the html root element of an XHTML 
> document has an associated identifying URI I'd struggle to know what
> it was - though I would willingly conceed that it has a URI based 
> identifier in the form of an extended name (modulo elements, 
> attributes, substitution groups... being distinct naming partitions).

Our own Namespaces Document finding says (
http://www.w3.org/2001/tag/doc/nsDocuments/#div.fragid) "For many 
applications of namespaces, it's valuable not only to be able to point to 
the namespace as a whole, but also to be able to point to terms within 
that namespace."   I think we all know of cases, such as the Atom use 
cased discussed in the self-desc. draft, in which each data value is a 
URI.  It seems to me that the statement you quote is pretty well 
justified.

--

> 4.2 and subsection (General)
> 
> Feels like there ought to be a few GPNs here capturing partial 
conclusions.

I'm open to suggestions, but I'd rather not hold up publication if we 
can't come up with any.

--

> 4.2.2 Microformats: (Question of information)
> 
> "Unlike... . The hCard profile specifies a value for the profile 
attribute..."
> 
> Is this particular idiom for the us of the profile attribute 
> actually grounded in an HTML specification?

I believe that Dan, among others, has led me to believe that the answer is 
yes.  I don't consider myself an expert on those aspects of HTML.

> Some of them? all of them that define the attribute?

Some of what?  Do you mean some of the microformats?  That doesn't make 
sense because a few sentences later I say: 
"Unfortunately, few microformats have such profiles, and even when 
profiles are available, evidence suggests that they are not universally 
applied. "  So, I'm afraid I'm misunderstanding your phrases "some of 
them" and "all of them".

> I believe that the profile attribute was and maybe still is under 
> treat in HTML5.

Yes.  I think there has also recently been consideration of some 
mechanisms that are similar in spirit but different in detail.

Are you suggesting that should change the draft finding, and if so how 
would you suggest?  I think the approach we've been taking is to discuss 
the pros and cons of having this facility on the merits.  If HTML5 decides 
eventually to ship without a facility that we have deemed valuable, then 
either we're wrong or they've missed an opportunity.  We can always revise 
the finding were that to happen, to warn people that our good advice 
cannot in fact be followed.

--

> 4.2.3 Self-describing XML documents (editorial) 3rd para:
> 
> Mentions the TAG nsDocument-8 finding which has matured beyond the 
> state described in this document.

Fixed.  The text now reads:

"The TAG Finding "Associating Resources with Namespaces" 
[NamespaceDocuments], recommends the use of [RDDL] as a preferred means of 
documenting namespaces."

The bibliography has also been updated to point to the published finding.

--

> 5 RDF and the Self-Describing Semantic-Web: 2nd para:
> 
> "Indeed RDF Schema and OWL Ontology technologies together offer a 
> standard, machine-processable means of describing particular uses of 
RDF"
> 
> Hmmmm.... well they provide the means to describe 
> entailments/inferences that can be drawn from a collection of RDF 
> statement and to detect when a collection of RDF statements is 
> inconsistent with respect to the axioms of a Schema/Ontology (and 
> indeed when class defns within an ontology are inconsistent). So... 
> in a very specialised way, I agree, but read as written I think that
> "...machine processable means of describing use of RDF" suggests a 
> much broader capability.

How about, "offer a machine-processable means of extracting information 
from particular uses of RDF"?  I'm open to better suggestions.  No change 
made for the moment.

--
> Section 5 3rd para: (anal)
> 
> "... to obtain RDF triples that represent or describe the 
referencedresource."
> 
> This is potentially deep in the heart of httpRange territory (or 
> not) depending on how closely one is reading.
> 
> Given a URI u (say for the planet mars) it is not ok by Web 
> architecture to provide a direct 200 response and a descriptive 
> representation of Mars. However it is ok to redirect to a 
> descriptive resource whose representation contain a description of 
> the resource reference by u.
> 
> You probably didn't intend the 'or' in the quoted fragment to be 
> read that closely.

Yes, I agree with your analysis, but at the end you seem to waffle an 
whether you are asking for a change or making an observation. Suggestions?

--
> Section 5 RDF source fragment (editorial)
> 
> RDF/XML is pretty ugly to read compared to N3 which conveys a much 
> clearer impression of the corresponding RDF graph:
> 
> @prefix employeeData:  <http://example.org/EmployeeInformation#> .
> @prefix rdf:           <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> 
> <http://example.org/Employees#BobSmith>
>       a                  employeeData:employee ;
>       employeeData:email <mailto:BobSmith@example.org> ;
>       employeeData:name "Bob Smith" .
> 
> Unless it is really important to use RDF/XML to make the point I'd 
> suggest replacing with the N3 above.


Let's see what others think.  I certainly take your point.  The reason I'm 
a bit hesitant is that I'm among the many readers who already knows XML 
very well, and RDF just a little.  Keeping in mind that the point is not 
to rigorously teach RDF, but to give one a sense of how the retrieved 
ontology might teach you that email could be sent, the XML is easier to 
get through for readers like me.  Unless you know N3, that free floating " 
      a     " on the line under <http: > is very confusing.  On the 
contrary, readers who come from a Semantic Web world will have no trouble 
at all with the N3, many other readers will guess right if they stare hard 
enough, so there's certainly merit to your suggestion.

Bottom line: I'd like to hear from other TAG members on this one.

Thank you again for the very careful reading and the thoughtful comments. 
What's your feeling about the likelihood that we can resolve these issues 
in time to publish at the F2F?  Thank you.

Noah

--



--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Received on Wednesday, 3 September 2008 01:26:18 UTC