Draft Review of the ITS 2.0 draft document

As recorded as an action (wait, it was not recorded on the call because tracker got confused by several ivan-s:-) I reviewed the ITS 2.0 document, as requested by the ITS WG via Felix Sasaki[1]. The section that is relevant for this Working Group is the mapping to an external ontology, called NIF[2]. Actually, the details of that ontology are also not relevant for this Working Group; the issue is to map the attributes set on the textual content of an HTML (or XML) document into RDF.

To take the example of the document:

<html><body><h2 translate="yes">Welcome to <span 
  its-ta-ident-ref="http://dbpedia.org/resource/Dublin" its-within-text="yes"
  translate="no">Dublin</span> in 
  <b translate="no" its-within-text="yes">Ireland</b>!</h2></body></html>

the goal is to produce a set of RDF statements of the form:

<URI_TO_IDENTIFY_A_TEXT_PORTION>
   nif:property1 value1;
   nif:property2 value2;
   nif:prop <URI_TO_IDENTIFY_A_TEXT_POSITION>
   ...

The really interesting question is how to define the two URI-s <URI_TO_IDENTIFY_A_TEXT_PORTION> and <URI_TO_IDENTIFY_A_TEXT_POSITION>, where, say, the first should somehow refer to "Welcome to Dublin Ireland!" and the other should tell the world that this text is within the <h2> element of the file.

The current mapping uses the following two URI-s

<http://example.com/exampledoc.html#char=0,29>
<http://example.com/exampledoc.html#xpath(/html/body[1]/h2[1])>

although it is quite obvious what these are for, I sense some sort of a problem with these. We may end in a rathole, but...

- We refer to IRI-s in our concept document: RFC3987
- IRI-s map to URI-s: RFC3987
- What RFC3987 says about fragments is:

"The fragment's format and resolution is therefore dependent on the media type [RFC2046] of a potentially retrieved representation, even though such a retrieval is only performed if the URI is dereferenced.  If no such representation exists, then the semantics of the fragment are considered unknown and are effectively unconstrained."

The way I translate is that if I want to have a proper URI, where I expect the media type to be BLA, then the fragment ID should somehow be defined for BLA. Although RDF regards IRI-s as opaque, I would still feel uneasy to do otherwise.

Looking at the URI-s above

- The 'char' fragment is defined by rfc 5147, but is defined for text/plain only. ITS talks about XML and HTML, ie, talks about resources whose media types are definitely _not_ text/plain
- The xpath fragment id is fine for XML. But it is not defined for text/html and, knowing how XML is frown upon by the HTML WG, I do not expect that to ever change.

In view of this, I do not feel comfortable with the choice of the mapping. The URI-s are not dereferenceable, neither are they correct...

That being said, I may be too picky and we could let this go, also considering the fact that this section is _not_ normative in ITS.

I had some discussion with Felix and also with Sebastian Hellmann, who is the author of NIF; one proposal I had was to use a URI of the form

http://www.w3.org/its?resource=http://example.com/exampldoc.html&char=0,29 

which, if some simple service is provided, can provide some simple information back, and is ok as a URI. I think that would be acceptable to them. But again, this WG may decide that I am just way too pedantic...

Ivan

P.S. It is of course possible to radically change the mapping with some blank nodes in the middle to avoid the issue...

[1] http://lists.w3.org/Archives/Public/public-rdf-wg/2013Aug/0000.html
[2] http://www.w3.org/TR/2013/WD-its20-20130820/#conversion-to-nif

----
Ivan Herman, W3C 
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf

Received on Friday, 23 August 2013 12:57:29 UTC