The IPTC's need for reification in XHTML 2 [was RE: DC in XHTML2] from Mark Birbeck on 2005-06-10 (www-html@w3.org from June 2005)

From: Mark Birbeck <mark.birbeck@x-port.net>
Date: Fri, 10 Jun 2005 15:34:44 +0100
To: "'Al Gilman'" <Alfred.S.Gilman@IEEE.org>
Cc: <www-rdf-interest@w3.org>, <semantic-web@w3.org>, <www-html@w3.org>, <iptc-metadata@yahoogroups.com>, "'Misha Wolf'" <Misha.Wolf@reuters.com>, <dc-general@jiscmail.ac.uk>
Message-ID: <249E3A28-00F9-4FA6-BEBD-5A416D03B477@S009>
Hi Al,

In response to my point:
> >Misha and the IPTC's interest though is in putting subject 
> codes in the 
> >form of QNames into the object part of the statement, not 
> the predicate part.

you said:
> That is not how I read the question. As I read the question, 
> that is not what they want to do; rather that is *where they 
> found a roadblock in the path to a solution that they were 
> going down.* I don't think you did the "if that's not the 
> answer, what's the question?" step to back off to the real 
> requirement.

I see no need to be as rude as you were, but if you're unable to suppress
it, why not at least make sure you are correct?

Anyway...enough moaning...



THE PROBLEM
Misha clarified his initial issue in a follow-up email, like this:

> Indeed, XHTML2 lets us define any element or string to be, say,
> "dc:creator" or "dc:subject".  What it does *not* do is let us
> specify that the value of dc:subject is, say, "foo:bar", where
> "foo" identifies a vocabulary and "bar" a term in that vocabulary.

That seems pretty clear to me.

His initial raising of the question was attached to your email, and went
like this:

> >>  >The News Architecture Working Party of the International Press  
> >> >Telecommunications Council (IPTC) is investigating the use of
> >>  >XHTML2 for expressing DC and other metadata.  A major  
> problem for 
> >> us is  >the lack of support in the current XHTML2 draft (as in
> >>  RDF/XML) for the
> >>  >use of QNames to express terms in controlled vocabularies  (aka 
> >> values  >of properties).
> >>  >
> >>  >At the moment, the XHTML2 @content attribute takes 
> PCDATA  and the 
> >> @href  >attribute takes IRIs.  There is no attribute available for 
> >> QNames.
> >>  >
> >>  >We want to be able to use, eg, <dc:subject> with a QName as  a 
> >> value (ie  >the object of the RDF statement).  The reasons 
> include  
> >> legibility and  >compactness.  News items (and news 
> headlines) often 
> >> carry numerous
> >  > >subject codes, hence the need for compactness.

That also seems pretty clear. Anyway, far from 'not taking a step back',
I've been thinking about nothing else since Misha and Laurent Le Meur
explained their requirements to me in London on Sunday (and explained them
again to both myself and Steven on Tuesday, during Steven's presentation to
their AGM). As I said at the meeting, just when we think we have the
metadata story wrapped up, along comes another requirement! But as I've
stressed in these discussions before, the HTML WG is fully committed to
solving such issues as far as we can (at least without turning the syntax
into something ridiculous).


So, on to the issue itself. You said:

> What I believe they want is a compact notation by which they 
> can associate content blocks with well-known (much re-used) 
> subject categories.

Well, that it should be compact is pretty obvious. But if we take your
advice and 'take a step back', you'll find that the much bigger issue is
that they want to make statements about statements *not* statements about
subject codes. That isn't how the question was posed, but that is the
stumbling block for what they want to do, as I'll try to show.



COMPACTNESS
Let's go through the issue, beginning with Misha's summary, above:

> News items (and news headlines) often carry numerous
> subject codes, hence the need for compactness.

Their primary concern is the sheer quantity of subject codes that they want
to have for a document. Since the news organisations may sometimes just send
a simple headline, they are worried that the amount of metadata transmitted
will dwarf the article itself.

The reason that this can happen is that the IPTC have a requirement that a
document should contain pretty much all of the metadata that it would need.
They don't want a document to have just a small number of codes, and then at
some point in the process go off to another database to look-up some
inferred values--they want as much information as possible to be in the
document there and then. (They may of course use inference databases to do
the initial population.)

The main motivation for this is to reduce network bandwidth as well as speed
up any processing that will take place on that document as it moves through
the system. It also means that you only have to send one package to a
consumer of your news. Anyway, whatever the motivation, this certainly means
that the syntax for expressing the subject codes needs to be succinct.



META-METADATA
However, if it was just a matter of the quantity of subject codes then
things might be tolerable. But the problem is greatly compounded by the fact
that each subject code might have information about who added it to the
document. As a document moves through the system, different codes may be
added on the way, by different organisations and individuals--they want to
record the who, the when and the why.

But what they are adding is metadata about the editing process--not the
"subject categories", as you imply--and this leads into good old
reification! I'll go through their examples to show what the *real* problem
with their mark-up is.

Laurent and Misha had the following 'fantasy mark-up':

  <iptc:contributor val="afp:llm">Laurent Le Meur</iptc:contributor>
  <iptc:subject
   val="srs:15000000"
   assignee="afp:llm"
   date="2005-06-06"
   xml:lang="fr">Sport</iptc:subject>

To even begin to accommodate this we've already agreed that we need QNames,
somewhere, somehow. In the following examples I'll use the '['/']' form that
I referred to in another post, but whether it's a set of new QName
attributes, or some 'adorned' syntax doesn't matter...let's assume for now
that we have somehow solved the issue.

In addition to this tweak to XHTML 2, we need a further enhancement which is
to allow attributes in other namespaces to serve as predicates.

So, using the strawman proposals for QNames and allowing additional
predicates, we now have:

  <link rel="[iptc:contributor]" href="[afp:llm]">Laurent Le Meur</link>
  <link rel="[iptc:subject]" href="[srs:15000000]"
   iptc:assignee="[afp:llm]"
   iptc:date="'2005-06-06'"
   xml:lang="fr">Sport</link>

Note a couple of additional things:

 * the date is quoted to show that it is a string literal;
 * the @rel and @rev values should now have '['/']' for
   consistency.



FINALLY...THE 'REAL' PROBLEM
However, the 'real' problem with this for the IPTC is that in XHTML 2
syntax, all of the predicates in this example are about the *document*, and
we want them to be about the assignment of the subject code.

Note the wording there--it's not about the subject code, but about the
*assignment* of the subject code. So it's not a simple matter of making
statements that have the subject code as the 'subject':

  <meta about="[srs:15000000]"
   iptc:assignee="[afp:llm]"
   iptc:date="'2005-06-06'"
   xml:lang="fr" />

since these are *not* statements about its *assignment*. (Hopefully now you
can see why I wanted to think about it a bit more before I made any
suggestions.)

So the 'real' problem is ultimately to do with having a means to refer to
statements. In the current draft you can nest statements in such a way that
they refer to their parent statement. It might look like this (assuming the
new additions to XHTML 2, described above):

  <link rel="[iptc:contributor]" href="[afp:llm]">Laurent Le Meur</link>
  <link id="a" rel="[iptc:subject]" href="[srs:15000000]">
    <meta iptc:assignee="[afp:llm]"
          dc:date="'2005-06-06'"
          xml:lang="fr">Sport</meta>
  </link>

The nested statements refer to the containing statement, not the document,
giving you this:

  <> iptc:contributor afp:11m .
  <> dc:subject srs:15000000 .
  <#a> iptc:assignee afp:11m .
  <#a> dc:date "2005-06-06" .

('srs:15000000' won't actually work as a format, but ignore that for now.)

However, you still don't know what #a refers to, and this is where we almost
certainly need to introduce reification. We probably need the presence of an
@id to 'explode' the statement (similar to @ID in RDF/XML), as follows:

  <> iptc:contributor afp:11m .
  <> dc:subject srs:15000000 .

  <#a> rdf:type rdf:Statement .
  <#a> rdf:subject <> .
  <#a> rdf:predicate dc:subject .
  <#a> rdf:object srs:15000000 .

  <#a> iptc:assignee afp:11m .
  <#a> dc:date "2005-06-06" .

I'll leave it there, since this may have frightened the life out of most
people on the HTML list, and perhaps we can continue this thread over on the
RDF lists. (HTML people won't miss anything, since whether we add a
sprinkling of reification or not has consequence for the RDF processing
side, and not the mark-up.)


> So I claim that creating a short QName notation that 
> (sufficiently formally and authoritatively) expands to "has 
> 'subject' (per Dublin
> Core) of 'sports' (per IPTC)" is solving precisely the 
> problem that they face.

Hopefully now you can see why your 'claim' is wrong...that was the
easy-peasy problem to solve.


> And if we were to make 'role' M-ary 
> rather than unary, this would flow neatly in there.

Having multiple @role values is certainly desirable, but it doesn't solve
their problem either.

I didn't follow most of the remaining parts of your email, so forgive for
not responding to them (I think they concerned mapping types with rdf:type).
I have, however, some comments on the final part about @class, which I'll
try to get down at some point, in a separate email.

Regards,

Mark


Mark Birbeck
CEO
x-port.net Ltd.

e: Mark.Birbeck@x-port.net
t: +44 (0) 20 7689 9232
w: http://www.formsPlayer.com/
b: http://internet-apps.blogspot.com/

Download our XForms processor from
http://www.formsPlayer.com/
Received on Friday, 10 June 2005 15:40:51 UTC