How are RDFa clients expected to handle 301 Moved Permanently? from Christoph LANGE on 2013-10-10 (public-rdfa@w3.org from October 2013)

From: Christoph LANGE <math.semantic.web@gmail.com>
Date: Thu, 10 Oct 2013 16:54:00 +0100
To: public-rdfa@w3.org
Message-ID: <5256CD98.7080603@gmail.com>
Dear RDFa community,

I am writing in the role of technical editor of the CEUR-WS.org open
access publishing service (http://ceur-ws.org/), which many of you have
used before.

We provide a tool that allows proceedings editors to include RDFa
annotations into their tables of content
(https://github.com/clange/ceur-make).  FYI: roughly 1 in 6 proceedings
volumes has been using RDFa recently.

We are now possibly running into a problem by having changed the
official URLs of our volume pages from, e.g.,
http://ceur-ws.org/Vol-994/ into http://ceur-ws.org/Vol-994, i.e.
dropping the trailing slash.  In short, RDFa requested from
http://ceur-ws.org/Vol-994 contains broken URIs in outgoing links, as
RDFa clients don't seem to follow the "HTTP 301 Moved Permanently",
which points from the slash-less URL to the slashed URL (which still
exists, as our server-side directory layout hasn't changed).  And I'm
wondering whether that's something we should expect an RDFa client to
do, or whether we need to fix our RDFa instead.

Our rationale for dropping the trailing slash was the following:

1. While at the moment all papers inside our volumes are PDF files, e.g.
http://ceur-ws.org/Vol-994/paper-01.pdf, we are thinking about other
content types (see
http://ceurws.wordpress.com/2013/09/25/is-a-paper-just-a-pdf-file/), in
particular directories containing accompanying data such as original
research data, and the main entry point to such a paper could then be
another HTML page in a subdirectory.

2. As the user (here we mean a human using a browser) should not be
responsible for knowing whether a paper, or a volume, is a file or a
directory, we thought we'd use slash-less URLs throughout, and then let
the server tell the browser (and thus the user) when some resource
actually is a directory.

(Do these considerations make sense?)

This behaviour is implemented as follows (irrelevant headers stripped):

$ wget -O /dev/null -S http://ceur-ws.org/Vol-1010
--2013-10-10 16:33:57--  http://ceur-ws.org/Vol-1010
Resolving ceur-ws.org... 137.226.34.227
Connecting to ceur-ws.org|137.226.34.227|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 301 Moved Permanently
  Location: http://ceur-ws.org/Vol-1010/
Location: http://ceur-ws.org/Vol-1010/ [following]
--2013-10-10 16:33:57--  http://ceur-ws.org/Vol-1010/
Reusing existing connection to ceur-ws.org:80.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK

But now RDFa clients don't seem to respect this redirect.  Please try
for yourself with http://www.w3.org/2012/pyRdfa/ and
http://linkeddata.uriburner.com/.  These are two freely accessible RDFa
extractors I could think of, and I think they are based on different
implementations.  (Am I right?)

When you enter a slashed URI, e.g. http://ceur-ws.org/Vol-1010/, you get
correct RDFa, in particular outgoing links to, e.g.,
http://ceur-ws.org/Vol-1010/paper-01.pdf.  When you enter the same URI
without a slash, the relative URIs that point from index.html to the
papers like <ol rel="dcterms:hasPart"><li about="paper-01.pdf"> resolve
to http://ceur-ws.org/paper-01.pdf.

Now I have the following questions:

Are these RDFa clients broken?

If they are not broken, what is broken on our side, and how can we fix it?

Is it acceptable that RDFa retrieved from a slash-less URL is broken,
whereas RDFa from the slashed URL works?

Is it OK to say that the "canonical URL" of something should be
slash-less, whereas the "semantic identifier" of the same thing (if
that's what we mean by its RDFa URI) should have a slash?  Or should
both be the same?  (Note: I am well aware of the difference between
information resources and non-information resources, but IMHO this
difference doesn't apply here, as we publish online proceedings.
http://ceur-ws.org/Vol-1010 _is_ the workshop volume, which has editors
and contains papers; it is not just a page that describes the workshop
volume.)

Is there an acceptable way of indicating in my RDFa that the slashed
version of the URL is to be preferred?  It would be easy for us to put
an explicit about="http://ceur-ws.org/Vol-1010/" into all index.html
files.  But this would still leave relative about="..." links broken
when RDFa is requested from the slash-less URL, as these are resolved
against the then slash-less base URI of the document.

Or do we finally have to make all outgoing RDFa links more explicit,
e.g. by using about="/Vol-1010/paper-01.pdf"?  That wouldn't be much of
a problem, as the RDFa is generated by a script anyway, but it would
once more make the script's output less readable.

Cheers, and many thanks in advance for your advice,

Christoph

-- 
Christoph Lange, School of Computer Science, University of Birmingham
http://cs.bham.ac.uk/~langec/, Skype duke4701

→ Mathematics in Computer Science Special Issue on “Enabling Domain
  Experts to use Formalised Reasoning”; submission until 31 October.
  http://cs.bham.ac.uk/research/projects/formare/pubs/mcs-doform/
Received on Thursday, 10 October 2013 15:54:17 UTC