Re: Fwd: Re: Document fragment vocabulary from Sebastian Hellmann on 2011-08-29 (uri@w3.org from August 2011)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Mon, 29 Aug 2011 15:43:29 +0200
To: Erik Wilde <dret@berkeley.edu>
CC: uri@w3.org, Michael Hausenblas <michael.hausenblas@deri.org>
Message-ID: <4E5B9781.5030409@informatik.uni-leipzig.de>
Hi Erik and Michael,
sorry for the delay, I was sick last week.
I tried to better state the problems I have in this email and hope that 
we are getting closer to the core now.

Am 23.08.2011 23:46, schrieb Erik Wilde:
> hello sebastian.
>
> On 2011-08-16 09:22 , Sebastian Hellmann wrote:
>> What is your suggestion then, what we should be doing? We consider
>> addressing fragments of text documents in general, with CSV and XML and
>> XHTML being specialisations. We might just add an additional
>> "type=RFC5147" to the fragment and then add several other types
>> ourselves: a stable one, one for morpho-syntax, etc.
>
> i am not quite sure how you could see CSV and XML and XHTML as 
> specialization of plain text. they do have different metamodels (at 
> least plain text and CSV and *ML) and thus need pretty different 
> approaches when it comes to fragment identification. i think the 
> problem you're having may be a well-know ugliness in web architecture: 
> fragment identifiers are specific for the media type, but URIs are 
> (often) not. this is just a design defect of the web, and there;'s no 
> easy way around it. sometimes people try to engineer around it 
> somehow, but as soon as you're starting to think about 
> decentralization and redirections, things typically fall apart. all 
> sorts of things have been proposed over the years to fix this defect, 
> but there it's a hard problem to solve in the general case and without 
> breaking backwards compatibility.

The basic problem seems to be the definition of what plain text is. I 
guess you are talking about the media type, while I am talking about 
plain text in general. My definition would be a bit broader such as: 
"Plain text is basically anything that makes sense to open in a text 
editor. " or negatively "Not a binary format." or "a character 
sequence". Then CSV and *ML impose certain rules upon the plain text and 
requires certain patterns of characters.
The easiest way to show that it is a specialisation is, that 
http://www.w3.org/DesignIssues/LinkedData.html#range=14406-144018 would 
theoretically work and point to the fragment "Semantic Web" based on the 
html source. The problem here is again that plain text according to RFC 
2046/3676 should not contain any markup or other things. But this is 
actually not an intuitive definition and I was not aware that this 
separation is made here.


The main use case I have is for Natural Language Processing. Application 
A creates a annotation in RDF (e.g. part of speech tags) .
@base <http://www.w3.org/DesignIssues/LinkedData.html#>
@prefix sso: <http://nlp2rdf.lod2.eu/schema/sso/>
<range=14406-144014> sso:posTag "JJ" , rdf:type 
<http://purl.oclc.org/olia/olia.owl#Adjective>

  A second application B can now read an understand this RDF, because it 
can understand the produced RDF, which is defined as NIF.
Such understanding is reached, because 1. common ontologies are used 
such as olia.owl#Adjective  and 2. because the URIs <range=14406-144014> 
have a well-defined Semantics.

When it comes to RDF however the Semantics of URIs seem to be different 
suddenly:
1.  is <range=14406-144014> the same as 
<range=14406-144014;md5=43tz8sfel8jilfeu8sfejkl> .
";md5=43tz8sfel8jilfeu8sfejkl" is just an integrity check, so if the 
integrity is given an application can assume that both URIs have the 
same meaning and point to "Semantic".
2. Furthermore using "?", 
<http://www.w3.org/DesignIssues/LinkedData.html?range=14406-144014> 
might refer to the same. It is an RDF subject and the reference can be 
arbitrarily defined.
3. As the DesignIssues web site is HTML, we could also use XLink or 
XPointer to mean the same thing.
4. If the annotations are expensie to calculate it would be nice, if 
they could stay valid as long as possible, thus using more robust 
identifiers: 
<contextlength=4&length=8&text=Semantic&md5=438jil89sfdkljise79>

All URI variants could be defined as equal using owl:sameAs. As NLP is a 
hacky business sometimes, some of the modelling should not be fixed, 
e.g. applications could also define that ranges, that differ just 1 
character are still the same or other fuzzyness.

So for RDF it does not really matter, what exactly is used. But for the 
Web it seems to matter. So I am currently looking for common ground here 
as it would be nice to have compatability. E.g. applications 
implementing NIF might be able to understand your annotations using RFC 
5147 and also more specialised URIs like the ones for CSV and LiveUrls 
[1]. Furthermore, the chance increases that annotations produced by NIF 
tools can be highlighted in a browser per default.

I am a little worried as fragment-ids are so restricted to media types, 
especially since you could easily reuse them, i.e. plain text RFC 5147 
for CSV and *ML
It seems difficult to find a good definition what the URIs used in the 
RDF actually denote. It might also not be possible to make this coherent 
with Fragment ID Semantics as defined by W3C. What do you think?

>> - Do you know of any benchmarking of the different URI approaches w.r.t.
>> to robustness, uniqueness, etc? I'm currently doing an evaluation so
>> please tell me, if I should include anything. I might include your
>> CSV-Frag Ids, but I would need some data that is changing (although I
>> could simulate it)
>
> i don't think you can make benchmarking without being very specific 
> about the scenario and use cases. which means you would need to have a 
> sample dataset of resources changing over time that would reflect the 
> scenario you are interested in, and then you could start comparing 
> approaches. without that, benchmarking would be pointless.

Let's say you apply a spell checker to Wikipedia pages. You find a page 
with 10 spelling mistakes. if the first one is
<http://en.wikipedia.org/wiki/Fragment_identifier#range=102,105> <is> 
"potion", <shouldBe> "portion" .
and the last one is:
<http://en.wikipedia.org/wiki/Fragment_identifier#range=992,997> <is> 
"exlamation", <shouldBe> "exclamation" .

Then the URI scheme is poorly choosing unless you edit the page either 
backwards or fix all mistakes one at a time.
But what would be the best URI scheme for this Use Case ?

I can understand your use case of annotating log files, but I guess it 
would be nice to be able to annotate Wikipedia pages.
This is what I would benchmark as it might produce a best practice in 
Web Annotation.
As I said, I would also try to benchmark the CSV URIs if you have a CSV 
corpus that I could use.

All the best and thanks for all your answers,
Sebastian

[1] http://liveurls.mozdev.org/index.html

-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
Received on Monday, 29 August 2011 13:44:12 UTC