Re: Fwd: Re: Document fragment vocabulary from Sebastian Hellmann on 2011-08-16 (uri@w3.org from August 2011)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Wed, 17 Aug 2011 01:22:13 +0900
To: Erik Wilde <dret@berkeley.edu>
CC: uri@w3.org, Michael Hausenblas <michael.hausenblas@deri.org>
Message-ID: <4E4A9935.8060301@informatik.uni-leipzig.de>

Hello Erik,

Am 16.08.2011 20:38, schrieb Erik Wilde:
> hello.
>
> On 2011-08-16 10:36 , Sebastian Hellmann wrote:
>> RFC5147 provides integrity checks, but there is no proposal that
>> produces robust fragment IDs. e.g. something that works on the context
>> and not on line or position. A change in the document on position 0
>> might render all fragment ids obsolete. E.g. "#range=(574,585)" would
>> not be valid any more, if one character was inserted at the beginning of
>> the document, changing the index.
>
> being one of the authors of this RFC, i'd like to point out that the 
> initial ideas were quite a bit more complicated and included features 
> similar to what you are looking for. however, during the process of 
> getting community support, it became clear that the preference of most 
> people was to have simpler and easier to implement fragment identifier 
> features. this does make them more brittle, but things on the web can 
> break, and even a more complicated feature set would only have made 
> them less likely to break. in the end, i think it was good that the 
> final RFC ended up being simple and easy to understand and implement, 
> but it definitely may not be enough for your use cases.

Easier to implement is only one aspect and I can understand that this 
was one of the major criteria for the community as it seems to be an 
easy common denominator. The format we are creating for LOD2 is for a 
Natural Language Processing developer community. I doubt, that they 
would be scared by a more complex URI pattern, but would rather embrace 
any offered advantages such as a tool annotating a web page and the 
frag-IDs  either stay robust or can be corrected automatically.  The 
different patterns will be implemented for several dozen NLP tools over 
the project lifetime of LOD2.

What is your suggestion then, what we should be doing? We consider 
addressing fragments of text documents in general, with CSV and XML and 
XHTML being specialisations. We might just add an additional 
"type=RFC5147" to the fragment and then add several other types 
ourselves: a stable one, one for morpho-syntax, etc.

I still have the following questions:
- Do you know of any systems, that implement RFC5147?
- What was your original use case for designing the frag-ids?
- Can you point me to a site where the less brittle version you 
suggested are discussed? Or could you give an example? My proposal for 
this is here:
http://aksw.org/Projects/NIF#context-hash-nif-uri-recipe
- Do you know of any benchmarking of the different URI approaches w.r.t. 
to robustness, uniqueness, etc? I'm currently doing an evaluation so 
please tell me, if I should include anything. I might include your 
CSV-Frag Ids, but I would need some data that is changing (although I 
could simulate it)
- What does "proposed standard" mean? This means, that the RFC is not a 
standard, but only "proposed" ?

Thanks for your answers,
Sebastian


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

Received on Tuesday, 16 August 2011 16:23:11 UTC