Fwd: Re: Document fragment vocabulary

Dear all,
a topic came up on public-lod@w3.org , which might best be posted here.
I will summarize most of it here real quick, the details can be found 
attached.

Currently, I am working on an interchange format in RDF for Natural 
Language Processing(NLP), called NIF [1] (slides[2]), which is part of 
LOD2[3].
It heavily relies on URI Fragment IDs that address substrings of a 
plain/text document.
Although RFC5147 [4] exists it does not cover some requirements of the 
NLP Use Case.

RFC5147 provides integrity checks, but there is no proposal that 
produces robust fragment IDs. e.g. something that works on the context 
and not on line or position. A change in the document on position 0 
might render all fragment ids obsolete. E.g. "#range=(574,585)" would 
not be valid any more, if one character was inserted at the beginning of 
the document, changing the index.
The RFC was already extended for CSV[5], but I would even go further and 
allow more extension and then collect them all in a structured format 
such as an RDF/OWL vocabulary. We have already done this for our cases [6]

For our purposes, we defined 2 fragment recipes, in this case to 
annotate the third occurrence of "Semantic Web":
http://www.w3.org/DesignIssues/LinkedData.html#offset__Semantic+Web_14406-14418
http://www.w3.org/DesignIssues/LinkedData.html#hash_md5_4__12_Semantic+Web_abeb272fe2deadd2cd486c4cea6cddf1

I'm quite unsure how to proceed now: Use our own fragment recipes, write 
another RFC or try to generalize the approach with the help of a vocabulary.
RFC5147 would then need to be extended by a "type=RFC5147" or 
"type=offset" or "type=hash" parameter and you would be able to lookup 
what "RFC5147", "offset" or "hash" meant. Could you give us some 
suggestions as we do not want to invent the 15th competing standard[7] .

Another problem, we have is that the fragment id is not sent to the 
server. Did this ever play a practical role up to now? For Linked Data 
it can be cumbersome: Let's say you have a 200 MB text file, with 
average 3 annotations per line (200,000 lines, 600,000 triples ).
Somebody attached an annotation on line 20000:
<http://example.com/text.txt#line=20000> my:comment "Please remove this 
line. It is so negative!" .
When making a query with RDF/XML Accept Header. You would always need to 
retrieve all annotations for all lines.
Then after transferring the 900k triples, the client would throw away 
all other triples, except the one for this line.
Currently, we do not care whether we will use "?nif=" or "/" or "#" and 
leave this up to the implementer.

The summary got quite long now and below are even more aspects 
mentioned. I hope this is not too confusing.
All the best,
Sebastian


[1] http://www.slideshare.net/kurzum/nif-nlp-interchange-format
[2] http://aksw.org/Projects/NIF
[3] http://lod2.eu
[4] http://tools.ietf.org/html/rfc5147
[5] http://tools.ietf.org/html/draft-hausenblas-csv-fragment
[6] http://nlp2rdf.lod2.eu/schema/string/
[7] http://xkcd.com/927/

-------- Original-Nachricht --------
Betreff: 	Re: Document fragment vocabulary
Weitersenden-Datum: 	Tue, 16 Aug 2011 06:15:14 +0000
Weitersenden-Von: 	public-lod@w3.org
Datum: 	Tue, 16 Aug 2011 15:09:21 +0900
Von: 	Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
An: 	Michael Hausenblas <michael.hausenblas@deri.org>
CC: 	Michael Martin <martin@informatik.uni-leipzig.de>, public-lod 
<public-lod@w3.org>, Alexander Dutton <alexander.dutton@oucs.ox.ac.uk>



Am 16.08.2011 14:12, schrieb Michael Hausenblas:
>>  It is not really LinkedData friendly.
>
>  Why?
>
It does not scale for large documents. Let's say you have a 200 MB text
file, with average 3 annotations per line (200,000 lines, 600,000 triples ).
Somebody attached an annotation on line 20000:

<http://example.com/text.txt#line=20000>   my:comment "Please remove this line. It is so negative!" .

When making a query with RDF/XML Accept Header. You would always need to
retrieve all annotations for all lines.
Then after transferring the 200 MB, the client would throw away all
other triples but the one.

>>  @Michael: is there some standardisation respective URIs for text
>>  going on?
>
>  As you've rightly identified, an RFC already exists. What would this
>  new standardisation activity be chartered for?
>
>  As and aside, this reminds me a bit of http://xkcd.com/927/
Hm, actually you created an extra standard yourself for csv, because the
approach by Wilde and Dürst did not cover your use case.
It does not cover mine either for 100%.  Potentially, there are a lot of
text based formats. So there should be a way to extend the pattern somehow.
>>  The approach by Wilde and Dürst[1] seems to lack stability.
>  I don't know what you mean by this. Lack of take-up, yes. Stability,
>  what's that?
Wilde and Dürst provide integrity checks, but there is no proposal that
produces robust fragment IDs.  e.g. something that works on the context
and not on line or position. A change in the document on position 0
might render all fragment ids obsolete. E.g. "#range=(574,585)" would
not be valid, if one character was inserted at the beginning of the
document.

>>  Do you think we could do such standardisation for document fragments
>>  and text fragments within the Media Fragments Group[3] ?
>  No. Disclaimer: I'm a MF WG member. Look at our charter [1] ...
>
Ok, thanks for clarifying that.
>
>  Maybe this thread should slowly be moved over to uri@w3.org [2]?
>
The # part not being sent to the server might be interesting for this
list as it is a linked data problem. Also I think we should create an
OWL Vocabulary to describe, document and standardize different fragment
identifiers, as Alexander has started. But we should only do it with the
w3c. Otherwise it will truly become "competing standard 15" .
The ontology could also just be descriptive, reflecting the RFCs.
Should we cross-post? Alternatively I could just start another thread there.
Sebastian

>
>  Cheers,
>      Michael
>
>  [1] http://www.w3.org/2008/01/media-fragments-wg.html
>  [2] http://lists.w3.org/Archives/Public/uri/
>  -- 
>  Dr. Michael Hausenblas, Research Fellow
>  LiDRC - Linked Data Research Centre
>  DERI - Digital Enterprise Research Institute
>  NUIG - National University of Ireland, Galway
>  Ireland, Europe
>  Tel. +353 91 495730
>  http://linkeddata.deri.ie/
>  http://sw-app.org/about.html
>
>  On 16 Aug 2011, at 05:40, Sebastian Hellmann wrote:
>
>>  Hi Michael and Alex,
>>  sorry to answer so late, I was in holiday in France.
>>  I looked at the three provided resources [1,2,3] and there are still
>>  some comments and questions I have.
>>
>>  1. The part after the # is actually not sent to the server. Are there
>>  any solutions for this? It is not really LinkedData friendly.
>>  Compare
>>  http://linkedgeodata.org/triplify/near/51.033333,13.733333/1000/class/Amenity
>>  (Currently not working, but it gives all points within a 1000m radius)
>>
>>  The client would be required to calculate the subset of triples from
>>  the resource, that are addressed.
>>
>>  2. [1] is quite basic and they are basically using position and
>>  lines. I made a qualitative comparison of different fragment id
>>  approaches for text in [4] slide 7.
>>  I was wondering if anybody has researched such properties of URI
>>  fragments. Currently, I am benchmarking stability of these uris using
>>  Wikipedia changes.
>>  Has such work been done before?
>>
>>  3. @Alex: In my opinion, your proposed fragment ontology can  only be
>>  used to provide documentation for different fragments.
>>  I would rather propose to just use one triple:
>>  <http://www.w3.org/DesignIssues/LinkedData.html#offset__14406-14418>
>>  a<http://nlp2rdf.lod2.eu/schema/string/OffsetBasedString>
>>  The ontology I made for Strings might be generalized for formats
>>  other than text based [5]
>>  One triple is much shorter. As you can see I also tried to encode the
>>  type of fragment right into the fragment "offset", although a
>>  notation like "type=offset"  might be better.
>>
>>  4.  @Michael: is there some standardisation respective URIs for text
>>  going on?
>>  I heard there would be a Language Technology W3C group. The approach
>>  by Wilde and Dürst[1] seems to lack stability.
>>  Do you think we could do such standardisation for document fragments
>>  and text fragments within the Media Fragments Group[3] ?
>>  I really thought the liveUrl project was quite good, but it seems
>>  dead[6].
>>
>>
>>  In LOD2[7] and NIF[8] we will need some fragment identifiers to
>>  Standardize NLP tools for the LOD2 stack.
>>  It would be great to reuse stuff instead of starting from scratch. I
>>  had to extend [1] for example, because it did not produce stable uris
>>  and also it did not contain the type of algorithm used to produce the
>>  URI.
>>
>>  All the best,
>>  Sebastian
>>
>>
>>  [1] http://tools.ietf.org/html/rfc5147
>>  [2] http://tools.ietf.org/html/draft-hausenblas-csv-fragment
>>  [3] http://www.w3.org/TR/media-frags/
>>  [4] http://www.slideshare.net/kurzum/nif-nlp-interchange-format
>>  [5] http://nlp2rdf.lod2.eu/schema/string/
>>  [6] http://liveurls.mozdev.org/index.html
>>  [7] http://lod2.eu
>>  [8] http://aksw.org/Projects/NIF
>>
>>  Am 04.08.2011 22:37, schrieb Michael Hausenblas:
>>>
>>>
>>>  Alex,
>>>
>>>>  Has something already done this? Is it even (mostly?) sane?
>>>
>>>  Sane yes, IMO. Done, sort of, see:
>>>
>>>  + URI Fragment Identifiers for the text/plain [1]
>>>  + URI Fragment Identifiers for the text/csv [2]
>>>
>>>  Cheers,
>>>      Michael
>>>
>>>  [1] http://tools.ietf.org/html/rfc5147
>>>  [2] http://tools.ietf.org/html/draft-hausenblas-csv-fragment
>>>
>>>  -- 
>>>  Dr. Michael Hausenblas, Research Fellow
>>>  LiDRC - Linked Data Research Centre
>>>  DERI - Digital Enterprise Research Institute
>>>  NUIG - National University of Ireland, Galway
>>>  Ireland, Europe
>>>  Tel. +353 91 495730
>>>  http://linkeddata.deri.ie/
>>>  http://sw-app.org/about.html
>>>
>>>  On 4 Aug 2011, at 14:22, Alexander Dutton wrote:
>>>
>>>>
>>>>  -----BEGIN PGP SIGNED MESSAGE-----
>>>>  Hash: SHA1
>>>>
>>>>  Hi all,
>>>>
>>>>  Say I have an XML document,<http://example.org/something.xml>, and I
>>>>  want to talk about about some part of it in RDF. As this is XML, being
>>>>  able to point into it using XPath sounds ideal, leading to
>>>>  something like:
>>>>
>>>>  <#fragment>  a fragment:Fragment ;
>>>>   fragment:within<http://example.org/something.xml>  ;
>>>>   fragment:locator "/some/path[1]"^^fragment:xpath .
>>>>
>>>>  (For now we can ignore whether we wanted a nodeset or a single node,
>>>>  and how to handle XML namespaces.)
>>>>
>>>>  More generally, we might want other ways of locating fragments
>>>>  (probably with a datatype for each):
>>>>
>>>>  * character offsets / ranges
>>>>  * byte offsets / ranges
>>>>  * line numbers / ranges
>>>>  * some sub-rectangle of an image
>>>>  * XML node IDs
>>>>  * page ranges of a paginated document
>>>>
>>>>  Some of these will be IMT-specific and may need some more thinking
>>>>  about, but the idea is there.
>>>>
>>>>
>>>>  Has something already done this? Is it even (mostly?) sane?
>>>>
>>>>
>>>>  Yours,
>>>>
>>>>  Alex
>>>>
>>>>
>>>>  NB. Our actual use-case is having pointers into an NLM XML file
>>>>  (embodying a journal article) so we can hook up our in-text reference
>>>>  pointer¹ URIs to the original XML elements (<xref/>s) they were
>>>>  generated from. This will allow us to work out the context of each
>>>>  citation for use in further analysis of the relationship between the
>>>>  citing and cited articles.
>>>>
>>>>  ¹ See
>>>>  <http://opencitations.wordpress.com/2011/07/01/nomenclature-for-citations-and-references/>
>>>>
>>>>  for an explanation of the terminology.
>>>>
>>>>  - -- 
>>>>  Alexander Dutton
>>>>  Developer, data.ox.ac.uk, InfoDev, Oxford University Computing
>>>>  Services
>>>>            Open Citations Project, Department of Zoology, University
>>>>  of Oxford
>>>>  -----BEGIN PGP SIGNATURE-----
>>>>  Version: GnuPG v1.4.11 (GNU/Linux)
>>>>  Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/
>>>>
>>>>  iEYEARECAAYFAk46nS4ACgkQS0pRIabRbjDVZQCdGblvoMgNqEietlE5EwAkPJY8
>>>>  pikAn2KApM0HjcXj6TZegA+Dek/DJIQX
>>>>  =UcCr
>>>>  -----END PGP SIGNATURE-----
>>>>
>>>>
>>>
>>>
>>
>>
>>  -- 
>>  Dipl. Inf. Sebastian Hellmann
>>  Department of Computer Science, University of Leipzig
>>  Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
>>  Research Group: http://aksw.org
>
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

Received on Tuesday, 16 August 2011 08:37:28 UTC