Re: Document fragment vocabulary from Sebastian Hellmann on 2011-08-16 (public-lod@w3.org from August 2011)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Tue, 16 Aug 2011 17:40:25 +0900
To: Michael Hausenblas <michael.hausenblas@deri.org>
CC: Michael Martin <martin@informatik.uni-leipzig.de>, public-lod <public-lod@w3.org>, Alexander Dutton <alexander.dutton@oucs.ox.ac.uk>
Message-ID: <4E4A2CF9.1040005@informatik.uni-leipzig.de>
I just forwared this mail and my questions to uri@w3c.org without cross 
posting, as some parts are really not interesting for the linked data list.
http://lists.w3.org/Archives/Public/uri/2011Aug/

Regards,
Sebastian


Am 16.08.2011 15:09, schrieb Sebastian Hellmann:
> Am 16.08.2011 14:12, schrieb Michael Hausenblas:
>>> It is not really LinkedData friendly.
>>
>> Why?
>>
> It does not scale for large documents. Let's say you have a 200 MB 
> text file, with average 3 annotations per line (200,000 lines, 600,000 
> triples ).
> Somebody attached an annotation on line 20000:
>
> <http://example.com/text.txt#line=20000>  my:comment "Please remove 
> this line. It is so negative!" .
>
> When making a query with RDF/XML Accept Header. You would always need 
> to retrieve all annotations for all lines.
> Then after transferring the 200 MB, the client would throw away all 
> other triples but the one.
>
>>> @Michael: is there some standardisation respective URIs for text  
>>> going on?
>>
>> As you've rightly identified, an RFC already exists. What would this 
>> new standardisation activity be chartered for?
>>
>> As and aside, this reminds me a bit of http://xkcd.com/927/
> Hm, actually you created an extra standard yourself for csv, because 
> the approach by Wilde and Dürst did not cover your use case.
> It does not cover mine either for 100%.  Potentially, there are a lot 
> of text based formats. So there should be a way to extend the pattern 
> somehow.
>>> The approach by Wilde and Dürst[1] seems to lack stability.
>> I don't know what you mean by this. Lack of take-up, yes. Stability, 
>> what's that?
> Wilde and Dürst provide integrity checks, but there is no proposal 
> that produces robust fragment IDs.  e.g. something that works on the 
> context and not on line or position. A change in the document on 
> position 0 might render all fragment ids obsolete. E.g. 
> "#range=(574,585)" would not be valid, if one character was inserted 
> at the beginning of the document.
>
>>> Do you think we could do such standardisation for document fragments 
>>> and text fragments within the Media Fragments Group[3] ?
>> No. Disclaimer: I'm a MF WG member. Look at our charter [1] ...
>>
> Ok, thanks for clarifying that.
>>
>> Maybe this thread should slowly be moved over to uri@w3.org [2]?
>>
> The # part not being sent to the server might be interesting for this 
> list as it is a linked data problem. Also I think we should create an 
> OWL Vocabulary to describe, document and standardize different 
> fragment identifiers, as Alexander has started. But we should only do 
> it with the w3c. Otherwise it will truly become "competing standard 15" .
> The ontology could also just be descriptive, reflecting the RFCs.
> Should we cross-post? Alternatively I could just start another thread 
> there.
> Sebastian
>
>>
>> Cheers,
>>     Michael
>>
>> [1] http://www.w3.org/2008/01/media-fragments-wg.html
>> [2] http://lists.w3.org/Archives/Public/uri/
>> -- 
>> Dr. Michael Hausenblas, Research Fellow
>> LiDRC - Linked Data Research Centre
>> DERI - Digital Enterprise Research Institute
>> NUIG - National University of Ireland, Galway
>> Ireland, Europe
>> Tel. +353 91 495730
>> http://linkeddata.deri.ie/
>> http://sw-app.org/about.html
>>
>> On 16 Aug 2011, at 05:40, Sebastian Hellmann wrote:
>>
>>> Hi Michael and Alex,
>>> sorry to answer so late, I was in holiday in France.
>>> I looked at the three provided resources [1,2,3] and there are still 
>>> some comments and questions I have.
>>>
>>> 1. The part after the # is actually not sent to the server. Are 
>>> there any solutions for this? It is not really LinkedData friendly.
>>> Compare 
>>> http://linkedgeodata.org/triplify/near/51.033333,13.733333/1000/class/Amenity
>>> (Currently not working, but it gives all points within a 1000m radius)
>>>
>>> The client would be required to calculate the subset of triples from 
>>> the resource, that are addressed.
>>>
>>> 2. [1] is quite basic and they are basically using position and 
>>> lines. I made a qualitative comparison of different fragment id 
>>> approaches for text in [4] slide 7.
>>> I was wondering if anybody has researched such properties of URI 
>>> fragments. Currently, I am benchmarking stability of these uris 
>>> using Wikipedia changes.
>>> Has such work been done before?
>>>
>>> 3. @Alex: In my opinion, your proposed fragment ontology can  only 
>>> be used to provide documentation for different fragments.
>>> I would rather propose to just use one triple:
>>> <http://www.w3.org/DesignIssues/LinkedData.html#offset__14406-14418> 
>>> a <http://nlp2rdf.lod2.eu/schema/string/OffsetBasedString>
>>> The ontology I made for Strings might be generalized for formats 
>>> other than text based [5]
>>> One triple is much shorter. As you can see I also tried to encode 
>>> the type of fragment right into the fragment "offset", although a 
>>> notation like "type=offset"  might be better.
>>>
>>> 4.  @Michael: is there some standardisation respective URIs for 
>>> text  going on?
>>> I heard there would be a Language Technology W3C group. The approach 
>>> by Wilde and Dürst[1] seems to lack stability.
>>> Do you think we could do such standardisation for document fragments 
>>> and text fragments within the Media Fragments Group[3] ?
>>> I really thought the liveUrl project was quite good, but it seems 
>>> dead[6].
>>>
>>>
>>> In LOD2[7] and NIF[8] we will need some fragment identifiers to 
>>> Standardize NLP tools for the LOD2 stack.
>>> It would be great to reuse stuff instead of starting from scratch. I 
>>> had to extend [1] for example, because it did not produce stable 
>>> uris and also it did not contain the type of algorithm used to 
>>> produce the URI.
>>>
>>> All the best,
>>> Sebastian
>>>
>>>
>>> [1] http://tools.ietf.org/html/rfc5147
>>> [2] http://tools.ietf.org/html/draft-hausenblas-csv-fragment
>>> [3] http://www.w3.org/TR/media-frags/
>>> [4] http://www.slideshare.net/kurzum/nif-nlp-interchange-format
>>> [5] http://nlp2rdf.lod2.eu/schema/string/
>>> [6] http://liveurls.mozdev.org/index.html
>>> [7] http://lod2.eu
>>> [8] http://aksw.org/Projects/NIF
>>>
>>> Am 04.08.2011 22:37, schrieb Michael Hausenblas:
>>>>
>>>>
>>>> Alex,
>>>>
>>>>> Has something already done this? Is it even (mostly?) sane?
>>>>
>>>> Sane yes, IMO. Done, sort of, see:
>>>>
>>>> + URI Fragment Identifiers for the text/plain [1]
>>>> + URI Fragment Identifiers for the text/csv [2]
>>>>
>>>> Cheers,
>>>>     Michael
>>>>
>>>> [1] http://tools.ietf.org/html/rfc5147
>>>> [2] http://tools.ietf.org/html/draft-hausenblas-csv-fragment
>>>>
>>>> -- 
>>>> Dr. Michael Hausenblas, Research Fellow
>>>> LiDRC - Linked Data Research Centre
>>>> DERI - Digital Enterprise Research Institute
>>>> NUIG - National University of Ireland, Galway
>>>> Ireland, Europe
>>>> Tel. +353 91 495730
>>>> http://linkeddata.deri.ie/
>>>> http://sw-app.org/about.html
>>>>
>>>> On 4 Aug 2011, at 14:22, Alexander Dutton wrote:
>>>>
>>>>>
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Say I have an XML document, <http://example.org/something.xml>, and I
>>>>> want to talk about about some part of it in RDF. As this is XML, 
>>>>> being
>>>>> able to point into it using XPath sounds ideal, leading to 
>>>>> something like:
>>>>>
>>>>> <#fragment> a fragment:Fragment ;
>>>>>  fragment:within <http://example.org/something.xml> ;
>>>>>  fragment:locator "/some/path[1]"^^fragment:xpath .
>>>>>
>>>>> (For now we can ignore whether we wanted a nodeset or a single node,
>>>>> and how to handle XML namespaces.)
>>>>>
>>>>> More generally, we might want other ways of locating fragments
>>>>> (probably with a datatype for each):
>>>>>
>>>>> * character offsets / ranges
>>>>> * byte offsets / ranges
>>>>> * line numbers / ranges
>>>>> * some sub-rectangle of an image
>>>>> * XML node IDs
>>>>> * page ranges of a paginated document
>>>>>
>>>>> Some of these will be IMT-specific and may need some more thinking
>>>>> about, but the idea is there.
>>>>>
>>>>>
>>>>> Has something already done this? Is it even (mostly?) sane?
>>>>>
>>>>>
>>>>> Yours,
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>> NB. Our actual use-case is having pointers into an NLM XML file
>>>>> (embodying a journal article) so we can hook up our in-text reference
>>>>> pointer¹ URIs to the original XML elements (<xref/>s) they were
>>>>> generated from. This will allow us to work out the context of each
>>>>> citation for use in further analysis of the relationship between the
>>>>> citing and cited articles.
>>>>>
>>>>> ¹ See
>>>>> <http://opencitations.wordpress.com/2011/07/01/nomenclature-for-citations-and-references/> 
>>>>>
>>>>> for an explanation of the terminology.
>>>>>
>>>>> - -- Alexander Dutton
>>>>> Developer, data.ox.ac.uk, InfoDev, Oxford University Computing 
>>>>> Services
>>>>>           Open Citations Project, Department of Zoology, University
>>>>> of Oxford
>>>>> -----BEGIN PGP SIGNATURE-----
>>>>> Version: GnuPG v1.4.11 (GNU/Linux)
>>>>> Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/
>>>>>
>>>>> iEYEARECAAYFAk46nS4ACgkQS0pRIabRbjDVZQCdGblvoMgNqEietlE5EwAkPJY8
>>>>> pikAn2KApM0HjcXj6TZegA+Dek/DJIQX
>>>>> =UcCr
>>>>> -----END PGP SIGNATURE-----
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> -- 
>>> Dipl. Inf. Sebastian Hellmann
>>> Department of Computer Science, University of Leipzig
>>> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
>>> Research Group: http://aksw.org
>>
>>
>
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
Received on Tuesday, 16 August 2011 08:46:05 UTC