- From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
- Date: Tue, 16 Aug 2011 17:40:25 +0900
- To: Michael Hausenblas <michael.hausenblas@deri.org>
- CC: Michael Martin <martin@informatik.uni-leipzig.de>, public-lod <public-lod@w3.org>, Alexander Dutton <alexander.dutton@oucs.ox.ac.uk>
I just forwared this mail and my questions to uri@w3c.org without cross posting, as some parts are really not interesting for the linked data list. http://lists.w3.org/Archives/Public/uri/2011Aug/ Regards, Sebastian Am 16.08.2011 15:09, schrieb Sebastian Hellmann: > Am 16.08.2011 14:12, schrieb Michael Hausenblas: >>> It is not really LinkedData friendly. >> >> Why? >> > It does not scale for large documents. Let's say you have a 200 MB > text file, with average 3 annotations per line (200,000 lines, 600,000 > triples ). > Somebody attached an annotation on line 20000: > > <http://example.com/text.txt#line=20000> my:comment "Please remove > this line. It is so negative!" . > > When making a query with RDF/XML Accept Header. You would always need > to retrieve all annotations for all lines. > Then after transferring the 200 MB, the client would throw away all > other triples but the one. > >>> @Michael: is there some standardisation respective URIs for text >>> going on? >> >> As you've rightly identified, an RFC already exists. What would this >> new standardisation activity be chartered for? >> >> As and aside, this reminds me a bit of http://xkcd.com/927/ > Hm, actually you created an extra standard yourself for csv, because > the approach by Wilde and Dürst did not cover your use case. > It does not cover mine either for 100%. Potentially, there are a lot > of text based formats. So there should be a way to extend the pattern > somehow. >>> The approach by Wilde and Dürst[1] seems to lack stability. >> I don't know what you mean by this. Lack of take-up, yes. Stability, >> what's that? > Wilde and Dürst provide integrity checks, but there is no proposal > that produces robust fragment IDs. e.g. something that works on the > context and not on line or position. A change in the document on > position 0 might render all fragment ids obsolete. E.g. > "#range=(574,585)" would not be valid, if one character was inserted > at the beginning of the document. > >>> Do you think we could do such standardisation for document fragments >>> and text fragments within the Media Fragments Group[3] ? >> No. Disclaimer: I'm a MF WG member. Look at our charter [1] ... >> > Ok, thanks for clarifying that. >> >> Maybe this thread should slowly be moved over to uri@w3.org [2]? >> > The # part not being sent to the server might be interesting for this > list as it is a linked data problem. Also I think we should create an > OWL Vocabulary to describe, document and standardize different > fragment identifiers, as Alexander has started. But we should only do > it with the w3c. Otherwise it will truly become "competing standard 15" . > The ontology could also just be descriptive, reflecting the RFCs. > Should we cross-post? Alternatively I could just start another thread > there. > Sebastian > >> >> Cheers, >> Michael >> >> [1] http://www.w3.org/2008/01/media-fragments-wg.html >> [2] http://lists.w3.org/Archives/Public/uri/ >> -- >> Dr. Michael Hausenblas, Research Fellow >> LiDRC - Linked Data Research Centre >> DERI - Digital Enterprise Research Institute >> NUIG - National University of Ireland, Galway >> Ireland, Europe >> Tel. +353 91 495730 >> http://linkeddata.deri.ie/ >> http://sw-app.org/about.html >> >> On 16 Aug 2011, at 05:40, Sebastian Hellmann wrote: >> >>> Hi Michael and Alex, >>> sorry to answer so late, I was in holiday in France. >>> I looked at the three provided resources [1,2,3] and there are still >>> some comments and questions I have. >>> >>> 1. The part after the # is actually not sent to the server. Are >>> there any solutions for this? It is not really LinkedData friendly. >>> Compare >>> http://linkedgeodata.org/triplify/near/51.033333,13.733333/1000/class/Amenity >>> (Currently not working, but it gives all points within a 1000m radius) >>> >>> The client would be required to calculate the subset of triples from >>> the resource, that are addressed. >>> >>> 2. [1] is quite basic and they are basically using position and >>> lines. I made a qualitative comparison of different fragment id >>> approaches for text in [4] slide 7. >>> I was wondering if anybody has researched such properties of URI >>> fragments. Currently, I am benchmarking stability of these uris >>> using Wikipedia changes. >>> Has such work been done before? >>> >>> 3. @Alex: In my opinion, your proposed fragment ontology can only >>> be used to provide documentation for different fragments. >>> I would rather propose to just use one triple: >>> <http://www.w3.org/DesignIssues/LinkedData.html#offset__14406-14418> >>> a <http://nlp2rdf.lod2.eu/schema/string/OffsetBasedString> >>> The ontology I made for Strings might be generalized for formats >>> other than text based [5] >>> One triple is much shorter. As you can see I also tried to encode >>> the type of fragment right into the fragment "offset", although a >>> notation like "type=offset" might be better. >>> >>> 4. @Michael: is there some standardisation respective URIs for >>> text going on? >>> I heard there would be a Language Technology W3C group. The approach >>> by Wilde and Dürst[1] seems to lack stability. >>> Do you think we could do such standardisation for document fragments >>> and text fragments within the Media Fragments Group[3] ? >>> I really thought the liveUrl project was quite good, but it seems >>> dead[6]. >>> >>> >>> In LOD2[7] and NIF[8] we will need some fragment identifiers to >>> Standardize NLP tools for the LOD2 stack. >>> It would be great to reuse stuff instead of starting from scratch. I >>> had to extend [1] for example, because it did not produce stable >>> uris and also it did not contain the type of algorithm used to >>> produce the URI. >>> >>> All the best, >>> Sebastian >>> >>> >>> [1] http://tools.ietf.org/html/rfc5147 >>> [2] http://tools.ietf.org/html/draft-hausenblas-csv-fragment >>> [3] http://www.w3.org/TR/media-frags/ >>> [4] http://www.slideshare.net/kurzum/nif-nlp-interchange-format >>> [5] http://nlp2rdf.lod2.eu/schema/string/ >>> [6] http://liveurls.mozdev.org/index.html >>> [7] http://lod2.eu >>> [8] http://aksw.org/Projects/NIF >>> >>> Am 04.08.2011 22:37, schrieb Michael Hausenblas: >>>> >>>> >>>> Alex, >>>> >>>>> Has something already done this? Is it even (mostly?) sane? >>>> >>>> Sane yes, IMO. Done, sort of, see: >>>> >>>> + URI Fragment Identifiers for the text/plain [1] >>>> + URI Fragment Identifiers for the text/csv [2] >>>> >>>> Cheers, >>>> Michael >>>> >>>> [1] http://tools.ietf.org/html/rfc5147 >>>> [2] http://tools.ietf.org/html/draft-hausenblas-csv-fragment >>>> >>>> -- >>>> Dr. Michael Hausenblas, Research Fellow >>>> LiDRC - Linked Data Research Centre >>>> DERI - Digital Enterprise Research Institute >>>> NUIG - National University of Ireland, Galway >>>> Ireland, Europe >>>> Tel. +353 91 495730 >>>> http://linkeddata.deri.ie/ >>>> http://sw-app.org/about.html >>>> >>>> On 4 Aug 2011, at 14:22, Alexander Dutton wrote: >>>> >>>>> >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>> >>>>> Hi all, >>>>> >>>>> Say I have an XML document, <http://example.org/something.xml>, and I >>>>> want to talk about about some part of it in RDF. As this is XML, >>>>> being >>>>> able to point into it using XPath sounds ideal, leading to >>>>> something like: >>>>> >>>>> <#fragment> a fragment:Fragment ; >>>>> fragment:within <http://example.org/something.xml> ; >>>>> fragment:locator "/some/path[1]"^^fragment:xpath . >>>>> >>>>> (For now we can ignore whether we wanted a nodeset or a single node, >>>>> and how to handle XML namespaces.) >>>>> >>>>> More generally, we might want other ways of locating fragments >>>>> (probably with a datatype for each): >>>>> >>>>> * character offsets / ranges >>>>> * byte offsets / ranges >>>>> * line numbers / ranges >>>>> * some sub-rectangle of an image >>>>> * XML node IDs >>>>> * page ranges of a paginated document >>>>> >>>>> Some of these will be IMT-specific and may need some more thinking >>>>> about, but the idea is there. >>>>> >>>>> >>>>> Has something already done this? Is it even (mostly?) sane? >>>>> >>>>> >>>>> Yours, >>>>> >>>>> Alex >>>>> >>>>> >>>>> NB. Our actual use-case is having pointers into an NLM XML file >>>>> (embodying a journal article) so we can hook up our in-text reference >>>>> pointer¹ URIs to the original XML elements (<xref/>s) they were >>>>> generated from. This will allow us to work out the context of each >>>>> citation for use in further analysis of the relationship between the >>>>> citing and cited articles. >>>>> >>>>> ¹ See >>>>> <http://opencitations.wordpress.com/2011/07/01/nomenclature-for-citations-and-references/> >>>>> >>>>> for an explanation of the terminology. >>>>> >>>>> - -- Alexander Dutton >>>>> Developer, data.ox.ac.uk, InfoDev, Oxford University Computing >>>>> Services >>>>> Open Citations Project, Department of Zoology, University >>>>> of Oxford >>>>> -----BEGIN PGP SIGNATURE----- >>>>> Version: GnuPG v1.4.11 (GNU/Linux) >>>>> Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ >>>>> >>>>> iEYEARECAAYFAk46nS4ACgkQS0pRIabRbjDVZQCdGblvoMgNqEietlE5EwAkPJY8 >>>>> pikAn2KApM0HjcXj6TZegA+Dek/DJIQX >>>>> =UcCr >>>>> -----END PGP SIGNATURE----- >>>>> >>>>> >>>> >>>> >>> >>> >>> -- >>> Dipl. Inf. Sebastian Hellmann >>> Department of Computer Science, University of Leipzig >>> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann >>> Research Group: http://aksw.org >> >> > > -- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org
Received on Tuesday, 16 August 2011 08:46:05 UTC