- From: Phil Ritchie <philr@vistatec.ie>
- Date: Tue, 19 Mar 2013 15:56:45 +0000
- To: Felix Sasaki <fsasaki@w3.org>
- Cc: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
- Message-ID: <OFFCC96CAF.9E585940-ON80257B33.0056933A-80257B33.00579828@vistatec.ie>
Felix, All, A question: does the id of an enclosing <script /> element need to be the same as the ITS element it encloses? e.g. <script type="application/its+xml" id="lq0"> <its:locQualityIssues xmlns:its="http://www.w3.org/2005/11/its" xml:id="lq0"> <its:locQualityIssue locQualityIssueType="non-conformance" locQualityIssueSeverity="75.7961783439491"></its:locQualityIssue> </its:locQualityIssues> </script> I suspect not. That being the case, I'm not convinced that having the script enclosed metadata point to the span's saves a significant amount of serialized footprint. Phil. From: Felix Sasaki <fsasaki@w3.org> To: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>, Date: 25/02/2013 18:02 Subject: Standoff experiment plus observations Hi all, Christian, Marcis and Tadej know this (apologies for the repetition) - but I thought others might be interested too. I played a bit with the NERD API http://nerd.eurecom.fr/documentation#nerdapi 1) I generated ITS "tan" via 4 annotation engines that can be accessed through the api: dpbedia spotlight, extractiv, lupedia, yahoo. 2 a) I also created a *non* ITS "tan" standoff version, see multiple-ann-with-id-plus-standoff.html . It relies on ID attributes, and the standoff annotations point to the IDs. This is the approach that we had discussed a while ago on the mailing list. 2 b) The file multiple-ann-with-standoff-refs-script.html uses our current localization quality issue and provenance standoff approach, that is: pointing from the content to annotations, here via an artificial x-t-ref attribute. >From 2), I learned various things: - Making sure that standoff works requires a known workflow, "know" esp. with regards to white space handling. Otherwise the multiple annotation engines create multiple character offsets. So from this having a recommendation to leave standoff processing to NIF makes a lot of sense. - The non ITS standoff representation (see multiple-ann-with-id-plus-standoff.html) has the merit that a human consumer who doesn't know anything about NIF et al. (= somebody in an XML based localization workflow or looking at an HTML document) can look into the annotations and choose: Hover over the green spans of text, e.g. over "St Peter" as part of " held in St Peter's Basilica. ". the annotation from extractiv holds a more specific "its-class-ref" than the one from dbpedia spotlight. But only dbpedia spotlight holds an "its-ident-ref". So a human user consuming these annotations has the most value if he combines them. - Developing applications based on the output of multiple engines is pretty straightforward for non NLP / NIF people if you have the output represented in an easy to digest format (JSON, XML, ...). I won't argue for standardizing that format and creating ITS "tan" standoff (we had that discussion). I'm mentioning this just because the merit of the annotations in a long term might grow if Web developers face a low barrier for wide spread app development. - A thought I had during today's discussion of the XLIFF mapping: having the external standoff pointing to IDs might be a way to solve the XLIFF representation issue of "mrk": here the issue is again (it seems) that you want to apply multiple annotations to the same span of text (the content of "mrk") - but you can't since the "type" attribute can be only used once. Externalizing the annotations solves that problem. - During the discussion of multiple annotations a while ago we also touched upon the "direction" of the standoff: from outside to IDs (see multiple-ann-with-id-plus-standoff.html and 2a), or from the document to the standoff (current loc quality issue / provenance, see 2b) above). Pointint from the document (= 2b) has the drawback in HTML that you need a separate "script" element for each target - whereas in the case of 2a) you only need one script element. So for 2a) in total there are 58 elements, and 2b) has 101 elements. FYI: with the above observations I won't push for anything - just sharing my experience to see what others think. Best, Felix [attachment "multiple-ann-with-id-plus-standoff.html" deleted by Phil Ritchie/VISTATEC] [attachment "multiple-ann-with-standoff-refs-script.html" deleted by Phil Ritchie/VISTATEC] ************************************************************ VistaTEC Ltd. Registered in Ireland 268483. Registered Office, VistaTEC House, 700, South Circular Road, Kilmainham. Dublin 8. Ireland. The information contained in this message, including any accompanying documents, is confidential and is intended only for the addressee(s). The unauthorized use, disclosure, copying, or alteration of this message is strictly forbidden. If you have received this message in error please notify the sender immediately. ************************************************************
Received on Tuesday, 19 March 2013 15:57:21 UTC