- From: Phil Ritchie <philr@vistatec.ie>
- Date: Tue, 19 Mar 2013 15:56:45 +0000
- To: Felix Sasaki <fsasaki@w3.org>
- Cc: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
- Message-ID: <OFFCC96CAF.9E585940-ON80257B33.0056933A-80257B33.00579828@vistatec.ie>
Felix, All,
A question: does the id of an enclosing <script /> element need to be the
same as the ITS element it encloses? e.g.
<script type="application/its+xml" id="lq0">
<its:locQualityIssues xmlns:its="http://www.w3.org/2005/11/its"
xml:id="lq0">
<its:locQualityIssue locQualityIssueType="non-conformance"
locQualityIssueSeverity="75.7961783439491"></its:locQualityIssue>
</its:locQualityIssues>
</script>
I suspect not.
That being the case, I'm not convinced that having the script enclosed
metadata point to the span's saves a significant amount of serialized
footprint.
Phil.
From: Felix Sasaki <fsasaki@w3.org>
To: "public-multilingualweb-lt@w3.org"
<public-multilingualweb-lt@w3.org>,
Date: 25/02/2013 18:02
Subject: Standoff experiment plus observations
Hi all,
Christian, Marcis and Tadej know this (apologies for the repetition) - but
I thought others might be interested too.
I played a bit with the NERD API
http://nerd.eurecom.fr/documentation#nerdapi
1) I generated ITS "tan" via 4 annotation engines that can be accessed
through the api: dpbedia spotlight, extractiv, lupedia, yahoo.
2 a) I also created a *non* ITS "tan" standoff version, see
multiple-ann-with-id-plus-standoff.html . It relies on ID attributes, and
the standoff annotations point to the IDs. This is the approach that we
had discussed a while ago on the mailing list.
2 b) The file multiple-ann-with-standoff-refs-script.html uses our current
localization quality issue and provenance standoff approach, that is:
pointing from the content to annotations, here via an artificial x-t-ref
attribute.
>From 2), I learned various things:
- Making sure that standoff works requires a known workflow, "know" esp.
with regards to white space handling. Otherwise the multiple annotation
engines create multiple character offsets. So from this having a
recommendation to leave standoff processing to NIF makes a lot of sense.
- The non ITS standoff representation (see
multiple-ann-with-id-plus-standoff.html) has the merit that a human
consumer who doesn't know anything about NIF et al. (= somebody in an XML
based localization workflow or looking at an HTML document) can look into
the annotations and choose: Hover over the green spans of text, e.g. over
"St Peter" as part of " held in St Peter's Basilica. ". the annotation
from extractiv holds a more specific "its-class-ref" than the one from
dbpedia spotlight. But only dbpedia spotlight holds an "its-ident-ref". So
a human user consuming these annotations has the most value if he combines
them.
- Developing applications based on the output of multiple engines is
pretty straightforward for non NLP / NIF people if you have the output
represented in an easy to digest format (JSON, XML, ...). I won't argue
for standardizing that format and creating ITS "tan" standoff (we had that
discussion). I'm mentioning this just because the merit of the annotations
in a long term might grow if Web developers face a low barrier for wide
spread app development.
- A thought I had during today's discussion of the XLIFF mapping: having
the external standoff pointing to IDs might be a way to solve the XLIFF
representation issue of "mrk": here the issue is again (it seems) that you
want to apply multiple annotations to the same span of text (the content
of "mrk") - but you can't since the "type" attribute can be only used
once. Externalizing the annotations solves that problem.
- During the discussion of multiple annotations a while ago we also
touched upon the "direction" of the standoff: from outside to IDs (see
multiple-ann-with-id-plus-standoff.html and 2a), or from the document to
the standoff (current loc quality issue / provenance, see 2b) above).
Pointint from the document (= 2b) has the drawback in HTML that you need a
separate "script" element for each target - whereas in the case of 2a) you
only need one script element. So for 2a) in total there are 58 elements,
and 2b) has 101 elements.
FYI: with the above observations I won't push for anything - just sharing
my experience to see what others think.
Best,
Felix [attachment "multiple-ann-with-id-plus-standoff.html" deleted by
Phil Ritchie/VISTATEC] [attachment
"multiple-ann-with-standoff-refs-script.html" deleted by Phil
Ritchie/VISTATEC]
************************************************************
VistaTEC Ltd. Registered in Ireland 268483.
Registered Office, VistaTEC House, 700, South Circular Road,
Kilmainham. Dublin 8. Ireland.
The information contained in this message, including any accompanying
documents, is confidential and is intended only for the addressee(s).
The unauthorized use, disclosure, copying, or alteration of this
message is strictly forbidden. If you have received this message in
error please notify the sender immediately.
************************************************************
Received on Tuesday, 19 March 2013 15:57:21 UTC