Re: Markup for quality from Felix Sasaki on 2013-06-10 (public-i18n-its-ig@w3.org from June 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Mon, 10 Jun 2013 20:05:01 +0200
To: joerg@bioloom.de
CC: Jirka Kosek <jirka@kosek.cz>, Arle Lommel <arle.lommel@dfki.de>, public-i18n-its-ig@w3.org, kim_harris@textform.com, Hans Uszkoreit <uszkoreit@dfki.de>, Aljoscha Burchardt <aljoscha.burchardt@dfki.de>
Message-ID: <51B6154D.5030906@w3.org>
Hi Arle, all,

Arle, thanks a lot for discussing one issue I mentioned at
http://lists.w3.org/Archives/Public/public-i18n-its-ig/2013Jun/0000.html

What are you thoughts about the other issues 1) and 3)?

Wrt to 1) and also the markup issue 2), it would be great to have the 
spec draft for MQM available - is that possible?

Now, about the overlapping markup: as Jirka mentioned, each solution has 
drawbacks. Below are two proposals that try to life with these.

1) Using NIF: You are correct that the NIF solution has the drawback of 
fixed character offsets. In an XSLT stylsheet that converts markup to 
NIF (here only for "text analysis", but I can easily create that for LQI)
http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl
I have resolved this by having a white space stripping step
http://www.w3.org/People/fsasaki/its20-general-processor/tools/stripping.xsl
that is also included in the backconversion from NIF
http://www.w3.org/People/fsasaki/its20-general-processor/tools/nif-2-its-ta.xsl
so all your editing environment need to make sure: has the textual 
content of the string being annotated changed. I assume that during a 
quality check step, the non white space content must not be changed 
anyway, so such a check is easy to do.


2) Using a copy of the same content for overlapping annotations
Another solution is to copy the strings that have issues so that users 
can annotate overlapping issues in parallel. This is what we did in the 
class in Potsdam: having an XLIFF file with several "alt-trans" 
elements. Each of these contain the same translation, but different lqi 
markup - see the attachment. So your example would look like this:

1st annotation
<p>Fifteen <span its-loc-quality-issue-type="..." ...><em>relays 
is</em></span> involved in the operation.</p>
2nd annotation
<p>Fifteen <em>relays <span itsx-mqm-issue-type="agreement"  
...>is</span></em> involved in the operation.</p>

Of course this has the drawback that you need to copy the original text. 
But it again (like NIF) has the advantage that you can make use of the 
tree document structure (see below) and (different to NIF) you can stay 
in the markup world.

3) milestone proposal
Your original solution and also the processing instruction proposal fall 
under the category of "milestones", see
http://conferences.idealliance.org/extreme/html/2004/Witt01/EML2004Witt01.html#t2-1-1
"milestone elements: empty elements that mark the boundaries between 
elements in a non-nesting structure "
The problem is that with this solution you loose the document tree 
structure and have the following problems:

a) As Jirka said, no document structure restriction can be stated: the 
related start and end "id" attributes can appear everywhere in the 
document. How do you want to validate the annotations?
b) Processing the milestones: in the approaches 1) and 2) it is easy to 
do an analysis like this: "give me all HTML 'a' elements that have an 
issue'. You can write an XPath expresssion like 
//h:a[@its-loc-quality-issue-type] and you are done. With your approach 
that is not feasible: you cannot easily check 'is there a milestone in 
'a'?", since there is no guarentee that both start and end milestone are 
inside 'a'. Such checks are somehow doable, but the performance will 
suffer a lot
c) Given b), you will need a special purpose annotation tool to work 
with the regions. A general XML editor can do solutions 1) (in 
combination with XSLT) and 2). For solution 3) you will need special 
purpose tools - do you expect widespread adoption and good 
interoperability between these?

The "processing of milestones" argument including performance would be 
the killer argument for me. I have seen a few tools implementing 
milestones and similar solutions, similar in the sense of breaking up 
the document structure. None of these could deal with large data sets, 
and all of them didn't find a lot of adoption (this was in the area of 
linguistic data sets, btw). Since both 1) and 2) rely on standard XML 
processing, you won't find such issues.


Best,

Felix


Am 10.06.13 13:38, schrieb Jörg Schütz:
> +1 for PIs. We have always used PIs in language checking applications 
> (spelling mistakes, grammar and style errors), and they worked very well.
>
> Cheers -- Jörg
>
> On June 10, 2013 at 12:03 (CEST), Jirka Kosek wrote:
>> On 10.6.2013 11:26, Arle Lommel wrote:
>>
>>> We need a way to mark up overlapping spans. For example, if you have 
>>> the following HTML5 segment:
>>
>> Haha, overlapping markup. There are several common ways how to handle
>> overlapping markup. But if you want to stick to XML syntax and
>> datamodel, no solution will be completely perfect and elegant.
>>
>>> The mapping from MQM to ITS 2.0 is clear here, but we need a way to 
>>> mark up the overlapping spans. So far we have internally used 
>>> something like this:
>>>
>>> <p>Fifteen <mqm-startIssue type="markup, misplaced" id="1" 
>>> /><em>relays <mqm-startIssue type="agreement" id ="2" 
>>> />is</em><mqm-endIssue id="1" /> involved</mqm-endIssue id="2" /> in 
>>> the operation.</p>
>>>
>>> We want a good path to interoperability with ITS. So we need a way 
>>> to put the following information in the document on overlapping 
>>> spans using local markup:
>>>
>>> its-loc-quality-issue-type="grammar" itsx-mqm-issue-type="agreement" 
>>> its-loc-quality-comment="should be &quot;relays are&quot;" (etc…)
>>>
>>> Any suggestions for how to handle this use case? We want to make it 
>>> as easy as possible to use MQM and ITS together, where MQM provides 
>>> mechanisms for greater granularity while still retaining 
>>> compatibility with ITS and ITS provides a way to share MQM data at a 
>>> common granularity with other systems.
>>
>> Your example above means that elements like <mqm-startIssue/> and
>> <mqm-endIssue/> have to be allowed anywhere in document which requires
>> change in schema. For this reason it's very common to use processing
>> instructions instead -- they are opaque to schema validation and no
>> schema change is necessary, for example:
>>
>> <p>Fifteen <?mqm-startIssue type="markup, misplaced" id="1" ?><em>relays
>> <?mqm-startIssue type="agreement" id ="2" ?>is</em><?mqm-endIssue id="1"
>> ?> involved<?mqm-endIssue id="2" ?> in the operation.</p>
>>
>>                     Jirka
>>
>
Attachments

text/xml attachment: multiple-annotations.xml
Received on Monday, 10 June 2013 18:05:38 UTC