Re: [ISSUE 34] Quality error category proposal from Arle Lommel on 2012-07-18 (public-multilingualweb-lt@w3.org from July 2012)

From: Arle Lommel <arle.lommel@dfki.de>
Date: Wed, 18 Jul 2012 12:17:38 +0200
To: Yves Savourel <ysavourel@enlaso.com>
Cc: <public-multilingualweb-lt@w3.org>
Message-Id: <A08AF3A9-FCE7-4FA0-837A-81E567D8BFF5@dfki.de>
Yves, have you become a vampire? We certainly didn't expect a response at 03:31 your time!
On Jul 18, 2012, at 11:31 , Yves Savourel wrote:

>>> - when is "later"? after the summer or for ITS 3.0?
>> 
>> We're actually starting work on this topic in another DFKI 
>> project now, but I do not anticipate seeing anything 
>> suitable until later this year at the earliest 
>> (certainly later than September).
> 
> My question is: would the final definition of the additional info be in ITS 2.0?
> I'm assuming yes.

Depends on what the "additional info" means. If you mean the definitions of errors and so forth, then no, that would be out of scope for ITS 2.0. If you mean information about the structure of what the machine-readable info is, then still, probably not, as that would be out of scope. However, I could see that being defined as a standard in OASIS or elsewhere (in fact, I think it would need to happen).

>> I think the URI pointer will be vital, even if it 
>> doesn't take the form described, because we will 
>> still need a way to point to the machine-readable
>> definition of the error, however it will be defined.
> 
> If the info the URI is pointing too is to be part of ITS 2.0, then I have no problem with the model. Actually it may even make the XLIFF 2.0 mapping a lot easier.
> 
> My concern was that the additional info would not be part of ITS 2.0, then having just a very simplistic set of information for quality-error would have been not very useable.

In a discussion with Phil today we came to much the same conclusion and he and I have tentatively restored most of the original proposal to deal with the needed complexity.


>>> type/code of error,
>> 
>> This is actually what I see the URL as supplying. Just a 
>> code by itself does not support interoperability. However,
>> if we point to machine-readable information (and define 
>> what that information looks like, which obviously won't 
>> happen in the next week or so), then we can work towards 
>> interoperability. Would that be OK for you, or are 
>> you thinking of (a) a defined picklist or (b) simple native 
>> codes (like the ones you supplied me a while back)? I'm 
>> really hoping that your answer is that you don't want A or B.
> 
> If the goal is to define the full set of information for ITS 2.0, then I have no problem doing it step by step. I just think it shouldn't be done for after ITS 2.0.
> 
> As for the value of type type/code of the issue:
> 
> It seems we keep running into this pattern of needing a main finite list of values for interoperability and at the same time a way to optionally provide user-defined values.
> 
> The category + sub-category model we talked several times about may work here as well. Actually it would probably work very well.
> 
> The first part of the composite value (so called the category) would be a pre-defined ITS finite list. Something like: inline-code, whitespace, grammar, terminology, spelling, date-format, number-format, etc. Any tool can likely decide in which of this broad values the specific issue belongs.
> 
> Then they can, if they want to, supplement this with their more specialized type. That value would be composed of some authority identifier and the actual value, using a QName-like format for example.
> 
> So used together we would have something like:
> 
> issueType="whitespace/enlaso:MISSING_LEADINGWS"
> 
> issueType="inline-code/enlaso:EXTRA_CODE"
> 
> or
> 
> issueType="whitespace" + issueSubType="enlaso:MISSING_LEADINGWS"
> 
> issueType="inline-code" + issueSubType="enlaso:EXTRA_CODE"
> 
> The actual notation using a single attribute or two is secondary. The idea is that the main category is mandatory if the sub-category is used, so tools can always fall back to the broad type of issue.

I think that we'll need something like this. So is your suggestion that the “top level” (issueType in your last examples) be defined in ITS 2.0 as a set of fairly granular categories and issueSubType be left open? I'm a bit worried still about the idea of a static list of issueType values, but if ITS 2.0 could somehow point to a normative and living database that is updated more frequently than ITS 2.0, it might work. (My worry is that if we embed even these course values in ITS 2.0, the day after it is finalized someone will come along with something we didn't think of and we then have to rev the spec itself over something simple, whereas if we point to an external database, then the spec is constant and we simply declare a new value in the database.)


>>> and a flag indicating if the given issue is active or not.
>> 
>> Good suggestion. What values would you suggest? I think there 
>> is more to it than whether it is active or not. For instance,
>> a reviewer might catch and error and flag it in the file 
>> (making it active). It then goes back to the translator, 
>> who cannot resolve it or needs confirmation about the 
>> proposed resolution, in which case it is still active but 
>> you would treat it differently than in the first case. 
> 
> The simpler the better. Something like enabled='yes|no" would do fine IMO. It just says this issue is currently disable/enabled, that's all users care as far as my experience goes. It's mostly used to flag false-positives as the same user re-run the check after fixing a set of problems.

It would be good to get feedback from Phil on this. If your suggest meets needs, then I'm all for simpler :-)


> I would add one more information: An attribute to store a possible suggested replacement text. Quite a few issues can be fixed automatically, or with a simple human validation. That attribute would hold the content to substitute to the content selected for the annotation.

Makes sense from a usability standpoint.


Looking at all this, however, I think that the ITS 2.0 quality model looks like quite the list: its-error; its-errorseverity, its-errorinfo, its-erroragent, its-errorreplacementtext, its-errorname + the ones for declaring the profile.

Because ITS doesn't provide a way to define a hierarchical model in its syntax, each bit actually has to be treated as its own independent data category (at least from the schema point of view)… While we can constrain it via prose, what we don't have is a way to simplify this within ITS 2.0 This is where I would like some guidance from Felix: I suspect this is one reason he wanted to move to use more indirection since it would let us create one data category in ITS 2.0 (simplifying implementation requirements) and then work out all this other stuff outside ITS 2.0. And I can't fault him for wanting to do that. As it's headed, quality will be half of ITS 2.0.

But let's talk amongst ourselves with Felix when he is back to see what we can come up with. I'm convinced by you and Phil that the "dumb" (single) pointer proposal has its problems, but adding six elemental data categories to handle one task in ITS 2.0 also seems like overkill.

-Arle
Received on Wednesday, 18 July 2012 10:18:11 UTC