RE: idValue requirement updated from Yves Savourel on 2012-05-02 (public-multilingualweb-lt@w3.org from May 2012)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Wed, 2 May 2012 04:51:16 -0600
To: "'David Lewis'" <dave.lewis@cs.tcd.ie>, <public-multilingualweb-lt@w3.org>
Message-ID: <assp.046975e1fe.assp.0469e145d4.002b01cd2851$803632b0$80a29810$@com>
Hi Dave,

 

I may be wrong, but it looks to me that an ID value that can be maintain over the course of several modifications of the source content it identifies cannot be auto-generated with certainty. The six cascading methods you described are getting close to achieve that, but, as far as I can tell, can’t guarantee it.

 

Such ID probably has to exist in the source document and be maintain by it.

 

In addition you are talking about a segment ID rather than an ID on some existing unit of the source document. This means the possible addition of markup at the middle of plain text paragraphs. All this would work if we have some extra layer of identification on top of the original content, something partially similar to xml:tm. But I’m doubtful it’s realistic for most formats.

 

Maybe the best the idValue data category could do is convey information about existing ways in the original document to get a unique and maintainable ID for a given node. In other words the 31 and #2 of your cascading rules. It’s already something important.

 

Maybe exploring ways to build additional identification information is worthwhile though. Being able to create some “companion IDs” using the content around a given node would certainly be useful for some localization task like perfect matching. I could imagine the use for some rule that would allow to generate “context values” that could be used stored in TM repositories and re-used across different tools.

 

Cheers,

-yves

 

 

From: David Lewis [mailto:dave.lewis@cs.tcd.ie] 
Sent: Tuesday, May 01, 2012 2:54 PM
To: public-multilingualweb-lt@w3.org
Subject: Re: idValue requirement updated

 

Hi Yves,
I think you are right about me thinking more along the lines of XLIFF id rather than XLIFF resname, but perhaps not exactly in the way you characterise it. 

I am thinking in terms of an id that can be used to track the progress of a specific segment in the content documents (lets park the use of multi-segement translation unit in XLIFF for the moment) against the corresponding XLIFF id.  However, I'm specifically concerned with the round trip use cases where the document may pass from a CMS to an XLIFF cycle and back again several times. The use cases I see for this are driven by the need for more continuous translation, pipelined at the granularity of the segment rather than the document, rather than once off hand-overs of documents between processes. Possible use cases might be:

1) a document is having its source revised and is being translated at the same time. Readiness of different elements is signalled in the document using the readiness/processTrigger data category, which is monitored by an LSP which provide updates of segments to be translated based on these flags and distributes translations using XLIFF. Consistent mapping between all segements and xliff translation unit ids is required to ensure that new, modified and deleted trans-units are correctly updated and kept in sequence.

2) Translations from one LSP may be undergoing monolingual review through direct access to the target on the CMS, while selected bi-lingual translation review is being conducted in parallel by another LSP. Feedback from both reviews may need to be routed back to the translating LSP, so document element-to-XLIFF mappings would be need to be reliably maintained for the two sets of XLIFF ids  operated by two different LSPs.

In these sort of use cases, where their is ongoing round-tripping between the CMS and TMS/XLIFF, then the need for consistent mapping between the source document on the CMS and the versions LSPs have, may soften the assumption that clients won't be willing to add additional elements to the document on the CMS. One can imagine that any augmented versions of the content documents would live on a 'staging' CMS while it is subject to preparation, translation and review, but prior to publication.

So, this implies a need for an id that is indeed relevant just to the localization process, but that never-the-less needs to support a persistent mapping between CMS element and trans-unit ID, potentially over several CMS-TMS roundtrips. The difference to resname as I understand it, is that resname is optional and in a sense best effort - if you can't map a trans-unit back to a particular element in the source, you can still try and translate the string, you just loose some contextual info. So it doesn't have the requirement to comprehensively maintain a mapping between all trans-units and  source content elements in the way I think the above use cases require. 

Hope that explains the requirement I had in mind a bit more clearly.

Finally, I'm not sure in any of these cases we are talking about an explicit id data category are we?

Would the implementation in fact be rules for generating and maintaining the mapping between source elements and XLIFF ids. Very speculatively, these could be expressed as some cascading rules for using: 1st) existing ids if present; 2nd) combo rules of ID and element names as your the updated text; 3rd) if allowed new id in existing elements; 4th) if allowed new elements with specific ids; 5th) some sort of external hashing pointer (e.g. http://nlp2rdf.org/nif-1-0#toc-nif-recipe-context-hash-based-uris) ; 6th) some sort of character count-based pointer (e.g. http://nlp2rdf.org/nif-1-0#toc-nif-recipe-offset-based-uris).  It would be a ruleset applicable to the document that we would need to record.

cheers,
Dave
 



On 01/05/2012 15:20, Yves Savourel wrote: 

I guess what we need to clarify is what are the requirements of the ID value we are discussing.
 
To me it should be:
- unique at least within the document
- the value should be the same in new versions of the document
 
That's because the type of tasks I would use it for are tasks across versions of the same document.
 
But Dave, you are maybe thinking of something different: how to get an ID valid for a given document during its localization cycle. In other word a value that doesn't need to survive after the document is done.
 
In other words you are thinking XLIFF 'id' and I'm thinking XLIFF 'resname'.
 
Cheers,
-ys
Received on Wednesday, 2 May 2012 10:51:44 UTC