Re: Practical question: elements or attributes? from David Lewis on 2012-04-18 (public-multilingualweb-lt@w3.org from April 2012)

From: David Lewis <dave.lewis@cs.tcd.ie>
Date: Wed, 18 Apr 2012 20:29:51 +0100
To: Arle Lommel <arle.lommel@gmail.com>
CC: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
Message-ID: <4F8F162F.3010700@cs.tcd.ie>
Hi Arle,
Thanks, these examples help illuminate the issues, especially how the we 
need such a structured attribute naming approach if we are not using 
name spaces and the issue of how to define co-dependence between 
attributes, which is a major conformance issue.

Your examples also highlight a natural outcome of the expansion of scope 
from internationalisation in ITS1.0 to the whole localisation process 
and to CMS-based localisation in MLW-LT.

In internationalisation, the authors both introduce the ITS attributes 
and have authority to add spans and divs the document at the same time. 
They can therefore have scope to optimise the document structure to 
avoid adding too many such elements just to support ITS attributes (ITS 
has a requirement to limit impact on the document).

As we go through the localisation process in MLW-LT, each segment, and 
potentially each term, becomes the subject to new MLW-LT attributes, 
since localisation needs to track every segment, and each may have 
different status that is of interest. This means that every segment 
potentially needs a span added just to support MLW-LT attributes. This 
will easily lead to the sort of situations Arle exemplifies for _every_ 
segement.

This takes us to where OAXAL (and LISA) went with xml:tm 
(http://www.ttt.org/oscarstandards/xml-tm/xml-tm.html). This suffers 
from loading the document with a load of markup just for localisation 
purposes, which however is not relevant for all the latter stages of 
content management. Hence my previous points on including support for 
recording such mark-up in a stand-off fashion once the document is past 
the stage where we need the immediacy of attributes being embeddeed in 
the document. If we are to strip out such spans/divs that were 
introduced purely for housing MLW-LT attribtues, we need an attribute to 
mark them as such so the span/div itself can be stripped cleanly.

Moritz, the CMS document lifecycle view is key here so i'd be interested 
on your take on the potential overhead localisation mark-up would add to 
document management.

Regards,
Dave

On 18/04/2012 09:30, Arle Lommel wrote:
> OK, accepting that we are using attributes now, that starts to dictate 
> some naming conventions we will have to use that we would not worry 
> about with elements. For example, if two data categories presently 
> include note attributes and we are using attributes, the two kinds of 
> notes will need different names so that we know what they apply to. 
> One way to do so is to extend the prefixing names that Felix 
> suggested, as shown in this example, where the names are nested:
>
>     The *<span*
>     *its-qa-type="syntactic error"*
>     *its-qa-ruleSet="SAE J2450"*
>     *its-qa-severity="major"*
>     *its-qa-note="bad grammar"*
>     *its-qa-agent="ABCReview"*
>     *its-somethingElse-note="Sample of a note from another section of
>     ITS"*
>     *>*verbs agrees*</span>* with the subject
>
>
> In this case I prefixed everything with *its* for our broad area and 
> *qa* for quality with *somethingElse* for some other part of the spec, 
> but both allowing a “note”. Sound reasonable? If so, it means I will 
> probably need to redo some examples to match, but it shouldn't be hard.
>
> But I see some issue with using attributes applied to existing HTML5 
> constructs, like *span*, in this fashion.
>
>  1. Some of the data categories are rather complex (take
>     processTrigger for example) and have rules about some components
>     being mandatory and some being optional. I've not done this with a
>     schema before, and perhaps it can be done, but how do we parse and
>     enforce data structures when we no longer have a unique element to
>     enforce those structures on (since the data categories are now
>     elements that can apply to many kinds of elements)? I.e., if we
>     have 20 constructs that can apply to a *span*, each with their own
>     data model, how do we modify the data model for *span* itself to
>     support validating all of these different data models? Does the
>     new version of XML Schema address ways to enforce co-occurrence
>     restrictions for attributes of an element (e.g., if you have /foo/
>     as an attribute you *must* also have /bar/, but you can have /bar/
>     without having /foo/)?
>  2. Since we will need to support multiple data categories on
>     elements, and we would rename using the nested convention outlined
>     above, does that mean that something like the following would be
>     considered valid?
>
>
>     *<div*
>
>         *its-localeSpecificContent="fr"*
>
>         *its-preserveSpace="yes"*
>
>         *translate="yes"*
>
>         *its-approvalStatus="yes"*
>
>         *its-legalStatus="yes"*
>
>         *its-processTrigger-type="contentL10N"*
>
>         *its-processTrigger-contentType="text/html"*
>
>         *its-processTrigger-contentTypeVersion="5"*
>
>         *its-processTrigger-sourceLang="en"*
>
>         *its-processTrigger-pivotLang="zh-Hant"*
>
>         *its-processTrigger-targetLants="fr,de,wo"*
>
>         *its-processTrigger-dateRequest="20120620T023456"*
>
>         *its-processTrigger-dateDelivery="20120625T023456"*
>
>         *its-processTrigger-priority="1"*
>
>         *its-processTrigger-contentResultSource="yes"*
>
>         *its-processTrigger-contentResultTarget="multilingual"*
>
>         *its-proofeadingState="yes"*
>
>         *its-revisionState="revised"*
>
>         *its-domain="fruit cultivation"*
>
>         *its-formatType="subtitles"*
>
>         *its-genre="academic"*
>
>         *its-purpose="education"*
>
>         *its-register="formal"*
>
>         *its-translatorQualification="expert in translation of
>         agricultural documents"*
>
>         *its-author="Bob Johnson"*
>
>         *its-contentLicensingTerms="open"*
>
>         *its-revisionAgent="human"*
>
>         *its-sourceLanguage="hu"*
>
>         *its-translationAgent="social"*
>
>         *its-qualityProfile-name="SAE J2450"*
>
>         *its-qualityProfile-uri="http://www.nowhere.com"*
>
>         *its-qualityProfile-pass="pass"*
>
>         *its-qualityProfile-score="98%"*
>
>         *its-qualityProfile-agent="Bil Smith"*
>
>         *its-confidentiality="nonconfidential"*
>
>         *its-context="informativeMessage"*
>
>         *its-languageResource-type="termbase"*
>
>         *its-languageResource-location="http://www.nowhere.com/terms.tbx"*
>
>         *its-languageResource-format="TBX"*
>
>         *its-languageResource-id="tbx01"*
>
>         *its-languageResource-description="the company’s terminology
>         on the web"*
>
>         *its-specialRequirements="max-length:2000 chars"*
>
>     *><p>*Sed ut perspiciatis unde omnis iste natus error sit
>     voluptatem accusantium doloremque laudantium, totam rem aperiam,
>     eaque ipsa quae ab illo inventore veritatis et quasi architecto
>     beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia
>     voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur
>     magni dolores eos qui ratione voluptatem sequi nesciunt.*</p><img
>     src="picture.jpg" alt="some text" /><p>*Neque porro quisquam est,
>     qui dolorem ipsum quia dolor sit amet, consectetur, adipisci
>     velit, sed quia non numquam eius modi tempora incidunt ut labore
>     et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima
>     veniam, quis nostrum exercitationem ullam corporis suscipit
>     laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem
>     vel eum iure reprehenderit qui in ea voluptate velit esse quam
>     nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo
>     voluptas nulla pariatur?*</p></div>*
>
>
> I do realize we are intending all this primarily for machine 
> processing, not human readability, but I really wish there were some 
> better way to handle this kind of situation that allowed for easier 
> selection of the relevant attributes for any given task. If we are 
> using attributes I don't think we can avoid it, especially as our goal 
> is to not mess up the structures of the documents things get applied 
> to, so we cannot add extra divs to separate the kinds of data, but I 
> don't find that sort of thing very satisfying…
>
> I think Dave's responses help clean up the kind of mess I describe 
> above in general, but when you start dealing with exceptions, things 
> can get ugly.
>
> -Arle
>
> Sic scripsit David Lewis in Apr 18, 2012 ad 02:27 :
>
>> Guys,
>> For these multi-attribute data categories some other alternatives to 
>> new elements are:
>>
>> 1) for data categories likely to apply largely to the whole document, 
>> e.g. the process flow definitions, it may be possible to include a 
>> dedicated element in the global ITS rules part of the document 
>> header, with a rule binding it to the relevant document elements, and 
>> then a simpler local attribute indicating exceptions where the rule 
>> bound element does not apply
>>
>> 2) record the composite data categories in an external data 
>> structure, e.g. as a CMS database attribute associated with the 
>> document. This would be relevant for QA or other provenance data 
>> which may need to be linked to the target document after it is 
>> published, e.g. for selecting parallel text for MT training, but 
>> without incurring the overhead of being stored in the document. The 
>> document itself would just need an attribute providing an index to 
>> thei data structure, e.g. a DB key or, for linked data, a URL
>>
>> 3) a variation of 2, where we want to further minimise the impact of 
>> the binding to external data categories on the published document, is 
>> for the external data to retain a link to the document (rather than 
>> the other way around as in 2). This external index could be to 
>> existing fragment identifiers in the document (perhaps addressed via 
>> the id value requirement) or using some fragment URL scheme, e.g. the 
>> NIF URL recipies at http://nlp2rdf.org/nif-1-0
>>
>> It seems that different schemes may be appropriate to different data 
>> categories, or may even be more attractive in different business 
>> settings for the same data category. My feeling is we need further 
>> input from the downstream (i.e. after translation) CMS side of things 
>> to help guide this.
>>
>> cheers,
>> Dave
>>
>>
>> On 17/04/2012 10:31, Arle Lommel wrote:
>>> Thanks Felix,
>>>
>>> That's basically what I meant by a "bundle of attributes", but you 
>>> are right that it doesn't look nice.
>>>
>>> I guess this question doesn't need to be resolved immediately, but 
>>> for the sake of consistency, I will take the "ugly" approach in the 
>>> examples I draft (unless they are already done the other way) and we 
>>> can discuss in Dublin. I will also use the "its-" prefix as you show 
>>> below.
>>>
>>> Best,
>>>
>>> Arle
>>>
>>> Sic scripsit Felix Sasaki in Apr 17, 2012 ad 11:25 :
>>>
>>>> Hi Arle all (still on vacation, just lurking),
>>>>
>>>> it doesn't look nice, but you can mimic elements with attributes. 
>>>> E.g., instead of
>>>>
>>>> <myElem someAttr="">...</myElem>
>>>> have in HTML5
>>>> <span its-myElem its-someAttr="">
>>>>
>>>> Felix
>>>
>>
>
Received on Wednesday, 18 April 2012 19:30:26 UTC