Re: Practical question: elements or attributes? from David Lewis on 2012-04-18 (public-multilingualweb-lt@w3.org from April 2012)

From: David Lewis <dave.lewis@cs.tcd.ie>
Date: Wed, 18 Apr 2012 14:47:05 +0100
To: Arle Lommel <arle.lommel@gmail.com>
CC: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
Message-ID: <4F8EC5D9.7010405@cs.tcd.ie>
Hi Arle,
Thanks, these examples help illuminate the issues, especially how the we 
need a structure attirbute naming approach if we are not using name 
spaces and the issue of how to define co-dependence between attributes, 
which is a major conforance issue.

The example also I think hihglight a natural outcome of the expansion of 
scope from internationalsiation in ITS to the whole localsiation process 
and to CMS-based localisation.

In internationalsiation the authors areintroducing the tags and also 
have authority to add spans and divs the document at the same time. They 
can therefore optimise this to avoid adding too many such elements just 
to support ITS attributes. As we go through the localsiation process, 
each segment, and potentially each term, becomes the subject MLW-Lt 
attributes, since localisation needs to track every segment, and each 
may have different status that is of interest, this mean that every 
segment potentially need a span added just to support our attributes. 
This will easily lead to the sort of situations Arle exemplifies for 
_every_ segement. This takes us to where OAXAL (and LISA) went with 
xml:tm. This suffers from loading the document with a load of markup 
just for localisationation, but which is not relevant for all the latter 
stages of content management. hence my previous points on support for 
recording such mark-up in a stand off fashion once we are past the stage 
where we need the immediacy of it being embeddeed in the document. If we 
are to strip out such spans/divs that were introduced purly for houseing 
MLW-LT attibrites, we need an attribute to mark them as such so the 
span/div itself can be stripped cleanly.

Moritz, the CMS platform and document lifecycle would be really helpful 
here - what are you thoughts?

Regards,
Dave


On 18/04/2012 09:30, Arle Lommel wrote:
> OK, accepting that we are using attributes now, that starts to dictate 
> some naming conventions we will have to use that we would not worry 
> about with elements. For example, if two data categories presently 
> include note attributes and we are using attributes, the two kinds of 
> notes will need different names so that we know what they apply to. 
> One way to do so is to extend the prefixing names that Felix 
> suggested, as shown in this example, where the names are nested:
>
>     The *<span*
>     *its-qa-type="syntactic error"*
>     *its-qa-ruleSet="SAE J2450"*
>     *its-qa-severity="major"*
>     *its-qa-note="bad grammar"*
>     *its-qa-agent="ABCReview"*
>     *its-somethingElse-note="Sample of a note from another section of
>     ITS"*
>     *>*verbs agrees*</span>* with the subject
>
>
> In this case I prefixed everything with *its* for our broad area and 
> *qa* for quality with *somethingElse* for some other part of the spec, 
> but both allowing a “note”. Sound reasonable? If so, it means I will 
> probably need to redo some examples to match, but it shouldn't be hard.
>
> But I see some issue with using attributes applied to existing HTML5 
> constructs, like *span*, in this fashion.
>
>  1. Some of the data categories are rather complex (take
>     processTrigger for example) and have rules about some components
>     being mandatory and some being optional. I've not done this with a
>     schema before, and perhaps it can be done, but how do we parse and
>     enforce data structures when we no longer have a unique element to
>     enforce those structures on (since the data categories are now
>     elements that can apply to many kinds of elements)? I.e., if we
>     have 20 constructs that can apply to a *span*, each with their own
>     data model, how do we modify the data model for *span* itself to
>     support validating all of these different data models? Does the
>     new version of XML Schema address ways to enforce co-occurrence
>     restrictions for attributes of an element (e.g., if you have /foo/
>     as an attribute you *must* also have /bar/, but you can have /bar/
>     without having /foo/)?
>  2. Since we will need to support multiple data categories on
>     elements, and we would rename using the nested convention outlined
>     above, does that mean that something like the following would be
>     considered valid?
>
>
>     *<div*
>
>         *its-localeSpecificContent="fr"*
>
>         *its-preserveSpace="yes"*
>
>         *translate="yes"*
>
>         *its-approvalStatus="yes"*
>
>         *its-legalStatus="yes"*
>
>         *its-processTrigger-type="contentL10N"*
>
>         *its-processTrigger-contentType="text/html"*
>
>         *its-processTrigger-contentTypeVersion="5"*
>
>         *its-processTrigger-sourceLang="en"*
>
>         *its-processTrigger-pivotLang="zh-Hant"*
>
>         *its-processTrigger-targetLants="fr,de,wo"*
>
>         *its-processTrigger-dateRequest="20120620T023456"*
>
>         *its-processTrigger-dateDelivery="20120625T023456"*
>
>         *its-processTrigger-priority="1"*
>
>         *its-processTrigger-contentResultSource="yes"*
>
>         *its-processTrigger-contentResultTarget="multilingual"*
>
>         *its-proofeadingState="yes"*
>
>         *its-revisionState="revised"*
>
>         *its-domain="fruit cultivation"*
>
>         *its-formatType="subtitles"*
>
>         *its-genre="academic"*
>
>         *its-purpose="education"*
>
>         *its-register="formal"*
>
>         *its-translatorQualification="expert in translation of
>         agricultural documents"*
>
>         *its-author="Bob Johnson"*
>
>         *its-contentLicensingTerms="open"*
>
>         *its-revisionAgent="human"*
>
>         *its-sourceLanguage="hu"*
>
>         *its-translationAgent="social"*
>
>         *its-qualityProfile-name="SAE J2450"*
>
>         *its-qualityProfile-uri="http://www.nowhere.com"*
>
>         *its-qualityProfile-pass="pass"*
>
>         *its-qualityProfile-score="98%"*
>
>         *its-qualityProfile-agent="Bil Smith"*
>
>         *its-confidentiality="nonconfidential"*
>
>         *its-context="informativeMessage"*
>
>         *its-languageResource-type="termbase"*
>
>         *its-languageResource-location="http://www.nowhere.com/terms.tbx"*
>
>         *its-languageResource-format="TBX"*
>
>         *its-languageResource-id="tbx01"*
>
>         *its-languageResource-description="the company’s terminology
>         on the web"*
>
>         *its-specialRequirements="max-length:2000 chars"*
>
>     *><p>*Sed ut perspiciatis unde omnis iste natus error sit
>     voluptatem accusantium doloremque laudantium, totam rem aperiam,
>     eaque ipsa quae ab illo inventore veritatis et quasi architecto
>     beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia
>     voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur
>     magni dolores eos qui ratione voluptatem sequi nesciunt.*</p><img
>     src="picture.jpg" alt="some text" /><p>*Neque porro quisquam est,
>     qui dolorem ipsum quia dolor sit amet, consectetur, adipisci
>     velit, sed quia non numquam eius modi tempora incidunt ut labore
>     et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima
>     veniam, quis nostrum exercitationem ullam corporis suscipit
>     laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem
>     vel eum iure reprehenderit qui in ea voluptate velit esse quam
>     nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo
>     voluptas nulla pariatur?*</p></div>*
>
>
> I do realize we are intending all this primarily for machine 
> processing, not human readability, but I really wish there were some 
> better way to handle this kind of situation that allowed for easier 
> selection of the relevant attributes for any given task. If we are 
> using attributes I don't think we can avoid it, especially as our goal 
> is to not mess up the structures of the documents things get applied 
> to, so we cannot add extra divs to separate the kinds of data, but I 
> don't find that sort of thing very satisfying…
>
> I think Dave's responses help clean up the kind of mess I describe 
> above in general, but when you start dealing with exceptions, things 
> can get ugly.
>
> -Arle
>
> Sic scripsit David Lewis in Apr 18, 2012 ad 02:27 :
>
>> Guys,
>> For these multi-attribute data categories some other alternatives to 
>> new elements are:
>>
>> 1) for data categories likely to apply largely to the whole document, 
>> e.g. the process flow definitions, it may be possible to include a 
>> dedicated element in the global ITS rules part of the document 
>> header, with a rule binding it to the relevant document elements, and 
>> then a simpler local attribute indicating exceptions where the rule 
>> bound element does not apply
>>
>> 2) record the composite data categories in an external data 
>> structure, e.g. as a CMS database attribute associated with the 
>> document. This would be relevant for QA or other provenance data 
>> which may need to be linked to the target document after it is 
>> published, e.g. for selecting parallel text for MT training, but 
>> without incurring the overhead of being stored in the document. The 
>> document itself would just need an attribute providing an index to 
>> thei data structure, e.g. a DB key or, for linked data, a URL
>>
>> 3) a variation of 2, where we want to further minimise the impact of 
>> the binding to external data categories on the published document, is 
>> for the external data to retain a link to the document (rather than 
>> the other way around as in 2). This external index could be to 
>> existing fragment identifiers in the document (perhaps addressed via 
>> the id value requirement) or using some fragment URL scheme, e.g. the 
>> NIF URL recipies at http://nlp2rdf.org/nif-1-0
>>
>> It seems that different schemes may be appropriate to different data 
>> categories, or may even be more attractive in different business 
>> settings for the same data category. My feeling is we need further 
>> input from the downstream (i.e. after translation) CMS side of things 
>> to help guide this.
>>
>> cheers,
>> Dave
>>
>>
>> On 17/04/2012 10:31, Arle Lommel wrote:
>>> Thanks Felix,
>>>
>>> That's basically what I meant by a "bundle of attributes", but you 
>>> are right that it doesn't look nice.
>>>
>>> I guess this question doesn't need to be resolved immediately, but 
>>> for the sake of consistency, I will take the "ugly" approach in the 
>>> examples I draft (unless they are already done the other way) and we 
>>> can discuss in Dublin. I will also use the "its-" prefix as you show 
>>> below.
>>>
>>> Best,
>>>
>>> Arle
>>>
>>> Sic scripsit Felix Sasaki in Apr 17, 2012 ad 11:25 :
>>>
>>>> Hi Arle all (still on vacation, just lurking),
>>>>
>>>> it doesn't look nice, but you can mimic elements with attributes. 
>>>> E.g., instead of
>>>>
>>>> <myElem someAttr="">...</myElem>
>>>> have in HTML5
>>>> <span its-myElem its-someAttr="">
>>>>
>>>> Felix
>>>
>>
>
Received on Wednesday, 18 April 2012 13:48:06 UTC