Re: Practical question: elements or attributes?

OK, accepting that we are using attributes now, that starts to dictate some naming conventions we will have to use that we would not worry about with elements. For example, if two data categories presently include note attributes and we are using attributes, the two kinds of notes will need different names so that we know what they apply to. One way to do so is to extend the prefixing names that Felix suggested, as shown in this example, where the names are nested: 

The <span
	its-qa-type="syntactic error"
	its-qa-ruleSet="SAE J2450"
	its-qa-severity="major"
	its-qa-note="bad grammar"
	its-qa-agent="ABCReview"
	its-somethingElse-note="Sample of a note from another section of ITS"
>verbs agrees</span> with the subject

In this case I prefixed everything with its for our broad area and qa for quality with somethingElse for some other part of the spec, but both allowing a “note”. Sound reasonable? If so, it means I will probably need to redo some examples to match, but it shouldn't be hard.

But I see some issue with using attributes applied to existing HTML5 constructs, like span, in this fashion.

Some of the data categories are rather complex (take processTrigger for example) and have rules about some components being mandatory and some being optional. I've not done this with a schema before, and perhaps it can be done, but how do we parse and enforce data structures when we no longer have a unique element to enforce those structures on (since the data categories are now elements that can apply to many kinds of elements)? I.e., if we have 20 constructs that can apply to a span, each with their own data model, how do we modify the data model for span itself to support validating all of these different data models? Does the new version of XML Schema address ways to enforce co-occurrence restrictions for attributes of an element (e.g., if you have foo as an attribute you must also have bar, but you can have bar without having foo)?
Since we will need to support multiple data categories on elements, and we would rename using the nested convention outlined above, does that mean that something like the following would be considered valid?

<div
its-localeSpecificContent="fr"
its-preserveSpace="yes"
translate="yes"
its-approvalStatus="yes"
its-legalStatus="yes"
its-processTrigger-type="contentL10N"
its-processTrigger-contentType="text/html"
its-processTrigger-contentTypeVersion="5"
its-processTrigger-sourceLang="en"
its-processTrigger-pivotLang="zh-Hant"
its-processTrigger-targetLants="fr,de,wo"
its-processTrigger-dateRequest="20120620T023456"
its-processTrigger-dateDelivery="20120625T023456"
its-processTrigger-priority="1"
its-processTrigger-contentResultSource="yes"
its-processTrigger-contentResultTarget="multilingual"
its-proofeadingState="yes"
its-revisionState="revised"
its-domain="fruit cultivation"
its-formatType="subtitles"
its-genre="academic"
its-purpose="education"
its-register="formal"
its-translatorQualification="expert in translation of agricultural documents"
its-author="Bob Johnson"
its-contentLicensingTerms="open"
its-revisionAgent="human"
its-sourceLanguage="hu"
its-translationAgent="social"
its-qualityProfile-name="SAE J2450"
its-qualityProfile-uri="http://www.nowhere.com"
its-qualityProfile-pass="pass"
its-qualityProfile-score="98%"
its-qualityProfile-agent="Bil Smith"
its-confidentiality="nonconfidential"
its-context="informativeMessage"
its-languageResource-type="termbase"
its-languageResource-location="http://www.nowhere.com/terms.tbx"
its-languageResource-format="TBX"
its-languageResource-id="tbx01"
its-languageResource-description="the company’s terminology on the web"	
its-specialRequirements="max-length:2000 chars"
><p>Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt.</p><img src="picture.jpg" alt="some text" /><p>Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?</p></div>

I do realize we are intending all this primarily for machine processing, not human readability, but I really wish there were some better way to handle this kind of situation that allowed for easier selection of the relevant attributes for any given task. If we are using attributes I don't think we can avoid it, especially as our goal is to not mess up the structures of the documents things get applied to, so we cannot add extra divs to separate the kinds of data, but I don't find that sort of thing very satisfying…

I think Dave's responses help clean up the kind of mess I describe above in general, but when you start dealing with exceptions, things can get ugly.

-Arle

Sic scripsit David Lewis in Apr 18, 2012 ad 02:27 :

> Guys,
> For these multi-attribute data categories some other alternatives to new elements are:
> 
> 1) for data categories likely to apply largely to the whole document, e.g. the process flow definitions, it may be possible to include a dedicated element in the global ITS rules part of the document header, with a rule binding it to the relevant document elements, and then a simpler local attribute indicating exceptions where the rule bound element does not apply
> 
> 2) record the composite data categories in an external data structure, e.g. as a CMS database attribute associated with the document. This would be relevant for QA or other provenance data which may need to be linked to the target document after it is published, e.g. for selecting parallel text for MT training, but without incurring the overhead of being stored in the document. The document itself would just need an attribute providing an index to     thei data structure, e.g. a DB key or, for linked data, a URL
> 
> 3) a variation of 2, where we want to further minimise the impact of the binding to external data categories on the published document, is for the external data to retain a link to the document (rather than the other way around as in 2). This external index could be to existing fragment identifiers in the document (perhaps addressed via the id value requirement) or using some fragment URL scheme, e.g. the NIF URL recipies at http://nlp2rdf.org/nif-1-0 
> 
> It seems that different schemes may be appropriate to different data categories, or may even be more attractive in different business settings for the same data category. My feeling is we need further input from the downstream (i.e. after translation) CMS side of things to help guide this.
> 
> cheers,
> Dave
> 
> 
> On 17/04/2012 10:31, Arle Lommel wrote:
>> 
>> Thanks Felix,
>> 
>> That's basically what I meant by a "bundle of attributes", but you are right that it doesn't look nice.
>> 
>> I guess this question doesn't need to be resolved immediately, but for the sake of consistency, I will take the "ugly" approach in the examples I draft (unless they are already done the other way) and we can discuss in Dublin. I will also use the "its-" prefix as you show below.
>> 
>> Best,
>> 
>> Arle
>> 
>> Sic scripsit Felix Sasaki in Apr 17, 2012 ad 11:25 :
>> 
>>> Hi Arle all (still on vacation, just lurking),
>>> 
>>> it doesn't look nice, but you can mimic elements with attributes. E.g., instead of
>>> 
>>> <myElem someAttr="">...</myElem>
>>> have in HTML5
>>> <span its-myElem its-someAttr="">
>>> 
>>> Felix
>> 
> 

Received on Wednesday, 18 April 2012 08:31:29 UTC