RE: [DPUB-ANNOTATION-UC] 2.1.3 general observation about language identification [I18N-ISSUE-458] from Phillips, Addison on 2015-10-11 (public-annotation@w3.org from October 2015)

From: Phillips, Addison <addison@lab126.com>
Date: Sun, 11 Oct 2015 19:29:08 +0000
To: Leonard Rosenthol <lrosenth@adobe.com>, Robert Sanderson <azaroth42@gmail.com>
CC: "public-digipub@w3.org" <public-digipub@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, Web Annotation <public-annotation@w3.org>
Message-ID: <63a5e210961f40f58a07b27439be6e88@EX13D08UWB003.ant.amazon.com>
Hi Leonard,

The document I mentioned is in a very early state: not everything is documented that should be. We certainly are looking for contributions and gaps!

The issue of language identification for attributes has a long and inglorious history. There are two possibly solutions that have been suggested over the years.


1.       Use the @lang of the element. That is, <thing lang=”zh-Hans-CN” attr=”some text here”>, the language of @attr is @lang. This problematic, since <thing> might contain text in a different language.

2.       Use an additional attribute. That is, <thing lang=”zh” attrlang=”en” attr=”some English”>, where @attrlang indicates the language of @attr.

But the real best practice is not to put natural language text into attributes in the first place and most new document formats try to avoid putting natural language text into attributes for precisely this reason. Note that direction has the same problem as language (so you have more than one thing that needs to be added to each attribute).

The Unicode language tag characters were never a good idea and they are, indeed, deprecated. For document formats, the usual practice is to use metadata fields with BCP 47 language tags. Storing the language inside the string itself present many problems (since implementations must add/remove/manage the tags, since they modify the contents of the string, and so forth).

Addison

From: Leonard Rosenthol [mailto:lrosenth@adobe.com]
Sent: Sunday, October 11, 2015 11:41 AM
To: Phillips, Addison; Robert Sanderson
Cc: public-digipub@w3.org; public-i18n-core@w3.org; Web Annotation
Subject: Re: [DPUB-ANNOTATION-UC] 2.1.3 general observation about language identification [I18N-ISSUE-458]

Addison – maybe I am missing it, but I don’t see anything in your document that addresses the issues of language specification on attributes.   Identification for an entire “document” or a well defined block of that “document”, is well defined and understood.  But for systems where attributes of an element/block exist which cannot, themselves, have attributes – how would you suggest that this be done?  It would seem similar to the “inline” problem in your 3.3 – but I don’t see any actual resolution or suggestion there or in the provided link.

There used to be a way to do to this with “escapes” in Unicode, but that has been deprecated.  Is there a replacement?

Leonard

From: "Phillips, Addison"
Date: Sunday, October 11, 2015 at 1:46 PM
To: Robert Sanderson
Cc: "public-digipub@w3.org<mailto:public-digipub@w3.org>", "public-i18n-core@w3.org<mailto:public-i18n-core@w3.org>", Web Annotation
Subject: RE: [DPUB-ANNOTATION-UC] 2.1.3 general observation about language identification [I18N-ISSUE-458]
Resent-From: <public-digipub@w3.org<mailto:public-digipub@w3.org>>
Resent-Date: Sunday, October 11, 2015 at 1:47 PM

Hi Rob,

Thanks for the reply.

Note that references to language tags should refer to BCP 47 rather than to (one of) the underlying RFCs such as 5646. BCP 47 is a stable reference.

I don’t agree that language and direction metadata should be lumped with other types of metadata. The best practice for natural language text fields is to provide for language and base direction metadata in the document structure at the field level, which is why our comment calls it out. That’s because each field may be in a different language or have a different base direction. Note that this is consistent with e.g. HTML5 and with a number of ebook formats.

The language may also be a separate field at the document level and this comment isn’t about the kinds of metadata (including language) as you list in your reply.

(Note that the I18N WG is publishing a FPWD of “best practices for specification authors” this coming week that details some of the items above and may be helpful to your WG. See http://w3c.github.io/bp-i18n-specdev/#lang_resource)

Thanks,

Addison

From: Robert Sanderson [mailto:azaroth42@gmail.com]
Sent: Sunday, October 11, 2015 2:24 AM
To: Phillips, Addison
Cc: public-digipub@w3.org<mailto:public-digipub@w3.org>; public-i18n-core@w3.org<mailto:public-i18n-core@w3.org>; Web Annotation
Subject: Re: [DPUB-ANNOTATION-UC] 2.1.3 general observation about language identification [I18N-ISSUE-458]


Thanks again for the comments.

I agree that language metadata is important and that the set of use cases does not specifically include any metadata about the body or target resources.  That was somewhat intentional, so as to avoid trying to list out all of the possible descriptive features for resources, such as creator, creation time, language, file format, license or other rights statements, intended audience and so forth.  These are listed for the annotation itself, as the primary resource of interest.

In the Web Annotation data model, the language is explicitly included along with format and general class of the resource [1].  In the upcoming WD (next week [2]), we also add creator and creation time.  For language we refer to RFC 5646 as the value of the language property.  Is that sufficient to cover the requirements?

Many thanks,

Rob

[1] http://www.w3.org/TR/annotation-model/#body-and-target-metadata

[2] http://azaroth42.github.io/web-annotation/model/wd/index-renamed.html




On Sat, Oct 10, 2015 at 1:19 PM, Phillips, Addison <addison@lab126.com<mailto:addison@lab126.com>> wrote:
[1] 2.1.3 general observation about language identification
Description:
    http://www.w3.org/TR/dpub-annotation-uc/


    2.1.3 general observation about language identification

    Tags and annotations generally use natural language tokens (such as words). While Unicode allows text to be stored, passed, and processed without regard for the specific language, it is the case that strings can benefit from language metadata for character shaping, spell-checking, font selection and more. In additional, directionality information is usually desired.




--
Rob Sanderson
Information Standards Advocate
Digital Library Systems and Services
Stanford, CA 94305
Received on Sunday, 11 October 2015 19:29:43 UTC