Re: [DPUB-ANNOTATION-UC] 2.1.3 general observation about language identification [I18N-ISSUE-458] from Ivan Herman on 2015-10-15 (public-annotation@w3.org from October 2015)

From: Ivan Herman <ivan@w3.org>
Date: Thu, 15 Oct 2015 09:46:05 +0200
To: Felix Sasaki <fsasaki@w3.org>, "Phillips, Addison" <addison@lab126.com>
Cc: Robert Sanderson <azaroth42@gmail.com>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, W3C Public Annotation List <public-annotation@w3.org>
Message-Id: <F87B149D-F1CC-4AFD-8C82-95410243C55C@w3.org>
I agree with your characterization of the model, Felix.

[Admin] Note that I have removed the DPUB IG mailing list from the header. This thread started as a comment on the Digital Publishing IG's Note on Annotation Use Cases. I believe the issues there were adequately responded by Rob, the editor of that Note. (B.t.w., the DPUB IG is not planning to re-issue a new release of that document in the coming months, so all those comments are recorded as issues, and possibly changes in the editor's draft, but no further actions are planned.) However, there were several discussions that spawn from the original mails on annotations in general, or on the Annotation model (like this mail). Those are not developed by the DPUB Interest Group, ie, these discussions should not involve that IG.

Thanks

Ivan


> On 15 Oct 2015, at 08:01 , Felix Sasaki <fsasaki@w3.org> wrote:
> 
> Hi Addison,
> 
> I am still a bit confused about the terminology. I understand what you mean with „document structure“; but I don’t see how that term is used in the annotation specification. They talk about annotations having a body and a target. None of these items are documents in your sense. The target can *refer* to a document, and use an URI to identify the document or parts of it - see the selector section
> http://www.w3.org/TR/annotation-model/#selectors <http://www.w3.org/TR/annotation-model/#selectors>
> and e.g. "Fig. 18 Text Position Selector“.
> 
> You wrote
>> document describing the annotation model (which may include or refer externally to the annotation as a non-textual values)
> 
> From my understanding this case „refer externally to the annotation“ does not happen - it is the other way round: the annotation refers to the document (or to images or other types of resources). To use a different terminology: the annotation is stand off, that is: stored independently of the document structure.
> 
> The body of an annotation can have language information associated with it, see the json example below. But again the annotation spec does not talk about documents being a type of an annotation body, but about simple textual body, embedded textual body etc., see section 3.2.1 ff
> http://www.w3.org/TR/annotation-model/#simple-textual-body <http://www.w3.org/TR/annotation-model/#simple-textual-body>
> 
> So I am still not sure how the term "document structure“ relates to these terms.
> 
> Best,
> 
> Felix
> 
>> Am 14.10.2015 um 21:30 schrieb Phillips, Addison <addison@lab126.com <mailto:addison@lab126.com>>:
>> 
>> When I say “document structure”, I mean the document describing the annotation model (which may include or refer externally to the annotation as a non-textual values). I was very careful in my previous message to use the term “natural language text field” so that this didn’t become confused with other metadata usages (which might include things like the language preferences of the target audience, for example).
>> 
>> If a document format provides a natural language text field, then that field needs to provide direction and language metadata for that field. Ideally, a spanning element (such as HTML’s <span>) would allow markup within the field as well.
>> 
>> Non-textual or external annotations or data are also important. A sound or video file (a “resource”), for example, is not a natural language text field. In that case, language information about the resource might be interesting, but that’s not the same thing as document-defined natural language text. I wouldn’t argue against making such metadata available, of course. But I think that is a separate set of requirements than we’re discussing here.
>> 
>> The “annotation model” will take the form of a “document” with some sort of format. Where the model can contain natural language text, that text needs to have direction and language. This is different from “creation time” or other application specific values and may be different from (for example) target audience language metadata (similar to the old Content-Language metadata field), which DPUB might define separately from any specific field in the document.
>> 
>> That make sense?
>> 
>> Addison
>> 
>> From: Felix Sasaki [mailto:fsasaki@w3.org <mailto:fsasaki@w3.org>]
>> Sent: Wednesday, October 14, 2015 11:46 AM
>> To: Phillips, Addison
>> Cc: Robert Sanderson; public-digipub@w3.org <mailto:public-digipub@w3.org>; public-i18n-core@w3.org <mailto:public-i18n-core@w3.org>; Web Annotation
>> Subject: Re: [DPUB-ANNOTATION-UC] 2.1.3 general observation about language identification [I18N-ISSUE-458]
>> 
>> Hi Addison and all,
>> 
>> for apologies to Addison and my i18n WG colleagues for not following this earlier closely, and a question below.
>> 
>> Am 11.10.2015 um 19:46 schrieb Phillips, Addison <addison@lab126.com <mailto:addison@lab126.com>>:
>> 
>> Hi Rob,
>> 
>> Thanks for the reply.
>> 
>> Note that references to language tags should refer to BCP 47 rather than to (one of) the underlying RFCs such as 5646. BCP 47 is a stable reference.
>> 
>> I don’t agree that language and direction metadata should be lumped with other types of metadata. The best practice for natural language text fields is to provide for language and base direction metadata in the document structure
>> 
>> 
>> The annotation model is not necessarily to be applied to documents, see below an example where the target of the annotation is an image
>> 
>> {
>>   "@id": "http://example.org/anno1 <http://example.org/anno1>",
>>   "@type": "oa:Annotation",
>>   "body": {
>>     "@id": "http://example.org/body1 <http://example.org/body1>",
>>     "@type": "dctypes:Sound",
>>     "format": "audio/mpeg",
>>     "language": "en"
>>   },
>>   "target": {
>>     "@id": "http://example.org/target1 <http://example.org/target1>",
>>     "@type": "dctypes:Image",
>>     "format": "image/jpeg"
>>   }
>> }
>> 
>> So I am wondering what you refer to then saying „provide for language and base direction metadata in the *document* structure“. Happy to discuss this also tomorrow during the i18n WG call.
>> 
>> Best,
>> 
>> Felix
>> 
>> 
>> 
>> at the field level, which is why our comment calls it out. That’s because each field may be in a different language or have a different base direction. Note that this is consistent with e.g. HTML5 and with a number of ebook formats.
>> 
>> The language may also be a separate field at the document level and this comment isn’t about the kinds of metadata (including language) as you list in your reply.
>> 
>> (Note that the I18N WG is publishing a FPWD of “best practices for specification authors” this coming week that details some of the items above and may be helpful to your WG. See http://w3c.github.io/bp-i18n-specdev/#lang_resource <http://w3c.github.io/bp-i18n-specdev/#lang_resource>)
>> 
>> Thanks,
>> 
>> Addison
>> 
>> From: Robert Sanderson [mailto:azaroth42@gmail.com <mailto:azaroth42@gmail.com>]
>> Sent: Sunday, October 11, 2015 2:24 AM
>> To: Phillips, Addison
>> Cc: public-digipub@w3.org <mailto:public-digipub@w3.org>; public-i18n-core@w3.org <mailto:public-i18n-core@w3.org>; Web Annotation
>> Subject: Re: [DPUB-ANNOTATION-UC] 2.1.3 general observation about language identification [I18N-ISSUE-458]
>> 
>> 
>> Thanks again for the comments.
>> 
>> I agree that language metadata is important and that the set of use cases does not specifically include any metadata about the body or target resources.  That was somewhat intentional, so as to avoid trying to list out all of the possible descriptive features for resources, such as creator, creation time, language, file format, license or other rights statements, intended audience and so forth.  These are listed for the annotation itself, as the primary resource of interest.
>> 
>> In the Web Annotation data model, the language is explicitly included along with format and general class of the resource [1].  In the upcoming WD (next week [2]), we also add creator and creation time.  For language we refer to RFC 5646 as the value of the language property.  Is that sufficient to cover the requirements?
>> 
>> Many thanks,
>> 
>> Rob
>> 
>> [1] http://www.w3.org/TR/annotation-model/#body-and-target-metadata <http://www.w3.org/TR/annotation-model/#body-and-target-metadata>
>> [2] http://azaroth42.github.io/web-annotation/model/wd/index-renamed.html <http://azaroth42.github.io/web-annotation/model/wd/index-renamed.html>
>> 
>> 
>> 
>> On Sat, Oct 10, 2015 at 1:19 PM, Phillips, Addison <addison@lab126.com <mailto:addison@lab126.com>> wrote:
>> [1] 2.1.3 general observation about language identification
>> Description:
>>     http://www.w3.org/TR/dpub-annotation-uc/ <http://www.w3.org/TR/dpub-annotation-uc/>
>> 
>>     2.1.3 general observation about language identification
>> 
>>     Tags and annotations generally use natural language tokens (such as words). While Unicode allows text to be stored, passed, and processed without regard for the specific language, it is the case that strings can benefit from language metadata for character shaping, spell-checking, font selection and more. In additional, directionality information is usually desired.
>> 
>> 
>> 
>> 
>> --
>> Rob Sanderson
>> Information Standards Advocate
>> Digital Library Systems and Services
>> Stanford, CA 94305
> 


----
Ivan Herman, W3C
Digital Publishing Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
ORCID ID: http://orcid.org/0000-0003-0782-2704
Received on Thursday, 15 October 2015 07:46:19 UTC