[Bug 19050] New: Microdata: Language handling from bugzilla@jessica.w3.org on 2012-09-25 (public-html@w3.org from September 2012)

From: <bugzilla@jessica.w3.org>
Date: Tue, 25 Sep 2012 22:00:24 +0000
To: public-html@w3.org
Message-ID: <bug-19050-2495@http.www.w3.org/Bugs/Public/>
https://www.w3.org/Bugs/Public/show_bug.cgi?id=19050

           Summary: Microdata: Language handling
           Product: HTML WG
           Version: unspecified
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec
        AssignedTo: dave.null@w3.org
        ReportedBy: contributor@whatwg.org
         QAContact: public-html-bugzilla@w3.org
                CC: ian@hixie.ch, hsivonen@iki.fi, mike@w3.org,
                    public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org, philipj@opera.com,
                    cmhjones@gmail.com


This was was cloned from bug 14470 as part of operation LATER convergence.
Originally filed: 2011-10-14 19:21:00 +0000
Original reporter: Jeni Tennison <jeni@jenitennison.com>

================================================================================
 #0   Jeni Tennison                                   2011-10-14 19:21:01 +0000 
--------------------------------------------------------------------------------
It is not clear how microdata handles languages. Language is not mentioned as
part of the microdata data model [1]. It is not exposed within microdata JSON
[2]. It is not used in the algorithm for creating vCard [3] or iCalendar [4],
where it should be used to provide a value for the LANGUAGE property [5][6].

There is a list of examples of multi-lingual content on the web at [7]. Another
example is the EUR-LEX site where information about items of European
legislation is available in multiple languages [8] or on legislation.gov.uk
where Welsh and English titles for the same item of legislation are listed
together [9].

Microdata will be unusable for multi-lingual content if it doesn't preserve the
language of textual values. The spec should make it clear whether language
should be preserved by consumers, ignored, or if this is implementation
dependent. Regardless, the vCard and iCalendar conversions in the spec should
take account of language.

[1]
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#the-microdata-model
[2]
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#json
[3]
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#conversion-to-vcard
[4]
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#conversion-to-icalendar
[5] http://tools.ietf.org/html/rfc6350#section-5.1
[6] http://tools.ietf.org/html/rfc5545#section-3.2.10
[7] http://microformats.org/wiki/multilingual-examples
[8]
http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31994Y0702(01):FR:NOT
[9] http://www.legislation.gov.uk/wsi
================================================================================
 #1   Ian 'Hixie' Hickson                             2011-10-18 22:41:06 +0000 
--------------------------------------------------------------------------------
It's entirely up to the vocabulary to specify a property to carry the language.
Microdata is just a group of name-value pairs.
================================================================================
 #2   Jeni Tennison                                   2011-10-19 15:27:16 +0000 
--------------------------------------------------------------------------------
(In reply to comment #1)
> It's entirely up to the vocabulary to specify a property to carry the language.
> Microdata is just a group of name-value pairs.

Your (now dropped) mapping of microdata to RDF [1] did take into account
language from the element when generating RDF (step 6.1.4.). Does that mean
that it is OK for a consumer to take language into account when processing
microdata, despite it not being part of the microdata data model, or was that
an error in that mapping?

Having some text in the spec that clarifies the interaction of microdata and
HTML language would be really useful to avoid publisher and consumer confusion.

[1]
http://www.w3.org/TR/2011/WD-microdata-20110525/#generate-the-triples-for-an-item
================================================================================
 #3   Ian 'Hixie' Hickson                             2011-10-25 02:58:01 +0000 
--------------------------------------------------------------------------------
The RDF mapping was not a microdata to RDF mapping, it was an HTML to RDF
mapping, and so it did much more than just expose the microdata model. It was
also, IMHO, a rather misguided idea.

I don't understand what is unclear here. It seems crystal clear that the
microdata model doesn't have language, just like it doesn't list prices for
each property, or data types, or the phase of the moon when the property was
set: if it had anything to do with a language, it would be mentioned, and it is
not.
================================================================================
 #4   Jeni Tennison                                   2011-10-26 09:00:40 +0000 
--------------------------------------------------------------------------------
The issue is that the natural/obvious method of getting information about the
language of a property value is for an application to use the lang DOM property
of the relevant property element. Without a clear indication that doing so is
non-conformant, the assumption will be that the HTML language can be used by
applications that interpret microdata and map to other formats because even
though it's not part of the microdata data model, language is information that
is accessible from the DOM.

It is also not clear to microdata vocabulary creators that they must provide
properties/types to indicate the language of a property's value if they want to
capture that information. Illustrating the use of other languages in one of the
example vocabularies would be one way of making this clearer.
================================================================================
 #5   Henri Sivonen                                   2011-10-27 08:52:47 +0000 
--------------------------------------------------------------------------------
(In reply to comment #1)
> It's entirely up to the vocabulary to specify a property to carry the language.
> Microdata is just a group of name-value pairs.

It seems very inconvenient to have to specify a vocabulary-specific language
markup mechanism instead of using the language markup mechanism from the HTML
layer.

Unfortunately, language info from the HTML layer doesn't map nicely to JSON. Is
that the reason why language isn't part of the data model? What's the reason
why language isn't part of the data model?
================================================================================
 #6   Ian 'Hixie' Hickson                             2011-11-03 16:00:27 +0000 
--------------------------------------------------------------------------------
I agree that it's inconvenient. The JSON issue isn't the reason, though it is
certainly a factor.

The original reason is simply that none of the use cases indicated a need for
this. It's still not entirely clear to me what use cases exist. Certainly
multilingual content exists, but what are people intending to do with it in a
microdata context that requires the labeling to persist?
================================================================================
 #7   Jeni Tennison                                   2011-11-05 07:17:00 +0000 
--------------------------------------------------------------------------------
A use case is that a search engine wants to bring together reviews and other
information about films into film-centric pages. It gathers that information
about that film from all over the web and wants to present people with reviews
in their preferred language(s). This requires it to preserve information about
the language of the reviews.

Also in this case, the film might have different titles in different languages;
the search engine would be able to link together the information provided in
different languages about the same film using pages in which there were
multiple translations of the title (see eg [1])

A perhaps more esoteric use case: translation services such as Google Translate
might look for examples where the same information about an item was given in
different languages as potential sources for improving its translation
services.

[1]
http://fr.wikipedia.org/wiki/Les_Aventures_de_Tintin_:_Le_Secret_de_La_Licorne
================================================================================
 #8   Ian 'Hixie' Hickson                             2011-11-11 20:01:18 +0000 
--------------------------------------------------------------------------------
(In reply to comment #7)
> A use case is that a search engine wants to bring together reviews and other
> information about films into film-centric pages. It gathers that information
> about that film from all over the web and wants to present people with reviews
> in their preferred language(s). This requires it to preserve information about
> the language of the reviews.

(I assume you mean aggregator, not search engine.)

The above can be solved today, you just need to include the language
information in the microdata:

   <p itemscope itemtype="http://example.com/movie/review">
    <span itemprop=text> bla bla bla </span>
    <meta itemprop=language content="en">
   </p>

It's redundant with lang="", but lang="" doesn't have the same coarseness as
microdata. Consider:

   <p itemscope itemtype="http://example.com/movie/review" lang="en">
    <span itemprop=text>
     <span lang="de">bla</span>
     <span lang="fr">bla</span>
    </span>
   </p>

What language would you associate with the "text" property?

Also, note that microdata isn't currently intended for handling cases where
entire blobs of HTML content are aggregated. For example, it would completely
fail with something like:

  <div itemprop=adcopy>
   <style scoped> em { color: purple } </style>    
   This product costs <s>$500</s> just $100!
   You should get <em>this</em> version, not any version.
  </p>

The microdata extraction would get:

   "adcopy": [ "\n    em { color: purple }     \n   This product costs $500
just $100!\n   You should get this version, not any version.\n  \n" ]

...which isn't at all what was intended.


> A perhaps more esoteric use case: translation services such as Google Translate
> might look for examples where the same information about an item was given in
> different languages as potential sources for improving its translation
> services.

Such a tool would presumably want intra-text language annotations, not just
coarse language annotations.


I think if we're to address the use cases presented, we need to add more than
just lang="" support; we need to add subtree support (which would give us
language support for free). I don't think it makes sense to make such a radical
addition so early in the technology's development. We should wait to see how
people are using it, first.
================================================================================
 #9   Ian 'Hixie' Hickson                             2011-12-07 00:15:39 +0000 
--------------------------------------------------------------------------------
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the tracker issue; or you may create a tracker issue
yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: none yet
Rationale: I have marked this LATER so that we can look at this again once
browsers have caught up with what we've specified so far, per the last
paragraph of comment 8.
================================================================================

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
Received on Tuesday, 25 September 2012 22:01:27 UTC