Re: [Ltru] RE: For review: Tagging text with no language

From: Mark Davis <mark.davis@icu-project.org> · Date: Mon, 21 May 2007 14:23:26 -0700

I agree with Karen on those points. Some other comments on the text
(outdented)

Question

How do I mark up HTML or XML content for language when I don't know the
language, or the content is non-linguistic?

[Skip to the answer]<http://www.w3.org/International/questions/qa-no-language#answer>
 Background

You should always use attributes to identify the human language of the text,
when known, on the highest possible element of documents in HTML or a format
based on XML, so that applications such as voice browsers, style sheets, and
the like can process that text.

I disagree. There are a great many times when you don't need to tag the
language. An XML document that is used in AJAX may have no need whatsoever
for a language. Moreover, in our experience, the language tags are so often
set wrong on web pages that we have to ignore them completely. We only
really trust the language tags that are attached internally.

Probably better to just phrase this as "When you want to identify the
language of the text, here is how to do it."

In XML-based formats you would usually use the xml:lang attribute, and in
XHTML/HTML the lang and/or xml:lang attributes. (See Declaring Language in
XHTML and HTML <http://www.w3.org/International/tutorials/language-decl/>for
details about language tagging in HTML.)

You can override that initial language setting for a part of the document
that is in a different language, eg. some French quotation in an English
document, by using the same attribute(s) around the relevant bit of text.

Suppose you have some text that is not in any language, such as type
samples, part numbers, illustrations of binary data, etc. How would you say
that this was in no language in particular? Or how about a situation where
you extracted the text from a database and it came with no linguistic
information?
 Answer

There are two parts to the above question.
 When the text is
non-linguistic<http://www.w3.org/International/questions/qa-no-language#nonlinguistic>

Use the subtag zxx when the text is *known to be* not in any language.

This would apply for text such as type samples, part numbers, illustrations
of binary data, etc. The definition of zxx in the Language Subtag Registry
is 'no linguistic content'.

I wouldn't include part numbers. I would include binary data, if what that
binary data represented had no linguistic content.

For example:

<p>Here is a list of part numbers:
<span xml:lang="zxx" lang="zxx">9RUI34 8XOS12 3TYY85</span>.</p>

  When the language is
undetermined<http://www.w3.org/International/questions/qa-no-language#undetermined>

If the XML format you are using supports it, use
xml:lang=""<http://www.w3.org/TR/REC-xml/#sec-lang-tag>,
otherwise use the subtag und.

However, according to the people who define language subtags, you should
only tag text as undetermined if you can't just leave it as is. In practice,
this means that markup described in the previous paragraph should only be
used where the format you are using requires it, or where undetermined text
is embedded in some content that has already been labeled for language in
some way.

These values indicate that we cannot determine, for one reason or another,
what the appropriate language information is, or whether the text is
non-linguistic.

und does not mean "cannot determine". It means "has not determined". Maybe I
could have determined the language, maybe not. That isn't the meaning of the
tag.

For example, xml:lang="" might be used if text is to be included into a
document and the text comes from a database that doesn't provide language
information and you can't be reasonably sure what the language is.

Again "can't be reasonably" is phrasing that doesn't belong here.

The effect would be to prevent any language information declared higher up
the hierarchy of elements in the document from applying to the included
text.

Implications for XHTML/HTML Note that xml:lang="" only works if defined in
the XML schema that describes the format of your document. It is not
appropriate for XHTML because the XHTML DTDs define xml:lang in such a way
that an empty string value for the xml:lang attribute is disallowed. (The
xml:lang attribute takes NMTOKEN values in the schema, so they cannot be
empty.)

You cannot leave the lang attribute empty in HTML, either.

For XHTML and HTML, then, you should use und if you need to express the
undefined nature of some text embedded in a document. Note, again, that on
the very rare occasion when the whole document is in an undefined language
it is better to just not declare the default language of the document.
  By the way

This is a summary of a discussion in a
thread<http://lists.w3.org/Archives/Public/www-international/2005JulSep/0163.html>on
www-international@w3.org, and a later
reprise<http://lists.w3.org/Archives/Public/www-international/2007JanMar/0123.html>of
those ideas to which several people contributed.

Martin Dürst points
out<http://lists.w3.org/Archives/Public/www-international/2007JanMar/0136.html>that
you can redefine the XHTML/HTML format within the document to create
an
HTML/XHTML page that validates while using lang="" or xml:lang="". This is
not recommended for widespread use, however, because such a document is no
longer strictly conforming in the sense of XHTML 1.0.
 Mark

On 5/21/07, Karen_Broome@spe.sony.com <Karen_Broome@spe.sony.com> wrote:
>
>
> Sorry for piping up late on this issue....
>
> I still question the practical applications of the "no linguistic content"
> semantic. I thought we had agreed that the most appropriate use of the "zxx"
> tag was to indicate that the association of a language with a piece of
> content is not applicable.  So if I'm classifying an instrumental musical
> work using a standard library cataloging system that is also used for
> lyrical works, I might indicate that the recording is "zxx"; a silent film
> might have a "zxx" audio track. This use of the zxx tag is not indicated in
> the text on the page. Should it be?
>
> Second, I don't think the part number example on the page is useful if the
> intention is to code pages "so that applications such as voice browsers ...
> can process that text." If we think about what behavior would be expected by
> a screen reader upon encountering a "zxx" tag, I would expect that it would
> ignore the text inside the tag -- just as it should with, say, binary junk.
> But clearly anyone trying to make sense of the content shown on this page
> would need to "read" those part numbers as well.  The same is true for
> programming code snippets that appear in technical tutorials. This is where
> I think there is a distinction between "non-applicable" and "non-linguistic"
> that is being ignored.
>
> What purpose would your <span> tag in the example serve? While this may do
> the right thing for the spellchecker, this is not the right thing to do for
> a screen reader.
>
> I have always argued that "no linguistic content" is not appropriate for
> code or part numbers and I think recent examples show why I continue to
> think this is a problematic usage and that the "zxx" semantic should be "not
> applicable."
>
> Regards,
>
> Karen Broome
> Metadata Systems Designer
> Sony Pictures Entertainment
> 310.244.4384
>
> www-international-request@w3.org wrote on 05/21/2007 11:51:35 AM:
>
> >
> > Najib and Martin,
> >
> > Thanks for your comments.  I had another go at the document.
> > http://www.w3.org/International/questions/qa-no-language
> >
> > RI
> >
> >
> > ============
> > Richard Ishida
> > Internationalization Lead
> > W3C (World Wide Web Consortium)
> >
> > http://www.w3.org/People/Ishida/
> > http://www.w3.org/International/
> > http://people.w3.org/rishida/blog/
> > http://www.flickr.com/photos/ishida/
> _______________________________________________
> Ltru mailing list
> Ltru@ietf.org
> https://www1.ietf.org/mailman/listinfo/ltru
>
>

-- 
Mark