- From: Mark Davis <mark.davis@icu-project.org>
- Date: Mon, 21 May 2007 14:23:26 -0700
- To: "Karen_Broome@spe.sony.com" <Karen_Broome@spe.sony.com>
- Cc: "Richard Ishida" <ishida@w3.org>, "Najib Tounsi" <ntounsi@emi.ac.ma>, www-international@w3.org, "LTRU Working Group" <ltru@ietf.org>
- Message-ID: <30b660a20705211423w5160232ie1ce3e53ec2b636b@mail.gmail.com>
I agree with Karen on those points. Some other comments on the text (outdented) Question How do I mark up HTML or XML content for language when I don't know the language, or the content is non-linguistic? [Skip to the answer]<http://www.w3.org/International/questions/qa-no-language#answer> Background You should always use attributes to identify the human language of the text, when known, on the highest possible element of documents in HTML or a format based on XML, so that applications such as voice browsers, style sheets, and the like can process that text. I disagree. There are a great many times when you don't need to tag the language. An XML document that is used in AJAX may have no need whatsoever for a language. Moreover, in our experience, the language tags are so often set wrong on web pages that we have to ignore them completely. We only really trust the language tags that are attached internally. Probably better to just phrase this as "When you want to identify the language of the text, here is how to do it." In XML-based formats you would usually use the xml:lang attribute, and in XHTML/HTML the lang and/or xml:lang attributes. (See Declaring Language in XHTML and HTML <http://www.w3.org/International/tutorials/language-decl/>for details about language tagging in HTML.) You can override that initial language setting for a part of the document that is in a different language, eg. some French quotation in an English document, by using the same attribute(s) around the relevant bit of text. Suppose you have some text that is not in any language, such as type samples, part numbers, illustrations of binary data, etc. How would you say that this was in no language in particular? Or how about a situation where you extracted the text from a database and it came with no linguistic information? Answer There are two parts to the above question. When the text is non-linguistic<http://www.w3.org/International/questions/qa-no-language#nonlinguistic> Use the subtag zxx when the text is *known to be* not in any language. This would apply for text such as type samples, part numbers, illustrations of binary data, etc. The definition of zxx in the Language Subtag Registry is 'no linguistic content'. I wouldn't include part numbers. I would include binary data, if what that binary data represented had no linguistic content. For example: <p>Here is a list of part numbers: <span xml:lang="zxx" lang="zxx">9RUI34 8XOS12 3TYY85</span>.</p> When the language is undetermined<http://www.w3.org/International/questions/qa-no-language#undetermined> If the XML format you are using supports it, use xml:lang=""<http://www.w3.org/TR/REC-xml/#sec-lang-tag>, otherwise use the subtag und. However, according to the people who define language subtags, you should only tag text as undetermined if you can't just leave it as is. In practice, this means that markup described in the previous paragraph should only be used where the format you are using requires it, or where undetermined text is embedded in some content that has already been labeled for language in some way. These values indicate that we cannot determine, for one reason or another, what the appropriate language information is, or whether the text is non-linguistic. und does not mean "cannot determine". It means "has not determined". Maybe I could have determined the language, maybe not. That isn't the meaning of the tag. For example, xml:lang="" might be used if text is to be included into a document and the text comes from a database that doesn't provide language information and you can't be reasonably sure what the language is. Again "can't be reasonably" is phrasing that doesn't belong here. The effect would be to prevent any language information declared higher up the hierarchy of elements in the document from applying to the included text. Implications for XHTML/HTML Note that xml:lang="" only works if defined in the XML schema that describes the format of your document. It is not appropriate for XHTML because the XHTML DTDs define xml:lang in such a way that an empty string value for the xml:lang attribute is disallowed. (The xml:lang attribute takes NMTOKEN values in the schema, so they cannot be empty.) You cannot leave the lang attribute empty in HTML, either. For XHTML and HTML, then, you should use und if you need to express the undefined nature of some text embedded in a document. Note, again, that on the very rare occasion when the whole document is in an undefined language it is better to just not declare the default language of the document. By the way This is a summary of a discussion in a thread<http://lists.w3.org/Archives/Public/www-international/2005JulSep/0163.html>on www-international@w3.org, and a later reprise<http://lists.w3.org/Archives/Public/www-international/2007JanMar/0123.html>of those ideas to which several people contributed. Martin Dürst points out<http://lists.w3.org/Archives/Public/www-international/2007JanMar/0136.html>that you can redefine the XHTML/HTML format within the document to create an HTML/XHTML page that validates while using lang="" or xml:lang="". This is not recommended for widespread use, however, because such a document is no longer strictly conforming in the sense of XHTML 1.0. Mark On 5/21/07, Karen_Broome@spe.sony.com <Karen_Broome@spe.sony.com> wrote: > > > Sorry for piping up late on this issue.... > > I still question the practical applications of the "no linguistic content" > semantic. I thought we had agreed that the most appropriate use of the "zxx" > tag was to indicate that the association of a language with a piece of > content is not applicable. So if I'm classifying an instrumental musical > work using a standard library cataloging system that is also used for > lyrical works, I might indicate that the recording is "zxx"; a silent film > might have a "zxx" audio track. This use of the zxx tag is not indicated in > the text on the page. Should it be? > > Second, I don't think the part number example on the page is useful if the > intention is to code pages "so that applications such as voice browsers ... > can process that text." If we think about what behavior would be expected by > a screen reader upon encountering a "zxx" tag, I would expect that it would > ignore the text inside the tag -- just as it should with, say, binary junk. > But clearly anyone trying to make sense of the content shown on this page > would need to "read" those part numbers as well. The same is true for > programming code snippets that appear in technical tutorials. This is where > I think there is a distinction between "non-applicable" and "non-linguistic" > that is being ignored. > > What purpose would your <span> tag in the example serve? While this may do > the right thing for the spellchecker, this is not the right thing to do for > a screen reader. > > I have always argued that "no linguistic content" is not appropriate for > code or part numbers and I think recent examples show why I continue to > think this is a problematic usage and that the "zxx" semantic should be "not > applicable." > > Regards, > > Karen Broome > Metadata Systems Designer > Sony Pictures Entertainment > 310.244.4384 > > www-international-request@w3.org wrote on 05/21/2007 11:51:35 AM: > > > > > Najib and Martin, > > > > Thanks for your comments. I had another go at the document. > > http://www.w3.org/International/questions/qa-no-language > > > > RI > > > > > > ============ > > Richard Ishida > > Internationalization Lead > > W3C (World Wide Web Consortium) > > > > http://www.w3.org/People/Ishida/ > > http://www.w3.org/International/ > > http://people.w3.org/rishida/blog/ > > http://www.flickr.com/photos/ishida/ > _______________________________________________ > Ltru mailing list > Ltru@ietf.org > https://www1.ietf.org/mailman/listinfo/ltru > > -- Mark
Received on Monday, 21 May 2007 21:23:37 UTC