W3C home > Mailing lists > Public > www-international@w3.org > January to March 2005

Re: Language X within scope of language Y

From: Mark Davis <mark.davis@jtcsv.com>
Date: Thu, 20 Jan 2005 08:14:12 -0800
Message-ID: <13a901c4ff0b$1632af00$6501a8c0@sanjose.ibm.com>
To: <www-international@w3.org>, "David Clarke" <w3c@dragonthoughts.co.uk>

That is a good point. And in theory it would be appropriate for every
instance of "the" meaning the English word should be tagged as English on
sites like http://www.lemonde.fr/. (There is one instance on that page right
now, although that might change by the time you read this, since the page
updates frequently.)

In practice, however, nobody is going to take the time to do that, so it is
really a moot point. If items like this end up being a significant issue for
search engines, they will try to build in heuristics to try to detect them.
If they don't end up being a significant issue, the search engines won't
bother. I simply think that the best you can hope for in practice is that a
document gets tagged with the predominant language in the document; fine
grained tagging may happen, but cannot be relied upon in general -- and will
simply not be done in the vast majority of cases.


----- Original Message ----- 
From: "David Clarke" <w3c@dragonthoughts.co.uk>
To: <www-international@w3.org>
Sent: Thursday, January 20, 2005 00:18
Subject: Re: Language X within scope of language Y

I think there is an area of importance that has been missed in this

When indexing content, as for a search engine, common words are dropped
out. Thus for indexing something in English, the word "the" would not be
indexed as it appears in almost every English sentence.

However the word "the" in French , carries the meaning of tea; this
suggests it should not be omitted from an index. Were it to be embedded
within a predominantly English text, embedded language marking would
make a significant difference to processing.

Similarly, an English "the" embedded in a French text should be omitted
from the index, but the indexer would need the clue to the embedded

  (Note "the" should have an accent, but accents are regularly omitted in
computer based content)

In message <129701c4fe94$75b7a100$6501a8c0@sanjose.ibm.com>, Mark Davis
<mark.davis@jtcsv.com> writes
>I agree. Big gray area. And in practice, I suspect that 99.428571% of the
>time, even if someone *could* annotate a document to indicate that an
>embedded word, sentence, or phrase is French instead of English, they
>So one certainly couldn't depend on it happening in arbitrary documents,
>even if the capability is there (eg in HTML or XML).
>On the other hand, within a closed environment, such as a linguistics
>research project, such capabilities might be used, and then depended on.
>----- Original Message -----
>From: "Peter Constable" <petercon@microsoft.com>
>To: <www-rdf-interest@w3.org>; <www-international@w3.org>;
>Sent: Wednesday, January 19, 2005 17:26
>Subject: RE: Language X within scope of language Y
>> From: Mark Davis [mailto:mark.davis@jtcsv.com]
>> Also, because words get adopted over time, and become "more and more"
>> considered a natural part of the language.
>I almost mentioned lexical borrowings as an issue. To my knowledge (though
>language contact is not an area of expertise for me) linguists have not
>established agreed-upon criteria by which to decide that lexical borrowing
>has become fully incorporated into another language. The process is
>certainly a gradual one.
>So, for instance, most English speakers would have no clue that
>"conversation" came into English from French. Most are probably aware that
>"faux pas" comes from French, but would not be conscious of that each time
>they use it. (Another example that's probably further along in
>internalization would be American usage of "foyer" in which the
>pronunciation is fully Anglicised (Canadians and Brits say /foije/.) I'd
>guess that most times an English speaker uses "nom de plume" or "bête noir"
>they are conscious of the French origin though some might feel this has
>adopted into their working vocabulary. And I'd guess any time an English
>speaker used "adieu", "un bon idée" or several other expressions they might
>use in an English conversation that they'd perceive themselves as having
>switched temporarily to French.
>So, at what point do you tag xml:lang="en" versus xml:lang="fr"? There are
>no well-defined answers to this one.
>Peter Constable
>Ietf-languages mailing list

David Clarke
Received on Thursday, 20 January 2005 16:14:20 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:40:50 UTC