W3C home > Mailing lists > Public > www-tag@w3.org > August 2002

Re: What are Semantics? (Was: Serving generic XML)

From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
Date: Mon, 19 Aug 2002 18:12:30 -0400
Message-Id: <p04330111b9871c02f7ef@[]>
To: www-tag@w3.org, www-style@w3.org

At 2:46 PM -0700 8/19/02, Tantek Çelik wrote:

>You actually expect a UA to parse the English tag name "headline" and then
>conclude it is a header, and then make similar conclusions for all other
>valid XML tag names?

Actually no. I expect it to look at the layout of the page and notice 
certain characteristics that strongly suggest certain things are 
headlines. I expect this will probably be done using some form of 
adaptive algorithms, rather than the deterministic ones we're 
accustomed to.

>This is because unambiguously parsing English and assigning meaning to
>English words is a solved problem right?
>Please do some homework on the state of AI and Natural Language Processing
>before making such ridiculous assertions.

This isn't just natural language processing, though. There mere fact 
that something is bigger and bold is a huge clue, and not the only 
one either.

>And never mind the fact that 90%+ folks in the world don't speak English.
>Add "i18n" reading to your homework as well.

To the extent that other cultures use different visual metaphors, 
you'd need to rerun the adaptive algorithms on native-language 
sources. Though perhaps you could just use the language itself as one 
of the clues of what else was significant.

>This is because computer vision is a solved problem right?  Again, more AI
>reading would help here, as I don't think you understand where the state of
>the art is, nor how far it has to go.

Actually, it's much easier than that. You don't need computer vision 
because the information is already in the computer. It's even easier 
than OCR. The computer has much more accurate information about what 
it's displaying on its screen that it can rely on. This isn't an easy 
problem by any means, but it's not nearly as hard as a lot of people 
think it is, and I strongly suspect it's much easier than changing 
the behavior of millions of web publishers who are hardwired by 
evolution to like WYSIWYG.

| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
|          XML in a  Nutshell, 2nd Edition (O'Reilly, 2002)          |
|              http://www.cafeconleche.org/books/xian2/              |
|  http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/  |
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
Received on Monday, 19 August 2002 18:20:53 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:55:53 UTC