W3C home > Mailing lists > Public > w3c-rdfcore-wg@w3.org > August 2003

Re: RDF Use Case: scraping metadata from the web

From: Martin Duerst <duerst@w3.org>
Date: Tue, 05 Aug 2003 09:34:21 -0400
Message-Id: <4.2.0.58.J.20030804152117.0496dbe0@localhost>
To: Brian McBride <bwm@hplb.hpl.hp.com>
Cc: rdf core <w3c-rdfcore-wg@w3.org>, i18n <w3c-i18n-ig@w3.org>

Hello Brian,

At 15:16 03/08/04 +0100, Brian McBride wrote:

>I'm still at the point of looking for a use case to demonstrate that 
>markup integrity is a real problem.

For some people, it is important. For others, it may not be important.
For the RDF Core WG, the graph is obviously very important, and the
triples. If somebody created a new language to serialize RDF, and
this new language would mess up graphs, I guess you would not be
happy. If this currently happened with RDF/XML, or if some XML
group changed XML so that it could happen, I guess you would not
be happy.

So I guess you should be able to understand that other people will
not be happy at all if their markup is arbitrarily changed. It's
not necessarily the people in the I18N WG who are most concerned
with markup integrity (although I think we actually are). But
assume some third party wants to use RDF to scrape metadata
from XML documents, and this third party is concerned about
markup integrity, either because s/he is just convinced that
markup is crucial, or because of concerns for various round
trip scenarios.

After all, any user can scrape plain text literals (with
language information), put them into RDF, and get them back
unchanged. Do you (the RDF Core WG) or we (the I18N WG) have
a detailled use case for this? Or do we all just agree, even
without ever really talking about it, that it would be a very
bad idea if plain literals suddenly got changed, e.g. if RDF
suddenly upper-cased all plain literals?

So let's assume that people in the XML community are concerned
in a similar way about markup integrity as people in the RDF
community are concerned about triple and graph integrity.

So a person who is concerned about markup integrity does some
scraping or something similar. They are faced with the following
alternatives:
1) Preserve the markup, ignore the language information
2) Change the markup, squeeze in an additional element to
    attach language information.
4) Put the language information somewhere else

3) does not work because then language information is lost
for purposes such as glyph disambiguation and text-to-speech.
So the user is faced with the question: Do I preserve markup,
or do I preserve language information?

Seen from an I18N viewpoint, if we get to this point, we already
have lost. From our experience, we know that the users unfortunately
in most cases will just take the easy way out, even if they don't
explicitly weight the alternatives. That means choosing 1), and
thus loosing language information, the wrong thing from an i18n
perspective.

The fact that the users not only have to change the markup,
but that they have to think about how to change it (which element
to use) and that this may depend on circumstances (e.g. <div>
vs. <span> in a very simple HTML case), which may significantly
complicate the extraction logic, doesn't at all help pushing
people towards conserving language information. Bad for i18n.

There is a fourth alternative that users may take in some case,
which is to strip all the markup so that they can maybe use
some language info (or maybe not). Of course, loosing markup
is also bad for i18n.


>You suggested that your issue has to do with multiple users doing the same 
>thing differently and I asked you to refine the use case we have been 
>discussing to better illustrate your issue.
>
>I don't see how this use case illustrates a problem with markup integrity; 
>rather it assumes that problem.

Yes, to some extent, we have to assume it as a problem because
we know that others see it as a problem.


Hope this helps.     Regards,    Martin.
Received on Tuesday, 5 August 2003 09:49:12 EDT

This archive was generated by hypermail pre-2.1.9 : Wednesday, 3 September 2003 09:59:33 EDT