W3C home > Mailing lists > Public > w3c-rdfcore-wg@w3.org > August 2003

Re: RDF Use Case: scraping metadata from the web

From: Patrick Stickler <patrick.stickler@nokia.com>
Date: Tue, 5 Aug 2003 17:24:03 +0300
Message-ID: <000801c35b5d$3861de50$f89216ac@NOE.Nokia.com>
To: "Brian McBride" <bwm@hplb.hpl.hp.com>, "ext Martin Duerst" <duerst@w3.org>
Cc: "rdf core" <w3c-rdfcore-wg@w3.org>, "i18n" <w3c-i18n-ig@w3.org>
[sorry, Brian, for jumping in here, but...]


I appreciate the position you present in the post below, but
I must stress the point that the problem you present is a
general problem relating to working with XML fragments, no
matter what the context, and *not* a problem with RDF, nor
a problem for RDF to fix.

By saying this, I do not mean to suggest that the problem is
not important to solve. It is. But not by RDF, and while we
have bent over backwards to try to figure out some way to
lessen the problem insofar as RDF is concerned, we have not
come up with any solution that, all things considered, is
better than what is now on the table and reflected in the
latest editors drafts.

Anytime an XML user wishes to deal with anything smaller than
a complete XML instance, they will encounter these sorts of
issues. RDF is not creating this problem.

If RDF were to provide one solution, then that would simply
be inconsistent with another solution provided for some other
context. You seem to be big on having consistent treatment,
so it puzzles me that you would seek so specialized a solution
by RDF specifically.

You appear to be asking us to make RDF inferior for SW purposes
in order to address this problem, just a little bit, insofar as 
RDF alone is concerned, for the sake of some indeterminable
number of XML users.

Let's not try to treat the symptoms rather than find a cure.

Let the XML folks tell XML users how to deal with this 
problem in a *general* way when dealing with XML fragments 
irregardless of the language of encapsulation.

E.g., have someone dust off the XML Fragment Interchange [1]
spec, make sure it does the right things, and then tell 
folks use it *everywhere* they deal with XML fragments, including
with RDF.

RDF is not going to be able to solve this general XML problem.
Certainly not at this point, given the fact that we should have 
been finished with all this stuff well over a *year* ago!

Can we *please* stop spinning our wheels on this and move one?

Thank you.



[1] http://www.w3.org/TR/xml-fragment

----- Original Message ----- 
  From: ext Martin Duerst 
  To: Brian McBride 
  Cc: rdf core ; i18n 
  Sent: 05 August, 2003 16:34
  Subject: Re: RDF Use Case: scraping metadata from the web

  Hello Brian,

  At 15:16 03/08/04 +0100, Brian McBride wrote:

  >I'm still at the point of looking for a use case to demonstrate that 
  >markup integrity is a real problem.

  For some people, it is important. For others, it may not be important.
  For the RDF Core WG, the graph is obviously very important, and the
  triples. If somebody created a new language to serialize RDF, and
  this new language would mess up graphs, I guess you would not be
  happy. If this currently happened with RDF/XML, or if some XML
  group changed XML so that it could happen, I guess you would not
  be happy.

  So I guess you should be able to understand that other people will
  not be happy at all if their markup is arbitrarily changed. It's
  not necessarily the people in the I18N WG who are most concerned
  with markup integrity (although I think we actually are). But
  assume some third party wants to use RDF to scrape metadata
  from XML documents, and this third party is concerned about
  markup integrity, either because s/he is just convinced that
  markup is crucial, or because of concerns for various round
  trip scenarios.

  After all, any user can scrape plain text literals (with
  language information), put them into RDF, and get them back
  unchanged. Do you (the RDF Core WG) or we (the I18N WG) have
  a detailled use case for this? Or do we all just agree, even
  without ever really talking about it, that it would be a very
  bad idea if plain literals suddenly got changed, e.g. if RDF
  suddenly upper-cased all plain literals?

  So let's assume that people in the XML community are concerned
  in a similar way about markup integrity as people in the RDF
  community are concerned about triple and graph integrity.

  So a person who is concerned about markup integrity does some
  scraping or something similar. They are faced with the following
  1) Preserve the markup, ignore the language information
  2) Change the markup, squeeze in an additional element to
      attach language information.
  4) Put the language information somewhere else

  3) does not work because then language information is lost
  for purposes such as glyph disambiguation and text-to-speech.
  So the user is faced with the question: Do I preserve markup,
  or do I preserve language information?

  Seen from an I18N viewpoint, if we get to this point, we already
  have lost. From our experience, we know that the users unfortunately
  in most cases will just take the easy way out, even if they don't
  explicitly weight the alternatives. That means choosing 1), and
  thus loosing language information, the wrong thing from an i18n

  The fact that the users not only have to change the markup,
  but that they have to think about how to change it (which element
  to use) and that this may depend on circumstances (e.g. <div>
  vs. <span> in a very simple HTML case), which may significantly
  complicate the extraction logic, doesn't at all help pushing
  people towards conserving language information. Bad for i18n.

  There is a fourth alternative that users may take in some case,
  which is to strip all the markup so that they can maybe use
  some language info (or maybe not). Of course, loosing markup
  is also bad for i18n.

  >You suggested that your issue has to do with multiple users doing the same 
  >thing differently and I asked you to refine the use case we have been 
  >discussing to better illustrate your issue.
  >I don't see how this use case illustrates a problem with markup integrity; 
  >rather it assumes that problem.

  Yes, to some extent, we have to assume it as a problem because
  we know that others see it as a problem.

  Hope this helps.     Regards,    Martin.
Received on Tuesday, 5 August 2003 10:24:06 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:54:07 UTC