Re: Action Item: WCL URI matching

Carlos Iglesias wrote:

> - LESS THAN A PAGE
> 
> A snippet of code:
> 
> e.g. A claim about the following snippet
> 
> ...
> <h3>Snippet section</h3>
> <p class="foo-paragraph">Some dummy test and an <img class="beauty-image" alt="Beautiful image" / ></p>
> ...
> 
> which can be found at http://www.example.org/foo.xhtml
> 
> --> Apparently this is NOT COVERED in any of the WCL URI matching requirements, but as discussed before within the group, it may not be necessary since we have a snippet pointer.

For XML resources this could be done using xpointer in the URI fragment.

> - GROUPS OF PAGES
> 
>   * A domain (All the resources within the specified domain)
> 
> e.g. A claim about the http://www.helloworld.net/ domain

What is the domain here? www.helloworld.net? helloworld.net? net?

> --> It is COVERED by 4 [Match a (sub-)domain and only those sub-domain patterns given in a list.]
> 
> 
>   * A group of subdomains (All the resources within the specified domain
> and all related subdomains)
> 
> e.g. A claim about http://www.helloworldagain.net/ domain and all it's subdomains (e.g. http://www2.helloworldagain.net/ and http://www3.helloworldagain.net/)

www2.helloworldagain.net and www3.helloworldagain.net are not subdomains 
of www.helloworldagain.net. However they are subdomains of 
helloworldagain.net.

> --> It is COVERED by 3 [Match a (sub-)domain and all sub-domains, except for those sub-domain patterns given by a list.]
> 
> A potential issue at this point is that the XG has decided [4] to adopt RDF-CL [5] in which subdomains of given host are always in scope [6], but as noted at the group minutes they will carry out whatever changes needed to make RDF-CL meet their requirements.

You mean, this is an issue with exclusions?

> Additionally the XG has requirements on scheme, port, query and fragment patterns, but as CarlosV noted in the past there are other options, frequently used by crawler tools (e.g. path depth limits), that are not covered with the current requirements.

AFAIR, there should be a way to compress statements in an EARL report, 
so that not every resource/web unit has to be listed explicitly. This 
compression is most likely not lossless. I very much doubt that we can 
create a lossless compression. There are too many parameters. And what 
would be the benefit? If we wanted to know whether a specific resource 
is part of the subject of the compressed statement, we would have to run 
a crawler with all the specified parameters first? Hmm
-- 
Johannes Koch - Competence Center BIKA
Fraunhofer Institute for Applied Information Technology (FIT.LIFE)
Schloss Birlinghoven, D-53757 Sankt Augustin, Germany
Phone: +49-2241-142628

Received on Wednesday, 19 July 2006 08:36:35 UTC