RE: Action Item: WCL URI matching

 

Hi Johannes,

> > - LESS THAN A PAGE
> > 
> > A snippet of code:
> > 
> > e.g. A claim about the following snippet
> > 
> > ...
> > <h3>Snippet section</h3>
> > <p class="foo-paragraph">Some dummy test and an <img 
> > class="beauty-image" alt="Beautiful image" / ></p> ...
> > 
> > which can be found at http://www.example.org/foo.xhtml
> > 
> > --> Apparently this is NOT COVERED in any of the WCL URI 
> matching requirements, but as discussed before within the 
> group, it may not be necessary since we have a snippet pointer.
> 
> For XML resources this could be done using xpointer in the 
> URI fragment.


Unfortunately the model must be applicable to HTML and maybe other non
XML resources.


> > - GROUPS OF PAGES
> > 
> >   * A domain (All the resources within the specified domain)
> > 
> > e.g. A claim about the http://www.helloworld.net/ domain
> 
> What is the domain here? www.helloworld.net? helloworld.net? net?


If I do a claim about the www.helloworld.net domain, this is the domain.
If I do a claim about the helloworld.net domain, this is the domain.
If I do a claim about the .net domain, this is the domain.

 
> > --> It is COVERED by 4 [Match a (sub-)domain and only those 
> sub-domain 
> > --> patterns given in a list.]
> > 
> > 
> >   * A group of subdomains (All the resources within the specified 
> > domain and all related subdomains)
> > 
> > e.g. A claim about http://www.helloworldagain.net/ domain 
> and all it's 
> > subdomains (e.g. http://www2.helloworldagain.net/ and 
> > http://www3.helloworldagain.net/)
> 
> www2.helloworldagain.net and www3.helloworldagain.net are not 
> subdomains of www.helloworldagain.net. However they are 
> subdomains of helloworldagain.net.


Yes, my fault.

Let's say then: a claim about helloworldagain.net domain and all it's
subdomains (e.g. www.helloworldagain.net, www2.helloworldagain.net and
www3.helloworldagain.net)

 
> > --> It is COVERED by 3 [Match a (sub-)domain and all sub-domains, 
> > --> except for those sub-domain patterns given by a list.]
> > 
> > A potential issue at this point is that the XG has decided 
> [4] to adopt RDF-CL [5] in which subdomains of given host are 
> always in scope [6], but as noted at the group minutes they 
> will carry out whatever changes needed to make RDF-CL meet 
> their requirements.
> 
> You mean, this is an issue with exclusions?


I mean this could be a potential issue with sub-domain exclusions. 


> > Additionally the XG has requirements on scheme, port, query 
> and fragment patterns, but as CarlosV noted in the past there 
> are other options, frequently used by crawler tools (e.g. 
> path depth limits), that are not covered with the current 
> requirements.
> 
> AFAIR, there should be a way to compress statements in an 
> EARL report, so that not every resource/web unit has to be 
> listed explicitly. This compression is most likely not 
> lossless. I very much doubt that we can create a lossless 
> compression. There are too many parameters. And what would be 
> the benefit? If we wanted to know whether a specific resource 
> is part of the subject of the compressed statement, we would 
> have to run a crawler with all the specified parameters first? Hmm


AFAIR, this has something to do with semantics (logical groups of
resources) and not only compression.


Regards,
CI.

Received on Wednesday, 19 July 2006 11:20:26 UTC