Action Item: WCL URI matching

Hi everybody,

In response to the action item recorded at [1], the following is a comparison between EARL and WCL potential requirements on URI matching open to your feedback:

We can find the following relationships between WCL URI matching requirements [2] and those registered at [3]

- A SINGLE PAGE

Just a single page:

e.g. A claim about http://www.example.org/index.html, or
http://www.anotherexample.org/foo.htm

--> It is COVERED by 7 [Match the (sub-)path specified and only those sub-paths patterns given in a list.]


- LESS THAN A PAGE

A snippet of code:

e.g. A claim about the following snippet

...
<h3>Snippet section</h3>
<p class="foo-paragraph">Some dummy test and an <img class="beauty-image" alt="Beautiful image" / ></p>
...

which can be found at http://www.example.org/foo.xhtml

--> Apparently this is NOT COVERED in any of the WCL URI matching requirements, but as discussed before within the group, it may not be necessary since we have a snippet pointer.


- GROUPS OF PAGES

  * A domain (All the resources within the specified domain)

e.g. A claim about the http://www.helloworld.net/ domain

--> It is COVERED by 4 [Match a (sub-)domain and only those sub-domain patterns given in a list.]


  * A group of subdomains (All the resources within the specified domain
and all related subdomains)

e.g. A claim about http://www.helloworldagain.net/ domain and all it's subdomains (e.g. http://www2.helloworldagain.net/ and http://www3.helloworldagain.net/)

--> It is COVERED by 3 [Match a (sub-)domain and all sub-domains, except for those sub-domain patterns given by a list.]

A potential issue at this point is that the XG has decided [4] to adopt RDF-CL [5] in which subdomains of given host are always in scope [6], but as noted at the group minutes they will carry out whatever changes needed to make RDF-CL meet their requirements.

  * A directory (All the resources within the specified directory)

e.g. A claim about http://www.byebyeworld.net/internal/*.*

--> It is COVERED by 6 [Match the (sub-)path specified and all sub-paths, except those sub-path patterns given in a list.]


  * Self-defined (User-defined groups of pages)

e.g. all the URLs that match an specific regular expression

--> It is not clear whether this requirement is covered or not, on one hand the Non-Requirements section explicitly says that "It is not required that patterns allow matching to any conceivable pattern. For example it is not required to be able to match all URIs where the character "b" is the second character of the host component." but on the other hand the Outline design section says "The matching a pattern consists of an optionally case sensitive leftmost, rightmost or exact [or regex ???] match of the normalised text of the relevant component from the candidate URI."


- EXCLUSIONS OR EXCEPTIONS IN A PREVIOUSLY DEFINED GROUP

e.g A claim about the http://www.anotherone.com/ domain with the exception of http://www.anotherone.com/private/*.* 


- And, of course, WHATEVER COMBINATION OF ALL THE ABOVE.

e.g A claim about http://www.anotherexample.org/directory/*.*, and http://www.thelastone.com/ with the exception of http://
www.anotherexample.org/directory/one.html and http://www.thelastone.com/two.html

Apparently this requirement is COVERED since they state "It MUST be possible to build patterns that meet the following requirements and to build patterns out of arbitrary combinations of those patterns, except where this would result in a pattern that is not self consistent." and the 15th requirement says "All patterns must be available in negative as well as affirmative forms." which allows the use of exclusions.


Additionally the XG has requirements on scheme, port, query and fragment patterns, but as CarlosV noted in the past there are other options, frequently used by crawler tools (e.g. path depth limits), that are not covered with the current requirements.


[1] - [http://www.w3.org/2006/07/05-er-minutes.html#action04]
[2] - [http://www.w3.org/2005/Incubator/wcl/matching.html]
[3] - [http://lists.w3.org/Archives/Public/public-wai-ert/2006Apr/0010]
[4] - [http://www.w3.org/2005/Incubator/wcl/#Minutes]
[5] - [http://www.w3.org/2004/12/q/doc/content-labels-schema.htm]
[6] - [http://www.w3.org/2004/12/q/doc/content-labels-schema.htm#hostRestriction]


Regards,

CI.

 
--------------------------------------

Carlos Iglesias

CTIC Foundation
Science and Technology Park of Gijón
33203 - Gijón, Asturias, Spain 

phone: +34 984291212
fax: +34 984390612
email: carlos.iglesias@fundacionctic.org
URL: http://www.fundacionctic.org

Received on Tuesday, 18 July 2006 16:10:12 UTC