Re: "Pave The Cowpaths" Design Principle

On May 16, 2007, at 10:14 AM, Gervase Markham wrote:

> Maciej Stachowiak wrote:
>> I think it is impossible to quantify how much weight, but I do  
>> think there are some people who, at least before joining this  
>> group, would have said that no weight should be given to any  
>> practice that is not explicitly required by the last version of HTML.
>
> Fair enough.
>
>>> This isn't a Pave the Cowpaths thing, because (as far as I know)  
>>> software which reads the web using semantic information doesn't  
>>> currently pay attention to class names. (Or perhaps someone has  
>>> counter-examples?) So no-one is using this "cowpath".
>> The principle is about author practices as found in existing  
>> content, not about what tools currently do with that content.  
>> However, contrary to your assertion, many microformats-based tools  
>> extract semantic information from class names in web pages.
>
> But, as someone else has also pointed out, not from the ones I've  
> seen thrown around as examples of potential standardisation (such  
> as "copyright"),

True, I think the biggest problem with this one is that it's unclear  
what kind of tool would use the data. Hypothetical uses I can imagine:

1) A UA feature to check the copyright notice (using whatever the  
spec provides to detect copyright, if anything) and license (using  
rel="license") so you can look at them before copying content from  
the page.

2) A statistical analysis of what people and organizations hold  
copyright on how much information on the web.

Both somewhat dubious.

> and - in the case of microformats - they use a defined hierarchy,  
> not just one floating class name, so the pattern is much less  
> likely to occur accidentally.

It's true that many of the complex microformats have a root class  
name, and multiple included structural elements identified by  
class="" or rel="" values. However, there are many trivial  
microformats based solely on a single rel value, such as  
rel="nofollow". (The rel-nofollow microformat is adopted directly  
into HTML5, I believe without controversy - people don't seem to  
worry about rel as much as class.)

Also, I know of at least one microformat based on a single element  
bearing a single class name, the "geo" microformat. To be fair, it  
does have an additional constraint on the text content of the element  
to be a conforming instance of the microformat, namely that it must  
have the format of a pair of geographical coordinates.

If we wanted to give a way to mark up a page's or section's copyright  
that would catch many current instances with very low false positive  
rate, I'd propose something like the following. To detect the  
copyright notices for a section:

1) Find all elements with class="copyright" that are in the section  
but not in any contained section.
2) For each such element:
     2.a) compute the textContent and  strip leading whitespace
     2.b) compare the first 9 characters of the processed content to  
the string "copyright" ignoring case.
     2.c) If there is a match, the element is a copyright notice for  
the section.

Optionally, the set of elements to be considered in step 1 could be  
restricted, though I'd make it more permissive than the current draft  
does.

(Note: this does not have an internationalization issue because by  
international treaty copyrights must begin with the word "Copyright";  
based on this we could be even more picky and require matching the  
case insensitive regexp "^copyright[ \t\n\r]+((\(C\)|©)[ \t\n\r]+)? 
[0-9]+".)

This might get you an element that has copyright info followed by  
other information, but that doesn't seem like a huge problem for  
intended consumers.

However, I would want to know what the anticipated consumers are to  
decide if something like this would be worthwhile.

Regards,
Maciej

Received on Wednesday, 16 May 2007 21:57:19 UTC