- From: Maciej Stachowiak <mjs@apple.com>
- Date: Wed, 16 May 2007 13:28:01 -0700
- To: Gervase Markham <gerv@mozilla.org>
- Cc: public-html@w3.org, www-html@w3.org
On May 16, 2007, at 10:14 AM, Gervase Markham wrote:
> Maciej Stachowiak wrote:
>> I think it is impossible to quantify how much weight, but I do
>> think there are some people who, at least before joining this
>> group, would have said that no weight should be given to any
>> practice that is not explicitly required by the last version of HTML.
>
> Fair enough.
>
>>> This isn't a Pave the Cowpaths thing, because (as far as I know)
>>> software which reads the web using semantic information doesn't
>>> currently pay attention to class names. (Or perhaps someone has
>>> counter-examples?) So no-one is using this "cowpath".
>> The principle is about author practices as found in existing
>> content, not about what tools currently do with that content.
>> However, contrary to your assertion, many microformats-based tools
>> extract semantic information from class names in web pages.
>
> But, as someone else has also pointed out, not from the ones I've
> seen thrown around as examples of potential standardisation (such
> as "copyright"),
True, I think the biggest problem with this one is that it's unclear
what kind of tool would use the data. Hypothetical uses I can imagine:
1) A UA feature to check the copyright notice (using whatever the
spec provides to detect copyright, if anything) and license (using
rel="license") so you can look at them before copying content from
the page.
2) A statistical analysis of what people and organizations hold
copyright on how much information on the web.
Both somewhat dubious.
> and - in the case of microformats - they use a defined hierarchy,
> not just one floating class name, so the pattern is much less
> likely to occur accidentally.
It's true that many of the complex microformats have a root class
name, and multiple included structural elements identified by
class="" or rel="" values. However, there are many trivial
microformats based solely on a single rel value, such as
rel="nofollow". (The rel-nofollow microformat is adopted directly
into HTML5, I believe without controversy - people don't seem to
worry about rel as much as class.)
Also, I know of at least one microformat based on a single element
bearing a single class name, the "geo" microformat. To be fair, it
does have an additional constraint on the text content of the element
to be a conforming instance of the microformat, namely that it must
have the format of a pair of geographical coordinates.
If we wanted to give a way to mark up a page's or section's copyright
that would catch many current instances with very low false positive
rate, I'd propose something like the following. To detect the
copyright notices for a section:
1) Find all elements with class="copyright" that are in the section
but not in any contained section.
2) For each such element:
2.a) compute the textContent and strip leading whitespace
2.b) compare the first 9 characters of the processed content to
the string "copyright" ignoring case.
2.c) If there is a match, the element is a copyright notice for
the section.
Optionally, the set of elements to be considered in step 1 could be
restricted, though I'd make it more permissive than the current draft
does.
(Note: this does not have an internationalization issue because by
international treaty copyrights must begin with the word "Copyright";
based on this we could be even more picky and require matching the
case insensitive regexp "^copyright[ \t\n\r]+((\(C\)|©)[ \t\n\r]+)?
[0-9]+".)
This might get you an element that has copyright info followed by
other information, but that doesn't seem like a huge problem for
intended consumers.
However, I would want to know what the anticipated consumers are to
decide if something like this would be worthwhile.
Regards,
Maciej
Received on Wednesday, 16 May 2007 21:56:57 UTC