- From: Maciej Stachowiak <mjs@apple.com>
- Date: Wed, 16 May 2007 13:28:01 -0700
- To: Gervase Markham <gerv@mozilla.org>
- Cc: public-html@w3.org, www-html@w3.org
On May 16, 2007, at 10:14 AM, Gervase Markham wrote: > Maciej Stachowiak wrote: >> I think it is impossible to quantify how much weight, but I do >> think there are some people who, at least before joining this >> group, would have said that no weight should be given to any >> practice that is not explicitly required by the last version of HTML. > > Fair enough. > >>> This isn't a Pave the Cowpaths thing, because (as far as I know) >>> software which reads the web using semantic information doesn't >>> currently pay attention to class names. (Or perhaps someone has >>> counter-examples?) So no-one is using this "cowpath". >> The principle is about author practices as found in existing >> content, not about what tools currently do with that content. >> However, contrary to your assertion, many microformats-based tools >> extract semantic information from class names in web pages. > > But, as someone else has also pointed out, not from the ones I've > seen thrown around as examples of potential standardisation (such > as "copyright"), True, I think the biggest problem with this one is that it's unclear what kind of tool would use the data. Hypothetical uses I can imagine: 1) A UA feature to check the copyright notice (using whatever the spec provides to detect copyright, if anything) and license (using rel="license") so you can look at them before copying content from the page. 2) A statistical analysis of what people and organizations hold copyright on how much information on the web. Both somewhat dubious. > and - in the case of microformats - they use a defined hierarchy, > not just one floating class name, so the pattern is much less > likely to occur accidentally. It's true that many of the complex microformats have a root class name, and multiple included structural elements identified by class="" or rel="" values. However, there are many trivial microformats based solely on a single rel value, such as rel="nofollow". (The rel-nofollow microformat is adopted directly into HTML5, I believe without controversy - people don't seem to worry about rel as much as class.) Also, I know of at least one microformat based on a single element bearing a single class name, the "geo" microformat. To be fair, it does have an additional constraint on the text content of the element to be a conforming instance of the microformat, namely that it must have the format of a pair of geographical coordinates. If we wanted to give a way to mark up a page's or section's copyright that would catch many current instances with very low false positive rate, I'd propose something like the following. To detect the copyright notices for a section: 1) Find all elements with class="copyright" that are in the section but not in any contained section. 2) For each such element: 2.a) compute the textContent and strip leading whitespace 2.b) compare the first 9 characters of the processed content to the string "copyright" ignoring case. 2.c) If there is a match, the element is a copyright notice for the section. Optionally, the set of elements to be considered in step 1 could be restricted, though I'd make it more permissive than the current draft does. (Note: this does not have an internationalization issue because by international treaty copyrights must begin with the word "Copyright"; based on this we could be even more picky and require matching the case insensitive regexp "^copyright[ \t\n\r]+((\(C\)|©)[ \t\n\r]+)? [0-9]+".) This might get you an element that has copyright info followed by other information, but that doesn't seem like a huge problem for intended consumers. However, I would want to know what the anticipated consumers are to decide if something like this would be worthwhile. Regards, Maciej
Received on Wednesday, 16 May 2007 21:57:19 UTC