- From: James Graham <jg307@cam.ac.uk>
- Date: Thu, 28 Dec 2006 00:58:28 +0000
Mike Schinkel wrote: > Matthew Paul Thomas wrote: > >> On Dec 22, 2006, at 3:23 AM, Benjamin Hawkes-Lewis wrote: >> >>> Henri Sivonen wrote: >>> ... >>> >>>> Also, it seems to me that the usefulness of non-heuristic machine >>>> consumption of semantic roles of things like dialogs, names of >>>> vessels, biological taxonomical names, quotations, etc. has been >>>> vastly exaggerated. >>>> >>> I'm not entirely sure what "non-heuristic machine consumption" is, >>> >> An example of non-heuristic machine consumption is where >> Google Glossary thinks: "In an HTML 3.2 or earlier document >> containing the code '<dl><dt>foo<dt> <dd>bar</dd></dl>', >> 'bar' is a definition of 'foo'". (It probably thinks the same >> about HTML 4 documents, too, which is applying a small >> "ignore that nonsense about dialogues" heuristic.) >> >> An example of heuristic machine consumption is where Google Glossary >> thinks: "In an HTML document containing the code >> '<p><b>foo:</b> bar</p>', 'bar' is probably a definition of >> 'foo', especially if the page has several consecutive >> paragraphs with that structure and different bold text." >> >> Non-heuristic machine consumption fails when semantic >> elements are abused, and becomes practical when elements have >> multiple popular meanings (examples of the latter include >> <dl> in HTML 4, and <p> in HTML 5). Heuristic machine >> consumption fails occasionally by the very nature of >> heuristics (examples currently include >> <http://www.google.com/search?q=define:author> and >> <http://www.google.com/search?q=define:editor>.) >> > > The origin of this thread was my request for adding attributes to all > elements to support microformat-like semantic markup. Based on the context > of your reply, it seems you are agreeing with Matthew Raymond in his > assertion that using microformat-like semantic markup is A Bad Thing(tm). Am > I understanding your position correctly? (If I'm not, please forgive me.) > Actually, IMHO mpt's point is far broader and consequentially more important than the confines of the original thread. The point, as I understand it, is that machine analysis of "semantic" markup fails if the markup construct is (ab)used in so many different ways that the interpretation of any particular fragment is no longer unambiguous. This is a sort of "heat[1] death" of the original semantics; as the use of an element becomes increasingly disordered (i.e. higher entropy), it becomes impossible to extract any useful information from the use of that element. This is critical in the proper design of semantic markup languages because one wishes to stave off the heat death as long as possible so that, as far as possible, UAs can perform useful functions based on the information in the markup (e.g. render it to a media for which the content was not explicitly designed). Obviously I don't know how to achieve this but there are a few things to consider: * Have enough elements. If there are obvious holes that people can't fill with existing elements used properly, they will reuse existing elements in new ways so increasing their entropy. * Don't have too many elements: If there are too many elements people won't understand them all and will reuse existing elements in the "wrong" way, so increasing their entropy. * Make the semantics of elements well defined: Start the elements in a "low entropy" i.e. highly ordered state. Make it obvious how the element is intended to be used (and restrict the valid uses to ones that can be discriminated by machine) so that fewer people accidentally abuse it. * Have some "high entropy" elements. This is the counterintuitive one. The goal, remember, is to extract as much information as possible from the semantically well-defined elements. However, in many situations there will not be a relevant element to use, the publishing setup will not be optimized for selecting the correct semantic element (think WYSIWYG editors), or the author will not be sufficiently familiar with the language semantics to make a well-informed choice about the right element to use. In this case providing (and encouraging the use of!) a set of high entropy "bit-bucket" elements that are semantically meaningless is very beneficial because they prevent the entropy increase associated with the abuse of the semantic elements. The increasing misuse of <em> as a "more semantic" <i> is an example of what happens when this policy is not followed. * Allow easy extensions. Having an extension mechanism for those who need more functionality is one way to stop the abuse of existing elements. This has to be sufficiently easy to use that the it can be widely adopted but powerful enough that it can replicate all the semantic features of the host language. This post was brought to you by the society for dodgy physical analogies concocted in the middle of the night. [1] Or, if you like, "Entropy death". Of course, this has nothing to do with real physical entropy but a lot to do with the common association between the second law of thermodynamics and the concept of disorder.
Received on Wednesday, 27 December 2006 16:58:28 UTC