- From: Manu Sporny <msporny@digitalbazaar.com>
- Date: Tue, 26 Aug 2008 22:50:25 -0400
- To: Ian Hickson <ian@hixie.ch>
- CC: WHAT-WG <whatwg@whatwg.org>, "www-archive@w3.org" <www-archive@w3.org>
Hi Ian, The second part of the replies to your questions regarding RDFa are below. Note that the list of technical merits I was affording RDFa was not meant to be exhaustive and I won't be adding to them in the body of this particular e-mail. There are additional ones, however, and we can get a more complete list of RDFa features to this mailing list in time. Ian Hickson wrote: >> If one understands web semantics to be an important part of the web's >> future, the question then becomes, why RDFa? Why not Microformats? >> >> While there are a number of technical merits that speak in favor of RDFa >> over Microformats (fully qualified vocabulary terms > > Why is this better? I am assuming that you are questioning why "fully qualified vocabulary terms" are better than "non-qualified vocabulary terms", so I will answer based on that assumption. There are several possibilities when specifying vocabulary terms that the Microformats and RDFa communities have explored. Strictly non-prefixed terms, emulated/pseudo-namespace qualified terms, and fully qualified URLs. Strictly Non-prefixed vocabulary terms -------------------------------------- The non-prefixed terms approach is what Microformats do 90% of the time. For example, the vocabulary term "photo" can be used to describe a static image for a particular resource. For example: <img class="photo" src="/images/sunset-at-sea.jpg" /> The Microformats community has taken this approach because it is the easiest on web page authors. Having to memorize long strings of pseudo-namespaced vocabulary terms like "com.yourcompany.yourproject.package.term" is more difficult than learning one simple word whose semantics does not change from vocabulary to vocabulary. While the benefit is a lowered barrier to entry, the drawback is a guaranteed collision of vocabulary terms as the number of vocabularies increase. How long would it take to have somebody else from another community come along and re-use the term "photo" for a completely different purpose? The answer, as it relates to the Microformats community, was "not that long". When we started developing the hAudio Microformat, the first thing we wanted to do was use the term "title" to refer to the "title of the audio recording". It seemed to be the most straight-forward vocabulary term that expressed what we wanted it to describe. We checked several dictionaries and the English definition for "title" was in-line with what we intended. However, it conflicted with one of the 8 other compound Microformats - hCard. It turns out "title" was defined as "job title" and so could not be re-used by any other Microformat, to mean anything other than "job title", no exceptions. From now on "title" will mean "job title" for the rest of time. If anyone were to use "title" in their semantic vocabulary outside of the Microformats community, the term would eventually become semantically meaningless. What happens when both vocabularies are used on the same page? There are examples of this being an issue on the Microformats site[5]. When you use non-prefixed vocabulary terms, the chances that there will be a vocabulary term conflict between two communities rises exponentially with relation to an increase in the number of total vocabularies. This approach is not scalable and was never meant to be scalable. Emulated-namespace/Pseudo-namespace vocabulary terms ---------------------------------------------------- Emulated-namespace/Pseudo-namespace (EN/PN) vocabulary terms have been mentioned on this list during the various RDFa discussions over the past week. Microformats have resorted to emulated namespaces when forced to do so, for example, the hAtom uF "entry-title" vocabulary term[1]. Emulated namespaces can take the form of a set of words separated by a namespace qualifier, for example: <img class="hcard.photo" src="/images/sunset-at-sea.jpg" /> <img class="org.microformats.hcard.photo" src="/images/sunset-at-sea.jpg" /> <img class="DC.title" src="/images/sunset-at-sea.jpg" /> Each one of these is an example of an emulated/pseudo-namespace. This approach has been rejected by the Microformats community because it is believed that namespaces are more difficult for webpage authors to learn and are thus best if avoided due to the limited scope of Microformats. The approach was rejected by the RDFa community because it doesn't follow core W3C TAG practices and instead invents a new method of specifying resources on the web that are not dereference-able. In short - it re-invents the URI wheel unnecessarily. Fully Qualified URLs -------------------- The fully-qualified URL approach is the one that has been adopted by the RDFa community. Fully qualified vocabulary terms specified using URLs look like the following (take special note of the @typeof and @property attributes): <div about="#thunder" typeof="http://purl.org/media/video#Movie"> <b property="http://purl.org/dc/terms/title">Tropic Thunder</b> </div> While this is the most verbose method of vocabulary term expression, it has a number of benefits that the other two methods do not provide: 1. You are guaranteed to not have any sort of vocabulary term collision. 2. It re-uses the concept of a URL, a method of namespace expression familiar to all Web users (web page authors, developers, and most importantly, regular folks using the web). 3. The link is directly dereference-able, meaning that one can put any vocabulary term expression into a web browser and get the definition of the term. For example, go to the following vocabulary term that defines the concept of a "Movie" in the -> Video RDF vocabulary: http://purl.org/media/video#Movie >> prefix short-hand via CURIEs > > This is definitely not better. I don't know where you're coming from since you haven't elaborated on that statement nor given a link to a document explaining your thought process. Since you haven't done so, all I can do is shoot in the dark as to what your issue with CURIEs might be... Let me first start with why we have this URL short-hand (aka: CURIEs) in the first place. It is a feature that helps web authors and others that are writing this stuff by hand to refer to long URLs in an easy way. This means that the following: <div about="#thunder" typeof="http://purl.org/media/video#Movie"> <b property="http://purl.org/dc/terms/title">Tropic Thunder</b> </div> can be written like so, when using CURIEs: <div about="#thunder" typeof="video:Movie"> <b property="dcterms:title">Tropic Thunder</b> </div> all one must do to enable CURIEs, is to define the prefixes at any DOM element that is higher in the tree, like so: <div xmlns:video="http://purl.org/media/video#" xmlns:dcterms="http://purl.org/dc/terms/" ... I can already hear the screams of protest on this list, as I understand this to be the one of the most evil things that you can do in the WHATWG group. :) We have been discussing an alternate way of expressing prefixes, like so: <div prefix="video=http://purl.org/media/video#" The @prefix attribute above would take a space-separated list of prefixes as CDATA, which could address one of the issues that the HTML5 community has with the CURIE proposal. However, I believe that we are far from discussing this at the present time - the WHATWG would have to acknowledge that web semantics is a problem they are interested in addressing with HTML5. Perhaps you could outline the reasons that the HTML5 community is so allergic to the concept of URL-short-hand using prefix mapping in HTML documents? I ask out of curiosity and because I have never heard the whole story from the WHATWG's perspective. >> accessibility-friendly > > How is not reusing HTML semantics better than using them? Could you explain further and include examples, please? I don't understand how RDFa's approach is "not reusing HTML semantics". > With the > exception of the now-resolve <time> issue, it seems like Microformats has > the better accessibility story. I was indirectly addressing the problem of semantically embedding machine-readable data-types that are not human friendly. Data such as dates, times, currency codes, weights, distances and most other types of ISO-codes that are not meant to be human readable, but necessary for proper semantic expression of units of measure. The example that I have from the Microformats community is the mis-use of the <abbr> element[2] in order to stuff dates and times into the hCalendar, hAtom and hReview specifications as well as durations into the hAudio specification[3]. This approach has resulted in a rejection[4] of Microformats on several BBC web properties. The addition of <time> only fixes things for Microformats in HTML5 - it does nothing for the expression of these machine-readable data-types in HTML4, XHTML1.1, and XHTML2. HTML5 is certainly not going to include a <currency> or <weight> or <force> element for mark-up of those data types for accessibility-friendly browsers. I am asserting that RDFa can solve this issue for all HTML communities without causing accessibility concerns, like the <abbr> design pattern issue that is still unsolved in the Microformats community. This semantics and accessibility issue extends outside of the HTML5 community, and I believe that it has been addressed for the most part in the HTML4, XHTML1.1 and XHTML2 communities through the adoption of RDFa. This is not to say that it is the only solution - just that it is the only solution that I am aware of that addresses the accessibility concerns when embedding semantic data in web pages using legacy elements such as <abbr>. >> unified processing rules, etc) > > Microformats could certainly benefit from a more consistent parsing model, > but that can be obtained without going to RDFa. Yes, I agree. If the parsing model were the only thing that needed to be fixed, then we wouldn't need RDFa. However, there is more that RDFa addresses. If you had a fully consistent parsing model for Microformats, you would still have the issue of erroneous scoping in Microformats: http://microformats.org/wiki/accepted-limitations-of-microformats#Microformat_Scoping_Issue Let us assume that the scoping issue is fix-able as well, and so the only major issue that remains, then, is the strictly non-prefixed vocabulary term issue, and if you address that - you basically have RDFa. >> [...] this issue really boils down to one of centralized innovation vs. >> distributed innovation. > > I don't see what the syntax has anything to do with whether the formats > are developed centrally or not. Then I have failed to explain how the syntax of the language affects the vocabulary development process. Let me try again: When a language choses to not use namespaces, the onus of the vocabulary term development falls onto a central authority that must manage those vocabulary terms in order to ensure that a collision does not occur. Imagine the Java programming language without namespaces, all class names would have to be globally unique, as in - if you created a popular class called "Parser" and released it to the world, nobody else in the entire world would be able to create a class named "Parser" for fear that it would conflict with your "Parser" class. However, if you add a namespace to that class, such as "org.whatwg.Parser" - it becomes possible to create class names in a decentralized environment. Taking this one step further, due to the reasons stated earlier in this e-mail, using arbitrary namespace qualifiers in HTML is unnecessary as we already have a readily available namespace qualifier on the web - the URL. > Nothing is stopping anyone from creating > another Microformats-like organisation that does the same thing without > going through the Microformats.org process. There could be millions of > them, in fact. So long as they pick names that are suitably unique (e.g. > URIs, or Java-like identifiers), or so long as they don't promote the use > of their formats outside of their own site, I don't see a problem. > > In fact, this is happening every day, with each author making up his own > class values for use on this own site. Yes, agreed... but internal website vocabularies are not the issue here - external website vocabularies are one of the issues that RDFa addresses. One of the questions we were tasked with answering was "How do you create vocabulary terms that can be easily inter-mixed with vocabulary terms from another online community on any HTML page contained on any site on the Web?" The key here is that we DO want people to promote the use of their vocabularies and formats outside of their own site, and we want them to do so without having to deal with any central authority. >> The Microformats community, and all communities like it, require a group >> of people to come together, collaborate and create a standard vocabulary >> to express ALL semantics. > > Well, for any one person to do anything useful with the data on the Web, > they have to have a core vocabulary (or set of vocabularies) that they > understand. So a set of standard vocabularies to express all the semantics > that that one person is interested in is needed, yes. > > It doesn't have to be _all_ semantics, however. I might want to have a > format for annotating Stargate analysis Web pages that me and my friends > write, but so long as me and my friends agree on it it doesn't have to > involve anyone else. Sure, and that's great and that problem has already been addressed to some degree. That is not the issue, however. The issue is "How do you use that vocabulary outside of your friend group (for instance, between all of the Stargate fan communities on the Web) and ensure that the vocabulary terms do not conflict with any other vocabulary on the Web?" >> A somewhat strained analogy would be bringing in representatives from >> all of the cultures of the world and having them agree on a universal >> vocabulary. > > That's pretty much exactly what Unicode did. Or what we're doing with > HTML. That doesn't seem untennable, it seems quite reasonable. No, what Unicode did was bring in representatives from all of the cultures of the world and it had them agree on a universal /character set/ - not a universal /vocabulary/. The difference is that no mainstream culture had to choose to not have their character set represented. Unicode can encode a majority of the characters used in all languages around the world, nobody had to make any hard decisions on which character set to leave behind. My point was that they didn't bring representatives in from around the world and tell them to decide on a single language - English, Mandarin, French, Sanskrit, or Esperanto. Pick one. Those are the types of decisions you are forced to make when you have centralized vocabulary development. You have centralized vocabulary development if your semantic expression language does not have the capability of namespacing vocabulary terms. Does that train of thought clarify the point I was attempting to make? >> In short, RDFa addresses the problem of a lack of a standardized >> semantics expression mechanism in HTML family languages. RDFa not only >> enables the use cases described in the videos listed above, but all use >> cases that struggle with enabling web browsers and web spiders to >> understand the context of the current page. > > I'm not convinced the problem you describe can be solved in the manner you > describe. It seems to rely on getting authors to do something that they > have shown themselves incapable of doing over and over since the Web > started. It seems like a much better solution would be to get computers to > understand what humans are doing already. I agree with your last two points, but not the first. Yes, some authors have shown themselves incapable of marking up certain types of metadata in HTML pages since the dawn of the Web. Yes, a better solution would be to get computers to understand what humans are doing already. Tools and education are vital, but not necessary, in addressing the laziness issue. RDFa is a tool that we can use to address the issue of machine learning. I won't go into the whole machine learning thing again other than to state that RDFa and machine-based learning are not mutually exclusive. They are mutually beneficial. As for not being convinced, I believe that I am addressing your concerns in a logical manner and am informing this community as to why some of the suggestions that have been made over the past week by members of WHATWG are not the proper approach to the problems that RDFa is addressing. I trust that you will continue to address the issues I am raising in the same logical manner and explain why WHATWG doesn't consider semantics to be an issue that needs to be addressed. It is unfortunate that it takes so much time to convey over a decade of experience and work performed by members of the Semantic Web Deployment workgroup as well as those involved in the RDFa Task Force and Microformats community. I trust that all on this list will continue to be receptive to what those that have been involved with these issues have to say as we continue to answer questions posed by members of WHATWG. > Even if we ignore that, it doesn't seem like the above discussion would > lead one to a set of requirements that would lead one to design a language > like RDFa. You will have to clarify this statement and give some examples, Ian. I am sure that there are holes in my explanation, however, we should be careful not to prematurely discount RDFa. It went through a very rigorous process involving many different people from many different communities that settled on one method, out of hundreds of permutations, for semantic expression in HTML. RDFa is going to be published as a W3C Recommendation in the coming months (it just got through the Candidate Recommendation phase and will be transitioning to the Proposed Recommendation phase very shortly). I need to know your thoughts about this topic in a bit more depth - why doesn't it seem like the above discussion would lead one to a set of requirements that would lead to a semantic expression mechanism like RDFa? > Thanks for the explanation, by the way. This is by far the most useful > explanation of RDFa that I have ever seen. Good, I'm glad it helped as there is much that people do not understand about RDFa. Thanks for listening thus far, and I'm committing time over the next two weeks to answer any questions that this community might have as a result of our work in the Microformats and RDFa communities. RDFa has changed a great deal in the past two years and what most people know of RDF and RDFa predates 2005. For those that are interested in all of the gory details, there is an RDFa Primer which is a pretty quick read, available here: http://www.w3.org/TR/xhtml-rdfa-primer/ The full RDFa specification, which is not a quick read, is available here: http://www.w3.org/TR/rdfa-syntax/ -- manu [1]http://microformats.org/wiki/namespaces-inconsistency-issue [2]http://microformats.org/wiki/datetime-design-pattern [3]http://microformats.org/wiki/haudio#Published [4]http://www.bbc.co.uk/blogs/radiolabs/2008/06/removing_microformats_from_bbc.shtml [5]http://microformats.org/wiki/accepted-limitations-of-microformats#Microformat_Scoping_Issue -- Manu Sporny President/CEO - Digital Bazaar, Inc. blog: Bitmunk 3.0 Website Launches http://blog.digitalbazaar.com/2008/07/03/bitmunk-3-website-launches
Received on Wednesday, 27 August 2008 02:51:05 UTC