- From: Robert Burns <rob@robburns.com>
- Date: Tue, 21 Aug 2007 14:45:30 -0500
- To: Leif Halvard Silli <lhs@malform.no>
- Cc: Julian Reschke <julian.reschke@gmx.de>, Karl Dubost <karl@w3.org>, Dan Connolly <connolly@w3.org>, "public-html@w3.org WG" <public-html@w3.org>, Sam Ruby <rubys@us.ibm.com>
Hi Leif, I think I'm seeing what you're saying now. I think you're looking to solve a more immediate problem than i was thinking about. I'll explain below. On Aug 21, 2007, at 10:57 AM, Leif Halvard Silli wrote: > > 2007-08-21 16:26:19 +0200 Robert Burns: >> On Aug 21, 2007, at 8:56 AM, Leif Halvard Silli wrote: > >>> The main thing that I agree very strongly with Karl in is that >>> the offline and online "gap" should be bridged, and that this >>> can happen through setting up clear/strict recommendations for >>> which extensions to use - which all sides (authors, authoring >>> software, browsers, servers) should pay attention to. This >>> bridging should include official language and charset >>> extensions, taking example from Apache, which also allready >>> offer its own such extensions, and have done so for a very long >>> time allready. > >> For character encodings I think things are somewhat a mess. Most >> authors are not that aware of character encodings. To me its >> really the type of thing authors should not have to worry about >> (if it had been handled in a sane way form the start). > > Who said that I thought the authors should worry about them - > anymore than he cares if his applications uses .htm, .html or > anything else? Or anymore than he cares about how the META tag for > encoding spesfication is written (which, btw, are very hard to > remember how to write)? I was more thinking that authors should never have to think about encodings at all (instead being handled by text processing applications), though that may be a solution in a different parallel universe. In other words if the first one or two bytes of a every text file included a registered byte sequence that mapped to a specific character encoding, then every text editor would simply set those bytes according to which encoding it was serializing the file to. It's just the kind of thing that no one would need to think about (other than text processing application developers). The one exception is if a server wanted to make multiple encodings of the same document available. However, I don't think this is very common even though it's fairly easy to setup on Apache. Instead we just have a few encodings that everyone supports (and UAs are expected to support many encodings too, which reduces the need for negotiation). Of course we don't live in that universe and it's difficult for me to imagine how to get there. So much legacy software would not know what to do with those leading bytes. Users would see the leading bytes as control characters and delete them. And so on. However, the UTF encodings do get us very close to that with their byte order marks (BOMs). I would love to see us simply recommend (as in SHOULD) UTF-8 or UTF-16 (with authoritative BOMs) for all HTML5 documents. > The author should not be needing to care whether his or her > authoring application adds the charset extension or if it adds a > META element with charset information - or do it some other way. Well many authors (of all skill sets) use text editors and interact with the serialization much more closely than that. The charset value is unlike any other text an author composes in an HTML document. Changing the value of an attribute or typing a different element changes the meaning of a document. However, typing a different value for the charset does not change the meaning of the document; it just makes it wrong. File types are a bit different than that. I can change a filename extension from .html to .xhtml and it's not necessarily wrong (it's just I want a different processing of the document. Though at other times it is just wrong (from .jpeg to .html). Again, encodings are very different in that the author changes something that the application should probably handle more opaquely (like through an invisible byte sequence). > However, the sad thing is that **if** the author and his > application uses a charset extension, then, in a offline mileu, the > browsers are likely to not make any sense of the charset extension. This is where I think you're onto something. It would be good for browsers to respect those extensions when opening local files. Especially since we don't have the parallel universe approach I'd like. However, it's not clear the filename extension would be all that much faster than looking inside the file for the internal metadata. >> Finally, for languages, its useful for servers to have metadata >> about language at its disposal to quickly deliver to clients. > > These extension are useful also for authors. It is very practical > to discern different variants of the same file/content based upon > the file extension. For authors, to have to look into the file is > cumbersome. Any author is free to use extensions in any way they want (as long as they don't get in the way of the final filename extension used for file typing). There's nothing we need to say in the draft or elsewhere to enable that practice, do we? >> However, i like the way HTML handles that already through the >> i18N language features. Apache can even be configured to sniff >> inside the files as they're added to the server to gather this >> data for quick indexing for later. > > The problem which .html and .xhtml reveals is that the servers put > more weight on the file extension than what is written inside the > file. See this is where I think there's an important difference. Filename extensions are the way we (authors) set the file type for files. Certainly we can set it incorrectly,, but unlike encodings, there can be multiple compatible file type treatments for the same file. In other words, the same file can be treated as .xhtml, .html, .txt or even as a raw stream of bytes. Setting the type that way is how the author indicates how to treat the file. No amount of intelligence can sniff inside the file to find out how the *author/user* wants that file treated. Once again, it makes no sense to treat a 8859-11 file as an 8859-2 file. The file's either one or the other. It's either Russian or English (at least it's main language is one or the other; though I guess a mixed and balanced document might be switchable). > Besides, one of the purposes of languge extensions is for content > negotiation. Well, if Apache can do that without language > extensions, then fine, that's and extra feature (which even fewer > peopler know about.) So would this be a recommendation for editing and conversion UAs? In other words, they would output filename extensions for encoding and language to be content negotiation ready? >> So all of these pieces of metadata each have their own place I >> think. The > > .HTML is also a metadata. Yes, agreed. And it's place is in the file extension. Also, as you point out, Apache has popularized the practice of using extensions for other purposes too. However, there are two very different situation: 1) a UA opening a local file, and 2) a server receiving a request for a file from a remote UA. It's easy for the local UA to simply check the internal metadata on the file. Whereas for the server it gets a request for a rile and then it needs to quickly find the right file to deliver to the remote client. In the first case the UA is just opening a file. In the second case the server has to find the right file. That to me is the reason for having the metadata handy (either in chained filename extensions in SQL databases or in filesystem attributes or extended attributes). >> safest thing is to keep the authoritative data inside the file >> itself, and then extract it and index it in filesystem metadata >> or elsewhere for quick retrieval. Many filesystems (and WebDAV >> too) support extended filesystem > > That extraction process is not the simple solution that Karl asked > for. I want to save the file and test immediatly. And not wait for > Spotlight or a big fast computer. Again, the local situation is different enough from the server, that it's really handled by the internal metadata. Or are you just saying you want editing UAs to assist authors in outputting this extracted metadata as filename extensions? > Besides, even Mac OS X comes with Apache. And the reason why I, on > <MyOwnMac.local> get Apache's default index.html page in Norwegian > instead of English, is precisely because the installed version of > Apache has implemented filname extension based language negotiation. So are you suggesting that the filesystems and file browsers change so that all files of the same content with different languages get presented as a single file. Then 'file.html' would really point to two files: 'file.html.utf8.ru' and 'file.html.utf8.en'. So when double clicking on (or typing in the terminal open ) 'file.html', the filesystem would select the right one based on the user's stated language preference? Is that what we're talking about? Would users have their own encoding preferences (again, this seems like something most users wouldn't care about). That's why I was saying that each kind of metadata has it's place. They each have different relations to the file, the user preferences and the operating environment. I'm just trying to understand how we would leverage what Apache does for local files. >> attributes. Some tools have started to store this information >> there. Systems like Apple's Splotlight extract authoritative >> metadata from files and store it in a sqlite database for >> indexing (but also makes use of filesystem attributes and >> extended attributes alongside the sql). To me those approaches >> represent best practice. Filenames (and their extensions) can be >> too easily and inadvertently changed: losing that metadata. The >> best thing to do is keep it inside the file (with the exception >> of file type which has now had a long tradition of filename >> extension mapping). > > I am intersted in capitalizing on what we allready have. And I do > not see these file extension problems that you see. Besides, you > can put things both inside the file and in the file name. That is > very safe, if the content is lost - which can also happen. I'm still not clear what purpose the filename extension metadata would serve. When it's already there (because of an Apache installation), it could be used, but how? Also one of the points I tried to raise in my earlier response relates to modern filesystems. We're long past the filename extension disputes that flared up on Mac OS over the years. The internet is really what brought filename extensions for file typing to Mac OS (it was already happening before Mac OS X largely due to the web and the internet in general). However we live in a very different time. In the early 90's Mac OS was really one of the only widely used filesystems that supported file type attributes. Today nearly every widely used filesystem either already supports or soon will support extended filesystem attributes. This means we have better places to store extracted or otherwise determined metadata for files. Also with XML-RPC and the like the transport of metadata can be handled fairly easily. This means we can store and transport all sorts of file metadata without overloading the filename extension. As Sander hinted, this also means that file type settings and other metadata attributes can be localized (e.g., storing standard IANA Latin script encoding types while presenting them as fully localized language names). However, I guess I'm getting way off-topic here. Take care, Rob
Received on Tuesday, 21 August 2007 19:45:48 UTC