mdimporter for XHTML files on macosx

XHTML Files in Mac OS X are not indexed the same way that HTML files are.
But there is a solution.

# Before XHTML aware.

Here an example of an XHTML file and what is known by Spotlight. The information is very basic. Nothing related to the content of the file.


→ mdls bnf.xhtml 

kMDItemContentCreationDate     = 2011-10-01 11:47:27 +0000
kMDItemContentModificationDate = 2013-01-07 00:56:56 +0000
kMDItemContentType             = "public.xhtml"
kMDItemContentTypeTree         = (
    "public.xhtml",
    "public.xml",
    "public.text",
    "public.data",
    "public.item",
    "public.content"
)
kMDItemDateAdded               = 2011-10-01 11:47:27 +0000
kMDItemDisplayName             = "bnf.xhtml"
kMDItemFSContentChangeDate     = 2013-01-07 00:56:56 +0000
kMDItemFSCreationDate          = 2011-10-01 11:47:27 +0000
kMDItemFSCreatorCode           = ""
kMDItemFSFinderFlags           = 0
kMDItemFSHasCustomIcon         = 0
kMDItemFSInvisible             = 0
kMDItemFSIsExtensionHidden     = 0
kMDItemFSIsStationery          = 0
kMDItemFSLabel                 = 0
kMDItemFSName                  = "bnf.xhtml"
kMDItemFSNodeCount             = 6447
kMDItemFSOwnerGroupID          = 502
kMDItemFSOwnerUserID           = 502
kMDItemFSSize                  = 6447
kMDItemFSTypeCode              = ""
kMDItemKind                    = "HTML"
kMDItemLogicalSize             = 6447
kMDItemPhysicalSize            = 8192


# MODIYING mdimporter.

* Go to /System/Library/Spotlight 
* find the RichText.mdimporter
* Right-click on it and choose "Show Package Contents". 
* Inside the folder, edit with your text editor (textmate, sublime, etc.) the info.plist file
  or something along
  sudo subl /System/Library/Spotlight/RichText.mdimporter/Contents/Info.plist 
* You will see something along:

    <array>
    <string>public.rtf</string>
    <string>public.html</string>
    <string>public.xml</string>
    <string>public.plain-text</string>
    <string>com.apple.traditional-mac-plain-text</string>
    <string>com.apple.rtfd</string>
    <string>com.apple.webarchive</string>
    <string>org.oasis-open.opendocument.text</string>
    <string>org.openxmlformats.wordprocessingml.document</string>
   </array>

* Edit it to add <string>public.xhtml</string>

    <array>
    <string>public.rtf</string>
    <string>public.html</string>
    <string>public.xhtml</string>
    <string>public.xml</string>
    <string>public.plain-text</string>
    <string>com.apple.traditional-mac-plain-text</string>
    <string>com.apple.rtfd</string>
    <string>com.apple.webarchive</string>
    <string>org.oasis-open.opendocument.text</string>
    <string>org.openxmlformats.wordprocessingml.document</string>
   </array>

* Save it

# REINDEXING

To reindex a file you can just use mdimport
→ mdimport bnf.xhtml 


# LET'S look again at the data.

→ mdls bnf.xhtml 

kMDItemContentCreationDate     = 2011-10-01 11:47:27 +0000
kMDItemContentModificationDate = 2013-01-07 00:56:56 +0000
kMDItemContentType             = "public.xhtml"
kMDItemContentTypeTree         = (
    "public.xhtml",
    "public.xml",
    "public.text",
    "public.data",
    "public.item",
    "public.content"
)
kMDItemDateAdded               = 2011-10-01 11:47:27 +0000
kMDItemDisplayName             = "bnf.xhtml"
kMDItemFSContentChangeDate     = 2013-01-07 00:56:56 +0000
kMDItemFSCreationDate          = 2011-10-01 11:47:27 +0000
kMDItemFSCreatorCode           = ""
kMDItemFSFinderFlags           = 0
kMDItemFSHasCustomIcon         = 0
kMDItemFSInvisible             = 0
kMDItemFSIsExtensionHidden     = 0
kMDItemFSIsStationery          = 0
kMDItemFSLabel                 = 0
kMDItemFSName                  = "bnf.xhtml"
kMDItemFSNodeCount             = 6447
kMDItemFSOwnerGroupID          = 502
kMDItemFSOwnerUserID           = 502
kMDItemFSSize                  = 6447
kMDItemFSTypeCode              = ""
kMDItemKeywords                = (
    livre,
    "bibliothe\U0300que",
    lutte,
    carnet
)
kMDItemKind                    = "HTML"
kMDItemLogicalSize             = 6447
kMDItemPhysicalSize            = 8192
kMDItemTitle                   = "Numérisation des livres de la BNF - Carnets de La Grange"

So we can see now that the data have the title and the keyword. And so become searchable.

# SEARCHING

It will be now accessible from Spotlight box at the top right, but also on the command line. For example


→ mdfind "kMDItemTitle=='*livres de la BNF*'"

/long/path/to/file/bnf.xhtml

It worked!


ps: interesting note about kMDItemKeywords and encoding.

-- 
Karl Dubost
http://www.la-grange.net/karl/

Received on Thursday, 20 June 2013 20:58:07 UTC