Regarding I4i's patent 5,787,449 and Microsoft's "Custom XML"

[UPDATE: Microsoft were recently granted patent no 7,571,169, which appears to be fully capable of covering every aspect of XML document representation. To be bitten by another company's ridiculous software patent so soon after receiving one of their own is the epitome of irony.]

This lawsuit has been discussed on blogs and their comment trails all over the net, but nobody seems to have actually tried to figure out the technical background of the lawsuit, like what the patent actually covers or what it is that Microsoft have done.

The I4i patent from 1998 presents an alternative data representation, with the aim of strictly separating content and structure. However, the patent fails to satisfactorily define the concepts of "content" and "structure", and in fact it seems that what they really want to achieve is a separation of content with presentation.

The distinction between content and presentation is intuitively very simple, although it's typically hard to maintain the distinction in real life when working with documents.

On the other hand, generalizing a distinction between content and structure is impossible. The content of a document is defined by the actual words, the order between them, the separation of text segments using punctuation, separation of paragraphs and sections, including identifying section headings etc, all structural elements at different levels. If you remove any of these, you change the meaning of the document. If you can no longer distinguish between a headline and a paragraph, then you have certainly lost a lot of information!

The method that I4i's patent 5,787,449 outlines is very simple, and (like most algorithm patents) could easily be thought up by any programmer who would want to achieve the same thing. Basically, they explain how they scan through a marked-up document (of any kind, really) and remove all the "metacodes" as they call it. Everything that is removed from the main document is instead placed in a "map", containing the original position of the codes. The result is a plain text file and a map with the markup - and you can make more of these maps if you need to use the document with alternative structures for different purposes.The patent also describes how you do the opposite and, from a selected map and a text stream generate a new marked-up document.

The patent does not describe any magical method by which a program can in any way interpret structural markup of which it has no prior knowledge.

Microsoft's "Custom XML" seems to refer to two completely different mechanisms.

One of them is the ability of embedding custom XML data as resource files in a "data store" inside a document, by adding them to the .docx ZIP file. This is mainly used programatically to embed existing structured data in the document and connecting them to "Content Controls", using XPath expression to select the data to display.

The other "Custom XML" concept refers to the ability to add custom XML markup to a Word document, by attaching a custom schema to the document and then marking up structured data in the document body according to your schema. In the document file, this information is denoted by <w:customXml> tags with an w:element attribute defining the custom tag in question. This information can then be used together with custom XSL transformations to produce structured output from the data tagged in the document.

Now, the patent owned by I4i does not claim to cover the idea of custom text transformation. It very clearly claims the right to an invention that is defined concisely as used to split marked up text into raw text plus a tag map. In fact, several of the paragraphs in the Statement section are very clear on the fact that the "invention" does not use "embedded codes for interpretation of the content". There is no evidence at all that Microsoft ever does anything like what the patent describes - what they do can be easily achieved by standard XSL transforms.

One evidence that these are two different things is that word refuses to tag data in the document in a way that would break the XML structure. If you mark just part of a headline plus some of the adjoining body text and add a custom XML tag, the resulting element will only cover the part of the headline that you marked, and nothing of the body text. This shows that they never intended to store the custom tags separately from the rest of the document.

The reason why I4i have pursued this lawsuit is of course that they are marketing a Word-based XML authoring tool, something that is more or less unnecessary now that you can do much the same thing with plain Word. Tough.

I'm certainly no fan of Microsoft's but in this case it's hard to see why a lawsuit like this should not be dismissed offhand as frivolous.

I wonder why Microsoft insists on fighting the lawsuit on the grounds that the original patent is invalid. It's clearly valid, it's just irrelevant - not only to Microsoft, Word and its users, but to the world in general.

References: The Microsoft Blog (includes the patent, the claim and the injunction) and several sections of the documentation for Office 2007 on MSDN. Here's a great guide for programmatic generation of Word documents including custom XML data parts, as an aside. Here's another technical page discussing the XML data store concept, and Here is another interesting blog post discussing the lawsuit.

Comments

Its a sad case of how process patents are poorly misinterpreted in the courts. As generally broad as this patent is, one could say that the content itself could be anything, from image/music files to GIS map data. For example could it not be said that an Itunes song list which contains "song location/name/order/last played/title/statistics etc, is not the same as holding a seperate tag/map file, which tells Itunes how to organize the content "in this case music" into a playlist? There are millions of examples of having tag/map files that are separate from the actual content that they represent but tell how to format or otherwise display and present the content. I would have to agree with microsoft that this patent is frivolous, meaning there is no new and innovate invention given to the industry. Its a common organizational practice used not only in software and database development, but in how applications store and reference "content".

it is probably trivial to find thousands of solutions based on some kind of map/data distinction that could be loosely compared to the I4i patent. The very specific solution that the patent clearly described, however, seems to have no place in the real world.

The problem they wanted to solve, that of presentational markup cluttering up the content-bearing text, is a bit of a non-issue nowadays, since we use CSS to get rid of the presentation and leave just the indispensable semantic markup in the text, where it belongs. Besides, all search engine indexing being done by Google and others depends on semantic markup to identify headlines etc to be able to classify hits. Oh, and in case anyone wonders, CSS doesn't use absolute addresses into the data, it uses classes and ID's to map presentational information onto the data.

Still, I don't know enough about US patent legislation to be able to guess what level of generalization they do from the detailed, specific, algorithm that a patent describes.

In any case,I would be very surprised if Microsoft's "Custom XML" contains any solution that could even be compared to the one described in patent 4,787,449. In my opinion, even if the patent should be deemed valid, MS most likely haven't infringed on it.

Microsoft infringed on it. A view backed with some more evidence is here: http://milan.kupcevic.net/custom-xml-microsoft-office-word-data-store-i4...

The biggest obstacle software industry faces today is the patenting system. Too broad patent descriptions are covering too many obvious and simple solutions. About 70 software patents are issued every day. No programmer is able to keep track of everything that gets patented. On the other hand, patenting procedures last too long compared to the average software product life-cycle. Moreover, patents are supposed to protect small inventors from big and powerful players, and yet, patenting is too expensive for small companies and small inventors.

Software patents should be illegal, as they are illegal in Europe.

Milan Kupcevic doesn't give any references for his statement. Sure, you can include a raw file in the package - it's just a ZIP file, but when I looked at the so called Custom XLM in Office 2007, it was never meant to be anything other than XML data, and the methods I found for extracting data from the file and inserting it in the document were entirely based on the data being on XML format.

I saw no way to create a "metacode map" for the raw data in the Custom XML files and have Word extract stuff from the files based on the map. I could have missed it, of course, but I did spend a lot of time looking.

Please note that the fact that you can read the file programmatically, and thus can create your own "metacode map" system with your own code, only means that *you* are infringing on i4i's patent, not Microsoft. Otherwise it would mean the end of data storage on computers. :)

I agree with you that software patents should be illegal, mainly because it's impossible to avoid issues like this, no matter how competent or careful the patent examiners are.

As you can see, Milan's article contains links pointing to the references, relevant Microsoft literature articles and to Brian Jones blog discussing the issues. Please note, Brian Jones is the actuall "custom XML in the Office XML format" chief architect at Microsoft.