Regarding I4i's patent 5,787,449 and Microsoft's "Custom XML"
[UPDATE: Microsoft were recently granted patent no 7,571,169, which appears to be fully capable of covering every aspect of XML document representation. To be bitten by another company's ridiculous software patent so soon after receiving one of their own is the epitome of irony.]
This lawsuit has been discussed on blogs and their comment trails all over the net, but nobody seems to have actually tried to figure out the technical background of the lawsuit, like what the patent actually covers or what it is that Microsoft have done.
The I4i patent from 1998 presents an alternative data representation, with the aim of strictly separating content and structure. However, the patent fails to satisfactorily define the concepts of "content" and "structure", and in fact it seems that what they really want to achieve is a separation of content with presentation.
The distinction between content and presentation is intuitively very simple, although it's typically hard to maintain the distinction in real life when working with documents.
On the other hand, generalizing a distinction between content and structure is impossible. The content of a document is defined by the actual words, the order between them, the separation of text segments using punctuation, separation of paragraphs and sections, including identifying section headings etc, all structural elements at different levels. If you remove any of these, you change the meaning of the document. If you can no longer distinguish between a headline and a paragraph, then you have certainly lost a lot of information!
The method that I4i's patent 5,787,449 outlines is very simple, and (like most algorithm patents) could easily be thought up by any programmer who would want to achieve the same thing. Basically, they explain how they scan through a marked-up document (of any kind, really) and remove all the "metacodes" as they call it. Everything that is removed from the main document is instead placed in a "map", containing the original position of the codes. The result is a plain text file and a map with the markup - and you can make more of these maps if you need to use the document with alternative structures for different purposes.The patent also describes how you do the opposite and, from a selected map and a text stream generate a new marked-up document.
The patent does not describe any magical method by which a program can in any way interpret structural markup of which it has no prior knowledge.
Microsoft's "Custom XML" seems to refer to two completely different mechanisms.
One of them is the ability of embedding custom XML data as resource files in a "data store" inside a document, by adding them to the .docx ZIP file. This is mainly used programatically to embed existing structured data in the document and connecting them to "Content Controls", using XPath expression to select the data to display.
The other "Custom XML" concept refers to the ability to add custom XML markup to a Word document, by attaching a custom schema to the document and then marking up structured data in the document body according to your schema. In the document file, this information is denoted by <w:customXml> tags with an w:element attribute defining the custom tag in question. This information can then be used together with custom XSL transformations to produce structured output from the data tagged in the document.
Now, the patent owned by I4i does not claim to cover the idea of custom text transformation. It very clearly claims the right to an invention that is defined concisely as used to split marked up text into raw text plus a tag map. In fact, several of the paragraphs in the Statement section are very clear on the fact that the "invention" does not use "embedded codes for interpretation of the content". There is no evidence at all that Microsoft ever does anything like what the patent describes - what they do can be easily achieved by standard XSL transforms.
One evidence that these are two different things is that word refuses to tag data in the document in a way that would break the XML structure. If you mark just part of a headline plus some of the adjoining body text and add a custom XML tag, the resulting element will only cover the part of the headline that you marked, and nothing of the body text. This shows that they never intended to store the custom tags separately from the rest of the document.
The reason why I4i have pursued this lawsuit is of course that they are marketing a Word-based XML authoring tool, something that is more or less unnecessary now that you can do much the same thing with plain Word. Tough.
I'm certainly no fan of Microsoft's but in this case it's hard to see why a lawsuit like this should not be dismissed offhand as frivolous.
I wonder why Microsoft insists on fighting the lawsuit on the grounds that the original patent is invalid. It's clearly valid, it's just irrelevant - not only to Microsoft, Word and its users, but to the world in general.
References: The Microsoft Blog (includes the patent, the claim and the injunction) and several sections of the documentation for Office 2007 on MSDN. Here's a great guide for programmatic generation of Word documents including custom XML data parts, as an aside. Here's another technical page discussing the XML data store concept, and Here is another interesting blog post discussing the lawsuit.