Reviewing XML Options

Richard Pipe, our Digital Publishing Specialist, has been actively engaged in discussions on the values of XHTML vs. XML DTDs. His position is that the big three DTDs; DocBook, TEI and NLM are money sinks. other Digital Publishers see the point, but his belief is that all publishers should not use DB/TEI/NLM.

These XML options have a place and application, but they are not a suitable option for general publisher content. The 'Big Three', which are analysed below, are quoted as being suitable for "any" XML strategy, which provides some context to the analysis.

National Library of Medicine - NLM

"The National Center for Biotechnology Information (NCBI) of the national Library of medicine (NLM) created the Journal Archiving and Interchange Tag Suite with the intent of providing a common format in which publishers and archives can exchange journal content. The Suite provides a set of XML schema modules that define elements and attributes for describing the textual and graphical content of journal articles as well as some non-article materials such as letters, editorials, and book and product reviews". - National Library of Medicine

NLM is a Journal article for XML strategy. Typically, journal articles cover most subjects available, and so are comprehensive. There are four variants for backlist, authoring, publishing and archiving. NLM was first released in 2003, making it a relatively new XML strategy to the Digital Publishing scene. The current version is running at 3.0, with the latest update being in 2008. There is a major strategy in hand to rework the Schemas to allow customisation, which will be the inevitable end of any committee based XML. Academic publishers should evaluate it closely; others ignore any advice to use it.

Text Encoding Initiative - TEI

"The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, TEI Guidelines have been widely used by libraries, museums, publishers and individual scholars to present texts for online research, teaching, and preservation". - Text Encoding Initiative

Humanities, social science and linguistics are a rather broad brush. Starting as SGML, it moved to XML as TEI Lite. There are around 70 members which are mostly universities and government organisations. For national government, or university organisations wanting to take an academic approach to XML, this is the appropriate strategy to use, otherwise ignore any advice to use it.

DocBook - DB

"DocBook is a schema (available in several languages including RELAX NG, SGML and XML DTDs, and W3C XML Schema) maintained by the DocBook Technical Committee of OASIS. It is particularly well suited to books and papers about computer hardware and software (though it is by no means limited to these applications).

Because it is a large and robust schema, and because its main structures correspond to the general notion of what constitutes a "book", DocBook has been adopted by a large and growing community of authors writing books of all kinds. DocBook is supported "out of the box" by a number of commercial tools, and there is rapidly expanding support for it in a number of free software environments. These features have combined to make DocBook a generally easy to understand, widely useful and very popular schema. Dozens of organisations are using DocBook for millions of pages of documentation, in various print and online formats, worldwide". - DocBook

Although this statement more closely resembles a sales pitch, their origins is computer books, and a very large subset of the elements are computer code specific. If a highly structured control of linear technical books without significant content variation is desired, DocBook is an appropriate choice, otherwise ignore any advice to use it.

Summary of the "Big Three"

The common theme throughout these sites is one of complication and complexity. Having large vocabularies that are hard to learn and use, a provision of multiple "resources" is offered to quickly start the strategy. Additionally, each strategy provides a basic version. To a publisher, this information means:

  1. All of the "Big Three" are specialist DTDs/Schemas, or specialist DTDs trying to become general with updates and customisation modules
  2. Difficult to learn, therefore difficult to check and test
  3. Can all be extended at a cost
  4. By implication of extensibility, each has tagging limitations
  5. Claim to be able to be processed to XHTML (Internet) and PDF (print)
  6. All very old and have their roots in SGML/SML in the days when strategies were being formulated and with the exception of NLM, long before the Internet was born
  7. high costs to get started, use and maintain

The following statement from TEI applies to all of the aforementioned XML's to some extent, but NLM less so; "...there is no one correct way to encode any given text...". This is a particular issue for future use of any XML and is an identifiable failure of DocBook and TEI in particular. That statement alone destroys the concept of interchange and predictable future value.

That statement is reinforced by the following "Some information form DocBook on 'Future proofing", which was extracted from the DocBook site:

"Whether you're just getting started with DocBook, or curating a collection of tens of thousands of DocBook documents, one question that you have to consider is "how stable is DocBook?" Will the documents that you write today still be useful tomorrow, or next year, or in the next century?

This question may seem particularly pertinent if you're in the process of converting a collection of DocBook 4.x documents to DocBook V5.0 because we introduced a number of backward-incompatible changes in V5.0.

The DocBook Technical Committee understands that the community benefits from the long-term stability of the DocBook family of schemas. We also understand that DocBook must continue to adapt and change in order to remain relevant in a changing world.

All changes, and especially changes that are backward incompatible (changes that make a currently valid document no longer valid under a new version of the schema), have a cost associated with them. The technical committee must balance those costs against the need to remain responsive to the community's desire to see DocBook grow to cover the new use cases that inevitably arise in documentation.

With that in mind, the DocBook Technical Committee has adopted the following policy on backward-incompatible changes. This policy spells out when backward-incompatible changes can occur and how much notice the technical committee must provide before adopting a schema that is backward incompatible with the current release.

This policy allows DocBook to continue to change and adapt while simultaneously guaranteeing that existing users will have sufficient advance notice to develop reasonable migration plans".

If planning to use DocBook or any of the others, carefully consider migration due to costs. Another thing to consider is that these are not complete strategies. With real publisher content it is east to run an XSL or other processes across the content in an ETL (Extract, Transform, Upload) operation. It is significantly different to know that thousands of documents/books/articles, or whatever content is considered, has been processed correctly.

if planning to use TEI, content is probably going into a database or website so it probably does not require any real future-proof value, but it should be considered that it is primarily designed for machine reading, not reuse extraction and future proof-reading.

High quality XML for publisher content is ultimately about strictly controlled tagging patterns that are used consistently, which encapsulate fluidity for presentation, and it is this that generates current and future value. None of the above XML strategies directly introduces standards for the correctness of tagging patterns on content. They are technical, and somewhat abstract definitions about the element and attribute vocabularies they support. The tagging model extension method is expensive and usually ill-fitting wide genres of content, and the explicit detail that is needed for individual publisher content for real reuse.

XHTML Power

Missing from the simplicity of XML and the complex XML strategies created is the core underlying nature of content. All of the aforementioned DTDs lack clear definitions in their "list of elements", with much of the grammar opaque. It is difficult to tell how, or more importantly where, an element is to be used from its grammar. Due to this, these XML strategies start with a massive learning cost. It is possible to get the outsource crowd to do the tagging, but it is difficult to know whether it is applied appropriately and correctly if your deliverable is an XML file and an ePub. It is also unrealistic to check every XML file delivered.

Each element in HTML falls into zero or more categories that group elements with similar characteristics and deconstructs into the following:

  1. Metadata content
  2. Flow content
  3. Sectioning content
  4. Heading content
  5. Phrasing content
  6. Embedded content
  7. Interactive content

This is missing in most XML strategies, except by reference to the DTD. It is not an explicit content strategy. These define the primary purpose of each element and its relation to all others. This XML strategy is extremely well know and understood across the world.

Conclusion

This analysis is not a criticism of these DTDs for what they are, they are all technical tour-de-forces. The notion behind this analysis was to increase awareness of their relevance and applicability for the production of the full gamut of publisher digital content, and the future possible uses of that content. DB/TEI/NLM will only deliver in narrow publisher genres where the content is for specific objectives. The main purpose of this discussion is to bring XHTML forward as a real, valuable digital content strategy that fits a wider range of content than specialist XML strategies are able to.

XML consultants always fall back to "The Big Three". Understandably, they do not want to reinvent a custom XML vocabulary which is a difficult and expensive undertaking. However, these three, for all their qualities and features, fall far short of the requirements of digital content strategies. An approximate or close fit just does not meet digital content requirements.

XHTML stands patiently waiting to be used, but it is ignored because of the "Internet Web Page taint". It is dismissed instantly as unusable with statements such as the following:

  1. It is not a good archive format
  2. There is no future value
  3. It is about structure, not purpose
  4. While I understand it can be used I would still recommend DB/TEI/NLM.

Each of these statements is wrong and exhibit a serious lack of knowledge of what is, and will be, expected of general publisher content going forward. XHTML is an incredibly powerful, internationally supported, highly controlled XML content schema. The core structure must be valid, the tagging patterns must be well formed but can be highly amorphous, just what is needed for future value reuse.

Share this post


About Us

Axis12 specialise in building, hosting and supporting high traffic, content heavy web applications for both the public and private sector that help them achieve their Digital First aspirations. We recently implemented Cross Platform Publisher for the National Health Service (NHS) in the United Kingdom, which has transformed the way online reporting and publishing is carried out.
Read more...