Avoid XML First

Getting the structure of any digital content correct is core for any real digital content strategy. To start this discussion in the right tone; Publishers should avoid XML and implement an XHTML strategy with a controlled CSS selector vocabulary.

Why Such a Strong Anti-XML Pitch?

It doesn't work, it is not maintainable, it is not sustainable, is not extensible, it is not flexible or agile, it is expensive, it never delivers what it promises and it is not ready for the type of future content it is now facing. These are generalisations, but just having vast amounts of XML tagged content is not a real 2013 digital content strategy.

It is in the same way that HTML5 has eliminated Flash and redefined what the Internet is becoming, it redefines publisher content management strategies. XML consultants generally overlook this as the below quote from the O'Reilly article makes apparent:.

"Over the past year and a half, O'Reilly has sponsored the DocBook project's" development of open source XSL stylesheets for transforming DocBook XML content to EPUB 3", which we've used to update our own toolchain to produce EPUB 3 output. With the release of iBooks 3.0 in late 2012, a critical mass of O'Reilly's readers had devices that supported EPUB 3 content. We felt it was time to upgrade our content to EPUB 3 to provide people using 3.0-compliant platforms the best quality reading experience".

Whilst it took them a year, Cross Platform Publisher was creating ePub3 from five year old stored content in December 2011, just 15 days after the spec was released. No XSL stylesheet to transform DocBook to XHTML. All their special handling statements in the article were just implicit in the content, it only has to be difficult if XML is used first.

Cross Platform Publishing XHTML

The argument with pro XML boils down to semantics. We define Seven properties with equivalent importance, which are"

  1. Structure. In FX structure is king. This is the core value which defines the content stack and conditional grouping. It is best understood as the core accessibility value of the content when no styling or processing is applied. The core structure elements are XHTML/5. There is no reinvention or extension of the HTML; and the available HTML elements must be used with thought. Not all HTML web-oriented elements and attributes apply for high future value digital content. A Title Page, Chapter, Title, paragraph and list are structural components.
  2. Semantics. Semantic names should only be used if the value is explicit and associated with a structural element. Semantics are always applied as qualifiers to structure. Using HTML as the base ensures it is not possible to create a tagging pattern where structure has to be implied from semantics. This approach prevents digital content "death by semantics".
  3. Styling. Styling is the layout, decoration or prettifying with CSS for increased understanding, user engagement, custom presentation or branding. It is important to understand that CSS is a very powerful tool that works with the XHTML on multiple dimensions Styling is jut one of those dimensions.
  4. Presentation. In FX terms presentation means the "format" context from which a content presentation instance will be delivered. Eg: PDF, e-books, other formats, static sites, fixed and flow layouts, interactivity, CDP/ACO and remixed content. To the extend possible FX tagged content must always be available for any presentation context. Where content or tagging is created for a single or set of presentation contexts, the tagging must be explicit and obvious for those presentation contexts to both humans and processors through controlled grammars.
  5. Behaviour. Modern digital content often has the requirement to exhibit behavioural characteristics in various contexts. FX must allow and enable required behaviour without reducing, siloing or hiding core content value. This includes the ability to be used with CSS modifiers such as transforms, transitions and animation; Javascript assisted interactivity; and flow/fixed/variable layout for a range of digital content reading devices, platforms and reuse environments.
  6. Processing. Digital content of worth requires processing for many purposes. Defined machine processing instructions must be consistent to reduce processing costs and future processing costs to a minimum. FX should be created to ensure tagging patterns provide clarity to allow explicit processing to achieve any required result. IE. Processing is never a "leave it until later" option. FX must implicitly state what and how content is to be processed without ambiguity. Ideally every FX tag is processing target.
  7. Metadata. Metadata in the FX context is data about content. This can include descriptive, fixity, provenance, rights, third-party vocabularies and processing instructions. Comprehensive and correct metadata is required for all formats, but is essential to allow content to be processed and used correctly in multiple advanced delivery contexts such as SCORM, web pages, extraction and remixing environments, even little ePub generation. FX states metadata is more important than semantic tagging for the correct use and reuse of content, although there are structure (such as references) that can be tagged for metadata extraction directly from semantic selectors. However this is inevitably more costly than providing straight-forward metadata constructions.

With the exception of NLM (and a lot of that is tagged ineffectively), there is not an XML system out there that delivers the goods for any publisher of any content whatsoever. DocBook is seen as having a large vocabulary (which is relatively weak and missing details) but it does not get content structure even close to correct.

However the problem of digital content ownership and production is bigger than inadequate XML strategies.

XML Inadequacies

There are less effective digital content strategies available, such as trying to produce multiple formats in desktop environments such as InDesign, Sigal and Calibre, or like ilk. These can eventually produce an ePub format but they require a massive effort.

If there is a sudden need an ePub3, the same amount of effort is required again, and if you need complex notes, indexes, references and image positioning, it just cannot be done sensibly with this method.

Adobe keeps getting it wrong in this regard. You only have to read the IDML document to see why they can never get it right without a total refactoring of the core software (which is XML). From a digital content production perspective, tools such as InDesign (great for PDF), and Apples iBooks Publish are not appropriate for digital content; where it isn't economical and inefficient. It is incredibly expensive to create that content in the first instance, and it cannot be reused by publishers to make more money.

Back to the plot

The XML advocates, those who think DocBook, TEI, NLM, or some other custom XML is a content strategy, have got digital publishing all wrong.

The O'Reilly article highlights the issues of an XML repository in DocBook. If your "XML consultant" is recommending any of these approaches, chances are they are going to cost you, the publisher, a lot of money without much in the way of deliverables.

Yesterdays XML solutions do not address the business dynamics required by publishers today. What do you as a publisher need from your digital content? Consider this list:

  1. Make money as fast as possible.
  2. Make money from as many channels in the digital content e-retail diaspora as possible.
  3. Adapt instantly to changes in the content delivery landscape.
  4. Use the same content you made your ePub2 formats with a year ago into ePub3, but with advanced changes and modifications to exploit new features.
  5. Address reflowable and fixed layout instantly, even on the same digital content if necessary.
  6. Create hardcover, large-print, paper-back and mass market print PDF editions from the same digital content source.
  7. Instantly handle font sub-setting, obfuscation, SMIL processing and rich-media optimisation without having to do any work.
  8. Make new and original products that match todays emerging readers.
  9. Create your own forward looking tablet/desktop/device of tomorrow content engagement experiences.
  10. Constantly and consistently reduce costs, increase profits and improve reader engagement experiences.


If you are writing and publishing a novel a text-editor is fine. If you are a hobbyist type production person enjoying the incompatibility problems of Kindle, iBooks, Nook, Kobo, etc. differences that's fine. However if you have to deliver education, academic, trade, non-fiction, self-help, travel, cooking, magazines and just about every other type of content out there on a schedule and budget, the print and webpage origin tool-set doesn't cut it.

All the format and channel delivery problems are addressed very easily is you start with your content in the right format. In 2013 that means XHTML5.

Share this post

About Us

Axis12 specialise in building, hosting and supporting high traffic, content heavy web applications for both the public and private sector that help them achieve their Digital First aspirations. We recently implemented Cross Platform Publisher for the National Health Service (NHS) in the United Kingdom, which has transformed the way online reporting and publishing is carried out.