Getting Started With The Word2DITA Transform

How to set up an Ant task to convert a Word document to DITA using the Open Toolkit

The Word2DITA transform is packaged as an Open Toolkit plugin (see Generating DITA from Documents (Word-to-DITA Transformation Framework)). This makes it easy to apply the transform to Word documents using an existing Open Toolkit installation. You can also apply the transform directly to the document.xml file within a DOCX package, e.g., using OxygenXML. The Word2DITA transform is not dependent on any Open Toolkit preprocessing or Toolkit-provided components.

To convert one of the sample files, follow these steps:
  1. Deploy the DITA for Publishers Toolkit plugins to your Open Toolkit (see Installing the Toolkit Plugins).
  2. Create a file named build.properties in your home directory (e.g., /Users/ekimber) and in that file put a line like this:
    dita-ot-dir=c:/DITA-OT

    Where the part in bold reflects the location of the Toolkit on your machine.

  3. From the DITA for Publishers sample data (provided as a separate Zip file in the main installation), copy the word2dita/all_defaults directory to a convenient place (e.g., to /Users/ekimber/w2d_transforms/all_defaults).
  4. If necessary, run the startcmd file that is applicable for your operating system. This displays a DITA-OT shell. The DITA-OT shell makes sure that certain libraries required for the transformation are available.

    You do not need to run the startcmd file if you already have a DITA-OT shell (a command-prompt or terminal window that was invoked by the startcmd file open).

  5. In the DITA-OT shell, change to the directory with the Word document.
    cd ~/w2d_transforms/all_defaults/word

    Where the part in bold reflects the location of the sample files on your machine.

  6. In the DITA-OT shell, run the dita-ot-run-word2dita.xml build file with Ant:
    ant -f dita-ot-run-word2dita.xml

    It should run and you should get some output in all_defaults/dita. You should get a map and some number of topics, one for each Heading 1, Heading 2, Heading 3, and Heading 4 paragraph in your Word document.

To convert one of your own Word documents, follow these steps:
  1. Copy of the word2dita/all_defaults directory once again and store it under a new name (e.g., in /Users/ekimber/w2d_transforms/my_transform). This will be the starting point for your Word-to-DITA transformation configuration.
  2. Replace the Word document in that directory with the Word document you want to convert.
  3. Edit your Word document and make sure that the first paragraph in the document is styled with the "Title" paragraph style—you may need to add a new paragraph. This paragraph defines the map title and signals the generation of the root output map. This setup is required by the default style-to-tag mapping.
  4. Edit the file dita-ot-run-word2dita.xml in that directory and change this line:
    <property name="args.input" 
      location="${myAntFile.dir}/word2dita_single_doc_to_map_and_topics_01.docx"
    />
    To reflect the filename of your Word document. For example:
    <property name="args.input" 
      location="${myAntFile.dir}/my_document.docx"
    />
  5. In a DITA-OT shell, change to the directory with your Word document.
    cd ~/w2d_transforms/my_transform/word

    Where the part in bold reflects the location of your Word document on your machine.

  6. In a DITA-OT shell, run the dita-ot-run-word2dita.xml build file with Ant:
    ant -f dita-ot-run-word2dita.xml

    It should run and you should get some output (e.g., in my_transform/dita).

The Toolkit log will include messages from the Word-to-DITA transform, which will report unmapped paragraph and character styles. By default, any unmapped paragraph style is mapped to <p>, so you will usually get valid, if not ideal, output from the default mapping.

Once you have verified that you can get something from the transform you are ready to start configuring the style-to-tag mapping to reflect your specific requirements and Word documents. You will likely find that you also need to refine how your documents are styled so that they map most effectively.

Finally, remember that the Word-to-DITA process is a work in progress and there is always room for improvement. Please report any bugs or feature requests to the DITA for Publishers bug tracker on GitHub (https://github.com/dita4publishers/d4p-word2dita/issues)

Remember too that the Word to DITA framework is open source, which means I welcome fixes and enhancements from the community.