Creating a Kindle Book using Microsoft Word (Quick Note)

Amazon provides a tool called “kindlegen” for creating a Kindle book from a Microsoft Word document. There is also a previewer so you can test your book before publishing. In a nutshell, you keep your Word styles relatively simple, save as “Web Page, Filtered” (this strips out more of the Microsoft specific markup), and run kindlegen on the HTML file it saves. Pretty easy.

The only problem is this approach does not generate a “logical table of contents” for the kindle viewer app to use. This post presents a quick and dirty PHP script to generate this table of contents. I will leave it to someone else if they want to clean it up. It did the job for me, so I am moving on.

This script assumes you inserted a Microsoft Word table of contents into the document. If you look at the HTML file saved by word, you will see that it uses “<p class=MsoToc1>” for level 1 table of contents entries, and so on. This is used to build the logical table of contents structure by the PHP script.

To create a logical table of contents, you need to create two additional files to the HTML file. You will find online documentation that talks about splitting your document into multiple files. The following two posts I found particular useful: and The NCX file holds the logical table of contents, and the OPF file you need to point to the NCX and HTML files. You pass the OPF file to kindlegen instead of the HTML file to create a kindle book with a logical table of contents in it.

For example, lets say you saved your word document as “BOOK.htm”. To use the following script you must create a file “BOOK.htm.metadata” and put the following contents in it. (This is copied into the OPF file.) Adjust the values appropriately for your book.

<dc:title>Volume 2: Theme Web Page Assets</dc:title>
<dc:creator>Alan Kent</dc:creator>
<dc:description>This volume focuses on CSS, HTML, and JavaScript page assets for themes.</dc:description>

Then run the PHP script below (I hard coded the filename of the HTML file into the script to get it going quickly – you will need to adjust the filename to your book). This will create a “.ncx” and “.opf” file. Run Kindlegen on the OPF file, and you are done. Your Kindle book should have a logical table of contents, making navigation easier.

Here is the script. Copy and paste into your own file.


$htmlFilename = "BOOK.htm";

$htm = file_get_contents($htmlFilename);
$doc = new DOMDocument();
$paras = $doc->getElementsByTagName('p');

$currentLevel = 0;
$uid = 'MyUid';

$ncx = <<<EOF
<?xml version="1.0"?>
<!DOCTYPE ncx PUBLIC "-//NISO//DTD ncx 2005-1//EN"

<ncx xmlns="" version="2005-1">
$ncx .= " <meta name='dtb:uid' content='$uid'/>\n";
$ncx .= <<<EOF
    <meta name="dtb:depth" content="3"/>
    <meta name="dtb:totalPageCount" content="0"/>
    <meta name="dtb:maxPageNumber" content="0"/>

$ncx .= " <navMap>\n";

$cnt = 1;

for ($i = 0; $i < $paras->length; $i++) {

  $para = $paras->item($i);
  $style = $para->getAttribute('class');

  // Work out Toc level for new entry.
  if ($style == "MsoToc1") {
    $level = 1;
  } else if ($style == "MsoToc2") {
    $level = 2;
  } else if ($style == "MsoToc3") {
    $level = 3;
  } else {

  $aNode = $para->childNodes[0];
  $anchor = $aNode->getAttribute('href');
  if ($cnt == 1) $firstAnchor = $anchor;
  $text = str_replace("\r", '', str_replace("\n", ' ', $aNode->childNodes[0]->textContent));

  while ($currentLevel >= $level) {
    $ncx .= str_repeat(" ", $currentLevel) . " </navPoint>\n";

  $indent = str_repeat(" ", $currentLevel);
  $ncx .= "$indent <navPoint id='navPoint-$cnt' playOrder='$cnt'>\n";
  $ncx .= "$indent <navLabel>\n";
  $ncx .= "$indent <text>$text</text>\n";
  $ncx .= "$indent </navLabel>\n";
  $ncx .= "$indent <content src='$htmlFilename$anchor'/>\n";

while ($currentLevel > 0) {
  $ncx .= str_repeat(" ", $currentLevel) . " </navPoint>\n";

$ncx .= " </navMap>\n";
$ncx .= "</ncx>\n";

$opf = <<<EOF
<?xml version='1.0' encoding='utf-8'?>
<package xmlns='' version='2.0' unique-identifier='MyUid'>
  <metadata xmlns:dc='' xmlns:opf=''>

$opf .= file_get_contents($htmlFilename . ".metadata");
$opf .= "  </metadata>\n";
$opf .= "  <manifest>\n";
$opf .= "    <item id='doc' media-type='text/html' href='$htmlFilename'></item>\n";
$opf .= "    <item id='ncx' media-type='application/x-dtbncx+xml' href='$htmlFilename.ncx'/>\n";
$opf .= "  </manifest>\n";
$opf .= "  <spine toc='ncx'>\n";
$opf .= "    <itemref idref='doc'/>\n";
$opf .= "  </spine>\n";
$opf .= "  <guide>\n";
$opf .= "    <reference type='toc' title='Table of Contents' href='$htmlFilename#toc'></reference>\n";
$opf .= "    <reference type='text' title='Welcome' href='$htmlFilename$firstAnchor'></reference>\n";
$opf .= "  </guide>\n";
$opf .= "</package>\n";

file_put_contents("$htmlFilename.opf", $opf);
file_put_contents("$htmlFilename.ncx", $ncx);

One comment

  1. Alan,
    Nice article – this will help me when I decide to publish my novel about the life and times of TeraText!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: