Convert doc file to docx file

  • by

Esoteric doc files may entangle riddles of “Word found unreadable content” in a file. I created a doc-to-docx converter to help identify problems in a doc file by coverting it to docx file (readable by Open XML tools) and by testing portions of the file. The tool is published at my Github repo (here).

Symtoms include an error popup saying “Word found unreadable content in filename. Do you want to recover the contents of this document?” Solutions to the problem are available to fix the file while root causes of the problem remain unknown.

Inconsistency or non-compliant XML in any part of doc file may lead to the said errors. For example, the image below shows a sample of doc file. An error-injected doc file called sample_broken.doc is available in examples folder. The broken file has incorrect reference to an image in the file.

The image below shows an image of a Word file.

A sample readable file in examples folder of the Github repo.

Inside a Word document XML package, document.xml, has fractions of the whole document. The tool will create a separate file for each section. The image below shows an exmple of files that the tool will create according to the block of the document. The left part shows the image of the original document and the right part illustrates the generated files. In the generated file, three headers list parts or files. “Part of Word document” has the separated images from the original document. “XML file for the part” lists the XML files that hold XML data for the part. “Docx file for the part” shows Word files that hold respective sesion of XML data only.

The tool will create breakdown files only for the groups of the inconsistent parts. The below image illustrates the files that will be generated in order to narrow down the problematic portion. In the example file, the inconsistency is at the one of the images and traced down to the files that are “document_06_0216_0215.xml” and “new_06_0216_0215.docxbroken_imagedata.docx”. The identified broken parts are highlighted in the filenames of Word documents with a text “borken” in the filename.

A strucure image of the Word document inconsistent sections will be placed under the name of “tree.png” in the execution folder. The file includes tags for the borken parts.

Sampe image file: structure for the incorrect file (image propotions are modified larger for the illustration purpose.)

The Word application error for the inconsistencies read as follows.

English“Word found unreadable content in filename. Do you want to recover the contents of this document?”

(日本語による補足) 本ページで紹介しているツールはDoc 形式のワードファイルをDocx に変換します。Doc ファイルを開く際にWord でのエラーが発生する場合には、変換と同時にエラーが発生する箇所のファイルを作成します。この時、ファイル内の構成要素を細分化しながらエラーを発生させている箇所のファイルを自動的に作成します。