Recovering from a SAXParseException error with no data loss

I was working on a book I’m writing in LibreOffice Writer. I am using the docx format for the book, as I plan to finish its formatting in Microsoft Word, not knowing that LibreOffice Writer has a tendency to corrupt docx files. After closing and opening the file again, I received the following error:

I extracted the docx file (which is just a zip file, on Windows you can rename it to something.zip to extract or, while Ubuntu Linux allows you to extract it without renaming it). Found the document.xml file and opened it in VIM. I used the following command to jump to position 791513 on line 2:

791513l

That is the position number followed by a lowercase L.

I don’t see any error there, so LibreOffice Writer is not telling the truth, the error is not there. I opened document.xml in Chromium, but it reported the same wrong error position. Since the error message I received was saying the “w:cstheme” attribute was redefined, I decided to use regular expressions to search for it. I spent a stressful hour trying to learn VIM’s ridiculous regular expression syntax, but couldn’t figure it out.

In the end, I decided to use egrep instead. I ran the following command on the command line, which looks for a “w:cstheme” attribute that is not separated by a forward slash from antoher “w:cstheme” attribute, meaning it will find tags that have duplicate “w:cstheme” attributes, which is the error that LibreOffice Writer is reporting:

egrep "w:cstheme[^/]*w:cstheme" document.xml

And voila! It highlighted the error:

I copied the highlighted text (using ctrl+shift+c), opened document.xml again in VIM, and pasted the text in VIM’s seach bar (first press forward slash to open the search bar, then ctrl+shift+v to paste):

Pressing enter twice, it jumped right to the line and position (“column”) where the error was, which was actually position 817157:

If you understand html/xml, you will see the issue. To correct it, change this:

<w:rFonts w:eastAsia="Times New Roman" w:cs="Times New Roman" w:cstheme="majorBidi" w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cstheme="majorBidi"/>

To this, removing one of the unnecessary ‘w:cstheme=”majorbidi”‘ attributes:
<w:rFonts w:eastAsia="Times New Roman" w:cs="Times New Roman" w:cstheme="majorBidi" w:ascii="Times New Roman" w:hAnsi="Times New Roman" />

I searched again for the error in VIM, to make sure there were no repeated errors. I fixed multiple other occurrences of the error until I couldn’t find any more. Now, when opening the document.xml in Chromium, no error was reported:

This was a good sign. I made the mistake of compressing the parent folder of the document, renaming it to docx, and trying to open it. LibreOffice Writer said the document was corrupted and offered to fix it. It tried but failed. After a long time, I realized my error. I shouldn’t have compressed the parent folder, I should have compressed the files and folders inside the parent folder directly, as follows:

Above I have selected the files and folders that make up the docx document. I then right-clicked it and chose “Compress”, and chose the “zip” option. Below is the compressed file:

Next, I renamed the file to “occupy.docx_FILES.docx”:

Then I opened the file in LibreOffice Writer, and it worked!

To prevent this in the future, I will save the file in the ODF Text Document format (.odt), which is the native format used by LibreOffice that supposed doesn’t suffer from this issue. Once the book is done, I will then save it as docx for use in Microsoft Word.

Leave a Reply

Your email address will not be published.