Approaches to PDF -> DOCX conversion
Over the past several years, I had to work with PDF files, specifically on automating the PDF to DOCX conversion. Unfortunately, PDF to DOCX conversion is the only sustainable way of generating editable PDFs that I know of.
Surprisingly, there is not that much information about it on the internet, so I gathered all the approaches I tried in one place, hoping it would help someone.
Approach #1: Xpdf and pdftohtml #
Xpdf is a PDF viewer and a set of various tools to work with PDF files. Unfortunately, it's not capable of converting to DOCX, so we will have to use HTML. It contains many tools that can convert PDF to text, PostScript, image files, etc. You can check the whole list in the About section of the official site. The binary we are interested in is pdftohtml.
You can get it by installing xpdf and pdftohtml from your OS's package manager or by installing Poppler, another PDF library based on xpdf, that ships with pdftohtml built in.
Usage is pretty straightforward:
pdftohtml sample.pdf out/sample.html
Navigating to the out directory, you'll notice that pdftohtml made one HTML file per page, which is not ideal. But the thing that makes pdftohtml worse than the other approaches is the structure of HTML that it produces.
If you look at the page's source, you'll notice that it's a bunch of divs positioned absolutely. This format makes it hard or just impossible to reason about the structure of the file programmatically, for example, if you want to know how the text flows in the document.
pdftohtml might be a good choice if you only need to display the PDF file and don't require any editing while keeping the format.
Approach #2: LibreOffice (unoconv/unoserv) #
The next possible solution is using LibreOffice's headless mode for conversion. Again, just like pdftohtml, it should be available for installation from your OS's package manager, or you might already have it since some Linux distributions ship with it. Note that in some operating systems, the binary is called soffice and not libreoffice (apparently libreoffice binary is just a shell script that wraps soffice). I will use soffice, but you can safely replace it with libreoffice, it shouldn't make any difference.
Using it is not much different from pdftohtml:
soffice --headless --infilter=writer_pdf_import --convert-to docx --outdir . ./sample.pdf
--infilter parameter is very important because the conversion won't work without it. The values for this parameter are not documented anywhere, but I found this value in the source code: https://github.com/LibreOffice/core/blob/bdbb5d0389642c0d445b5779fe2a18fda3e4a4d4/sdext/source/pdfimport/config/pdf_import_filter.xcu
After running the command, you will have a DOCX file that should look something like the original PDF. For some files, the conversion preserves the formatting almost one-to-one, and for others - you get an unreadable mess.
It crashes and gets stuck pretty often, especially on large files. What's makes things even worse is that after the crash, there might still be a soffice process hanging somewhere in the background, and it will prevent you from using the converter again until you kill it. Of course, you could write a wrapper that will take care of this, but, thankfully, it already exists, and it's called unoserver
Overall, LibreOffice is a hit-and-miss. It's free and open-source, easy to install on a server and use in headless mode, but it's very unstable, and the conversion parameters are not well documented. On top of that, it's not good at preserving the layout and formatting in the output.
Approach #3: MS Office #
This solution is a bit unconventional. I got the idea from a podcast name I can't remember, where a Dropbox engineer was talking about how they generate document previews for MS Office files. They run Microsoft Office in headless mode through Wine on Linux servers and then use that to convert Office files to HTML that can be viewed on the web.
As you can see, it's pretty similar to the previous approach, but instead of LibreOffice, we will use MS Office.
Initially, I did try installing MS Office in Wine, but after spending a couple of days trying to fix obscure and seemingly random errors, I gave up. Instead, I went with a different approach: installing it on a Windows Server (AWS has EC2 Windows instances). Surprisingly, it's not that hard to automate it. For example, Ansible has many Windows modules, so I could use Chocolatey to install Office 365 automatically. The only issue is that the Office license cannot be applied automatically. So after provisioning the server, you'll have to log into that server through Remote Desktop and apply the license.
But how do we automate conversion? We can automate it with BAT files, but this feature is barely documented like LibreOffice, and Microsoft does not officially support or encourage it. Thankfully, there's a project called documents4j (documents4j on GitHub) that hides the complexity and hacks needed to make the conversion work. You could use it as a library in Java code or as a standalone conversion server/client.
Overall, Microsoft Office is much better at preserving the layout of the original PDF, and it feels much more stable. Still, it can get stuck on certain files, so I recommend having a scheduled task that will restart the documents4j server if it doesn't respond.
This approach might be good if your application infrastructure is already hosted on Windows or you need a better conversion quality than LibreOffice.
Approach #4: Adobe Document Cloud #
This approach is straightforward - we can use API provided by Adobe to convert PDF files to DOCX while preserving the formatting. Why did I even bother going through the previous approaches then? Surprisingly, until 2019 this API was not available at all, even though Adobe's PDF software can convert PDF files.
It's very easy to use, and multiple clients are available: Java, .NET, Node.js, and Python. It can also be used directly without any clients through a public API.
In my opinion, this is the best approach if you're looking for good conversion quality and stability and don't want to deal with setting up Windows servers or making sure that LibreOffice is not getting stuck.
Conclusion #
If you want to get the best conversion quality and are ready to pay for it, use Adobe’s API. For personal side-projects or one-off tasks, LibreOffice with unoserv should be enough.
- Next: Array type in Nim