Extract text from Word files (docx) simply


If you want to extract the text content of a Word file there are a few solutions to do this in Python. Unfortunately most of these solutions have dependencies or need to run an external command in a subprocess or are heavy/complex, using an office suite, etc. I find that the best solution among those in the Stackoverflow page is python-docx. But using it bring two dependencies: python-docx itself and lxml. Installing python-docx is not a big problem. Unfortunately lxml is sometimes hard to install or, at the minimum, requires compilation.

To avoid that, inspired by python-docx, I created a simple function to extract text from .docx files that do not require dependencies, using only the standard library. So it’s easy to incorporate it in any Python project.

Is there any way to improve it?

Comments powered by Disqus