Extract text from Word files (docx) simply

Nov 26, 2013 Python

If you want to extract the text content of a Word file there are a few solutions to do this in Python. Unfortunately most of these solutions have dependencies or need to run an external command in a subprocess or are heavy/complex, using an office suite, etc. I find that the best solution among those in the Stackoverflow page is python-docx. But using it bring two dependencies: python-docx itself and lxml. Installing python-docx is not a big problem. Unfortunately lxml is sometimes hard to install or, at the minimum, requires compilation.

To avoid that, inspired by python-docx, I created a simple function to extract text from .docx files that do not require dependencies, using only the standard library. So it’s easy to incorporate it in any Python project.

Is there any way to improve it?

try: from xml.etree.cElementTree import XML except ImportError: from xml.etree.ElementTree import XML import zipfile """ Module that extract text from MS XML Word document (.docx). (Inspired by python-docx <https://github.com/mikemaccana/python-docx>) """ WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}' PARA = WORD_NAMESPACE + 'p' TEXT = WORD_NAMESPACE + 't' def get_docx_text(path): """ Take the path of a docx file as argument, return the text in unicode. """ document = zipfile.ZipFile(path) xml_content = document.read('word/document.xml') document.close() tree = XML(xml_content) paragraphs = [] for paragraph in tree.getiterator(PARA): texts = [node.text for node in paragraph.getiterator(TEXT) if node.text] if texts: paragraphs.append(''.join(texts)) return '\n\n'.join(paragraphs)