关于python：从Word文档中提取文本而不使用COM /自动化的最佳方法是什么？

Best way to extract text from a Word doc without using COM/automation?

是否有一种合理的方法可以从不依赖COM自动化的Word文件中提取纯文本？ (这是在非Windows平台上部署的Web应用程序的功能-在这种情况下是不可协商的。)

Antiword似乎是一个合理的选择，但似乎已被放弃。

Python解决方案将是理想的选择，但似乎不可用。

(与从python中的MS Word文件中提取文本相同的答案)

使用我本周制作的Python本机docx模块。这是从文档中提取所有文本的方法：

1
2
3
4
5
6
7

document = opendocx('Hello world.docx')

# This location is where most document content lives
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]

# Extract all text
print getdocumenttext(document)

请参阅Python DocX网站

100％Python，没有COM，没有.net，没有Java，没有使用正则表达式解析序列化的XML。

为此，我使用catdoc或反字词，无论给出的结果是最容易解析的。我将其嵌入到python函数中，因此在解析系统(以python编写)中易于使用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

import os

def doc_to_text_catdoc(filename):
(fi, fo, fe) = os.popen3('catdoc -w"%s"' % filename)
fi.close()
retval = fo.read()
erroroutput = fe.read()
fo.close()
fe.close()
if not erroroutput:
return retval
else:
raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

到catdoc的-w开关关闭自动换行BTW。

如果您要做的只是从Word文件(.docx)中提取文本，则只能使用Python进行。就像Guy Starbuck编写的一样，您只需要解压缩文件，然后解析XML。受python-docx的启发，我编写了一个简单的函数来做到这一点：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile

"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'

def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)

paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))

return '\
\
'.join(paragraphs)

使用OpenOffice API，Python和Andrew Pitonyak的出色在线宏书籍，我设法做到了这一点。 7.16.4节是开始的地方。

使它完全不需要屏幕即可工作的另一个技巧是使用Hidden属性：

1
2
3

RO = PropertyValue('ReadOnly', 0, True, 0)
Hidden = PropertyValue('Hidden', 0, True, 0)
xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )

否则，当您打开文档时，文档会在屏幕上滑动(可能在Web服务器控制台上)。

tika-python

Apache Tika库的Python端口，根据文档，Apache tika支持从1500多种文件格式中提取文本。

注意：它也可以与pyinstaller配合使用

使用pip安装：

1	pip install tika

样品：

1
2
3
4
5

#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file

链接到官方GitHub

对于docx文件，请查看位于

的Python脚本docx2txt

http://cobweb.ecn.purdue.edu/~kak/distMisc/docx2txt

用于从docx文档中提取纯文本。

Open Office有一个API

只要有人想用Java语言做，就可以使用Apache poi api。 extractor.getText()将从docx中提取平面文本。这是链接https://www.tutorialspoint.com/apache_poi_word/apache_poi_word_text_extraction.htm

老实说，不要使用" pip install tika "，它是为单用户(一个使用笔记本电脑工作的开发人员)而不是多用户(多个开发人员)开发的。

在命令行中使用Tika的小类TikaWrapper.py波纹管足以满足我们的需求。

您只需要使用JAVA_HOME路径和Tika jar路径实例化该类，仅此而已！它非常适合许多格式(例如PDF，DOCX，ODT，XLSX，PPT等)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76

#!/bin/python
# -*- coding: utf-8 -*-

# Class to extract metadata and text from different file types (such as PPT, XLS, and PDF)
# Developed by Philippe ROSSIGNOL
#####################
# TikaWrapper class #
#####################
class TikaWrapper:

java_home = None
tikalib_path = None

# Constructor
def __init__(self, java_home, tikalib_path):
self.java_home = java_home
self.tika_lib_path = tikalib_path

def extractMetadata(self, filePath, encoding="UTF-8", returnTuple=False):
'''
- Description:
Extract metadata from a document

- Params:
filePath: The document file path
encoding: The encoding (default ="UTF-8")
returnTuple: If True return a tuple which contains both the output and the error (default = False)

- Examples:
metadata = extractMetadata(filePath="MyDocument.docx")
metadata, error = extractMetadata(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
'''
cmd = self._getCmd(self._cmdExtractMetadata, filePath, encoding)
out, err = self._execute(cmd, encoding)
if (returnTuple): return out, err
return out

def extractText(self, filePath, encoding="UTF-8", returnTuple=False):
'''
- Description:
Extract text from a document

- Params:
filePath: The document file path
encoding: The encoding (default ="UTF-8")
returnTuple: If True return a tuple which contains both the output and the error (default = False)

- Examples:
text = extractText(filePath="MyDocument.docx")
text, error = extractText(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
'''
cmd = self._getCmd(self._cmdExtractText, filePath, encoding)
out, err = self._execute(cmd, encoding)
return out, err

# ===========
# = PRIVATE =
# ===========

_cmdExtractMetadata ="${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --metadata ${FILE_PATH}"
_cmdExtractText ="${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --encoding=${ENCODING} --text ${FILE_PATH}"

def _getCmd(self, cmdModel, filePath, encoding):
cmd = cmdModel.replace("${JAVA_HOME}", self.java_home)
cmd = cmd.replace("${TIKALIB_PATH}", self.tika_lib_path)
cmd = cmd.replace("${ENCODING}", encoding)
cmd = cmd.replace("${FILE_PATH}", filePath)
return cmd

def _execute(self, cmd, encoding):
import subprocess
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = process.communicate()
out = out.decode(encoding=encoding)
err = err.decode(encoding=encoding)
return out, err

.doc和.odt的效果很好。

它在命令行上调用openoffice将文件转换为文本，然后可以将其简单地加载到python中。

(它似乎还有其他格式选项，尽管没有记录在案。)

关于python：从Word文档中提取文本而不使用COM /自动化的最佳方法是什么？

Best way to extract text from a Word doc without using COM/automation?

推荐阅读

linux命令添加文件？

linux文件中剪切命令？

linux存储文件命令？

linux命令移除文件夹？

在linux文件后加命令？

linux命令删除文件夹？

linux命令看文件编码？

linux编辑文件的命令？

linux命令行关掉文件？

linux替换文本命令？

linux分隔文件命令？

linux改文件权限命令？

linux替换文本的命令？

linux命令交换文件名？

linux改变文件所有者的命令？

linux文件夹转移命令？

linux转移文件命令？

linux下替换文件命令？

linux文件nl命令？

linux寻找文件夹命令？