I have a vocabulary with phonetics here(the extracted password is: 1234
). I import these contents by saving the docx
file as an XML
file.
Import["D:\\360Downloads\\考博英语10000词汇表.docx", "Text",
CharacterEncoding -> "UTF-8"]
Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.xml", "CDATA", CharacterEncoding -> "UTF-8"]
However, after importing this XML
file with the above method, there are a lot of errors in phonetic information.
Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.xml", \
"Plaintext", CharacterEncoding -> "UTF-8"]
When I use the above code to import the data of this file, many unnecessary line breaks appear.
I tried to typeset it by function StringCases
, but failed:
s = StringReplace[
Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.\
xml", "Plaintext", CharacterEncoding -> "UTF-8"], "\n" -> ""];
StringCases[s, (x : CharacterRange["a", "z"] ..) ~~ (y__ /;
StringFreeQ[y, "["]) ~~ CharacterRange["a", "z"] .. ~~
"[" ~~ __ ~~ "]" :> {x, y}]
I want to know what I can do to import the contents of this file according to the following format:
{{abate,{ə'beit},{v.减轻,减退;废除}},{aberrant,{æ'berənt},{a.畸变的;异常的;脱离常轨的}},...{accent ,{'æksənt, æk'sent},{ n.腔调,口音;重音,重音符号,v.加重读}}...}
python-docx
. In mma, you can useExternalEvaluate
to communicate with python. $\endgroup$