How to import the MS Word document content and typesetting?

Question

I have a vocabulary with phonetics here(the extracted password is: 1234). I import these contents by saving the docx file as an XML file.

Import["D:\\360Downloads\\考博英语10000词汇表.docx", "Text", 
 CharacterEncoding -> "UTF-8"]
Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.xml", "CDATA", CharacterEncoding -> "UTF-8"]

However, after importing this XML file with the above method, there are a lot of errors in phonetic information.

Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.xml", \
"Plaintext", CharacterEncoding -> "UTF-8"]

When I use the above code to import the data of this file, many unnecessary line breaks appear.

I tried to typeset it by function StringCases, but failed:

s = StringReplace[
   Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.\
xml", "Plaintext", CharacterEncoding -> "UTF-8"], "\n" -> ""];
StringCases[s, (x : CharacterRange["a", "z"] ..) ~~ (y__ /; 
     StringFreeQ[y, "["]) ~~ CharacterRange["a", "z"] .. ~~ 
   "[" ~~ __ ~~ "]" :> {x, y}]

I want to know what I can do to import the contents of this file according to the following format:

{{abate,{ə'beit},{v．减轻，减退；废除}},{aberrant,{æ'berənt},{a.畸变的;异常的;脱离常轨的}},...{accent ,{'æksənt, æk'sent},{ n．腔调，口音；重音，重音符号,v．加重读}}...}

using python package python-docx. In mma, you can use ExternalEvaluate to communicate with python. — partida, Commented Feb 2, 2021 at 0:57

A little mouse on the pampas · Accepted Answer · 2021-02-02 07:00:22Z

1

A simple way (remove the last set of words):

s = StringReplace[
   Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.\
xml", "Plaintext", CharacterEncoding -> "UTF-8"], "\n" -> ""];
sol = StringSplit[s, 
   x : ((CharacterRange["a", "z"] | Whitespace) .. ~~ "[" ~~ 
       Shortest[y__] ~~ "]") :> Style[x, FontColor -> Red]] ;
Partition[sol // Rest // Most, 2]

answered Feb 2, 2021 at 7:00

A little mouse on the pampas

5,7902 gold badges13 silver badges42 bronze badges

Add a comment |

Stack Exchange Network

How to import the MS Word document content and typesetting?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
import
office
or ask your own question.

Hot Network Questions

How to import the MS Word document content and typesetting?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged importoffice or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
import
office
or ask your own question.