1
$\begingroup$

I have a vocabulary with phonetics here(the extracted password is: 1234). I import these contents by saving the docx file as an XML file.

Import["D:\\360Downloads\\考博英语10000词汇表.docx", "Text", 
 CharacterEncoding -> "UTF-8"]
Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.xml", "CDATA", CharacterEncoding -> "UTF-8"]

However, after importing this XML file with the above method, there are a lot of errors in phonetic information.

Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.xml", \
"Plaintext", CharacterEncoding -> "UTF-8"]

When I use the above code to import the data of this file, many unnecessary line breaks appear.

enter image description here

I tried to typeset it by function StringCases, but failed:

s = StringReplace[
   Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.\
xml", "Plaintext", CharacterEncoding -> "UTF-8"], "\n" -> ""];
StringCases[s, (x : CharacterRange["a", "z"] ..) ~~ (y__ /; 
     StringFreeQ[y, "["]) ~~ CharacterRange["a", "z"] .. ~~ 
   "[" ~~ __ ~~ "]" :> {x, y}]

I want to know what I can do to import the contents of this file according to the following format:

{{abate,{ə'beit},{v.减轻,减退;废除}},{aberrant,{æ'berənt},{a.畸变的;异常的;脱离常轨的}},...{accent ,{'æksənt, æk'sent},{ n.腔调,口音;重音,重音符号,v.加重读}}...}
$\endgroup$
2
  • 1
    $\begingroup$ using python package python-docx. In mma, you can use ExternalEvaluate to communicate with python. $\endgroup$
    – partida
    Commented Feb 2, 2021 at 0:57
  • $\begingroup$ @partida Thank you very much for your advice. $\endgroup$ Commented Feb 2, 2021 at 1:06

1 Answer 1

1
$\begingroup$

A simple way (remove the last set of words):

s = StringReplace[
   Import["C:\\Users\\***\\Desktop\\考博英语10000词汇表\\考博英语10000词汇表.\
xml", "Plaintext", CharacterEncoding -> "UTF-8"], "\n" -> ""];
sol = StringSplit[s, 
   x : ((CharacterRange["a", "z"] | Whitespace) .. ~~ "[" ~~ 
       Shortest[y__] ~~ "]") :> Style[x, FontColor -> Red]] ;
Partition[sol // Rest // Most, 2]

enter image description here

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.