Python element tree - extract text from element, stripping tags

Question

With ElementTree in Python, how can I extract all the text from a node, stripping any tags in that element and keeping only the text?

For example, say I have the following:

<tag>
  Some <a>example</a> text
</tag>

I want to return Some example text. How do I go about doing this? So far, the approaches I've taken have had fairly disastrous outcomes.

IIRC BeautifulSoup has some simple ways to take care of that... — Wayne Werner, Commented Oct 14, 2013 at 21:54
If possible, I'd like to avoid using additional external libraries — Brandon, Commented Oct 14, 2013 at 21:57
Undoubtedly it would be incorrect (I think) because regex is bad for XML, but you could try something like re.sub(r'\<.*?\>', '', text). — Wayne Werner, Commented Oct 14, 2013 at 21:59

Tomalak · Accepted Answer · 2018-11-13 17:33:10Z

If you are running under Python 3.2+, you can use itertext.

itertext creates a text iterator which loops over this element and all subelements, in document order, and returns all inner text:

import xml.etree.ElementTree as ET
xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

# -> 'Some example text'

If you are running in a lower version of Python, you can reuse the implementation of itertext() by attaching it to the Element class, after which you can call it exactly like above:

# original implementation of .itertext() for Python 2.7
def itertext(self):
    tag = self.tag
    if not isinstance(tag, basestring) and tag is not None:
        return
    if self.text:
        yield self.text
    for e in self:
        for s in e.itertext():
            yield s
        if e.tail:
            yield e.tail

# if necessary, monkey-patch the Element class
if 'itertext' not in ET.Element.__dict__:
    ET.Element.itertext = itertext

xml = '<tag>Some <a>example</a> text</tag>'
tree = ET.fromstring(xml)
print(''.join(tree.itertext()))

# -> 'Some example text'

abarnert · Accepted Answer · 2013-10-14 22:19:17Z

As the documentation says, if you want to read only the text, without any intermediate tags, you have to recursively concatenate all text and tail attributes in the correct order.

However, recent-enough versions (including the ones in the stdlib in 2.7 and 3.2, but not 2.6 or 3.1, and the current released versions of both ElementTree and lxml on PyPI) can do this for you automatically in the tostring method:

>>> s = '''<tag>
...   Some <a>example</a> text
... </tag>'''
>>> t = ElementTree.fromstring(s)
>>> ElementTree.tostring(s, method='text')
'\n  Some example text\n'

If you also want to strip whitespace from the text, you'll need to do so manually. In your simple case, that's easy:

>>> ElementTree.tostring(s, method='text').strip()
'Some example text'

In more complicated cases, however, where you want to strip out whitespace within intermediate tags, you'll probably have to fall back on recursively processing the texts and tails. That's not too hard; you just have to remember to deal with the possibility that the attributes may be None. For example, here's a skeleton you can hook your own code on:

def textify(t):
    s = []
    if t.text:
        s.append(t.text)
    for child in t.getchildren():
        s.extend(textify(child))
    if t.tail:
        s.append(t.tail)
    return ''.join(s)

This version only works when text and tail are guaranteed to be a str or None. For trees you build up manually, that's not guaranteed to be true.

Michal · Accepted Answer · 2018-09-22 11:49:45Z

Aslo exists a very simple solution in case it's possible to use XPath. It's called XPath Axes: more about it can be found here.

When having a node (like a tag div) which itself contains text and other nodes as well (like tags a or center or another div) with text inside or it contains just text and we want to select all text in that div node, it's possible to do it with folowing XPath: current_element.xpath("descendant-or-self::*/text()").extract(). What we will get is a list of all texts within a current element, stripping tags inside, if there are any.

What's nice about it is that no recursive function is needed, XPath takes care of all of this (using recusion itself, but for us it's as clean as it only can be).

Here is StackOverflow question concerning this proposed solution.

n.b.: This applies only to lxml. The xml.etree package does not know enough XPath to do this. — Tomalak, Commented Nov 13, 2018 at 16:56

Collectives™ on Stack Overflow

Python element tree - extract text from element, stripping tags

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
python
xml-parsing
elementtree
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged pythonxml-parsingelementtree or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
xml-parsing
elementtree
or ask your own question.