how to remove an element in lxml

Question

I need to completely remove elements, based on the contents of an attribute, using python's lxml. Example:

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  #remove this element from the tree

print et.tostring(tree, pretty_print=True)

I would like this to print:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Is there a way to do this without storing a temporary variable and printing to it manually, as:

newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
  newxml+=et.tostring(elt)

newxml+="</groceries>"

Benjamin Loison · Accepted Answer · 2024-01-31 11:59:00Z

192

Use the remove method of an xmlElement :

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it

print et.tostring(tree, pretty_print=True, xml_declaration=True)

If I had to compare with the @Acorn version, mine will work even if the elements to remove are not directly under the root node of your xml.

edited Jan 31 at 11:59

Benjamin Loison

5,4824 gold badges18 silver badges37 bronze badges

answered Nov 2, 2011 at 14:22

Cédric Julien

80k16 gold badges129 silver badges133 bronze badges

2

Can you comment on the differences between this answer and the one provided by Acorn?
– ewok
Commented Nov 2, 2011 at 14:27
2

It's a shame the Element class doesn't have a 'pop' method.
– Michael Mulich
Commented Aug 28, 2015 at 18:17
it's a shame xpath can only be used to select elements. it is like SQL with only the select statements.
– Eric Chow
Commented Jan 12, 2021 at 8:44
The remove function detaches an element from the tree and therefore removes the XML node (Element, PI or Comment), its content (the descendant items) and the tail text. Here, preserving the tail text is superfluous because it only contains whitespaces and a newline. But, in some situation you may need to keep it…
– Laurent LAPORTE
Commented Mar 17, 2021 at 8:54
To preserve the tail text and to optionally keep the element content, you can consider using the remove_node function defined bellow.
– Laurent LAPORTE
Commented Mar 17, 2021 at 9:28

Add a comment |

Acorn · Accepted Answer · 2011-11-02 14:39:46Z

32

You're looking for the remove function. Call the tree's remove method and pass it a subelement to remove.

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <punnet>
    <fruit state="rotten">strawberry</fruit>
    <fruit state="fresh">blueberry</fruit>
  </punnet>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state='rotten']"):
    bad.getparent().remove(bad)

print et.tostring(tree, pretty_print=True)

Result:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

edited Nov 2, 2011 at 14:39

answered Nov 2, 2011 at 14:22

Acorn

50.2k29 gold badges140 silver badges176 bronze badges

You've just got all the lxml-related answers for me, don't you? ;-)
– ewok
Commented Nov 2, 2011 at 14:25
Can you comment on the differences between this answer and the one provided by Cedric?
– ewok
Commented Nov 2, 2011 at 14:27
4

Ah, I overlooked the fact that .remove() requires the element to be a child of the element you are calling it on. So you need to call it on the parent of the element you want to remove. Answer corrected.
– Acorn
Commented Nov 2, 2011 at 14:34
@Acorn : that's it, if the element to remove were not directly under the root node, it would have fail.
– Cédric Julien
Commented Nov 2, 2011 at 14:38
19

@ewok: give Cédric the accept as he answered 1 second earlier than me, and more importantly, his answer was correct :)
– Acorn
Commented Nov 2, 2011 at 14:47

| Show 3 more comments

zephor · Accepted Answer · 2018-06-04 03:57:18Z

I met one situation:

<div>
    <script>
        some code
    </script>
    text here
</div>

div.remove(script) will remove the text here part which I didn't mean to.

following the answer here, I found that etree.strip_elements is a better solution for me, which you can control whether or not you will remove the text behind with with_tail=(bool) param.

But still I don't know if this can use xpath filter for tag. Just put this for informing.

Here is the doc:

strip_elements(tree_or_element, *tag_names, with_tail=True)

Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their entire subtree, including all their attributes, text content and descendants. It will also remove the tail text of the element unless you explicitly set the with_tail keyword argument option to False.

Tag names can contain wildcards as in _Element.iter.

Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants. If you want to include the root element, check its tag name directly before even calling this function.

Example usage::
   strip_elements(some_element,
       'simpletagname',             # non-namespaced tag
       '{http://some/ns}tagname',   # namespaced tag
       '{http://some/other/ns}*'    # any tag from a namespace
       lxml.etree.Comment           # comments
       )

Notice that strip_elements (and strip_tags too) removes all descendant elements which tag name matches one of the * tag_names* names. — Laurent LAPORTE, Commented Mar 17, 2021 at 9:26

Benjamin Loison · Accepted Answer · 2023-12-08 23:51:24Z

As already mentioned, you can use the remove() method to delete (sub)elements from the tree:

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)

But it removes the element including its tail, which is a problem if you are processing mixed-content documents like HTML:

<div><fruit state="rotten">avocado</fruit> Hello!</div>

Becomes

<div></div>

Which is I suppose what you not always want :) I have created helper function to remove just the element and keep its tail:

def remove_element(el):
    parent = el.getparent()
    if el.tail.strip():
        prev = el.getprevious()
        if prev:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
    remove_element(bad)

This way it will keep the tail text:

<div> Hello!</div>

Check the el.tail is not None, as there might be such a case. — Eivydas Vilčinskas, Commented Jan 17, 2019 at 11:07

Benjamin Loison · Accepted Answer · 2023-12-08 23:55:19Z

You could also use html from lxml to solve that:

from lxml import html

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree = html.fromstring(xml)

print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

for i in tree.xpath("//fruit[@state='rotten']"):
    i.drop_tree()

print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

It should output this:

//BEFORE
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


//AFTER
<groceries>
  
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  
  <fruit state="fresh">peach</fruit>
</groceries>

Laurent LAPORTE · Accepted Answer · 2021-03-17 09:21:53Z

The remove function detaches an element from the tree and therefore removes the XML node (Element, PI or Comment), its content (the descendant items) and the tail text. Here, preserving the tail text is superfluous because it only contains whitespaces and a newline, which can be considered ignorable whitespaces.

To remove a element (and its content), preserving its tail, you can use the following function:

def remove_node(child, keep_content=False):
    """
    Remove an XML element, preserving its tail text.

    :param child: XML element to remove
    :param keep_content: ``True`` to keep child text and sub-elements.
    """
    parent = child.getparent()
    parent_text = parent.text or u""
    prev_node = child.getprevious()
    if keep_content:
        # insert: child text
        child_text = child.text or u""
        if prev_node is None:
            parent.text = u"{0}{1}".format(parent_text, child_text) or None
        else:
            prev_tail = prev_node.tail or u""
            prev_node.tail = u"{0}{1}".format(prev_tail, child_text) or None
        # insert: child elements
        index = parent.index(child)
        parent[index:index] = child[:]
    # insert: child tail
    parent_text = parent.text or u""
    prev_node = child.getprevious()
    child_tail = child.tail or u""
    if prev_node is None:
        parent.text = u"{0}{1}".format(parent_text, child_tail) or None
    else:
        prev_tail = prev_node.tail or u""
        prev_node.tail = u"{0}{1}".format(prev_tail, child_tail) or None
    # remove: child
    parent.remove(child)

Here is a demo:

from lxml import etree

tree = etree.XML(u"<root>text <bad>before <bad>inner</bad> after</bad> tail</root>")
bad1 = tree.xpath("//bad[1]")[0]
remove_node(bad1)

etree.dump(tree)
# <root>text  tail</root>

If you want to preserve the content, you can do:

tree = etree.XML(u"<root>text <bad>before <bad>inner</bad> after</bad> tail</root>")
bad1 = tree.xpath("//bad[1]")[0]
remove_node(bad1, keep_content=True)

etree.dump(tree)
# <root>text before <bad>inner</bad> after tail</root>

Collectives™ on Stack Overflow

how to remove an element in lxml

6 Answers 6

Not the answer you're looking for? Browse other questions tagged
python
xml
lxml
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Not the answer you're looking for? Browse other questions tagged pythonxmllxml or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
xml
lxml
or ask your own question.