BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

Question

Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article. This is what I have so far:

from BeautifulSoup import BeautifulSoup
import feedparser
import urllib

# Dictionaries
links = {}
titles = {}

# Variables
n = 0

rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-    30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"

# Parse the RSS feed
feed = feedparser.parse(rss_url)

# view the entire feed, one entry at a time
for post in feed.entries:
    # Create variables from posts
    link = post.link
    title = post.title
    # Add the link to the dictionary
    n += 1
    links[n] = link

for k,v in links.items():
    # Open RSS feed
    page = urllib.urlopen(v).read()
    page = str(page)
    soup = BeautifulSoup(page)

    # Find all of the text between paragraph tags and strip out the html
    page = soup.find('p').getText()

    # Strip ampersand codes and WATCH:
    page = re.sub('&\w+;','',page)
    page = re.sub('WATCH:','',page)

    # Print Page
    print(page)
    print(" ")

    # To stop after 3rd article, just whilst testing ** to be removed **
    if (k >= 3):
        break

This produces the following output:

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py")
Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago.  Total assets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter though this was 5.7% lower than the level a year ago.  The higher figures reflected the effects both of volume and exchange rate factors.

The net asset value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net asset values decreased by £3.6 billion (1.3%).

The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations.  In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,

>>>

The problem is that this is the first paragraph of each article, however I need to show the entire article. Any help would be gratefully received.

Just an FYI, you can use soup = BeautifulSoup(urllib.urlopen(v)) to create soup objects. — Blender, Commented Sep 17, 2012 at 0:54
Also, word on the street is that if you're just learning BeautifulSoup you're better off with bs4. — Amanda, Commented Oct 26, 2012 at 13:02

Amanda · Accepted Answer · 2012-10-26 13:09:27Z

121

You are getting close!

# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()

Using find (as you've noticed) stops after finding one result. You need find_all if you want all the paragraphs. If the pages are formatted consistently ( just looked over one), you could also use something like

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

to zero in on the body of the article.

edited Oct 26, 2012 at 13:09

answered Oct 26, 2012 at 13:00

Amanda

12.5k17 gold badges64 silver badges92 bronze badges

14

Using soup.find('p').get_text() also works (in order to conform to PEP 8).
– user5307109
Commented Jul 20, 2017 at 0:03
Only gets a single a single occurrence. If there are multiple paragraphs, then use: stackoverflow.com/a/69325288/6907424
– hafiz031
Commented Sep 27, 2022 at 4:02

Add a comment |

connorbode · Accepted Answer · 2022-03-20 15:53:43Z

This works well for specific articles where the text is all wrapped in <p> tags. Since the web is an ugly place, it's not always the case.

Often, websites will have text scattered all over, wrapped in different types of tags (e.g. maybe in a <span> or a <div>, or an <li>).

To find all text nodes in the DOM, you can use soup.find_all(text=True).

This is going to return some undesired text, like the contents of <script> and <style> tags. You'll need to filter out the text contents of elements you don't want.

blocklist = [
  'style',
  'script',
  # other elements,
]

text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blocklist]

If you are working with a known set of tags, you can tag the opposite approach:

allowlist = [
  'p'
]

text_elements = [t for t in soup.find_all(text=True) if t.parent.name in allowlist]

Would you mind updating the terminology to use equitable language? Guide can be found here. help.sap.com/doc/b0322267728e48a28b0c8ee7dd1ab4c7/1.0/en-US/… — ldmtwo, Commented Mar 19, 2022 at 1:05

Just Me · Accepted Answer · 2021-09-25 10:32:14Z

10

get_text

htmldata = getdata("https://www.geeksforgeeks.org/how-to-automate-an-excel-sheet-in-python/?ref=feed") 
soup = BeautifulSoup(htmldata, 'html.parser') 
data = '' 
for data in soup.find_all("p"): 
    print(data.get_text())

answered Sep 25, 2021 at 10:32

Just Me

9833 gold badges19 silver badges32 bronze badges

Add a comment |

Collectives™ on Stack Overflow

BeautifulSoup getText from between <p>, not picking up subsequent paragraphs

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
python
python-2.7
beautifulsoup
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged pythonpython-2.7beautifulsoup or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
python-2.7
beautifulsoup
or ask your own question.