I have written script to extract some information from pdf file. Each page is read as blocks. if [V2G has been found, then it will saved it as well as the title ,subtitle and the bulleted list.
My code:
data = []
req = namedtuple('Req', 'a b c d e f')
for page in doc:
dic = page.get_text("dict")
blocks = dic['blocks'] # text blocks
...
for b in blocks :
#title
if (font1 == 'Cambria-Bold':
nr = text.partition(" ")
title = text.partition(" ")[2]
#subtitle
elif font1 == 'Cambria':
Sub_nr = text.partition(" ")[0]
sub_title = text.partition(" ")[2]
#text
elif text.startswith('[V2G'):
id = text.replace('[', '')
txt = text1.strip()
data.append(req(nr,title,Sub_nr,sub_title, id, txt))
#bulleted list after the text
elif text.startswith("—"):
text += "\n" + text
the problem is the bulleted list ,because it located in the next block of the text ([V2G). Also not each word begin with [V2G has a bulleted list.
So how can I save the text of bulleted list as well as the text from txt-varibale and save it in the namedtuple argument (f)?
then I would like to push it on the same last row of my list?
is it possible to modify just one argument of named tuple and append it to the last element of my list(data) without the another argument?