0

I have written script to extract some information from pdf file. Each page is read as blocks. if [V2G has been found, then it will saved it as well as the title ,subtitle and the bulleted list.

My code:

data = []
req = namedtuple('Req', 'a b c d e f')
    
for page in doc:
   dic = page.get_text("dict")
   blocks = dic['blocks']  # text blocks
   ...
    
   for b in blocks : 
    #title
     if (font1 == 'Cambria-Bold':
         nr = text.partition(" ")
         title = text.partition(" ")[2] 
     #subtitle
    elif font1 == 'Cambria':
         Sub_nr = text.partition(" ")[0] 
         sub_title = text.partition(" ")[2]
    #text
    elif text.startswith('[V2G'):                   
         id = text.replace('[', '')
         txt = text1.strip()
         data.append(req(nr,title,Sub_nr,sub_title, id, txt))
    
    #bulleted list after the text
    elif text.startswith("—"):
         text += "\n" + text

the problem is the bulleted list ,because it located in the next block of the text ([V2G). Also not each word begin with [V2G has a bulleted list.

So how can I save the text of bulleted list as well as the text from txt-varibale and save it in the namedtuple argument (f)?

then I would like to push it on the same last row of my list?

is it possible to modify just one argument of named tuple and append it to the last element of my list(data) without the another argument?

Result should be: enter image description here

3
  • 2
    Tuples (named or normal) are immutable, you can't modify an element of it.
    – Barmar
    Commented Apr 19, 2023 at 17:33
  • 1
    please provide sample input and expected output. Without the full contect, it is difficult to understand what you are trying to accomplish
    – itprorh66
    Commented Apr 19, 2023 at 21:02
  • thank you for your commit, I have provide input and expected output as picture
    – user34088
    Commented Apr 20, 2023 at 9:15

1 Answer 1

2

As Barmar has pointed out in a comment above, (named) tuples are immutable. Why not use a dictionary instead? You could assign new items to this dictionary throughout the loop and update them as well. Appending the dictionary to the data list should of course be done as the last step in each iteration.

I am not sure why the bullet point lists have to be treated separately from the other text. It is hard to answer this part of the question without knowing how text and text1 are assigned.

data = []
    
for page in doc:
    dic = page.get_text("dict")
    blocks = dic['blocks']  # text blocks
    ...
    
    for b in blocks : 
        req = {'txt': ""}

        # title
        if font1 == 'Cambria-Bold':
            req['nr'] = text.partition(" ")
            req['title'] = text.partition(" ")[2] 
        
        # subtitle
        elif font1 == 'Cambria':
            req['sub_nr'] = text.partition(" ")[0] 
            req['sub_title'] = text.partition(" ")[2]
    
        # text
        elif text.startswith('[V2G'):                   
            req['id'] = text.replace('[', '')
            req['txt'] = text1.strip()
         
        # bulleted list after the text
        elif text.startswith("—"):
            req['txt'] += ("\n" + text)

        data.append(req)    

data is now a list of dictionaries, which pandas' DataFrame class will accept as input.

0

Not the answer you're looking for? Browse other questions tagged or ask your own question.