5

I am trying to parse bank statements and credit card statements into a machine-readable format.

The bank statements are issued as (unprotected) Pdf documents.

I am looking for a tool that can convert these Pdf documents into something easier to process, like .csv or .xls

The data in these documents is shown in a weird table-like format, e.g.

DATE      SUBJECT                                   AMOUNT
          VARIABLE NUMBER OF FIELDS,
          DEPENDING ON THE TYPE
          OF TRANSACTION. 
          TWO FIELDS MAY BE ON THE SAME LINE
          AND A SINGLE FIELD MAY SPAN TWO LINES

DATE      SUBJECT                                   AMOUNT
....

Additional difficulties include:

  • A single entry can be broken up over two pages.
  • Header and footer have variable height and content.
  • Some pages contain advertisements.

So I am looking for a software with the following features:

  • The software should deal with the weird multi-line input format. Simply parsing everything as fixed width tables, with one row per input row, would require a lot of editing afterwards.

  • should run on Windows, ideally Windows XP

  • should not require MS Office 10 or newer. Ideally not an Office plugin at all, but a stand-alone program.

  • Should not require excessive manual interaction during conversion, but also not excessive manual correction afterwards. Some amount of interaction is good, if it reduces the need for post-editing.

  • an offline tool would be strongly preferred, since these files are confidential.

  • either open source or a big, renowned publisher are preferred, for the same reason.

  • a large support base and active community for this tool would be a big plus

  • free software preferred, but a moderate one-time investment would be acceptable (if clearly superior to the free solution)

  • The volume of these pdf files is limited. For now I am looking to convert less than 100 files, so batch-conversion of 10000 files is not a priority.

  • It is no problem if the software requires advanced knowledge about regexp, parsing, etc. It should just be easier than writing my own parser from scratch.

2 Answers 2

2

I had the same problem - Belgium's ING Bank now provides statements only as format pdf, with layout ± exactly as described in the post.

I tried 4 or 5 Open Source text extractors. The best was GhostScript - with thanks to this posting on stackoverflow;

GhostScript is the only O.S. extractor that I have found that renders lines as on page. (pdftotext and friends tend to break lines half way along.)

Then a little Perl script parsed the extracted txt into csv without problems. The columns in that final csv are Date;Balance;AMOUNT;text field1;text field2; ...

Dates are easy to spot with a regex qr(\d\d-\d\d\20\d\d).

The line «DATE SUBJECT AMOUNT» is also easy to spot with a regex that matches «+99.999,00» or «-1,23» (european way of writing numbers) at end of line.

The weird multiline format is rendered into csv as

  • weird line 1 in column text field 1,

  • weird line 2 in column text field 2,

  • etc.

I dont use that information very much so have not tried to mend any extraneous line breaks that there may be in the original.

One useful tip - in developing the Perl, add an extra column of "my calculation of the balance" and check that it always equals what the bank statement gives as the balance.

1
  • Correction - Belgium's ING Bank does provide statements as format csv; but I had to telephone the ING Bank help-desk to find out how to get them. Commented Jul 9, 2015 at 17:05
0

I had a similar challenge parsing PDF statements from TD Waterhouse. They have very weird structure that even changes sometimes during the year back and forth. I ended up writing a Python script with some relatively complex logic to process them. I used PDFMiner library. But to make it safe I had to put tons of different safety checks there - it was about extracting the data I used for my tax reports :)

Another crazy alternative I could imagine is OCR. Most likely no OCR will digest these nice-looking pages but I believe most of the financial statements have fixed layout. Thus, it would be easy to cut them into fragments (with something like ImageMagick) and then OCR each fragment separately. While sounds crazy, I think for some documents it may be a good option.

Not the answer you're looking for? Browse other questions tagged or ask your own question.