4
$\begingroup$

I have a row in an excel file cell with an embedded PDF in it per cell. The PDF is embedded into Excel as an object. How can I load these PDF files into Mathematica?

Sorry, I do not know how to povide a sample Excel-file.

This is how it looks like in Excel: enter image description here

$\endgroup$
3
  • $\begingroup$ Can you at least show a screenshot of how this PDF is embedded because there are multiple ways of doing it? Is there a particular formula in that cell? $\endgroup$
    – Domen
    Commented Jun 14 at 11:00
  • $\begingroup$ I added the image how it looks in Excel. Due to german language the cell content is EINBETTEN... which is german for embed. I am not sure what there would be written for english language setting. $\endgroup$
    – Eisbär
    Commented Jun 14 at 11:10
  • $\begingroup$ If someone has a possibility how to provide an Excel file with an embedded PDF, I can provide one. $\endgroup$
    – Eisbär
    Commented Jun 14 at 12:04

1 Answer 1

5
$\begingroup$

XLSX file is nothing else than a simple ZIP archive. If PDFs were properly embedded in the Excel document, you will find them in the xl/embeddings folder, stored as oleObject###.bin files. These binary files are your embedded files, but they have some metadata prepended to them. I didn't go into the details of this metadata header, however, if you know what type of file you have embedded, you can simply look for the header of your file. All PDF files begin with a string %PDF-.

Long story short, you can simply use Import in Mathematica to extract files from the ZIP archive, find all oleObject###.bin files, look for the presence of PDF header string, and import the data as a PDF document.

importPDFsFromXLSX[xlsxFile_] := Module[{
   files = Import[xlsxFile, {"ZIP", "FileNames"}],
   binFiles, pdfHeader = Normal@StringToByteArray["%PDF-"]
   },
  binFiles = Import[xlsxFile, {"ZIP", 
   Select[files, StringEndsQ["oleObject" ~~ _ ~~ ".bin"]], "Byte"}];
  Table[ImportByteArray[
    ByteArray@binFile[[SequencePosition[binFile, pdfHeader][[1, 1]] ;;]], "PDF"], 
    {binFile, binFiles}]
  ]

Below, I've created a test Excel file with four embedded PDFs (VDELAJ is the Slovene translation of your EINBETTEN or English EMBED function).

enter image description here

Now we simply import the file using the function above.

enter image description here

$\endgroup$
1
  • $\begingroup$ :-))) Great work! It works perfectly for me. I imported all obejcts. One remark: In the definition I added a _ in "oleObject" ~~ _ ~~ ".bin" to get more bin files selected. $\endgroup$
    – Eisbär
    Commented Jun 14 at 14:52

Not the answer you're looking for? Browse other questions tagged or ask your own question.