I have a row in an excel file cell with an embedded PDF in it per cell. The PDF is embedded into Excel as an object. How can I load these PDF files into Mathematica?
Sorry, I do not know how to povide a sample Excel-file.
I have a row in an excel file cell with an embedded PDF in it per cell. The PDF is embedded into Excel as an object. How can I load these PDF files into Mathematica?
Sorry, I do not know how to povide a sample Excel-file.
XLSX file is nothing else than a simple ZIP archive. If PDFs were properly embedded in the Excel document, you will find them in the xl/embeddings
folder, stored as oleObject###.bin
files. These binary files are your embedded files, but they have some metadata prepended to them. I didn't go into the details of this metadata header, however, if you know what type of file you have embedded, you can simply look for the header of your file. All PDF files begin with a string %PDF-
.
Long story short, you can simply use Import
in Mathematica to extract files from the ZIP archive, find all oleObject###.bin
files, look for the presence of PDF header string, and import the data as a PDF document.
importPDFsFromXLSX[xlsxFile_] := Module[{
files = Import[xlsxFile, {"ZIP", "FileNames"}],
binFiles, pdfHeader = Normal@StringToByteArray["%PDF-"]
},
binFiles = Import[xlsxFile, {"ZIP",
Select[files, StringEndsQ["oleObject" ~~ _ ~~ ".bin"]], "Byte"}];
Table[ImportByteArray[
ByteArray@binFile[[SequencePosition[binFile, pdfHeader][[1, 1]] ;;]], "PDF"],
{binFile, binFiles}]
]
Below, I've created a test Excel file with four embedded PDFs (VDELAJ
is the Slovene translation of your EINBETTEN
or English EMBED
function).
Now we simply import the file using the function above.