1

I am doing some research on minerals for a book. I will eventually need to track about two thousand minerals. This is currently done by a ~800 line Word table where each line more or less looks like this picture https://imgur.com/a/YDvLXxo

The functionality I need is (a) to be able to sort the table on any set of columns (b) see the visual thumbnail for each line (c) but maintain a full resolution copy of the image (d) and be able to click on any active links in the table entry.

Currently the Word file is ~600MB in size and Word (Office 365 version) is clearly beginning to creak at the seams (the autosave no longer works, for example, and sometimes I am unable to select anything in the document.)

I currently run a python program that reads the Word file and produces a .pdf file because the output I need is to map each line of the table to one output page, with picture, notes, and other attributes appropriately scaled and positioned. The .pdf file is about ~2GB in size, so I run it through a .pdf file shrinker afterwards and get a reasonable ~200MB final output.

So my question is, is there a better way to do this? I did take a quick look at Access, and was unable to figure out how to get it to handle the pictures.

I can't really split the Word file in two because I'd lose the sort functionality. While it's true that once the table is finished, I would not need to sort it any more, I am continuing to add entries to the table still and the sorting allows me to check for several different error types.

So, is there a flag I can set in Word to allow "large" files safely? If so, then is there a way to have the same functionality but have one mineral per page, nicely formatted with appropriate styles? If not, should I switch to some other document software? (Note: I originally had this in Google Docs but that doesn't offer the Sort functionality. Add-ons kind of offered sort but each one broke in one way or another on a much smaller table.)

2
  • 1
    Have you explored using Excel versus Word? Excel is better equipped to handle a table of the size you have. Commented Jan 26, 2021 at 14:22
  • I tried that first, but I could not figure out how to insert an image in an Excel cell -- so it STAYS THERE. Images on Excel sheets appear to be floating on top rather than actually associated with a cell. Commented Jan 27, 2021 at 19:33

3 Answers 3

2

If you are staying with a Microsoft Office product you should use Excel for this application. Sorting is built-in (see screen clip), the file can be saved directly to PDF, a worksheet can have over 1 million rows, multiple worksheets are standard so you could classify items into various categories on different worksheets. And I’m sure a Python app can be written that would read the Excel data from the workbook.

enter image description here

6
  • The problem with Excel was the images -- as I said above, images appear to float on top of excel sheets rather than actually be associated with a cell. Commented Jan 27, 2021 at 19:36
  • @WaltDonovan Excel has a property setting for images. Move and Size with Cells, or Move but don't size, and also Don't move or size with cells. You can set the width of the cell and height of the row to frame the image, and then with the other property setting mentioned, they will hold their position in your worksheet. Commented Jan 27, 2021 at 20:07
  • OK, if I set the excel row/column size to large values then the cut and paste works (after several minutes.) The issue is now that the multiline entries in the Word table end up as weird split cells in excel rather than a single cell. Also, all of the links are broken and no longer clickable. Commented Jan 27, 2021 at 23:31
  • It's worst than that -- the autofit row/col ignore the image size and look at the text size only and scale the image. Sigh. Commented Jan 27, 2021 at 23:38
  • To your entire table, you need to apply Word Wrap. Also, set the row height for all rows to the same height you have for the rows of your word table. For column widths, do the same thing. Commented Jan 28, 2021 at 1:08
0

As you are already familiar with Python, I strongly recommend bringing your project onto Google Sheets with the Google API.

  1. Copy and Paste your table into a new Google Sheets table
  2. Familiarize yourself with the Sheets API (API key&auth, Python Connection, accessing and reading your table)
  3. Expand your Python script to fetch the data from Sheets and generate your PDF

Making Python scripts for Google Sheets is a powerful data science skill nowadays, because Google Sheets is very flexible, efficient and can work like an online database.

Here is a good video tutorial for starters: Google Sheets - Python API, Read & Write Data (YouTube)

Cheers!

3
  • I'll take a look at Sheets and see if you can actually insert an image in a cell so it stays attached to the cell. Thanks. Commented Jan 27, 2021 at 19:37
  • Google sheets does not allow me to paste the images, so that's out. I do see a new "image in cell" feature, but I'd really rather not have to manually copy over a thousand images. Maybe if I use the API I could do it. The other issue I had was that I originally did have the table as a google doc table. The problem was it got slower and slower to use it as google docs kept backing up all 200 MB (at the time) to the point where it got unusable. I don't think an online solution is what I need. Commented Jan 27, 2021 at 23:43
  • Cool you gave it a try, thanks for letting me know. Commented Feb 6, 2021 at 20:35
0

The simple answer appears to be that Word tables have the functionality I need (each table row adjusts to content including image, sort is easy, links inside cells work) so I will just have to be careful not to exceed any hidden file size limits.

An alternative would be to find some other good document editor that supports importing from Word accurately.

As long as Excel treats images in cells as objects rather than values, Excel is going to be unusable for my purpose.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .