0

I have a csv-like file with complicated data and would like to open it in LibreOffice Calc, Microsoft Excel, or another spreadsheet program.

What I mean by csv-like:

  • Fields are separated by the ASCII unit separator \037/\x1F/31 decimal (a regular csv would use ,).
  • Records are separated by the null byte \0 (a regular csv would use \n).
  • Fields contain nearly arbitrary data, for instance (unpaired) quotes (", ') and linebreaks (\r, \n, or both). There are no quoting mechanisms, therefore the field and record delimiters cannot be stored inside a field.

After conversion to a spreadsheet document, I want to keep all data exactly as in the input. If a field in the file contains a linebreak, quote,or any other special symbol the corresponding cell in the output should too. This implies that all cells in the resulting spreadsheet have to be formatted as text.
Preserving the original data is extremely important. Since the files are huge I cannot manually look into the results and decide whether the conversion was successful or the data was mangled.

Since the data is generated by a bash script it would be nice if that script could do the conversion too. Other automatic solutions are welcome. Conversion to a regular csv file would probably be possible using csv's quoting mechanisms, but is not an option unless there is a reliable one-click csv import in Calc or Excel. As far as I know, importing a csv file manually takes quite some time and is error prone (in Calc and Excel you have to manually specify the delimiters, quoting mechanisms, column data type = text. In newer Excel versions you also have to deal with unwanted Data Connections).

Is there a tool (like ssconvert), or a simple script, or an option in Calc / Excel to
convert my csv-like file unmangled into any spreadsheet format (ods, odf, xls, xlxs)?

1 Answer 1

1

I solved my problem using the python library xlsxwriter and a small custom script.

Install the library using

pip install XlsxWriter

Save the following script as csvlike2xlsx.py.

#! /usr/bin/env python3
# converts an unquoted csv file with custom delimiters into a xlsx spreadsheet
# such that all data is shown exactly as in the text file (no processing of numbers, dates, formulas, ...)
import sys
import xlsxwriter

# delimiters / separators
fs="\037" # field separator is ASCII unit separator (US)
rs="\000" # record separator is ASCII null byte (NUL)

workbook = xlsxwriter.Workbook('out.xlsx', {
    'strings_to_numbers': False,
    'strings_to_formulas': False,
    'strings_to_urls': False,
    'constant_memory': True})
textformat = workbook.add_format()
textformat.set_num_format('@') # text format
# optional display option, useful when dealing with lots of linebreaks inside fields
# textformat.set_text_wrap()
for argpos, infilename in enumerate(sys.argv[1:], start=1):
    with open(infilename, 'r', newline='') as infile:
        infilecontent = infile.read();
        worksheet = workbook.add_worksheet(str(argpos))
        maxcol = 0
        for row, record in enumerate(infilecontent.split(rs)):
            if len(record) > maxcol:
                maxcol = len(record)
                worksheet.set_column(0, maxcol, None, textformat)
            for col, field in enumerate(record.split(fs)):
                worksheet.write(row, col, field)
workbook.close() # writes file to disk

Then run the script on your files using

python3 csvlike2xlsx.py file [...]

The result will be written to the file out.xlsx which contains one table/tab for each of the converted csv-files.

Room for improvement

  • For conversion, the input file is read completely into memory. I think this shouldn't be a problem since files bigger than your memory shouldn't be imported into Excel anyway. However, if you know a quick fix to efficiently read just until the next (record or field) separator in python feel free to edit this answer. According to my research, there are only predefined methods to read the next line using the predefined delimiters '' (end of file), '\n', '\r','\r\n', and None (universal line endings), see the documentation of open(..., newline=...).
  • The tables in the resulting xlsx file are numbered. The table for the first csv file is named 1, the table for the second csv file is named 2, and so on. It would be nice if the table name was the csv file name. However, Excel has many restrictions on the table name (length limit 32, no special symbols, unique name) therefore it is not always possible to use the csv file name.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .