1. errors=
is sometimes useful
If a file has to have a certain encoding but the existing dataframe has characters that cannot be represented, errors=
can be used to "coerce" the data to be saved anyway at the cost of losing information. All possible values that can be passed as the errors=
argument to the open()
function in Python can be passed here.
For example, the below code saves a csv with ascii encoding where the Japanese characters are replaced with a ?
.
df = pd.DataFrame({'A': ['Shohei Ohtani は一生に一度の選手だ。']})
df.to_csv('data1.csv', encoding='ascii', errors='replace', index=False)
print(pd.read_csv('data1.csv'))
A
0 Shohei Ohtani ???????????
2. float_format=
is sometimes useful
You can format float dtypes using float_format=
and doing so saves a lot of memory sometimes at the cost of losing precision. For example,
df = pd.DataFrame({'A': [*range(1,9,3)]*1000})/3
df.to_csv('data1.csv', index=False) # 61,440 bytes on disk
df.to_csv('data2.csv', index=False, float_format='%.2f') # 20,480 bytes on disk
3. Save a compressed csv
Since pandas 1.0.0, you can pass a dict to compression that specifies compression method and file name inside the archive. The below code creates a zip file named compressed_data.zip
which has a single file in it named data.csv
.
df.to_csv('compressed_data.zip', index=False, compression={'method': 'zip', 'archive_name': 'data.csv'})
# read the archived file as a csv
pd.read_csv('compressed_data.zip')
You can even add to an existing archive; simply pass mode='a'
.
df.to_csv('compressed_data.zip', compression={'method': 'zip', 'archive_name': 'data_new.csv'}, mode='a')