Avoiding automatic data type conversion in Microsoft Excel and Pandas

| coding, python

Automatic conversion of data types is often handy, but sometimes it can mess things up. For example, when you import a CSV into Microsoft Excel, it will helpfully convert and display dates/times in your preferred format–and it will use your configured format when exporting back to CSV, which is not cool when your original file had YYYY-MM-DD HH:MM:SS and someone's computer decided to turn it into MM/DD/YY HH:MM. To avoid this conversion and import the columns as strings, you can change the file extension to .txt instead of .csv and then change each column type that you care about, which can be a lot of clicking. I had to change things back with a regular expression along the lines of:

import re
s = "12/9/21 11:23"
match = re.match('([0-9]+)/([0-9]+)/([0-9]+)( [0-9]+:[0-9]+)', s)
date = '20%s-%s-%s%s:00' % (match.group(3).zfill(2), match.group(1).zfill(2), match.group(2).zfill(2), match.group(4))

The pandas library for Python also likes to do this kind of data type conversion for data types and for NaN values. In this particular situation, I wanted it to leave columns alone and leave the nan string in my input alone. Otherwise, to_csv would replace nan with the blank string, which could mess up a different script that used this data as input. This is the code to do it:

import pandas as pd
df = pd.read_csv('filename.csv', encoding='utf-8', dtype=str, na_filter=False)

I'm probably going to run into this again sometime, so I wanted to make sure I put my notes somewhere I can find them later.

You can comment with Disqus or you can e-mail me at sacha@sachachua.com.