Avoiding automatic data type conversion in Microsoft Excel and Pandas
| coding, python
Automatic conversion of data types is often handy, but sometimes it
can mess things up. For example, when you import a CSV into Microsoft
Excel, it will helpfully convert and display dates/times in your
preferred format–and it will use your configured format when
exporting back to CSV, which is not cool when your original file had
YYYY-MM-DD HH:MM:SS
and someone's computer decided to turn it into
MM/DD/YY HH:MM
. To avoid this conversion and import the columns as
strings, you can change the file extension to .txt
instead of .csv
and then change each column type that you care about, which can be a
lot of clicking. I had to change things back with a regular expression
along the lines of:
import re s = "12/9/21 11:23" match = re.match('([0-9]+)/([0-9]+)/([0-9]+)( [0-9]+:[0-9]+)', s) date = '20%s-%s-%s%s:00' % (match.group(3).zfill(2), match.group(1).zfill(2), match.group(2).zfill(2), match.group(4)) print(date)
The pandas
library for Python also likes to do this kind of data
type conversion for data types and for NaN values. In this particular
situation, I wanted it to leave columns alone and leave the nan
string in my input alone. Otherwise, to_csv
would replace nan
with
the blank string, which could mess up a different script that used
this data as input. This is the code to do it:
import pandas as pd df = pd.read_csv('filename.csv', encoding='utf-8', dtype=str, na_filter=False)
I'm probably going to run into this again sometime, so I wanted to make sure I put my notes somewhere I can find them later.