Like many of you, I enjoy doing lots of data visualization and machine learning using Pandas.
Pandas loads CSV files (among other things), so natually I export my datasets in CSV format for easy consumption. Today’s CSV export was a bit on the slow side, so I decided to profile it.
My code looks something like this:
with open(filename, 'wb') as csv_file: writer = csv.DictWriter(csv_file, fieldnames=fieldnames) for index, row in enumerate(data_iterator): writer.writerow(row)
And this is the reference speed:
time python csv_writer_with_default_args.py real 0m41.033s user 0m39.976s sys 0m0.277s
So, let’s profile it and look at the bottlenecks
python -m cProfile -o csv_writer.profile csv_writer.py runsnake csv_writer.profile
It’s pretty clear that
_dict_to_list is the major bottleneck here. Let’s look at the code in
def _dict_to_list(self, rowdict): if self.extrasaction == "raise": wrong_fields = [k for k in rowdict if k not in self.fieldnames] if wrong_fields: raise ValueError("dict contains fields not in fieldnames: " + ", ".join([repr(x) for x in wrong_fields])) return [rowdict.get(key, self.restval) for key in self.fieldnames]
The bottleneck is the safety check that this function performs if
self.extrasaction == "raise". In particular, it checks that the
rowdict that is about to be written doesn’t have unexpected extra fields. Since this is a condition that should never happen in my code, we can safely disable this check. Turns out we can do so by passing
DictWriter a flag
with open(filename, 'wb') as csv_file: writer = csv.DictWriter(csv_file, extrasaction='ignore', fieldnames=fieldnames) for index, row in enumerate(data_iterator): writer.writerow(row)
What’s the speed now?
time python csv_writer_fast.py real 0m5.267s user 0m4.407s sys 0m0.147s