As part of my recent work in upgrading our partner & customer ticketing system to make things sweeter for all our users I have been assigned a task involving importing data from a CSV that was generated using MySQL’s SELECT ... INTO OUTFILE
. The problem was that the source table contained BLOB fields.
The resulting CSV was quite non-standard. MySQL simply does not allow to produce CSVs conforming to RFC 4180: it would not escape double quotes by duplicating them, but would use a dedicated escapement symbol (backslash \
by default); and when using that escapement symbol, it will escape byte 0x00
as two characters: \0
(backslash + character “0”). This needs a non-standard processing – e.g., Python’s csv
module can’t restore the binary from such a file, no matter which encoding (like latin_1
) you use to read file (Python’s csv
module only works with files opened in text mode), or which settings you pass to reader. MySQL may be instructed to not use the escapement symbol (using FIELDS ESCAPED BY ''
), and then zero bytes will be output as is, but then the double quotes (byte 0x22
) inside all fields (including blobs) will be completely not escaped, making the CSV invalid and unreadable. MySQL should just duplicate double-quotes in this case, and it would work out of the box… But for now, I needed to read from the file with escaping by \
and all the quirks.
I hadn’t had time to look for a CSV reader capable of dealing with such a syntax (actually, I did a brief search, but of course not exhaustive). So I had to come with a home-grown Python code for this:
def readMySQL_CSVLine(binfile): rec = [] field = bytearray() is_quoted = False # inside quoted field is_escaped = False # immediately after "\" is_after_quote = False # inside a quoted field, immediately after # a double quote that is not an opening # quote of a field while True: ch = binfile.read(1) if len(ch) == 0: # EOF if (is_after_quote): # There was an opening quote, then # closing quote, then EOF; the field # must be added, even if empty rec.append(field) elif is_quoted or is_escaped: # Missing closing quote or a # character following "\" if is_escaped: # Invalid last "\" field += b'\\' rec.append(field) elif len(rec) > 0 or len(field) > 0: # Don't return records # for empty last line rec.append(field) return rec if is_escaped: is_escaped = False if ch == b'0': ch = b'\x00' elif ch not in b'\n"\\,': # "\" should only escape: # FIELDS ESCAPED BY, # FIELDS [OPTIONALLY] ENCLOSED BY, # FIELDS TERMINATED BY, # LINES TERMINATED BY, # and ASCII NUL. field += b'\\' field += ch continue if ch == b'\n' and (not is_quoted or is_after_quote): # EOL rec.append(field) return rec if ch == b',' and (not is_quoted or is_after_quote): # End of field is_quoted = False is_after_quote = False rec.append(field) field = bytearray() # no .clear() here, or else it will clear # content inside rec continue if ch == b'"': if is_after_quote: # Normal case of two two consequent quotes - # one escapes the other is_after_quote = False field += ch elif is_quoted: is_after_quote = True else: # not is_quoted if len(field) == 0: # First character in a field is_quoted = True else: # a quote in the middle of a field that didn't # start with quotes field += ch continue if is_after_quote: # An invalid non-closing lonely quote in the # middle of a quoted field is_after_quote = False field += b'"' # Add it as usual character if ch == b'\\': is_escaped = True continue field += ch
It is used like this:
with open('dumped_table.csv', 'rb') as csv_file: while True: rec = readMySQL_CSVLine(csv_file) if not rec: break numeric_field, text_field, blob_field = rec numeric_field = numeric_field.decode('ascii') text_field = text_field.decode('utf-8') with open('path/to/file', 'w+b') as outfile: outfile.write(blob_field)
This worked for me. Of course, it may easily be converted to be a proper iterator, but I just leave it here as is. Hope that this could be useful for someone, until MySQL implement an option to produce a standards-compliant export.
[…] Reading from MySQL data with BLOBs dumped to CSV – Mike Kaganski’s blog […]
LikeLike