Reading from MySQL data with BLOBs dumped to CSV

As part of my recent work in upgrading our partner & customer ticketing system to make things sweeter for all our users I have been assigned a task involving importing data from a CSV that was generated using MySQL’s SELECT ... INTO OUTFILE. The problem was that the source table contained BLOB fields.

The resulting CSV was quite non-standard. MySQL simply does not allow to produce CSVs conforming to RFC 4180: it would not escape double quotes by duplicating them, but would use a dedicated escapement symbol (backslash \ by default); and when using that escapement symbol, it will escape byte 0x00 as two characters: \0 (backslash + character “0”). This needs a non-standard processing – e.g., Python’s csv module can’t restore the binary from such a file, no matter which encoding (like latin_1) you use to read file (Python’s csv module only works with files opened in text mode), or which settings you pass to reader. MySQL may be instructed to not use the escapement symbol (using FIELDS ESCAPED BY ''), and then zero bytes will be output as is, but then the double quotes (byte 0x22) inside all fields (including blobs) will be completely not escaped, making the CSV invalid and unreadable. MySQL should just duplicate double-quotes in this case, and it would work out of the box… But for now, I needed to read from the file with escaping by \ and all the quirks.

I hadn’t had time to look for a CSV reader capable of dealing with such a syntax (actually, I did a brief search, but of course not exhaustive). So I had to come with a home-grown Python code for this:

def readMySQL_CSVLine(binfile):
    rec = []
    field = bytearray()
    is_quoted = False # inside quoted field
    is_escaped = False # immediately after "\"
    is_after_quote = False # inside a quoted field, immediately after
                           # a double quote that is not an opening
                           # quote of a field
    while True:
        ch = binfile.read(1)
        if len(ch) == 0: # EOF
            if (is_after_quote): # There was an opening quote, then
                                 # closing quote, then EOF; the field
                                 # must be added, even if empty
                rec.append(field)
            elif is_quoted or is_escaped: # Missing closing quote or a
                                          # character following "\"
                if is_escaped: # Invalid last "\"
                    field += b'\\'
                rec.append(field)
            elif len(rec) > 0 or len(field) > 0: # Don't return records
                                                 # for empty last line
                rec.append(field)
            return rec
        if is_escaped:
            is_escaped = False
            if ch == b'0':
                ch = b'\x00'
            elif ch not in b'\n"\\,': # "\" should only escape:
                                      # FIELDS ESCAPED BY,
                                      # FIELDS [OPTIONALLY] ENCLOSED BY,
                                      # FIELDS TERMINATED BY,
                                      # LINES TERMINATED BY,
                                      # and ASCII NUL.
                field += b'\\'
            field += ch
            continue
        if ch == b'\n' and (not is_quoted or is_after_quote): # EOL
            rec.append(field)
            return rec
        if ch == b',' and (not is_quoted or is_after_quote): # End of field
            is_quoted = False
            is_after_quote = False
            rec.append(field)
            field = bytearray() # no .clear() here, or else it will clear
                                # content inside rec
            continue
        if ch == b'"':
            if is_after_quote: # Normal case of two two consequent quotes -
                               # one escapes the other
                is_after_quote = False
                field += ch
            elif is_quoted:
                is_after_quote = True
            else: # not is_quoted
                if len(field) == 0: # First character in a field
                    is_quoted = True
                else: # a quote in the middle of a field that didn't
                      # start with quotes
                    field += ch
            continue
        if is_after_quote: # An invalid non-closing lonely quote in the
                           # middle of a quoted field
            is_after_quote = False
            field += b'"' # Add it as usual character
        if ch == b'\\':
            is_escaped = True
            continue
        field += ch

It is used like this:

with open('dumped_table.csv', 'rb') as csv_file:
    while True:
        rec = readMySQL_CSVLine(csv_file)
        if not rec:
            break
        numeric_field, text_field, blob_field = rec
        numeric_field = numeric_field.decode('ascii')
        text_field = text_field.decode('utf-8')
        with open('path/to/file', 'w+b') as outfile:
            outfile.write(blob_field)

This worked for me. Of course, it may easily be converted to be a proper iterator, but I just leave it here as is. Hope that this could be useful for someone, until MySQL implement an option to produce a standards-compliant export.