Skip to content Skip to sidebar Skip to footer

Handle Unwanted Line Breaks With Read_csv In Pandas

I have a problem with data that is exported from SAP. Sometimes you can find a line break in the posting text. What should be in one line, is then in two and this results in a pret

Solution 1:

You can do some pre-processing to get rid of the unwanted breaks. Example below which I tested.

import fileinput

with fileinput.FileInput('input.csv', inplace=True, backup='.orig.bak') as file:
    for line in file:
        print(line.replace('\n','^'), end='')

with fileinput.FileInput('input.csv', inplace=True, backup='.1.bak') as file:
    for line in file:
        print(line.replace('^~','~'), end='')

with fileinput.FileInput('input.csv', inplace=True, backup='.2.bak') as file:
    for line in file:
        print(line.replace('^','\n'), end='')

Solution 2:

The correct way would be to fix the file at creation time. If this is not possible, you could pre-process the file or use a wrapper.

Here is a solution using a byte level wrapper that combines lines until you have the correct number of delimiters. I use a byte level wrapper to make use of the classes of the io module and add as little code of my own as I can: a RawIOBase reads lines from an underlying byte file object, and combines lines to have the expected number of delimiters (only readinto and readable are overriden)

classcsv_wrapper(io.RawIOBase):
    def__init__(self, base, delim):
        self.fd = base           # underlying (byte) file object
        self.nfields = None
        self.delim = ord(delim)  # code of the delimiter (passed as a character)
        self.numl = 0# number of line for error processing
        self._getline()          # load and process the header linedef_nfields(self):
        # number of delimiters in current line          returnlen([c for c in self.line if c == self.delim])

    def_getline(self):
        whileTrue:
            # loads a new line in the internal buffer
            self.line = next(self.fd)
            self.numl += 1if self.nfields isNone:           # store number of delims if not known
                self.nfields = self._nfields()
            else:
                while self.nfields > self._nfields():  # optionaly combine lines
                    self.line = self.line.rstrip() + next(self.fd)
                    self.numl += 1if self.nfields != self._nfields():        # too much here...print("Too much fields line {}".format(self.numl))
                continue# ignore the offending line and proceed
            self.index = 0# reset line pointers
            self.linesize = len(self.line)
            breakdefreadinto(self, b):
        iflen(b) == 0: return0if self.index == self.linesize:            # if current buffer is exhaustedtry:                                   # read a new one
                self._getline()
            except StopIteration:
                return0for i inrange(len(b)):                    # store in passed bytearrayif self.index == self.linesize: break
            b[i] = self.line[self.index]
            self.index += 1return i
    defreadable(self):
        returnTrue

You can then change your code to:

data = pd.read_csv(
    csv_wrapper(open(path_to_file, 'rb'), '~'),
    sep='~',
    encoding='latin1',
    error_bad_lines=True,
    warn_bad_lines=True)

Post a Comment for "Handle Unwanted Line Breaks With Read_csv In Pandas"