Handle Unwanted Line Breaks With Read_csv In Pandas
I have a problem with data that is exported from SAP. Sometimes you can find a line break in the posting text. What should be in one line, is then in two and this results in a pret
Solution 1:
You can do some pre-processing to get rid of the unwanted breaks. Example below which I tested.
import fileinput
with fileinput.FileInput('input.csv', inplace=True, backup='.orig.bak') as file:
for line in file:
print(line.replace('\n','^'), end='')
with fileinput.FileInput('input.csv', inplace=True, backup='.1.bak') as file:
for line in file:
print(line.replace('^~','~'), end='')
with fileinput.FileInput('input.csv', inplace=True, backup='.2.bak') as file:
for line in file:
print(line.replace('^','\n'), end='')
Solution 2:
The correct way would be to fix the file at creation time. If this is not possible, you could pre-process the file or use a wrapper.
Here is a solution using a byte level wrapper that combines lines until you have the correct number of delimiters. I use a byte level wrapper to make use of the classes of the io module and add as little code of my own as I can: a RawIOBase
reads lines from an underlying byte file object, and combines lines to have the expected number of delimiters (only readinto
and readable
are overriden)
classcsv_wrapper(io.RawIOBase):
def__init__(self, base, delim):
self.fd = base # underlying (byte) file object
self.nfields = None
self.delim = ord(delim) # code of the delimiter (passed as a character)
self.numl = 0# number of line for error processing
self._getline() # load and process the header linedef_nfields(self):
# number of delimiters in current line returnlen([c for c in self.line if c == self.delim])
def_getline(self):
whileTrue:
# loads a new line in the internal buffer
self.line = next(self.fd)
self.numl += 1if self.nfields isNone: # store number of delims if not known
self.nfields = self._nfields()
else:
while self.nfields > self._nfields(): # optionaly combine lines
self.line = self.line.rstrip() + next(self.fd)
self.numl += 1if self.nfields != self._nfields(): # too much here...print("Too much fields line {}".format(self.numl))
continue# ignore the offending line and proceed
self.index = 0# reset line pointers
self.linesize = len(self.line)
breakdefreadinto(self, b):
iflen(b) == 0: return0if self.index == self.linesize: # if current buffer is exhaustedtry: # read a new one
self._getline()
except StopIteration:
return0for i inrange(len(b)): # store in passed bytearrayif self.index == self.linesize: break
b[i] = self.line[self.index]
self.index += 1return i
defreadable(self):
returnTrue
You can then change your code to:
data = pd.read_csv(
csv_wrapper(open(path_to_file, 'rb'), '~'),
sep='~',
encoding='latin1',
error_bad_lines=True,
warn_bad_lines=True)
Post a Comment for "Handle Unwanted Line Breaks With Read_csv In Pandas"