Very Large Input And Piping Using Subprocess.popen

May 31, 2023 Post a Comment

I have pretty simple problem. I have a large file that goes through three steps, a decoding step using an external program, some processing in python, and then recoding using anot

Solution 1:

Popen has a bufsize parameter that will limit the size of the buffer in memory. If you don't want the files in memory at all, you can pass file objects as the stdin and stdout parameters. From the subprocess docs:

bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size. A negative bufsize means to use the system default, which usually means fully buffered. The default value for bufsize is 0 (unbuffered).

Solution 2:

Try to make this small change, see if the efficiency is better.

 for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))

Solution 3:

However, all the data are buffered to memory ...

Are you using subprocess.Popen.communicate()? By design, this function will wait for the process to finish, all the while accumulating the data in a buffer, and then return it to you. As you've pointed out, this is problematic if dealing with very large files.

If you want to process the data while it is generated, you will need to write a loop using the poll() and .stdout.read() methods, then write that output to another socket/file/etc.

Do be sure to notice the warnings in the documentation against doing this as it is easy to result in a deadlock (the parent process waits for the child process to generate data, who is in turn waiting for the parent process to empty the pipe buffer).

Solution 4:

I was using the .read() method on the stdout stream. Instead, I simply needed to read directly from the stream in the for loop above. The corrected code does what I expected.

#!/usr/bin/env python
import os
import sys
import subprocess

def main(infile,reflist):
    print infile,reflist
    samtoolsin = subprocess.Popen(["samtools","view",infile],
                                  stdout=subprocess.PIPE,bufsize=1)
    samtoolsout = subprocess.Popen(["samtools","import",reflist,"-",
                                    infile+".tmp"],stdin=subprocess.PIPE,bufsize=1)
    for line in samtoolsin.stdout:
        if(line.startswith("@")):
            samtoolsout.stdin.write(line)
        else:
            linesplit = line.split("\t")
            if(linesplit[10]=="*"):
                linesplit[9]="*"
            samtoolsout.stdin.write("\t".join(linesplit))

Solution 5:

Trying to do some basic shell piping with very large input in python:

svnadmin load /var/repo < r0-100.dump

I found the simplest way to get this working even with large (2-5GB) files was:

subprocess.check_output('svnadmin load %s < %s' % (repo, fname), shell=True)

I like this method because it's simple and you can do standard shell redirection.

I tried going the Popen route to run a redirect:

cmd = 'svnadmin load %s' % repo
p = Popen(cmd, stdin=PIPE, stdout=PIPE, shell=True)
withopen(fname) as inline:
    for line in inline:
        p.communicate(input=line)

But that broke with large files. Using:

p.stdin.write()

Also broke with very large files.

Python Playground