Skip to content Skip to sidebar Skip to footer

Python: Easy Way To Modify Array In Parallel

The question might sound easy, but being new to parallelization in Python I am definitely struggling. I dealt with parallelization in OpenMP for C++ and that was a hell of a lot ea

Solution 1:

Using only the standard library, you could use shared memory, in this case an Array to store and modify your array:

from multiprocessing import Pool, Array, Lock

lock = Lock()

my_array = [Array('i', [1, 2, 3], lock=lock),
            Array('i', [4, 5, 6], lock=lock),]

Let me suggest you some modifications to your procedure: make a list or a schedule of all the changes you need to make to your matrix (To make it clear I am going to use a namedtuple), and a function to map these changes.

Change = namedtuple('Change', 'row idx value')

scheduled_changes = [Change(0, 0, 2),
                     Change(0, 1, 2),
                     Change(1, 0 ,2),
                     Change(1, 1, 2)]
# or build the scheduled changes list in any other way like using 
# for loops or list comprehensions...

def modify(change, matrix=my_array):
    matrix[change.row][change.idx] = change.value

Now you can use a Pool to map the modify function to the changes:

pool = Pool(4)
pool.map(modify, scheduled_changes)

for row in my_array:
    for col in row:
        print(col, end=' ')
    print()

# 2 2 3
# 2 2 6

Solution 2:

Generally there are two ways of sharing data

  • Shared Memory
  • Multiple processes (in your case)

if you used the threading backend in Joblib, you wouldn't have any problem with your current code, but in python (Cpython) don't execute in parallel because of the GIL (Global interpreter lock). so that isn't true parallelism

On the other hand, using Multiple processes, spawns a new interpreter in a new process, though it still has it's own GIL, but you can spawn more processes to do a particular task which means you have side stepped the GIL and you can execute more than one operation at the same time, the problem with this is that processes don't share the same memory, so if they're working on a particular data, they basically copy this data and work on their own copy, not some global copy where everybody writes to like threads (That's why the GIL was put there to prevent different threads from unexpectedly changing the state of a variable). It's possible to share memory (not exactly sharing, you have to pass the data around between the processes)

Using shared memory is in the Joblib docs, quoting the docs

By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.

what this means is that by default if you don't specify that you want the memory to be shared among the number of processes you have, each process will take a copy of the list and perform the operation on it (when in reality you want them to work on the same list you passed).

with your previous code, and some print statements added you'll understand what's going on

from joblib import Parallel, delayed

my_array = [[ 1 ,2 ,3],[4,5,6]]

def foo(array,x):
  for i in [0,1,2]:
     array[x][i]=2
     print(array, id(array), 'arrays in workers')
  return 0

def main(array):
  print(id(array), 'Original array')
  inputs = [0,1]
  if __name__ == '__main__':
    Parallel(n_jobs=2, verbose = 0)(delayed(foo)(array,i) for i in inputs)
    print(my_array, id(array), 'Original array')

main(my_array)

Running the code, we get:

140464165602120 Original array
[[2, 2, 3], [4, 5, 6]] 140464163002888 arrays in workers
[[2, 2, 3], [4, 5, 6]] 140464163002888 arrays in workers
[[2, 2, 2], [4, 5, 6]] 140464163002888 arrays in workers
[[1, 2, 3], [2, 5, 6]] 140464163003208 arrays in workers
[[1, 2, 3], [2, 2, 6]] 140464163003208 arrays in workers
[[1, 2, 3], [2, 2, 2]] 140464163003208 arrays in workers
[[1, 2, 3], [4, 5, 6]] 140464165602120 Original array

From the output you can see that the operations are indeed carried out but on different lists (with different id's), so at the end of the day there is no way to merge those two lists together to get your desired result.

but when you specify that you want memory to be shared between the various processes, you'll see that the output is quite different, as the processes are all working on the same list

from joblib import Parallel, delayed
from joblib.pool import has_shareable_memory

my_array = [[ 1 ,2 ,3],[4,5,6]]

def foo(array,x):
  for i in [0,1,2]:
     array[x][i]=2
     print(array, id(array), 'arrays in workers')
  return 0

def main(array):
  print(id(array), 'Original array')
  inputs = [0,1]
  if __name__ == '__main__':
    Parallel(n_jobs=2, verbose = 0)(delayed(has_shareable_memory)(foo(array,i)) for i in inputs)
    print(my_array, id(array), 'Original array')

main(my_array)

Output

140615324148552 Original array
[[2, 2, 3], [4, 5, 6]] 140615324148552 arrays in workers
[[2, 2, 3], [4, 5, 6]] 140615324148552 arrays in workers
[[2, 2, 2], [4, 5, 6]] 140615324148552 arrays in workers
[[2, 2, 2], [2, 5, 6]] 140615324148552 arrays in workers
[[2, 2, 2], [2, 2, 6]] 140615324148552 arrays in workers
[[2, 2, 2], [2, 2, 2]] 140615324148552 arrays in workers
[[2, 2, 2], [2, 2, 2]] 140615324148552 Original array

I'll suggest you experiment and play with the multi processing module proper, to really understand what's going on before you start using Joblib

https://docs.python.org/3.5/library/multiprocessing.html

A simple way to do this, using the multiprocessing module from the standard library would be

from multiprocessing import Pool

def foo(x):
    return [2 for i in x]


my_array = [[1 ,2 ,3],[4,5,6]]

if __name__ == '__main__':
    with Pool(2) as p:
        print(p.map(foo, my_array))

But Note that this doesn't modify my_array, it returns a new list


Post a Comment for "Python: Easy Way To Modify Array In Parallel"