Python: Easy Way To Modify Array In Parallel
Solution 1:
Using only the standard library, you could use shared memory, in this case an Array
to store and modify your array:
from multiprocessing import Pool, Array, Lock
lock = Lock()
my_array = [Array('i', [1, 2, 3], lock=lock),
Array('i', [4, 5, 6], lock=lock),]
Let me suggest you some modifications to your procedure: make a list or a schedule of all the changes you need to make to your matrix (To make it clear I am going to use a namedtuple
), and a function to map these changes.
Change = namedtuple('Change', 'row idx value')
scheduled_changes = [Change(0, 0, 2),
Change(0, 1, 2),
Change(1, 0 ,2),
Change(1, 1, 2)]
# or build the scheduled changes list in any other way like using
# for loops or list comprehensions...
def modify(change, matrix=my_array):
matrix[change.row][change.idx] = change.value
Now you can use a Pool
to map the modify function to the changes:
pool = Pool(4)
pool.map(modify, scheduled_changes)
for row in my_array:
for col in row:
print(col, end=' ')
print()
# 2 2 3
# 2 2 6
Solution 2:
Generally there are two ways of sharing data
- Shared Memory
- Multiple processes (in your case)
if you used the threading
backend in Joblib, you wouldn't have any problem with your current code, but in python (Cpython) don't execute in parallel because of the GIL (Global interpreter lock). so that isn't true parallelism
On the other hand, using Multiple processes, spawns a new interpreter in a new process, though it still has it's own GIL, but you can spawn more processes to do a particular task which means you have side stepped the GIL and you can execute more than one operation at the same time, the problem with this is that processes don't share the same memory, so if they're working on a particular data, they basically copy this data and work on their own copy, not some global copy where everybody writes to like threads (That's why the GIL was put there to prevent different threads from unexpectedly changing the state of a variable). It's possible to share memory (not exactly sharing, you have to pass the data around between the processes)
Using shared memory is in the Joblib docs, quoting the docs
By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.
what this means is that by default if you don't specify that you want the memory to be shared among the number of processes you have, each process will take a copy of the list and perform the operation on it (when in reality you want them to work on the same list you passed).
with your previous code, and some print statements added you'll understand what's going on
from joblib import Parallel, delayed
my_array = [[ 1 ,2 ,3],[4,5,6]]
def foo(array,x):
for i in [0,1,2]:
array[x][i]=2
print(array, id(array), 'arrays in workers')
return 0
def main(array):
print(id(array), 'Original array')
inputs = [0,1]
if __name__ == '__main__':
Parallel(n_jobs=2, verbose = 0)(delayed(foo)(array,i) for i in inputs)
print(my_array, id(array), 'Original array')
main(my_array)
Running the code, we get:
140464165602120 Original array
[[2, 2, 3], [4, 5, 6]] 140464163002888 arrays in workers
[[2, 2, 3], [4, 5, 6]] 140464163002888 arrays in workers
[[2, 2, 2], [4, 5, 6]] 140464163002888 arrays in workers
[[1, 2, 3], [2, 5, 6]] 140464163003208 arrays in workers
[[1, 2, 3], [2, 2, 6]] 140464163003208 arrays in workers
[[1, 2, 3], [2, 2, 2]] 140464163003208 arrays in workers
[[1, 2, 3], [4, 5, 6]] 140464165602120 Original array
From the output you can see that the operations are indeed carried out but on different lists (with different id's), so at the end of the day there is no way to merge those two lists together to get your desired result.
but when you specify that you want memory to be shared between the various processes, you'll see that the output is quite different, as the processes are all working on the same list
from joblib import Parallel, delayed
from joblib.pool import has_shareable_memory
my_array = [[ 1 ,2 ,3],[4,5,6]]
def foo(array,x):
for i in [0,1,2]:
array[x][i]=2
print(array, id(array), 'arrays in workers')
return 0
def main(array):
print(id(array), 'Original array')
inputs = [0,1]
if __name__ == '__main__':
Parallel(n_jobs=2, verbose = 0)(delayed(has_shareable_memory)(foo(array,i)) for i in inputs)
print(my_array, id(array), 'Original array')
main(my_array)
Output
140615324148552 Original array
[[2, 2, 3], [4, 5, 6]] 140615324148552 arrays in workers
[[2, 2, 3], [4, 5, 6]] 140615324148552 arrays in workers
[[2, 2, 2], [4, 5, 6]] 140615324148552 arrays in workers
[[2, 2, 2], [2, 5, 6]] 140615324148552 arrays in workers
[[2, 2, 2], [2, 2, 6]] 140615324148552 arrays in workers
[[2, 2, 2], [2, 2, 2]] 140615324148552 arrays in workers
[[2, 2, 2], [2, 2, 2]] 140615324148552 Original array
I'll suggest you experiment and play with the multi processing module proper, to really understand what's going on before you start using Joblib
https://docs.python.org/3.5/library/multiprocessing.html
A simple way to do this, using the multiprocessing module from the standard library would be
from multiprocessing import Pool
def foo(x):
return [2 for i in x]
my_array = [[1 ,2 ,3],[4,5,6]]
if __name__ == '__main__':
with Pool(2) as p:
print(p.map(foo, my_array))
But Note that this doesn't modify my_array
, it returns a new list
Post a Comment for "Python: Easy Way To Modify Array In Parallel"