Python Matrix Multiplication And Caching
Solution 1:
The latter version (block multiplication) is only faster by 30% because you save 30% of index accessing by using a local variable in the inner loop!
(and FYI this is not C++: list
type is like C++ vector
otherwise people would waste their time trying to access elements by index. So this is the fastest standard random access container available in python)
I just made some test program based on your code to prove my point and what I was suspecting (I had to complete your code, sorry for the big block but at least it is minimal complete & verifiable for all).
defzero_matrix(sz):
return [[0]*sz for i inrange(sz)]
defbaseline_matrix_multiply(a, b, n):
'''
baseline multiply
'''
c = zero_matrix(n)
for i inrange(n):
for j inrange(n):
for k inrange(n):
c[i][j] += a[i][k] * b[k][j]
return c
defbaseline_matrix_multiply_flipjk(a, b, n):
'''
same as baseline but switch j and k loops
'''
c = zero_matrix(n)
for i inrange(n):
for k inrange(n):
for j inrange(n):
c[i][j] += a[i][k] * b[k][j]
return c
defbaseline_matrix_multiply_flipjk_faster(a, b, n):
'''
same as baseline but switch j and k loops
'''
c = zero_matrix(n)
for i inrange(n):
ci = c[i]
for k inrange(n):
bk = b[k]
aik = a[i][k]
for j inrange(n):
ci[j] += aik * bk[j]
return c
deffast_matrix_multiply_blocking(a, b, n):
'''
use blocking
'''
c = zero_matrix(n)
block = 25;
en = int(block * n/block)
for kk inrange(0, en, block):
for jj inrange(0, en, block):
for i inrange(n):
for j inrange(jj, jj + block):
sum = c[i][j]
for k inrange(kk, kk + block):
sum += a[i][k] * b[k][j]
c[i][j] = sumreturn c
deffast_matrix_multiply_blocking_faster(a, b, n):
'''
use blocking
'''
c = zero_matrix(n)
block = 25;
en = int(block * n/block)
for kk inrange(0, en, block):
for jj inrange(0, en, block):
for i inrange(n):
ai = a[i]
ci = c[i]
for j inrange(jj, jj + block):
s = ci[j]
for k inrange(kk, kk + block):
s += ai[k] * b[k][j]
ci[j] = s
return c
definit_ab(sz):
return [list(range(sz)) for i inrange(sz)],[[3]*sz for i inrange(sz)]
sz=200import time
a,b = init_ab(sz)
start_time=time.time()
baseline_matrix_multiply(a,b,sz)
print("baseline_matrix_multiply: "+str(time.time()-start_time))
a,b = init_ab(sz)
start_time=time.time()
baseline_matrix_multiply_flipjk(a,b,sz)
print("baseline_matrix_multiply_flipjk: "+str(time.time()-start_time))
a,b = init_ab(sz)
start_time=time.time()
fast_matrix_multiply_blocking(a,b,sz)
print("fast_matrix_multiply_blocking: "+str(time.time()-start_time))
a,b = init_ab(sz)
start_time=time.time()
baseline_matrix_multiply_flipjk_faster(a,b,sz)
print("**baseline_matrix_multiply_flipjk_faster**: "+str(time.time()-start_time))
a,b = init_ab(sz)
start_time=time.time()
fast_matrix_multiply_blocking_faster(a,b,sz)
print("**fast_matrix_multiply_blocking_faster**: "+str(time.time()-start_time))
results on my PC (last results surrounded with stars are my versions):
baseline_matrix_multiply: 2.578160047531128baseline_matrix_multiply_flipjk: 2.5315518379211426fast_matrix_multiply_blocking: 1.9359750747680664**baseline_matrix_multiply_flipjk_faster**: 1.4532990455627441**fast_matrix_multiply_blocking_faster**: 1.7031919956207275
As you can see, my version of your baseline_matrix_multiply_flipjk
(the fourth one) is faster than even the block multiply, meaning that index checking & accessing dwarves the cache effect that you can experience in compiled languages & using direct pointers like C or C++.
I just stored the values that were not changing in the inner loop (the one to optimize most) to avoid index access.
Note that I tried to apply the same recipe to the block multiply and I gained some time compared to your version, but is still beaten by the flipjk_faster version because of the unability to avoid index access.
Maybe compiling the code using Cython and drop the checks would get the result you want. But pre-computing indexes never hurts.
Solution 2:
Python tends to not cache the results of its functions. It requires explicit notice of when you want build a cache for a function. You can do this using the lrc_cache
decorator.
The following is code I threw together the other day and I've just added some readability. If there's something wrong, comment and I'll sort it out:
from functools import lru_cache
from random import randint as rand
from time import clock as clk
recur = 0#@lru_cache(maxsize=4, typed=False)defFunc(m, n):
global recur
recur += 1if m == 0:
return n + 1elif n == 0:
return Func(m - 1, 1)
else:
return Func(m - 1, Func(m, n - 1))
n = []
m = 0
ITER = 50
F1 = 3
F2 = 4
staTime = clk()
for x inrange (ITER):
n.append(Func(F1, F2))
m += clk()-staTime
print("Uncached calls of Func:: "+str(recur//ITER))
print("Average time per solving of Ackerman's function:: "+str(m/ITER))
print("End result:: "+str(n[0]))
BTW: "#" indicates a comment, in case you're unfamiliar with Python.
So try running this and try AGAIN without the "#" on the line #@lru_cache(maxsize=4, typed=False)
Additionally, maxsize is the maximum size of the cache (works best in powers of two according to the docs) and typed just makes the cache add a new cached condition whenever a different type of the same value is passed as an argument.
Finally, the "Func" function is known as Ackermann's function. A stupidly deep amount of recursion goes on so be aware of stackoverflows (lol) if you should change the max recursion depth.
Post a Comment for "Python Matrix Multiplication And Caching"