Python Matrix Multiplication And Caching

September 29, 2023 Post a Comment

I am investigating caching behavior in different languages. I create two matrices in python using lists (yes, I know it's a linked list but bear with me here). I then multiply thes

Solution 1:

The latter version (block multiplication) is only faster by 30% because you save 30% of index accessing by using a local variable in the inner loop!

(and FYI this is not C++: list type is like C++ vector otherwise people would waste their time trying to access elements by index. So this is the fastest standard random access container available in python)

I just made some test program based on your code to prove my point and what I was suspecting (I had to complete your code, sorry for the big block but at least it is minimal complete & verifiable for all).

defzero_matrix(sz):
    return [[0]*sz for i inrange(sz)]

defbaseline_matrix_multiply(a, b, n):
    '''
    baseline multiply
    '''
    c = zero_matrix(n)
    for i inrange(n):
        for j inrange(n):
            for k inrange(n):
                c[i][j] += a[i][k] * b[k][j]
    return c


defbaseline_matrix_multiply_flipjk(a, b, n):
    '''
    same as baseline but switch j and k loops
    '''
    c = zero_matrix(n)
    for i inrange(n):
        for k inrange(n):
            for j inrange(n):
                c[i][j] += a[i][k] * b[k][j]
    return c

defbaseline_matrix_multiply_flipjk_faster(a, b, n):
    '''
    same as baseline but switch j and k loops
    '''
    c = zero_matrix(n)
    for i inrange(n):
        ci = c[i]
        for k inrange(n):
            bk = b[k]
            aik = a[i][k]
            for j inrange(n):
                ci[j] += aik * bk[j]
    return c

deffast_matrix_multiply_blocking(a, b, n):
    '''
    use blocking
    '''
    c = zero_matrix(n)

    block = 25;
    en = int(block * n/block)

    for kk inrange(0, en, block):
        for jj inrange(0, en, block):
            for i inrange(n):
                for j inrange(jj, jj + block):
                    sum = c[i][j]
                    for k inrange(kk, kk + block):
                        sum += a[i][k] * b[k][j]
                    c[i][j] = sumreturn c

deffast_matrix_multiply_blocking_faster(a, b, n):
    '''
    use blocking
    '''
    c = zero_matrix(n)

    block = 25;
    en = int(block * n/block)

    for kk inrange(0, en, block):
        for jj inrange(0, en, block):
            for i inrange(n):
                ai = a[i]
                ci = c[i]
                for j inrange(jj, jj + block):
                    s = ci[j]
                    for k inrange(kk, kk + block):
                        s += ai[k] * b[k][j]
                    ci[j] = s
    return c

definit_ab(sz):
    return [list(range(sz)) for i inrange(sz)],[[3]*sz for i inrange(sz)]

sz=200import time

a,b = init_ab(sz)

start_time=time.time()
baseline_matrix_multiply(a,b,sz)
print("baseline_matrix_multiply: "+str(time.time()-start_time))

a,b = init_ab(sz)

start_time=time.time()
baseline_matrix_multiply_flipjk(a,b,sz)
print("baseline_matrix_multiply_flipjk: "+str(time.time()-start_time))

a,b = init_ab(sz)


start_time=time.time()
fast_matrix_multiply_blocking(a,b,sz)
print("fast_matrix_multiply_blocking: "+str(time.time()-start_time))

a,b = init_ab(sz)

start_time=time.time()
baseline_matrix_multiply_flipjk_faster(a,b,sz)
print("**baseline_matrix_multiply_flipjk_faster**: "+str(time.time()-start_time))

a,b = init_ab(sz)

start_time=time.time()
fast_matrix_multiply_blocking_faster(a,b,sz)
print("**fast_matrix_multiply_blocking_faster**: "+str(time.time()-start_time))

results on my PC (last results surrounded with stars are my versions):

baseline_matrix_multiply: 2.578160047531128baseline_matrix_multiply_flipjk: 2.5315518379211426fast_matrix_multiply_blocking: 1.9359750747680664**baseline_matrix_multiply_flipjk_faster**: 1.4532990455627441**fast_matrix_multiply_blocking_faster**: 1.7031919956207275

As you can see, my version of your baseline_matrix_multiply_flipjk (the fourth one) is faster than even the block multiply, meaning that index checking & accessing dwarves the cache effect that you can experience in compiled languages & using direct pointers like C or C++.

I just stored the values that were not changing in the inner loop (the one to optimize most) to avoid index access.

Note that I tried to apply the same recipe to the block multiply and I gained some time compared to your version, but is still beaten by the flipjk_faster version because of the unability to avoid index access.

Maybe compiling the code using Cython and drop the checks would get the result you want. But pre-computing indexes never hurts.

Solution 2:

Python tends to not cache the results of its functions. It requires explicit notice of when you want build a cache for a function. You can do this using the lrc_cache decorator.

The following is code I threw together the other day and I've just added some readability. If there's something wrong, comment and I'll sort it out:

from functools import lru_cache

from random import randint as rand
from time import clock as clk

recur = 0#@lru_cache(maxsize=4, typed=False)defFunc(m, n):
    global recur
    recur += 1if m == 0:
        return n + 1elif n == 0:
        return Func(m - 1, 1)
    else:
        return Func(m - 1, Func(m, n - 1))


n = []
m = 0

ITER = 50
F1 = 3
F2 = 4


staTime = clk()
for x inrange (ITER):
    n.append(Func(F1, F2))

m += clk()-staTime

print("Uncached calls of Func:: "+str(recur//ITER))
print("Average time per solving of Ackerman's function:: "+str(m/ITER))
print("End result:: "+str(n[0]))

BTW: "#" indicates a comment, in case you're unfamiliar with Python.

So try running this and try AGAIN without the "#" on the line #@lru_cache(maxsize=4, typed=False)

Additionally, maxsize is the maximum size of the cache (works best in powers of two according to the docs) and typed just makes the cache add a new cached condition whenever a different type of the same value is passed as an argument.

Finally, the "Func" function is known as Ackermann's function. A stupidly deep amount of recursion goes on so be aware of stackoverflows (lol) if you should change the max recursion depth.

Python Playground

Python Matrix Multiplication And Caching

Solution 1:

Solution 2:

Post a Comment for "Python Matrix Multiplication And Caching"