28

I'm using the Anaconda distribution of Python, together with Numba, and I've written the following Python function that multiplies a sparse matrix A (stored in a CSR format) by a dense vector x:

@jit
def csrMult( x, Adata, Aindices, Aindptr, Ashape ):

    numRowsA = Ashape[0]
    Ax       = numpy.zeros( numRowsA )

    for i in range( numRowsA ):
        Ax_i = 0.0
        for dataIdx in range( Aindptr[i], Aindptr[i+1] ):

            j     = Aindices[dataIdx]
            Ax_i +=    Adata[dataIdx] * x[j]

        Ax[i] = Ax_i

    return Ax 

Here A is a large scipy sparse matrix,

>>> A.shape
( 56469, 39279 )
#                  having ~ 142,258,302 nonzero entries (so about 6.4% )
>>> type( A[0,0] )
dtype( 'float32' )

and x is a numpy array. Here is a snippet of code that calls the above function:

x       = numpy.random.randn( A.shape[1] )
Ax      = A.dot( x )   
AxCheck = csrMult( x, A.data, A.indices, A.indptr, A.shape )

Notice the @jit-decorator that tells Numba to do a just-in-time compilation for the csrMult() function.

In my experiments, my function csrMult() is about twice as fast as the scipy .dot() method. That is a pretty impressive result for Numba.

However, MATLAB still performs this matrix-vector multiplication about 6 times faster than csrMult(). I believe that is because MATLAB uses multithreading when performing sparse matrix-vector multiplication.


Question:

How can I parallelize the outer for-loop when using Numba?

Numba used to have a prange() function, that made it simple to parallelize embarassingly parallel for-loops. Unfortunately, Numba no longer has prange() [actually, that is false, see the edit below]. So what is the correct way to parallelize this for-loop now, that Numba's prange() function is gone?

When prange() was removed from Numba, what alternative did the developers of Numba have in mind?


Edit 1:
I updated to the latest version of Numba, which is .35, and prange() is back! It was not included in version .33, the version I had been using.
That is good news, but unfortunately I am getting an error message when I attempt to parallelize my for loop using prange(). Here is a parallel for loop example from the Numba documentation (see section 1.9.2 "Explicit Parallel Loops"), and below is my new code:

from numba import njit, prange
@njit( parallel=True )
def csrMult_numba( x, Adata, Aindices, Aindptr, Ashape):

    numRowsA = Ashape[0]    
    Ax       = np.zeros( numRowsA )

    for i in prange( numRowsA ):
        Ax_i = 0.0        
        for dataIdx in range( Aindptr[i],Aindptr[i+1] ):

            j     = Aindices[dataIdx]
            Ax_i +=    Adata[dataIdx] * x[j]

        Ax[i] = Ax_i            

    return Ax 

When I call this function, using the code snippet given above, I receive the following error:

AttributeError: Failed at nopython (convert to parfors) 'SetItem' object has no attribute 'get_targets'


Given
the above attempt to use prange crashes, my question stands:

What is the correct way ( using prange or an alternative method ) to parallelize this Python for-loop?

As noted below, it was trivial to parallelize a similar for loop in C++ and obtain an 8x speedup, having been run on 20-omp-threads. There must be a way to do it using Numba, since the for loop is embarrassingly parallel (and since sparse matrix-vector multiplication is a fundamental operation in scientific computing).


Edit 2:
Here is my C++ version of csrMult(). Parallelizing the for() loop in the C++ version makes the code about 8x faster in my tests. This suggests to me that a similar speedup should be possible for the Python version when using Numba.

void csrMult(VectorXd& Ax, VectorXd& x, vector<double>& Adata, vector<int>& Aindices, vector<int>& Aindptr)
{
    // This code assumes that the size of Ax is numRowsA.
    #pragma omp parallel num_threads(20)
    {       
        #pragma omp for schedule(dynamic,590) 
        for (int i = 0; i < Ax.size(); i++)
        {
            double Ax_i = 0.0;
            for (int dataIdx = Aindptr[i]; dataIdx < Aindptr[i + 1]; dataIdx++)
            {
                Ax_i += Adata[dataIdx] * x[Aindices[dataIdx]];
            }

            Ax[i] = Ax_i;
        }
    }
}
9
  • Have you tried the parallel=True keyword argument to the jit decorator? I mean annotating it with @jit(parallel=True)?
    – f0xdx
    Commented Oct 25, 2017 at 4:48
  • @fxx I just tried replacing @jit with @jit(parallel=True), and when I ran my test code snippet I received the following error message: KeyError: "<class 'numba.targets.cpu.CPUTargetOptions'> does not support option: 'parallel'"
    – littleO
    Commented Oct 25, 2017 at 4:53
  • Yes, this is an experimental feature (and depending on you version of numba might not yet be available). Ok, with that option removed, the next thing I'd try is to port the implementation to @vectorize or @guvectorize (to generate ufuncs). Maybe you even have to roll out the inner loop into another function for that.
    – f0xdx
    Commented Oct 25, 2017 at 5:00
  • @littleO Let's be a bit more quantitative in the problem formulation. How large and how sparse is the A matrix ( rows, cols, dtype ) + a ( sparse / dense ) occupancy ratio? N.b.: Trying to compare a MATLAB code-execution with Py3/Numba ecosystem tooling may be a lot misleading. Commented Oct 25, 2017 at 8:49
  • @user3666197 I updated the question with some important new information. A has 56,469 rows and 39,279 columns and 142,258,302 nonzero entries (so about 6.4% of its entries are nonzero). The output of type(A[0,0]) is numpy.float32. I wrote a very similar csrMult function in C++ where it was trivial to parallelize the for loop (because C++ supports openMP natively), and my function got about 6 or 7 times faster. I would expect to achieve a similar speedup by parallelizing the for loop in Python when using Numba.
    – littleO
    Commented Oct 25, 2017 at 9:15

2 Answers 2

28

Numba has been updated and prange() works now! (I'm answering my own question.)

The improvements to Numba's parallel computing capabilities are discussed in this blog post, dated December 12, 2017. Here is a relevant snippet from the blog:

Long ago (more than 20 releases!), Numba used to have support for an idiom to write parallel for loops called prange(). After a major refactoring of the code base in 2014, this feature had to be removed, but it has been one of the most frequently requested Numba features since that time. After the Intel developers parallelized array expressions, they realized that bringing back prange would be fairly easy

Using Numba version 0.36.1, I can parallelize my embarrassingly parallel for-loop using the following simple code:

@numba.jit(nopython=True, parallel=True)
def csrMult_parallel(x,Adata,Aindices,Aindptr,Ashape): 
    
    numRowsA = Ashape[0]    
    Ax = np.zeros(numRowsA)
    
    for i in numba.prange(numRowsA):
        Ax_i = 0.0        
        for dataIdx in range(Aindptr[i],Aindptr[i+1]):
            
            j = Aindices[dataIdx]
            Ax_i += Adata[dataIdx]*x[j]
            
        Ax[i] = Ax_i            
                        
    return Ax

In my experiments, parallelizing the for-loop made the function execute about eight times faster than the version I posted at the beginning of my question, which was already using Numba, but which was not parallelized. Moreover, in my experiments the parallelized version is about 5x faster than the command Ax = A.dot(x) which uses scipy's sparse matrix-vector multiplication function. Numba has crushed scipy and I finally have a python sparse matrix-vector multiplication routine that is as fast as MATLAB.

14
  • 3
    A cool piece of news. If this works universally on either of the Intel, AMD, ARM, ... architectures, then the code re-design was indeed a brilliant move. If the trick was to just use the new possibilities, coming from hardware-based extended registers and vectorised operations instructions, not present on other processor architectures, the ARM and might be also the AMD ports will not enjoy the performance that you have enjoyed to observe. Anyway, enjoy the new powers available for further expanding your valued research. Commented Dec 15, 2017 at 11:51
  • 1
    Thank you for pointing me to this! I have forwarded a link to the Numba team for their encouragement. Commented Dec 16, 2017 at 19:42
  • 2
    @MichaelGrant I have a question for you, if you don't mind. Do you know if Numba provides a way to specify the "chunk size" when using prange() to parallelize a for-loop?
    – littleO
    Commented Dec 18, 2017 at 12:44
  • 2
    Thinking about it more, it makes sense that A * x would be slower in MATLAB than A' * x. With CSC storage, A' * x, is much easier to parallelize, because each row gets its own thread. Commented Dec 18, 2017 at 16:34
  • 1
    @GeoffreyNegiar I was hesitant to accept my own answer and to undo acceptance on a different answer, but you are right. I just made this the accepted answer.
    – littleO
    Commented Jun 18, 2020 at 19:08
9

Thanks for your quant updates, Daniel.
The following lines might be hard to swallow, but kindly believe me, there are more things to take into account. I have worked on / / problems
having matrices in the scales ~ N [TB]; N > 10 and their sparse accompanions, so some pieces of experience may be useful for your further views.

WARNING: Do not expect any dinner to be served for free

A wish to parallelise a piece of code sounds like a more and more often contemporary re-articulated mana. The problem is not the code, but the cost of such move.

The economy is the number one problem. Amdahl's Law, as it was originally formulated by Gene Amdahl, did not take into account the very costs of [PAR]-processes-setups + [PAR]-processes-finalisations & terminations, which indeed have to be paid in every real-world implementation.

The overhead-strict Amdahl's Law depicts the scale of these un-avoidable adverse effects and helps understand a few new aspects that have to be evaluated before one opts to introduce parallelisation ( at an acceptable cost of doing so, as it is very, indeed VERY EASY to pay MUCH more than one may gain from -- where a naive disappointment from a degraded processing performance is the easier part of the story ).

Feel free to read more posts on overhead-strict Amdahl's Law re-formulation, if willing to better understand this topic and to pre-calculate the actual "minimum"-subProblem-"size", for which the sum-of-[PAR]-overheads will get at least justified from real-world tools for introducing the parallel-split of the subProblem onto N_trully_[PAR]_processes ( not any "just"-[CONCURRENT], but true-[PARALLEL] -- these are way not equal ).


Python may get a dose of steroids for increased performance:

Python is a great prototyping eco-system, whereas numba, numpy and other compiled-extensions help a lot to boost performance way farther than a native, GIL-stepped python (co-)-processing typically delivers.

Here, you try to enforce numba.jit() to arrange the job almost-for-free, just by its automated jit()-time lexical-analyser ( that you throw your code on ), which ought both "understand" your global goal ( What to do ), and also propose some vectorisation-tricks ( How best assemble a heap of CPU-instructions for maximum efficiency of such code-execution ).

This sounds easy, but it is not.

Travis Oliphant's team has made immense progress on numba tools, but let's be realistic and fair not to expect any form of automated-wizardry to get implemented inside a .jit()-lexer + code-analysis, when trying to transform a code and assemble a more efficient flow of machine instructions to implement the high-level task's goal.

@guvectorize? Here? Seriously?

Due to [PSPACE] sizing, you may immediately forget to ask numba to somehow efficiently "stuff" the GPU-engine with data, a memory-footprint of which is way behind the GPU-GDDR sizings ( not speaking at all about too-"shallow" GPU-kernel sizes for such mathematically-"tiny" processing to just multiply, potentially in [PAR], but to later sum in [SEQ] ).

(Re-)-Loading GPU with data takes heaps of time. If having paid that, the In-GPU memory-latencies are not very friendly for "tiny"-GPU-kernels economy either -- your GPU-SMX code-execution will have to pay ~ 350-700 [ns] just to fetch a number ( most probably not automatically re-aligned for the best coalesced SM-cache-friendly re-use in next steps and you may notice, that you never, let me repeat it, NEVER re-use a single matrix cell at all, so caching per-se will not deliver anything under those 350~700 [ns] per matrix cell ), while a smart pure numpy-vectorised code can process matrix-vector product in less than 1 [ns] per cell on even the largest [PSPACE]-footprints.

That is a yardstick to compare to.

( Profiling would better show here the hard-facts, but the principle is well known beforehand, without testing how to move a few TB of data onto GPU-fabric just to realise this on one's own. )


The worst of the bad news:

Given the memory scales of the matrix A, the worse effect to be expected is, that the sparse-organisation of the storage of the matrix representation will most probably devastate most, if not all, possible performance gains achievable by numba-vectorised tricks on dense matrix representations, as there will likely be almost zero chance for efficient memory fetched cache-line re-uses and sparsity will also break any easy way to achieve a compact mapping of vectorised operations and these will hardly remain able to get easily translated into advanced CPU-hardware vector-processing resources.


Inventory of solvable problems:

  • always better pre-allocate the vector Ax = np.zeros_like( A[:,0] ) and pass it as another parameter into the numba.jit()-compiled parts of the code, so as to avoid repetitive paying additional [PTIME,PSPACE]-costs for creating ( again ) new memory-allocations ( the more if the vector is suspect for being used inside an externally orchestrated iterative optimisation process )
  • always better specify ( to narrow the universality, for the sake of resulting code performance )
    at least the numba.jit( "f8[:]( f4[:], f4[:,:], ... )" )-calling interface directives
  • always review all numba.jit()-options available and their respective default values ( may change version to version ) for your specific situation ( disabling GIL and better aligning the goals with numba + hardware capabilities will always help in numerically intensive parts of the code )

@jit(   signature = [    numba.float32( numba.float32, numba.int32 ),                                   #          # [_v41] @decorator with a list of calling-signatures
                         numba.float64( numba.float64, numba.int64 )                                    #
                         ],    #__________________ a list of signatures for prepared alternative code-paths, to avoid a deferred lazy-compilation if undefined
        nopython = False,      #__________________ forces the function to be compiled in nopython mode. If not possible, compilation will raise an error.
        nogil    = False,      #__________________ tries to release the global interpreter lock inside the compiled function. The GIL will only be released if Numba can compile the function in nopython mode, otherwise a compilation warning will be printed.
        cache    = False,      #__________________ enables a file-based cache to shorten compilation times when the function was already compiled in a previous invocation. The cache is maintained in the __pycache__ subdirectory of the directory containing the source file.
        forceobj = False,      #__________________ forces the function to be compiled in object mode. Since object mode is slower than nopython mode, this is mostly useful for testing purposes.
        locals   = {}          #__________________ a mapping of local variable names to Numba Types.
        ) #____________________# [_v41] ZERO <____ TEST *ALL* CALLED sub-func()-s to @.jit() too >>>>>>>>>>>>>>>>>>>>> [DONE]
 def r...(...):
      ...
15
  • 3
    I don't think specifying the signature is a good advise, it prevents optimizations based on the contiguous-ness of the data (sometimes resulting in noticable degraded performance). Also I'm not sure why you mention GPU here. Nothing in the question mentions GPU.
    – MSeifert
    Commented Oct 25, 2017 at 11:44
  • 1
    But I like the part about the cost of parallel processing, especially the often ignored part that "it is very, indeed VERY EASY to pay MUCH more than one may gain from"!
    – MSeifert
    Commented Oct 25, 2017 at 11:45
  • Ad GPU) actually it was mentioned in comments above to try numba @guvectorize tool, so I added a few remarks on hidden extreme costs of ( also indeed VERY OFTEN mis-used ) GPU-latency-masking-SMX toys for this sort of problems. GPU can help for "mathematically"-large-GPU-kernels operating on very-compact-&-small-data-region + having minimum, best none, SIMT-synchronisation, but not for anything else. Parallelisation AT ANY COSTS is so, so often these days. "Ó Tempóra, ó Mórés ..." :o) Commented Oct 25, 2017 at 11:52
  • 1
    Thanks for this detailed answer. One thing to keep in mind is that I wrote a very similar csrMult function in C++, where it was trivial to parallelize the for loop (because C++ supports openMP natively), and by parallelizing the for loop I observed a 6x or 7x speedup, using the same matrix. I would expect a similar speedup here. In any case, I think it should at least be possible to parallelize my for loop using prange() without having the code crash. In C++, I only needed to write #pragma omp parallel for above the for loop to make the loop execute in parallel.
    – littleO
    Commented Oct 25, 2017 at 11:54
  • 2
    if I am reading this correctly there seems to be a mistaken assumption that guvectorize decorators imply GPU computation, but this is not correct. Indeed I use such constructs all the time on CPU targets. Commented Dec 16, 2017 at 19:47

Not the answer you're looking for? Browse other questions tagged or ask your own question.