6
\$\begingroup\$

I have written a code to calculate the MSD of some molecules. The code averages over multiple time origins (sliding time window) and over all the molecules. I have also made it do one extra thing: do a dot product of the displacement vector selected and the eigenvectors of the molecule. The code works but it is extremely slow.

In this code which I've posted below, the function "eig" takes in a file containing eigenvectors (3x3 matrix) of 105 molecules with 5,001 frames. So the shape is like (5,001x105,3,3). The second function, "coordinates" takes in the center of mass positions of the molecules. This file contains molecule index, x,y,z positions. So shape is like (5001x105,1,4) These two functions are reasonably okay.

The last function meanSqDisp is what I need help with. I think the for loops are killing my code. But I'm not all that good with Python and don't know where to make improvements to make it faster. I need your help.

import numpy as np

def eig(filename, totstep, frequency):
    data = np.loadtxt(filename)
    data = np.array(data).reshape(-1, 3, 3)
    return data

def coordinates(filename, totstep, frequency):
    with open(filename, 'r') as f:
        lines = f.readlines()
        m = int(totstep / frequency) + 1
        n = int(len(lines) / m)
        all_positions = [[] for _ in range(n)]
        for i, line in enumerate(lines):
            parts = line.split()
            atom_index = int(parts[0])
            x_position = float(parts[1])
            y_position = float(parts[2])
            z_position = float(parts[3])
            all_positions[atom_index - 1].append([x_position, y_position, z_position])
        all_positions = np.array(all_positions)
        return all_positions

def meanSqDisp(eigenvectors, coordinates):
    msd = {}
    for coord in coordinates:
        for window_size in range(2, len(coord) + 1):
            msd_sum = np.zeros(3)
            num_disp = len(coord) - window_size + 1
    
            for i in range(num_disp):
                
                #extract the displacement and eigenvector window
                window_disp_vec = coord[i:i + window_size]
                window_evec = eigenvectors[i:i + window_size]
        
                #subtract the last coord from the first in the window
                disp = window_disp_vec[-1] - window_disp_vec[0]
                
                #extract the first eigenvector from the eigenvector window
                evec = window_evec[0]
                
                #dot product of the eigenvectors and displacement (projection)
                coord_array= np.dot (disp,evec.T)
                msd_sum += np.square (coord_array)
                
            # calculate the mean of squared displacements
            msd_per_window = msd_sum / num_disp
            if window_size not in msd:
                msd[window_size] = []
            msd[window_size].append(msd_per_window)
    
    if not msd:
        return np.zeros(0) # return an empty array if msd is empty
    
    msd_data = np.array(list(msd.values())).reshape(5000,105,3)
    #number of number of frame-1, number of molecules, x,y,z
    msd_data = np.array(msd_data).reshape(5000,105,3) 
    avg_over_molecules = np.mean(msd_data, axis=1)
    #number of frames-1, 1, x,y,z
    avg_over_molecules = np.array(avg_over_molecules).reshape(5000,1,3)
    return avg_over_molecules

flipped_eig = eig("evecs-flipped", 500000, 100)
com_coord = coordinates("com-ec-2.out", 500000, 100)
msd_values = meanSqDisp(flipped_eig, com_coord)

From the profile stat below (advised in a comment) it seems like meanSqDisp() itself and dot product operation takes the most time (if I'm reading it correctly).

(k means ×1000 (used to keep columns aligned); procedures with "filename" numpy are from
/opt/ohpc/pub/libs/intel/python2/2.7.14/lib/python2.7/site-packages/numpy/lib/npyio.py)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1 5442.061 5442.061 6476.399 6476.399 msd-many-mol-….py:25(meanSqDisp)
1312763k  1026.565    0.000 1026.565    0.000 {numpy.core.multiarray.dot}
   525231    6.652    0.000    6.652    0.000 {range}
       33    3.815    0.116   14.220    0.431 numpy:994(read_data)
  4725945    2.822    0.000    3.451    0.000 numpy:734(floatconv)
3151k/1575k  2.049    0.000    2.182    0.000 numpy:966(pack_items)
  1575316    1.732    0.000    3.541    0.000 numpy:982(split_line)
        1    0.823    0.823    2.279    2.279 msd-many-mol-….py:9(coordinates)
  1575317    0.735    0.000    0.735    0.000 {zip}
   525000    0.703    0.000    0.703    0.000 {numpy.core.multiarray.zeros}

EDIT: I have added a sample input (2 frames) of what the input file looks like for Coordinates(). 4 columns corresponding to molecule number, x,y,z.

1 19.0729 -16.9332 -0.697239
2 -6.5962 8.55415 -19.5319
3 14.3369 1.06025 29.3728
.
.
105 2.08686 -5.48649 4.95266
1 19.252 -16.8804 -0.639706
2 -6.4427 8.47382 -19.2969
3 14.4882 0.967256 29.1666
.
.
105 5.9034 -2.8569 7.976

Sample input (1 frame) for eig() looks like below. This is eigenvectors for molecules 1-3 (3 eigenvectors per molecule)...so up to 105 molecules and 5001 frames.

0.720   -0.604  0.342  
0.655   0.754   -0.048 
-0.228  0.259   0.939  
-0.302  0.087   -0.949 
-0.742  0.603   0.291  
0.598   0.793   -0.117 
0.376   -0.016  -0.927 
-0.491  0.844   -0.214 
0.786   0.535   0.309
\$\endgroup\$
12
  • 1
    \$\begingroup\$ Please understand that when you assign window_disp_vec and window_evec, you are allocating a new vector and then copying in a window's worth of data; the : slice operator implies a copy. Then we access three elements and discard those copies. You have an opportunity to perform three indexed accesses without performing the slice copy. Also, this question is about performance but it omits timing figures and there's no profiler output telling us which lines consume the bulk of the elapsed time. There's time to revise. \$\endgroup\$
    – J_H
    Commented May 20, 2023 at 2:30
  • 1
    \$\begingroup\$ Since the profiler output is now added in the correct form and the comment has been adjusted, your question looks much better now. Well done. \$\endgroup\$
    – Mast
    Commented May 21, 2023 at 8:53
  • \$\begingroup\$ @Mast, thank you. I really hope I would be able to get help on this by forum members. \$\endgroup\$
    – mjksho
    Commented May 21, 2023 at 21:54
  • \$\begingroup\$ For tinkering, a (toy?) sample input would be great. \$\endgroup\$
    – greybeard
    Commented May 23, 2023 at 11:12
  • \$\begingroup\$ The intro says coords has shape (5001x105,1,4), but from the code I think its (105, 5001, 3). The comment #subtract the last coord from the first in the window is the reverse of what the line of code does. \$\endgroup\$
    – RootTwo
    Commented May 23, 2023 at 16:20

2 Answers 2

2
\$\begingroup\$

You care mostly about performance, so I will review the inner loops only and ignore everything else, including style. This is your code again:

for window_size in range(2, len(coord) + 1):
    msd_sum = np.zeros(3)
    num_disp = len(coord) - window_size + 1

    for i in range(num_disp):
        
        #extract the displacement and eigenvector window
        window_disp_vec = coord[i:i + window_size]
        window_evec = eigenvectors[i:i + window_size]

        #subtract the last coord from the first in the window
        disp = window_disp_vec[-1] - window_disp_vec[0]
        
        #extract the first eigenvector from the eigenvector window
        evec = window_evec[0]
        
        #dot product of the eigenvectors and displacement (projection)
        coord_array= np.dot (disp,evec.T)
        msd_sum += np.square (coord_array)
        
    # calculate the mean of squared displacements
    msd_per_window = msd_sum / num_disp
    if window_size not in msd:
        msd[window_size] = []
    msd[window_size].append(msd_per_window)

The biggest problem here is that you perform all loops in (interpreted) python, as opposed to working with arrays and having numpy do the tight loops in compiled native code. You can fix that by using the einsum() function. It uses Einstein's sum notation where you only specify the indices and numpy will run the loops for you. Here is a fairly direct translation of you inner loop from python to numpy:

# Allocate everything now, avoid memory allocations in the loop.
diff = np.zeros_like(coords)         # diff[i] = coords[i+m] - coords[i]
dot = np.zeros((len(coords), 3))     # dot[i] = np.dot(diff[i], ev[i])
msd = np.zeros((len(coords) + 1, 3)) # results
# m is window size minus one for more convenient indexing.
for m in range(1, len(coords)):
    # Drop first m and last m coords, then subtract the two arrays.
    np.subtract(coords[m:], coords[:-m], out=diff[:-m])
    # Dot products for all shifts i. Let numpy do the loop for you.
    np.einsum('ij,ikj->ik', diff[:-m], eigenvectors[:-m], out=dot[:-m])
    # Calculate mean square of dot products.
    np.square(dot[:-m], out=dot[:-m])
    np.mean(dot[:-m], axis=0, out=msd[m+1])

\$\endgroup\$
1
\$\begingroup\$

Document, especially procedures (function or method) using docstrings.

What is the significance of 105 in the reshapes, and why a second reshape of msd_data?

As J_H commented, try avoiding all that slicing, even with no copies created for "slicing NumPy arrays":

def mean_square_displacement(eigenvectors, coordinates):
    """ Return an array of mean square displacements. """
    msd = {}
    for coord in coordinates:
        for window_size in range(1, len(coord)):  # window_size_1 for -1?
            squared_displacements = np.zeros(3)
            num_disp = len(coord) - window_size
    
            for start in range(num_disp):
                
                # the displacement and eigenvector windows:
                # window_disp_vec = coord[start:start + window_size+1]
                # window_evec = eigenvectors[start:start + window_size+1]
        
                # subtract the first coord from the last in the window
                displacement = coord[start + window_size] - coord[start]
                
                # extract the first eigenvector (in the eigenvector window)
                eigenvector = eigenvectors[start]
                
                # dot product of the eigenvector and displacement (projection)
                coord_array = np.dot (displacement, eigenvector.T)
                squared_displacements += np.square(coord_array)
                
            # calculate the mean of squared displacements
            …

More things to try:

  • collect displacements into a NumPy array to operate on
    - call & loop setup overhead may be killing performance
  • looping not over window size, but start
    would need keeping a lot of "accumulators"
    allows keeping coord[start], eigenvectors[start].T
\$\endgroup\$
1
  • \$\begingroup\$ Thank you. 105 is the number of molecules in the system. n is actually used in the code. I will try the start suggestion and give feedback. \$\endgroup\$
    – mjksho
    Commented May 23, 2023 at 22:31

Not the answer you're looking for? Browse other questions tagged or ask your own question.