Explaining long computing time of Python code

Question

I have a Python code adding information to a layer if its features are overlayed by other features. The base layer has 1200 features and is 10 MB. The 4 layers for the overlay have roughly 40 MB combined.

When I run the script, it takes more than 10 hours to calculate the results. This seems odd and long. I have a Ryzen 2700X and 16GB of RAM.

Is there anything off about my code which could explain the long computing time? What steps could I do to optimize this?

import time

start_time = time.time()    

project = QgsProject().instance()

listOverlays = ['Overlay1', 'Overlay2', 'Overlay3', 'Overlay4']
mainLayer = project.mapLayersByName('Baselayer')[0]
provider = mainLayer.dataProvider()
mainLayer_names = [field.name() for field in mainLayer.fields()]

listOverlaysLayers = [project.mapLayersByName(layerName)[0] for layerName in listOverlays]

# add the necessary fields to mainLayer
for name in listOverlays:
    if name not in mainLayer_names :
        provider.addAttributes([QgsField(name, QVariant.String)])
mainLayer.updateFields()

#main Layer feats list After Adding the new fields
mainLayer_features = [ftr for ftr in mainLayer.getFeatures()]

mainLayer.startEditing()

for feature in mainLayer_features :
    print('Feature '+ str(feature.id()))
    for layer in listOverlaysLayers :
        
        print('......Layer '+ str(layer.name()))      
        layer_temp = QgsVectorLayer(f"{'Polygon'}?crs={str(mainLayer.crs().authid()).lower()}", "temp", "memory")
        pr = layer_temp.dataProvider()
        pr.addFeatures([feature])
           
        param = { 'INPUT' : layer, 'INTERSECT' : layer_temp, 'METHOD' : 0, 'PREDICATE' : [0] }
        processing.run('native:selectbylocation', param)
        
        if layer.selectedFeatureCount() == 0 :
            feature[str(layer.name())]='no'
            mainLayer.updateFeature(feature)
        else :
            feature[str(layer.name())]='yes'
            mainLayer.updateFeature(feature)
            
        del layer_temp
        del pr
        del param

mainLayer.commitChanges()    

end_time = time.time()
duration = end_time - start_time

print(f"\nDuration: {duration} seconds")

You should log the processing log from your algorithm "selectbylocation", using a QgsProcessingFeedback. But mainly, what do you have when you run this algorithm by hand ? You might indeed have some logs saying you don't have index ? — etrimaille, Commented Mar 22, 2023 at 7:54
the layers are not spatial indexed geojson files. I tried to convert them to .gpkg an index them, but then the code doesn't run at all. — MrSalamikuchen, Commented Mar 22, 2023 at 8:34
You might not believe me but it only took 20 seconds. I'm preparing the answer. — Kadir Şahbaz, Commented Mar 22, 2023 at 23:02

Taras · Accepted Answer · 2023-03-23 20:56:49Z

Here are some issues with your script:

I've already learned that using GeoJSON for analyzing is a too bad idea. The script below that takes 20 seconds when I use GeoPackage takes about 8-10 hours when I use GeoJSON. (GeoJSON -> 8-10 hr, GeoPackage -> 20 sec)
Remove the print statement. Well, it may not slow down a 60-minute process, but it does slow down a 20-second process. 60 minutes may increase to 61 minutes, but 20 seconds can increase to 80 seconds. (60 min -> 61 min, 20 sec -> 80 sec)
Making a QgsVectorLayer for each feature and using it in a processing tool is another bad idea. I didn't test it but, based on my experience, if you have a chance to use pure PyQGIS, don't use a processing tool. This may not always be True.
Don't make the selection in a loop if you can list the features in a Pythonic way.
In the final script below, using the edit mode (.startEditing) didn't reduce the speed noticeably, but in most cases using the dataProvider functions instead of edit mode is much faster.
In a situation where thousands of features are involved, the most important part of the process is to exclude features that we do not need. In other words, we must get rid of thousands of features that are not near the base feature. QgsFeatureRequest().setFilterRect provides us with this opportunity. It reduces processing time very dramatically. Guess what! The processing time for this data was reduced from 90 minutes to 20 seconds.
Another slowing part is to check the intersection of the feature in the baselayer with every feature in the other layers. Your goal is to check if any feature in an overlayer intersects the base feature. So if you catch an intersection with a base feature, you don't need to check for intersections with other features in the current overlayer. So you must break the inner for loop if you catch an intersection. This step doesn't have much effect when using step 6. Otherwise, this step will speed up the process considerably. (110 min -> 80 min when step 6 is ignored)

My computer is 10 years old. Processing times given above are approximate, except for 20 seconds.

The script:

import time

start_time = time.time()

get_layer = QgsProject().instance().mapLayersByName
overlay_names = ['Overlay1', 'Overlay2', 'Overlay3', 'Overlay4']
overlayers = [get_layer(name)[0] for name in overlay_names]

main_layer = get_layer('Baselayer')[0]

for name in overlay_names:
    if name not in main_layer.fields().names():
        main_layer.dataProvider().addAttributes([QgsField(name, QVariant.String)])
main_layer.updateFields()

field_indices = [main_layer.fields().indexFromName(i) for i in overlay_names]

attr_map = {}
request = QgsFeatureRequest().setSubsetOfAttributes(overlay_names, main_layer.fields())

for feature in main_layer.getFeatures(request):
    
    FID = feature.id()
    attr_map[FID] = {}
    geom = feature.geometry()
    for field_index, overlayer in zip(field_indices, overlayers):
        intersects = False
        
        # This line is CRITICAL
        request = QgsFeatureRequest().setFilterRect(geom.boundingBox())
                    
        for overlay_feature in overlayer.getFeatures(request):
            if geom.intersects(overlay_feature.geometry()):
                intersects = True
                break
        
        if intersects:
            attr_map[FID].update({field_index: 'yes'})
        else:
            attr_map[FID].update({field_index: 'no'})
            
main_layer.dataProvider().changeAttributeValues(attr_map)

end_time = time.time()
duration = end_time - start_time
print(f"\nDuration: {duration} seconds")

Sample results:

Thank you very very much dear Kadir for this detailed answer. I will check it as soon as I get back to my desk tomorrow. — MrSalamikuchen, Commented Mar 23, 2023 at 20:26
I hope you can understand what I mean. It's hard for me to explain subtle matters in English. — Kadir Şahbaz, Commented Mar 23, 2023 at 20:41
No worrys! Every point made absolute sense and with your help I was able to execute the code in 26 seconds! That's awesome. — MrSalamikuchen, Commented Mar 24, 2023 at 7:28
@MrSalamikuchen I was preparing an answer for the question you deleted. Did you find a solution? If you don't, I will add an answer if you undelete the question. — Kadir Şahbaz, Commented Aug 21, 2023 at 14:42

Stack Exchange Network

Explaining long computing time of Python code

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
pyqgis
performance
or ask your own question.

Hot Network Questions

Explaining long computing time of Python code

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pyqgisperformance or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
pyqgis
performance
or ask your own question.