7

I have a shapefile attribute table in QGIS that I would like to calculate the cumulative sum. The thing is that it has over 1.5 million features.

Type of device: HDD.

The attribute table looks like this:

enter image description here

I have tried to calculate in a new field the cumulated sum for "Equitativo" like this:

 with_variable(
    'ID_FINAL',
    ID_FINAL,
    sum(Equitativo, filter:=ID_FINAL<=@ID_FINAL)
    )

The problem is that is taking way too long (it's running for 17 hours now). How could I make this faster way?

6
  • 1
    Full table scans take time. Please Edit the Question to specify the storage format of the data, the type of device on which it is stored, and the actual runtime in seconds.
    – Vince
    Commented Jan 16, 2022 at 12:20
  • I'm not sure what "type of device" means. Commented Jan 16, 2022 at 12:26
  • HDD, SSD, USB thumb drive,...
    – Vince
    Commented Jan 16, 2022 at 12:36
  • 1
    If you need to perform many calculations like this you can use geopandas. The calculation is performed in less than a second. But it takes about 30 s to read the shapefile and the same to write
    – Bera
    Commented Jan 16, 2022 at 14:30
  • 1
    @BERA (geo)pandas is really so fast. but it took 76 seconds to read, 0.02 second to sum (insanely fast), 127 seconds to write to file in my computer. Commented Jan 16, 2022 at 14:42

2 Answers 2

15

You can use the script below in QGIS Python Editor. It takes about 45 seconds (including saving to the source file) for 1.5 million features in SSD or HDD. Select the layer first.

import time
from qgis.gui import QgsRubbersBand
from qgis.gui import iface
from qgis.PyQt.QtCore import Qvariant

start time = time.time()

layer = iface.activeLayer()

# Add CUM_SUM field if not exist
field_index = layer.fields().indexFromName("CUM_SUM")
if field_index == -1:
    field = QgsField('CUM_SUM', QVariant.Double, len=20, prec=3)
    layer.dataProvider().addAttributes([field])
    layer.updateFields()

# Get field indices
id_index = layer.fields().indexFromName("ID_FINAL")
eq_index = layer.fields().indexFromName("Equitativo")
cum_sum_index = layer.fields().indexFromName("CUM_SUM")

cum_sum = 0
attribute_map = {}
request = QgsFeatureRequest().addOrderBy('ID_FINAL').setFlags(QgsFeatureRequest.NoGeometry).setSubsetOfAttributes(['ID_FINAL', 'Equitativo'], layer.fields())
for feature in layer.getFeatures(request):
    cum_sum += feature["Equitativo"]
    attribute_map.update({feature.id(): {cum_sum_index: cum_sum}})

layer.dataProvider().changeAttributeValues(attribute_map)

print("%s seconds" % (time.time() - start_time))

enter image description here

5
  • Can you explain why this Python skript is so much faster?
    – Babel
    Commented Jan 16, 2022 at 15:18
  • 2
    @Babel It is hard to me to explain such a thing in English. Nyall Dawson explains "How to speed up PyQGIS scripts". It can give you an idea why an expression is sometimes so slow. For example; using get_feature_by_id is one of the reasons. Behind the scene, QGIS needs to do a bunch of things to give you the matching feature. Commented Jan 16, 2022 at 16:34
  • @Babel. Nyall says: Well, remember that every time you call getFeatures() QGIS needs to do a whole bunch of things before it can start giving you the matching features. In this case, the code is calling getFeatures() once for every feature ID in the list. So if the list had 100 features, that means QGIS is having to create a connection to the data source, set up and prepare a query to match a single feature, wait for the provider to process that, and then finally parse the single feature result. That’s a lot of wasted processing! Commented Jan 16, 2022 at 16:37
  • An expression may include that kind of situations. @Babel Commented Jan 16, 2022 at 16:39
  • That makes sense, thanks a lot for the explanation.
    – Babel
    Commented Jan 16, 2022 at 16:42
6

Your expression is not efficient because it calculates the same sum over and over again: for feature no. 100, it is adding 100 numbers (feature1, feature2, ... feature100), even though adding 99% of these values (from feature1 to feature99) was done before.

Better rely on the already calculated sum of the last feature: for feature 100, something like (pseudocode): sum (feature100) = sum (feature99) + value (feature100)

So try this expression:

attribute(get_feature_by_id(@layer, $id), 'Equitativo')
+
attribute(get_feature_by_id(@layer, $id-1), 'Equitativo')

Will definitely take more time than the Python script by @Kadir Şahbaz, so you might want to stick to this. This solution here is just to show how to do it using QGIS expressions only.

2
  • 2
    +1. A small issue here: We don't know if the table is ordered by ID_FINAL. Its order may not be matched with $id order. Is that possible to order by ID_FINAL using expression before running it? Commented Jan 16, 2022 at 14:22
  • Indeed, this works if the order for cumulative sum is the same than $id. If not, this would complicate things and I would definitely not stick to using expressions for larger datasets as your Python skript is much more efficient.
    – Babel
    Commented Jan 16, 2022 at 14:49

Not the answer you're looking for? Browse other questions tagged or ask your own question.