Performance degradation while updating tables having 10s of millions of records

Question

I want to update tables ( my be 20-30 ) having 10s of millions of records each.

The problem is that it is taking too much time for the update process and also at that time CPU usage also goes very high. I want to do in such a way that it can not use much CPU while processing the data. If the processing time is increased then it will not be a problem for me but it should use limited CPU resources for processing ( updating ) the table. I am using PostgreSQL as database and server operating system is Linux.

Sample query of mine can be like this

UPDATE TEMP 
SET CUSTOMERNAME = 
  ( select customername from user where user.customerid = temp.customerid );

You really need to include the code for the update. It could be CPU intensive because you are doing some complicated function on every row. — JNK, Commented May 9, 2012 at 17:48
I am just executing the update query in the table. For example UPDATE TEMP SET CUSTOMERNAME = ( select customername from user where user.customerid = temp.customerid ) ; — Bhavik Ambani, Commented May 9, 2012 at 17:49
Put that into your question, please. I'm not sure how PG handles subqueries but that could be an issue since it may get executed once per row. A non-specific answer here will be too general, so you need to give us a scenario to solve. — JNK, Commented May 9, 2012 at 17:53
Is there an index on user.customerid and on temp.customerid? Adding an index if there isn't one might help. — FrustratedWithFormsDesigner, Commented May 9, 2012 at 18:14

kgrittn · Accepted Answer · 2012-05-09 18:23:14Z

10

The first question is: Why is it important that you not use a lot of CPU time? The query will be bottlenecked on some resource; if you could introduce enough additional disk access, the CPU time used per second would go down, but would that really be an improvement? What resource would you prefer to saturate? Understanding why you have emphasized this might help guide people to providing an answer that you will find useful.

As suggested in a comment, your query may run faster with a join rather than a correlated subquery. Something like this:

UPDATE temp
  SET customername = user.customername
  FROM user
  WHERE user.customerid = temp.customerid;

Another important thing to know is whether you want to update all rows in the table. Are some of the values already correct? If so, you will get a big performance boost by not updating the rows that don't need it. Add AND temp.customername is distinct from user.customername to the WHERE clause.

If you limit the number of rows updated in each statement, and VACUUM ANALYZE after each UPDATE, you will avoid table bloat. If the point of the desire to minimize CPU time is to avoid a performance impact on concurrent transactions, this would give you the opportunity to introduce a short delay (in the form of a sleep or something similar) before you start the next UPDATE of a set of rows.

Even better, why are you redundantly storing the information in the temp table rather than joining to it when needed? (Sometimes there is a good reason; quite often there isn't.)

answered May 9, 2012 at 18:23

kgrittn

2,80418 silver badges20 bronze badges

Because it is main server and it is live machine if the CPU usage gets higher then its performance will be degraded and it will not be able to do other tasks for which it is made.
– Bhavik Ambani
Commented May 9, 2012 at 18:25
Great answer with many choices. And here I was just going to ask him to EXPLAIN between his query and the query which I'd rewritten (and looks like yours) :)
– swasheck
Commented May 9, 2012 at 18:25
@BhavikAmbani there is still the unanswered question of why you're storing this data in a table called TEMP. Are you preparing to introduce change into the environment and this is your DR strategy? Do you have dev/test/stage environments in which this could be attempted first?
– swasheck
Commented May 9, 2012 at 18:28
I have just named as temp but actual name is different
– Bhavik Ambani
Commented May 9, 2012 at 18:33
3

@BhavikAmbani: If this is not a temporary table, how do you intend to ensure that the names stay current in the other tables as name changes occur in the source table? If you normalize to 3rd normal form, you won't have problems with that. Don't believe FUD about joins being slow -- normalize first, and look to optimize when you hit an actual problem. There will usually be a better solution than denormalizing. Premature optimization causes a great many problems.
– kgrittn
Commented May 9, 2012 at 18:51

| Show 1 more comment

HLGEM · Accepted Answer · 2012-05-09 18:48:33Z

5

If you follow the very good advice of kgrittn and still have performance issues, you may need to perform the update in batches. You would still perform set-based updates but limit them to the first 1000 (or whatever number works for you, I've used from 500 to 50,000)records that don't match and then keep looping until all are done.

answered May 9, 2012 at 18:48

HLGEM

3,14317 silver badges18 bronze badges

Add a comment |

Clodoaldo · Accepted Answer · 2012-05-09 20:28:36Z

1

If there is an index on TEMP.CUSTOMERNAME and you are updating a significant part of the TEMP table then drop that index before the update and rebuild it after.

answered May 9, 2012 at 20:28

Clodoaldo

1,1352 gold badges8 silver badges22 bronze badges

I do not have index on customer name.
– Bhavik Ambani
Commented May 10, 2012 at 4:40

Add a comment |

Greg Smith · Accepted Answer · 2012-05-10 06:25:20Z

1

PostgreSQL has no way to reduce the amount of CPU time a process can use. On Linux you can use OS features like the renice command to do that. See Priorities for more information and some samples.

answered May 10, 2012 at 6:25

Greg Smith

9005 silver badges7 bronze badges

For that will I have to create one another user then assign memory to him ?
– Bhavik Ambani
Commented May 10, 2012 at 7:06

Add a comment |

Stack Exchange Network

Performance degradation while updating tables having 10s of millions of records

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
postgresql
performance
update
query-performance
or ask your own question.

Hot Network Questions

Performance degradation while updating tables having 10s of millions of records

4 Answers 4

Not the answer you're looking for? Browse other questions tagged postgresqlperformanceupdatequery-performance or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
postgresql
performance
update
query-performance
or ask your own question.