1

I want to update tables ( my be 20-30 ) having 10s of millions of records each.

The problem is that it is taking too much time for the update process and also at that time CPU usage also goes very high. I want to do in such a way that it can not use much CPU while processing the data. If the processing time is increased then it will not be a problem for me but it should use limited CPU resources for processing ( updating ) the table. I am using PostgreSQL as database and server operating system is Linux.

Sample query of mine can be like this

UPDATE TEMP 
SET CUSTOMERNAME = 
  ( select customername from user where user.customerid = temp.customerid );
6
  • You really need to include the code for the update. It could be CPU intensive because you are doing some complicated function on every row.
    – JNK
    Commented May 9, 2012 at 17:48
  • I am just executing the update query in the table. For example UPDATE TEMP SET CUSTOMERNAME = ( select customername from user where user.customerid = temp.customerid ) ; Commented May 9, 2012 at 17:49
  • Put that into your question, please. I'm not sure how PG handles subqueries but that could be an issue since it may get executed once per row. A non-specific answer here will be too general, so you need to give us a scenario to solve.
    – JNK
    Commented May 9, 2012 at 17:53
  • Is there an index on user.customerid and on temp.customerid? Adding an index if there isn't one might help. Commented May 9, 2012 at 18:14
  • Where should I place index on temp table or user table ? Commented May 9, 2012 at 18:15

4 Answers 4

10

The first question is: Why is it important that you not use a lot of CPU time? The query will be bottlenecked on some resource; if you could introduce enough additional disk access, the CPU time used per second would go down, but would that really be an improvement? What resource would you prefer to saturate? Understanding why you have emphasized this might help guide people to providing an answer that you will find useful.

As suggested in a comment, your query may run faster with a join rather than a correlated subquery. Something like this:

UPDATE temp
  SET customername = user.customername
  FROM user
  WHERE user.customerid = temp.customerid;

Another important thing to know is whether you want to update all rows in the table. Are some of the values already correct? If so, you will get a big performance boost by not updating the rows that don't need it. Add AND temp.customername is distinct from user.customername to the WHERE clause.

If you limit the number of rows updated in each statement, and VACUUM ANALYZE after each UPDATE, you will avoid table bloat. If the point of the desire to minimize CPU time is to avoid a performance impact on concurrent transactions, this would give you the opportunity to introduce a short delay (in the form of a sleep or something similar) before you start the next UPDATE of a set of rows.

Even better, why are you redundantly storing the information in the temp table rather than joining to it when needed? (Sometimes there is a good reason; quite often there isn't.)

6
  • Because it is main server and it is live machine if the CPU usage gets higher then its performance will be degraded and it will not be able to do other tasks for which it is made. Commented May 9, 2012 at 18:25
  • Great answer with many choices. And here I was just going to ask him to EXPLAIN between his query and the query which I'd rewritten (and looks like yours) :)
    – swasheck
    Commented May 9, 2012 at 18:25
  • @BhavikAmbani there is still the unanswered question of why you're storing this data in a table called TEMP. Are you preparing to introduce change into the environment and this is your DR strategy? Do you have dev/test/stage environments in which this could be attempted first?
    – swasheck
    Commented May 9, 2012 at 18:28
  • I have just named as temp but actual name is different Commented May 9, 2012 at 18:33
  • 3
    @BhavikAmbani: If this is not a temporary table, how do you intend to ensure that the names stay current in the other tables as name changes occur in the source table? If you normalize to 3rd normal form, you won't have problems with that. Don't believe FUD about joins being slow -- normalize first, and look to optimize when you hit an actual problem. There will usually be a better solution than denormalizing. Premature optimization causes a great many problems.
    – kgrittn
    Commented May 9, 2012 at 18:51
5

If you follow the very good advice of kgrittn and still have performance issues, you may need to perform the update in batches. You would still perform set-based updates but limit them to the first 1000 (or whatever number works for you, I've used from 500 to 50,000)records that don't match and then keep looping until all are done.

1

If there is an index on TEMP.CUSTOMERNAME and you are updating a significant part of the TEMP table then drop that index before the update and rebuild it after.

1
  • I do not have index on customer name. Commented May 10, 2012 at 4:40
1

PostgreSQL has no way to reduce the amount of CPU time a process can use. On Linux you can use OS features like the renice command to do that. See Priorities for more information and some samples.

1
  • For that will I have to create one another user then assign memory to him ? Commented May 10, 2012 at 7:06

Not the answer you're looking for? Browse other questions tagged or ask your own question.