15

I have two tables. Let's call them KEY and VALUE.
KEY is small, somewhere around 1.000.000 records.
VALUE is huge, say 1.000.000.000 records.

Between them there is a connection such that each KEY might have many VALUES. It's not a foreign key but basically the same meaning.

The DDL looks like this

create table KEY (
 key_id int,
 primary key (key_id)
);

create table VALUE (
 key_id int,
 value_id int,
 primary key (key_id, value_id)
);

Now, my problem. About half of all key_ids in VALUE have been deleted from KEY and I need to delete them in a orderly fashion while both tables are still under high load.

It would be easy to do

delete v 
  from VALUE v
  left join KEY k using (key_id)
 where k.key_id is null;

However, as it's not allowed to have a limit on multi table delete I don't like this approach. Such a delete would take hours to run and that makes it impossible to throttle the deletes.

Another approach is to create cursor to find all missing key_ids and delete them one by one with a limit. That seems very slow and kind of backwards.

Are there any other options? Some nice tricks that could help?

3
  • 1
    Sometimes WHERE NOT EXISTS is faster than LEFT JOIN [...] IS NULL, but not sure in this case (stackoverflow.com/questions/6777910/…). Hope it will help !
    – rap-2-h
    Commented Oct 24, 2013 at 10:21
  • you mean that you have already deleted the keys, and now want to delete orphan records in VALUE, don't you?
    – newtover
    Commented Oct 24, 2013 at 16:06
  • Right, at least that would be the exact same problem :) Commented Oct 24, 2013 at 16:55

12 Answers 12

25

Any solution that tries to delete so much data in one transaction is going to overwhelm the rollback segment and cause a lot of performance problems.

A good tool to help is pt-archiver. It performs incremental operations on moderate-sized batches of rows, as efficiently as possible. pt-archiver can copy, move, or delete rows depending on options.

The documentation includes an example of deleting orphaned rows, which is exactly your scenario:

pt-archiver --source h=host,D=db,t=VALUE --purge \
  --where 'NOT EXISTS(SELECT * FROM `KEY` WHERE key_id=`VALUE`.key_id)' \
  --limit 1000 --commit-each

Executing this will take significantly longer to delete the data, but it won't use too many resources, and without interrupting service on your existing database. I have used it successfully to purge hundreds of millions of rows of outdated data.

pt-archiver is part of the Percona Toolkit for MySQL, a free (GPL) set of scripts that help common tasks with MySQL and compatible databases.

8

Directly from MySQL documentation

If you are deleting many rows from a large table, you may exceed the lock table size for an InnoDB table. To avoid this problem, or simply to minimize the time that the table remains locked, the following strategy (which does not use DELETE at all) might be helpful:

Select the rows not to be deleted into an empty table that has the same structure as the original table:

INSERT INTO t_copy SELECT * FROM t WHERE ... ;

Use RENAME TABLE to atomically move the original table out of the way and rename the copy to the original name:

RENAME TABLE t TO t_old, t_copy TO t;

Drop the original table:

DROP TABLE t_old;

No other sessions can access the tables involved while RENAME TABLE executes, so the rename operation is not subject to concurrency problems. See Section 12.1.9, “RENAME TABLE Syntax”.

So in Your case You may do

INSERT INTO value_copy SELECT * FROM VALUE WHERE key_id IN
    (SELECT key_id FROM `KEY`);

RENAME TABLE value TO value_old, value_copy TO value;

DROP TABLE value_old;

And according to what they wrote here RENAME operation is quick and number of records doesn't affect it.

5
  • 2
    The problem is that it will take a while to insert half of billion records into a nearby table and commit the transaction. Otherwise you will have a problem to sync the copy table with the original one, if the latter is under high load and being updated.
    – newtover
    Commented Oct 25, 2013 at 9:17
  • I just did a little test just to make sure. And MySQL will copy all INSERTs a preform all UPDATEs that are going to be made during execution of this INSERT ... SELECT query.
    – Gustek
    Commented Oct 25, 2013 at 12:13
  • All the rows that should stay in the table, ie not deleted, is constantly updating and this method would force me to block writes until it's done as this transaction would hold a read lock (and blocking writes) for a really long time. Commented Oct 29, 2013 at 5:42
  • it would not, at the time when the first query will finish You will have exactly same data in both tables (except for rows You want to delete in the new one). All updates/inserts on first table will be taken in consideration when crating new table and read lock doesn't prevent You from writes. So the only time You may loose data integrity is time between executing first and second query. but if You would make it as procedure to minimize this time it should be ok.
    – Gustek
    Commented Oct 29, 2013 at 18:50
  • If more rows are inserted after the first insert of this set of statements, these new rows will never been preserved even if they meet the where clause. The whole table must be locked before the insert into the copy.
    – Sam
    Commented Jun 22, 2018 at 10:41
6
+100

What about this for having a limit?

delete x 
  from `VALUE` x
  join (select key_id, value_id
          from `VALUE` v
          left join `KEY` k using (key_id)
         where k.key_id is null
         limit 1000) y
    on x.key_id = y.key_id AND x.value_id = y.value_id;
5
  • This seems to work. It's slower than what I was hoping for but it's very simple and from my measurements it might be fast enough. Commented Oct 25, 2013 at 18:34
  • 4
    This will hurt performance as 1) joining two large table can be slow; 2) the nested SELECT has to scan more then 1000 x N times rows in order to find "first" 1000 rows to delete; 3) the last 999 rows will be slowest as it will run two full index scan without early exit; 4) deleted rows in VALUE could be very random positioned in table that IO will not likely sequential
    – Dennis C
    Commented Oct 29, 2013 at 5:45
  • You are correct, it still seems to be faster than doing a cursor over all rows as data doesn't have to be retrieved. Also, KEY isn't large, just one million rows or so. Commented Oct 29, 2013 at 12:27
  • Have you tried comparing the performance of this join with that of creating a temporary table? In my case, it was much faster without being overly complicated. Commented Oct 30, 2013 at 13:22
  • deleting data in chunks reduced the overall time, in addition with that changing few configuration parameters also increased the performance rathishkumar.in/2017/12/… but some precaution required when implementing on production server. Commented Dec 6, 2017 at 8:06
2

First, examine your data. Find the keys which have too many values to be deleted "fast". Then find out which times during the day you have the smallest load on the system. Perform the deletion of the "bad" keys during that time. For the rest, start deleting them one by one with some downtime between deletes so that you don't put to much pressure on the database while you do it.

1

May be instead of limit divide whole set of rows into small parts by key_id:

delete v 
  from VALUE v
  left join KEY k using (key_id)
 where k.key_id is null and v.key_id > 0 and v.key_id < 100000;

then delete rows with key_id in 100000..200000 and so on.

2
  • 2
    There is nothing saying that key_id 100001 won't have 1 million value_ids associated with it and that's too much for one delete. Commented Oct 18, 2013 at 12:07
  • The answer does the best you want on MySQL, that which have minimual index lookup, table scan and disk IO access. If your table is in heavy load, you break in into smaller transaction to prevent slave lag and lock blocking.
    – Dennis C
    Commented Oct 29, 2013 at 5:40
1

You can try to delete in separated transaction batches. This is for MSSQL, but should be similar.

declare @i INT
declare @step INT
set @i = 0
set @step = 100000

while (@i< (select max(VALUE.key_id) from VALUE))
BEGIN
  BEGIN TRANSACTION
  delete from VALUE where
    VALUE.key_id between @i and @i+@step and
    not exists(select 1 from KEY where KEY.key_id = VALUE.key_id and KEY.key_id between @i and @i+@step)

  set @i = (@i+@step)
  COMMIT TRANSACTION
END
1

Create a temporary table!

drop table if exists batch_to_delete;
create temporary table batch_to_delete as
select v.* from `VALUE` v
left join `KEY` k on k.key_id = v.key_id
where k.key_id is null
limit 10000; -- tailor batch size to your taste

-- optional but may help for large batch size
create index batch_to_delete_ix_key on batch_to_delete(key_id); 
create index batch_to_delete_ix_value on batch_to_delete(value_id);

-- do the actual delete
delete v from `VALUE` v
join batch_to_delete d on d.key_id = v.key_id and d.value_id = v.value_id;
1

To me this is a kind of task the progress of which I would want to see in a log file. And I would avoid solving this in pure SQL, I would use some scripting in Python or other similar language. Another thing that would bother me is that lots of LEFT JOINs with WHERE IS NOT NULL between the tables might cause unwanted locks, so I would avoid JOINs either.

Here is some pseudo code:

max_key = select_db('SELECT MAX(key) FROM VALUE')
while max_key > 0:
    cur_range = range(max_key, max_key-100, -1)
    good_keys = select_db('SELECT key FROM KEY WHERE key IN (%s)' % cur_range)
    keys_to_del = set(cur_range) - set(good_keys)
    while 1:
        deleted_count = update_db('DELETE FROM VALUE WHERE key IN (%s) LIMIT 1000' % keys_to_del)
        db_commit
        log_something
        if not deleted_count:
            break
    max_key -= 100

This should not bother the rest of the system very much, but may take long. Another issue is to optimize the table after you deleted all those rows, but this is another story.

1

If the target columns are properly indexed this should go fast,

DELETE FROM `VALUE`
WHERE NOT EXISTS(SELECT 1 FROM `key` k WHERE k.key_id = `VALUE`.key_id)
-- ORDER BY key_id, value_id -- order by PK is good idea, but check the performance first.
LIMIT 1000

Alter the limit from 10 to 10000 to get acceptable performance, and rerun it several times.

Also take in mind that this mass deletes will perform locks and backups for each row .. multiple the execution time for each row several times ...

There are some advanced methods to prevent this, but the easiest workaround is just to put a transaction around this query.

0

Do you have SLAVE or Dev/Test environment with same data?

The first step is to find out your data distribution if you are worried about a particular key having 1 million value_ids

SELECT v.key_id, COUNT(IFNULL(k.key_id,1)) AS cnt 
FROM `value` v  LEFT JOIN `key` k USING (key_id) 
WHERE k.key_id IS NULL 
GROUP BY v.key_id ;

EXPLAIN PLAN for above query is much better than adding

ORDER BY COUNT(IFNULL(k.key_id,1)) DESC ;

Since you don't have partitioning on key_id (too many partitions in your case) and want to keep database running during your delete process, the option is to delete in chucks with SLEEP() between different key_id deletes to avoid overwhelming server. Don't forget to keep an eye on your binary logs to avoid disk filling.

The quickest way is :

  1. Stop application so data is not changed.
  2. Dump key_id and value_id from VALUE table with only matching key_id in KEY table by using

    mysqldump YOUR_DATABASE_NAME value --where="key_id in (select key_id from YOUR_DATABASE_NAME.key)" --lock-all --opt --quick --quote-names --skip-extended-insert > VALUE_DATA.txt

  3. Truncate VALUE table

  4. Load data exported in step 2
  5. Start Application

As always, try this in Dev/Test environment with Prod data and same infrastructure so you can calculate downtime.

Hope this helps.

5
  • Perhaps I'm not reading this right but between step 1 and 5 it seems that the database is practically down, with not applications having access to it's data. I can't do that, both the entire database and these two tables needs to be up and running during the entire process. Commented Oct 23, 2013 at 7:34
  • Yes. The other options are to partition VALUE table based on key_id BUT that will be too many partitions in your case OR loop through deleting rows based on key_id and put SLEEP() in your loop as explained. Did you find out how many key_ids you have to delete from VALUE table?
    – Parag
    Commented Oct 23, 2013 at 21:04
  • It's about half of all values that needs to be deleted. The table can't be partitioned, doing so requires a total rebuild of the table and while that is happening the table is down and not accessible. This question is about to delete the data without taking down the table or database. There are far simpler approaches if that was allowed, like just one huge delete. Commented Oct 24, 2013 at 4:40
  • 2
    Have you looked into Percona's pt-archiver utility ? You don't have to transfer data into another table. Read Description on website.
    – Parag
    Commented Oct 24, 2013 at 16:03
  • I have looked at pt-archiver and it seems too slow for me. It's a very good tool otherwise. Commented Oct 29, 2013 at 5:38
0

I am just curious what the effect would be of adding a non-unique index on key_id in table VALUE. Selectivity is not high at all (~0.001) but I am curious how that would affect the join performance.

0

Why don't you split your VALUE table into several ones according to some rule like key_id module some power of 2 (like 256 for example)?

Not the answer you're looking for? Browse other questions tagged or ask your own question.