Efficiently deleting entries in a MySQL 5.7.X table with 30 million rows?

Question

I'm a SQL rookie in need of some help. I have a MySQL 5.7.X table that is used for log entries. Each log entry contains an ID of a system entity, and some other crud. Currently there's about 30M rows in this table. Another table holds a different type of record, but the ID is the same. I.e. something like this:

Table A
+----------------+
| ID | Timestamp |
+----------------+
+ 1  | 2022-...  +
+----------------+

Table B
+------------------------------+
| ID | Timestamp | Logger_name |
+------------------------------+
| 1  | 2022-...  + XYZ         |
+------------------------------+

Table B above is the one that currently has something like 30 million rows. What I'd like to do is "remove all rows from table B where the ID is NOT found in table A". We've tried this using a DELETE, but it takes a very long time, and the application is blocked during this period. I've read about approaches like CTAS, as well as moving rows in combination with an ALTER command, but the issue is that both tables need to remain online and active during this operation.

The "row deletion" for table B will be done from a scheduled Cron job in Kubernetes, so there will definitely be multiple threads/processes writing to/reading from both these tables. I believe there are no foreign key constraints or indexes involved.

EDIT:

I can't post all the details of the tables due to corporate policy. Table A will have maybe a few hundred rows at any given time. Table B almost certainly will have millions of rows. For table A, the ID is the primary key. Table B has no explicit primary key. I'm guessing the ID being the first column listed will be the primary key? In both cases, InnoDB is the engine. I don't think we're using MyISAM explicitly anywhere.

I do understand that logging to a DB is less than ideal, and it wouldn't have been my choice. But I'm stuck with this solution for the time being and need to make the best of it.

There's a lot of stuff we have to know to help you. We can get most of it if you give the SQL statements SHOW CREATE TABLE for both tables and edit your question to show us the output -- the table definitions. Primary key definitions matter, as do other index definitions. Whether you use InnoDB or MyISAM also matters, a lot. — O. Jones, Commented Mar 1, 2022 at 1:28

Nick Bailey · Accepted Answer · 2022-03-01 00:40:30Z

1

If the scale of table A is substantially smaller, you could pretty plausibly take all records from table A, get a list of all the ids, save that in some format (CSV, etc.), and then write a script that loops over the list and deletes records matching each id in its own DELETE query. That way, you shouldn't lock either table during the delete.

More generally, this is a great example of why log data should never go in an application database. There are tons of great log aggregators out there, and even if you don't feel like using one of those, pushing logs to some sort of external data store keeps them from ballooning the size of your application DB.

answered Mar 1, 2022 at 0:40

Nick Bailey

3,1442 gold badges12 silver badges13 bronze badges

Yes, table A is significantly smaller. But I need to delete the IDs from table B that are NOT in table A. Why wouldn't a DELETE for a single ID lock table B? The legacy log DB is there, and it's outside of my control. I would have used a log aggregator too, but that's not an option at this moment.
– user2337270
Commented Mar 1, 2022 at 2:37

Add a comment |

Collectives™ on Stack Overflow

Efficiently deleting entries in a MySQL 5.7.X table with 30 million rows?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
mysql
sql
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged mysqlsql or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
mysql
sql
or ask your own question.