9
\$\begingroup\$

So I've had a problem where I need to compare data in 2 different tables on two different servers. Now, I know MySQL supports CHECKSUM TABLES, but from my testing and understanding, it's not reliable across server instances and versions.

So I created this query:

$part = '@CRC := MD5(CONCAT_WS(\'#\', COALESCE(`'.
    implode('`, "#NULL#"), COALESCE(`', $this->_columns).
    '`, "#NULL#")))';
$sql1 = "SELECT COUNT(*) AS cnt, 
        SUM(CONV(SUBSTRING({$part}, 1, 4), 16, 10)) as a1, 
        SUM(CONV(SUBSTRING(@CRC, 5, 4), 16, 10)) as a2, 
        SUM(CONV(SUBSTRING(@CRC, 9, 4), 16, 10)) as a3, 
        SUM(CONV(SUBSTRING(@CRC, 13, 4), 16, 10)) as a4, 
        SUM(CONV(SUBSTRING(@CRC, 17, 4), 16, 10)) as a5, 
        SUM(CONV(SUBSTRING(@CRC, 21, 4), 16, 10)) as a6, 
        SUM(CONV(SUBSTRING(@CRC, 25, 4), 16, 10)) as a7, 
        SUM(CONV(SUBSTRING(@CRC, 29, 4), 16, 10)) as a8 
    FROM `dbname`.`tablename` 
    WHERE `id` >= $min AND `id` <= $max ";

So basically, it's concatenating each row together (all the columns of each row more specifically) and then MD5ing them. Then it walks 4 hexbits at a time through that MD5 and sums them across all rows (4 hexbits to allow me to do huge tables without needing to worry about overflowing). Then, I just compare the result of this query on both tables to see if everything is the same.

By using this binary search, I am able to rather quickly narrow down where the changes are so that I can port them.

It's actually reasonably efficient, so I'm not too concerned about that. What I am concerned about is if this is even necessary. It's screaming to me "You're doing it wrong", but I can't figure out any cleaner method around it...

What are your thoughts?

\$\endgroup\$
2
  • 1
    \$\begingroup\$ It may be overkill for your problem but have you looked at MYSQL clusters to handle mirroring and distributing of data. It would offer other features that may be useful. \$\endgroup\$
    – user60
    Commented Jan 19, 2011 at 21:37
  • \$\begingroup\$ They need to be syncronized lazily. Basically pushing QC into Production. And the data volume prohibits (or at least makes it expensive) to dump and restore the entire dataset each cycle... \$\endgroup\$
    – ircmaxell
    Commented Jan 19, 2011 at 21:38

2 Answers 2

5
\$\begingroup\$

We use mk-table-checksum.

It works really great in Master-Slave context where it also allows to sync differences in both directions depending on your choice.

Saidly from what i've seen most people it for replication and i can't provide any copy/pasteable output but if you don't know it it's definitly worth looking into. If you know it i'd like to hear why it doesn't work for you.

To get an overview over many tables you can use something like mk-table-checksum host1 host2 | mk-checksum-filter

\$\endgroup\$
1
  • \$\begingroup\$ The only issue that I have with that, is that it'll be hard to write a binary search using it. It does have chunk size, but it doesn't seem that I can specify the range of PK as well. So all it would tell me is if they differed. But looking there, the tool mk-table-sync would do it. But I need some custom work wrapped around it, so calling an external program is a last resort at best. Thanks though! \$\endgroup\$
    – ircmaxell
    Commented Jan 20, 2011 at 13:16
1
\$\begingroup\$

Simply run in MySQL CHECKSUM TABLE 'yourtable'

Or for PHP solution read How to check MySQL Replication databases are in Sync

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.