0

If I wanted to compare two similar Accumulo tables and highlight their differences, how could I do this?

My first thought is creating database dumps and running Unix diff over the files, but that wouldn't scale.

My second thought is maybe there's a way to sync two Accumulo tables, hopefully with a dry-run option, that could collect the differences somewhere.

... is there at least a way to do this in HBase?

1 Answer 1

2

Sadly, I don't know of anything that exists out of the box to do this.

Trivially, you could implement this with two Scanners and do a merged read. Because both Scanners are returning sorted data, if the two key-values are equal, you advance both Scanners. If the Key from Scanner1 sorts before the Key from Scanner2, you know that Key doesn't exist in the table from Scanner2 and you advance Scanner1. If the Key from Scanner2 is sorts before the Key from Scanner1, that Key doesn't exist in the table from Scanner2 and you advance Scanner2.

However, like you said, that would be pretty slow as you have one thread reading one table and you likely have multiple cores to run things concurrently.

To make this scale, you can "partition" your table into buckets (e.g. if your table keys are the alphabet [A, B, C, ... Z], each partition could be a letter in this case), and you can parallelize your same algorithm. Using the alphabet example, you could have 26 clients reading over the portions of the tables concurrently. This is something that could be easily implemented as a map-reduce job, too.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .