Technique to sync medium/large amounts of data

Question

I implemented sync for my app (with a REST backend), it works well but there are some problems:

Blocks - Since I don't want to overwrite possible inputs of the user during sync, I block the UI (progress overlay) during sync. This of course affects negatively the user experience.
Size - If the data is large the request may take a lot and there may also be out of memory errors processing the response.

So I'm thinking how to improve this. For point 2, I thought to simply paginate the request, i.e. send how many entries I want and a reference date (e.g. creation date) to make the database query simpler.

But point 1 is a problem independently what I do for point 2, as long as the user can manipulate data during sync, there's a possibility this this data will be overwritten with the sync result. I could for example lock the client database during sync, such that only the sync process can write to it, and queue the client operations, but what happens when these operations try to e.g. increment some data which was just changed by sync and this leads to a different result than what the client intended to do.

My app is about personal grocery management so the requirements are not super strict, if an update is lost or something it's not tragic. I'm looking for a balanced solution, which is easy to implement while keeping a fairly good UX.

I would need to know more about the architecture to offer a meaningful answer. You speak of a client database, so I presume their is a client app that has a local DB, which I presume is being synced with the server DB and is done for performance reasons? Why is it not feasible to have a server app that interacts with the DB to provide the client views and interface into the DB? — Thomas Carlisle, Commented Jun 5, 2016 at 16:03
Might be interesting to take a look on this example: meteorhacks.com/understanding-mergebox DDP is a protocol which fixes the kind of issues you mention. Their approach may be of good use for this issues. It shows how to reduce the amount of data synced and how to solve conflicts. — Luc Franken, Commented Sep 6, 2016 at 6:39

Brandon · Accepted Answer · 2016-06-08 01:30:51Z

Blocks

The problem you are referring to for scenario 1 is consistency, that is, having 2 stores holding the same information and needing them to be consistent.

In a distributed environment like this, optimistic concurrency can prevail, meaning, instead of locking, allow for anything to change at any time, and build-in a mechanism to track overlaps.

First step, is there any data which truly has overlaps? Or is your data set truly just a set of items? If the latter, then if you are in a situation where items can be added by any client, and they replicate down, then you simply need each client to keep track of which items have not yet been pushed.

If there can be overlaps, for example, you are managing documents (sounds unlikely), then you either have locking, or allow concurrent updates and flag overlaps as a problem to be resolved by the consumer. Examples of this pattern:

Software version control systems like git, mercurial, subversion, etc.
Evernote
Google Docs

If you have ever worked with software version control, you'll recognize the idea of a merge conflict and/or overlap. That is the same problem you have. Source control systems like Vault avoid this problem by each client treating each file as read-only until a file is "checked out", which creates a mutually exclusive lock across all clients. Git, on the other hand, will identify conflicts, merge them if it can do so, and if not, force the end user (developer) to manually merge.

Evernote allows my wife and I to each edit our grocery list. To minimize conflicts, we sync as often as possible, but there will always be a chance of us having conflicting changes, which has happened. Evernote simply appends conflicting document versions on top of each other with markers indicating there was a conflict.

In a case where concurrent changes are rare, usually this sort of optimistic concurrency is ideal. In a highly concurrent transaction system, locking may be a better approach.

Size

As far as data size, you have two options I can think of:

Pagination
Streaming

Pagination is probably the way I'd go because it's simpler to code, usually, and easier to track your progress, and in a way, gives you more control. With my recommendations for handling consistency above, hopefully pagination becomes a non-issue.

Streaming may be an alternative for you too and would reduce the risk when it comes to consistency since you'd have a single round-trip. Streaming basically means to change this data flow:

results = getResultsFromServer();
for (result in results) {
  doSomething(result)
}

to:

getResultsFromServer((result) => {
  doSomething(result)
})

In the former, the entire result set must be buffered so that you can iterate over it, even though you only consume the data one at a time in a forward-only manner. In the latter, you only need one item at a time. I implemented this approach once for a large report that would overwhelm our production servers when a customer ran it for a very large date range.

Kasey Speakman · Accepted Answer · 2016-06-07 22:38:30Z

I think I understand that your main concern is that the input is directly bound to the local database, and the sync also writes to the local database. So if the user saves some value during sync, it is undefined as to which information will be shown. (Last one wins.)

One solution that I often use for UI is having the user editing a copy of the data and only saving it back to the database after they have chosen to save it (even if that is an implicit choice like clicking off the input, applicable to things like To Do lists). That also makes canceling an operation pretty easy: throw away the edited copy.

Doing something like that will help with your sync story, because you can persist the changes from sync to the local database without disturbing the user's input. And if they "save" during a sync, you can wait to save until after the sync finishes. You wouldn't need to block. If you need the "in progress" data to be persistent (say, in case the device crashes), you could still save it... just in a different place in the database.

As far as incrementing some value and arriving at a different result, this is a classic case for using idempotent operations. A simple example to illustrate:

AddQuantity(1); // result depends on previous quantity
SetQuantity(13); // result does not depend on previous quantity
                 // can be applied numerous times w/o changing result
                 // idempotent operation

Using literal numbers is contrived of course, but 13 could be pre-calculated and carried in the queue, instead of calculating it afterwards via the AddQuantity method.

When you get multiple users synchronizing the same items, you are going to run into concurrency conflicts. One user updates and syncs, while another is updating from an old version, and when the second user goes to sync, they overwrite the first user's changes.

At that point, usually the answer is some sort of concurrency control like Optimistic Concurrency with version numbers (or ETags). The server will then raise a conflict if you try to update an old version of the item. It will be up to you on how to handle that conflict. You could display a message to the user and ask them what to do. (Hint: they will probably always choose to overwrite the server version with their version.) Or you could silently take one of the versions as authoritative. Or you could try to apply an algorithm to merge the differences (if applicable to your situation). Basically, do what provides the most value for your app, with trade-offs you can live with.

Note that if you are syncing while the user is editing something, you have the opportunity to detect updates to the item they are changing and display a banner "This item has been updated by another user."

Stack Exchange Network

Technique to sync medium/large amounts of data

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
synchronization
or ask your own question.

Hot Network Questions

Technique to sync medium/large amounts of data

2 Answers 2

Not the answer you're looking for? Browse other questions tagged synchronization or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
synchronization
or ask your own question.