2

QUICK SUMMARY:

We are trying to set up a remote CD server near China using SQL Server Transactional Replication. We almost have it working but are noticing that when we update an item, the remote CD is always displaying the previous version of that item.

DETAILS:

We are trying to set up Sitecore 10.3.1 in a test env (before deploying this config to production) with the following VMs:

  • US CM
  • US CM database
  • US CD
  • US CD database
  • US Solr
  • China CD
  • China CD database
  • China Solr

The Web database is being replicated from the US CD database server to the China CD database server using SQL Server Transactional Replication.

Following the Sitecore docs, we have moved the EventQueue, Properties and Tasks table from the Web database to a separate WebShared database and have configured the CM and both CDs to use it.

We also have US and CN Solr instances with Solr Replication keeping them in sync.

Sitecore documents we have consulted in setting this up:

  1. https://support.sitecore.com/kb?id=kb_article_view&sysparm_article=KB0610106
  2. https://doc.sitecore.com/xp/en/developers/103/platform-administration-and-architecture/configure-sitecore-roles-for-separate-eventqueue%2C-properties%2C-and-tasks-tables.html

Here are some reproducible steps that illustrate our problem:

  1. On US CM, update item to include "Test 1" in the content, save it and publish it
  2. On US CD site, corresponding page displays "Test 1"
  3. On China CD site, same page does not display "Test 1"
  4. On US CM, update same content item to read "Test 2", save it and publish it
  5. On US CD site, page now displays "Test 2"
  6. On China CD site, page now displays "Test 1" (note that it is one update behind)
  7. Recycle app pool on China CD site
  8. On China CD site, page now displays "Test 2" (note that it now has the latest content)

We believe the China CD site learns that an item has been updated via the EventQueue table, which it seems to polls pretty frequently. We also know that SQL Replication can take a few seconds to push or pull changes.

We believe the following is happening:

  1. China CD site gets notified via EventQueue.
  2. China CD site immediately pulls the content item from the China Web database (before SQL Replication has updated it) and caches it.
  3. User requests the page and gets the cached item with old content.

We tried a test where we waited a few minutes before requesting the page on the China CD site. And before requesting the page, we checked the item content in the China Web database and verified it had the latest content. But when we requested the page, we still got the old content!

Does Sitecore have any way to tell when the update of an item's data via SQL replication has been completed? If not, it seems like a fundamental flaw to have the trigger mechanism (detecting an update in the shared EventQueue table) be much faster than the mechanism that transfers the actual data (SQL replication). In our old Sitecore 8 env, we think the EventQueue and item data were both updated via SQL replication so maybe that's why that env didn't have these issues?

Any thoughts on how to resolve this? We submitted this problem to Sitecore Support over a month ago and haven't made any significant progress towards a solution.

1 Answer 1

2

I think what you are seeing here is caching issue on secondary CD. Sitecore official documentation https://doc.sitecore.com/xp/en/developers/104/platform-administration-and-architecture/walkthrough--replicating-the-web-database-using-azure-active-geo-replication.html#deploy-the-application-and-storage-roles have a warning message at the bottom:

On CD roles using replicated Web databases, visitor requests can potentially repopulate the newly cleared cache with old content, until the replication is completed. Sitecore is working on a solution to this.

Unfortunately, it is there from version 9 and Sitecore support did not help here as this is not their priority to fix it.

My advise here is first to create CacheClear.aspx page and add this to CD directly to determine if this is your actual issue and if it is you will have to come with custom cache clear event on item saved/published remote in delayed manner, by using timer or similar. Worth also to add some retry pattern like using Polly library to make it more resilient. Official Azure geo-replication latency is 5s and backed by SLA so 5s would work in ideal world for sure to clear cache for Item https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-Online-Services?lang=1 and https://learn.microsoft.com/en-us/answers/questions/1040341/how-to-monitor-azure-sql-db-geo-replication-latenc

Not the answer you're looking for? Browse other questions tagged or ask your own question.