22 January 2016

Streaming vs Synchronus Replication in Postgres

I recently faced one strange issue in Rails which usually questioned some of the basic Relation Database principles. It gave me almost a sleepless night until I was able to get to the Root cause of the issue.

The problem

The problem was pretty straightforward. A Rake task generates an email and the email had two places where the count of documents was mentioned. Ideally they are supposed to be the same - but for some reason it was different.

The pain point

The reason this particular problem was painful because this has not occurred for few years and that it occurred only intermittently. The problem with intermittency is that there is always some theory behind. Here too there was something. Here are steps I had to perform to find the Root cause.

The approach

I first looked into the Rake task's log file which is outputted when my specific Email job runs. Things looked fine there - meaning it completed in under 90 seconds as expected.
The next step was to look at the production logs. The logs as expected was having 30 insert statements - Check. And it also has a read statement for the insert statements before and it was a typical count(*) query. The problem occurred at this point. The count(*) should have returned 30 but instead it returned 4. There comes another count(*) somewhere below in the code - but that returned 30 as expected!

The above step revealed that this problem is not with the Rails layer but something to do with our production database setup. So routed my energy towards there.
The production database environment is a Master-Slave configuration with Master taking Writes and Reads and Slave purely configured to take Reads. Both these nodes are load balanced via a PG Pool server. My initial gut said to investigate some time in the PGPool but that is not much useful as all PG Pool going to do is route traffic.
So I went and read about the Master - Slave Replication configuration. I read about two types of replication. One being synchronous replication and the other being Streaming Replication. Digging into that I found my root cause!

Synchronous vs Streaming Replication

Assume you have two databases A and B with A being a R/W Master and B being R-only Slave. If an insert or update command is issued, it goes and writes that entry to A as its configured for write. If the database returns after it ensures that all the slaves got this write - it is called as Synchronus or 2-Safe Replication. If A does not wait for this step however acknowledge whether it wrote successfully and later streams that value to B - this is called as Streaming Replication.

Both has their obvious own pros and cons. Streaming Replication is for Raw Speed and is also a very good configuration where there are too many writes. And Synchronous Replication although not as fast as Streaming provides 100% consistency. We unfortunately were in Streaming Replication mode. The 30 inserts happened so fast at A, that before even it could stream them to B, the count query intervened and read the half baked data from B. I am talking in terms of millisecond speed.

How did we fix it?

We isolated all our cron jobs to run in a dedicated node and pointed the database directly to the Master database server skipping the PG Pool in the process. In a single database configuration the concept of Streaming or Synchronous Replication does not apply. Hope this was helpful!

Cheers!
Braga

No comments: