Infrastructure/DB Replication Plan
Databases are currently a single point of failure in infrastructure. We'd like to come up with something that lets us reboot a db server and not have downtime. We have mostly postgres databases and one thing (the wiki) on mysql.
Owners: Toshio (abadger1999), Seth (skvidal), Kevin (nirik)
Ticket: Infra ticket 2718
Features we're looking for
These are the reasons that we want db replication. Anything less than this would be unacceptable
- Want to reboot db server. Sysadmin manually specifies that db1 is going away and db2 should take over
- Very short downtime
- less than 5 minutes on a switchover/failover event
- No loss of data. Once the db says data is committed there must be copies on other boxes
- Performance must meet our current demands but only our current demands.
- if we need to service 100 fas commits per second but the current (unreplicated) service could theoretically handle 1000 commits, the replication solution only needs to handle 100 commits, not 1000.
Really really want
If a solution has these and its competition doesn't chances are we're going to go with that solution.
- Auto failover
- Db1 stops responding. db2 automatically takes over.
- No downtime (as long as one db node is up)
Won't lose sleep over
May I have a pony too?
- load balancing (reads or writes)
- Currently we don't have load issues
- replication to other data centers
postgres technologies to explore
synchronous streaming replication + repmgr
The core replication is in postgres 9.x. repmgr adds command line scripts and automation to make switchover and failover much easier to manage.
Note: synchronous replication gives us the no data loss guarantee. However, it is hard to mix that with high availability (as it means transactions that write data can't be completed until the slaves reply that they've committed the data too). Reading the synchronous replication docs, it may be that this can be mitigated by having multiple slaves. From the documentation it seems that postgres might be making sure that the data is replicated to at least one other slave but not demanding that it be written to all the other slaves. So if one of two slaves is stopped, the system would still be able to function.
PGPoolII is very flexible. We'll have to pick out the set of features that gives us what we want and test it heavily to see how it works.