## Master-slave and sharding with Doctrine, together at last #### Davide Borsatto, sfday 2019
### About me * PHP developer since 2007 * Worked on all major versions of Symfony, starting from 1.0 * Will add missing `@throws` annotations to your code
### About Madisoft * We work remotely! * We work full time on a Symfony project * [We're hiring](https://labs.madisoft.it/entra-nel-team/)!
### About Nuvola * Created in 2013 as a Symfony application (currently on 4.3) * Web application for schools * Uses Amazon Web Services * Uses MySQL and has a huge amount of database queries, _more than a billion_ during a regular school day * In 2014 sharding was first introduced
### What is sharding? Many different views and specifications; our approach is to split data into different databases, according to domain boundaries (i.e. schools) Our setup has a global database and multiple tenants (database shards)

Sharding: why and why not

Pros Cons
  • Scalability
  • Security
  • Complexity
  • Manual data aggregation
### Things to keep in mind when working with a sharded database
* Be aware of where data is supposed to live: some might be stored in a shard, some in the global database, some in both * Even when working locally, you need to take sharding into account in your workflow: your local database will work using shards, too
### Sharding: implementation with Doctrine Two main interfaces to implement: `Doctrine\DBAL\Sharding\ShardManager` and `Doctrine\DBAL\Sharding\ShardChoser\ShardChoser` [Documentation on Doctrine's website](https://www.doctrine-project.org/projects/doctrine-dbal/en/2.9/reference/sharding.html)
#### ShardManager Works with the connection class in order to select the right shard. Contains two types of methods: for handling shard selection (`selectGlobal`, `selectShard`, `getCurrentDistributionValue`, `getShards`) and for performing queries across all shards (`queryAll`)
#### ShardChoser Provides a single `pickShard` method, it is used for abstracting the logic of converting a "distribution value" to a shard identifier
#### DBAL Configuration ```yaml doctrine: dbal: connections: default: shard_choser_service: nuvola.app.sharding.shard_choser shard_manager_class: Nuvola\AppBundle\Sharding\ShardManager wrapper_class: Doctrine\DBAL\Sharding\PoolingShardConnection # ... ```
### More about sharding * [Symfony and multi tenant apps: Nuvola](https://labs.madisoft.it/symfony-and-multi-tenant-apps-nuvola/) * [Symfony and multi tenant apps: sharding](https://labs.madisoft.it/symfony-and-multi-tenant-apps-sharding/) * [Doctrine sharding](https://labs.madisoft.it/doctrine-sharding/) * SymfonyDay 2016 talk [video](https://vimeo.com/196399354) / [slides](https://www.slideshare.net/matteomoro8/scaling-symfony-apps)
## Optimizing our setup
  • Built to handle peak usage
  • Usage varies widly: extreme peaks during school hours (8AM to 3PM) and certain periods (end of school year), but low to no use during summer and night time
  • The setup was rigid: we could not scale it down gently to save resources
  • We were spending a ton of money for resources we did not use
### Possible solutions
* Moving [from RDS GP2 to RDS IOPS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html): still not enough for our needs * Database clustering: not cost effective and possibly lack of support * Some proxy to handle read/write splitting: complex to set up and maintain * Manually optimize certain areas of the application: inefficient and prone to error
We eventually decided to switch to Amazon Aurora and use its master-slave functionality
### What do we mean with "master-slave"? _Master-slave replication enables data from one database server (the master) to be replicated to one or more other database servers (the slaves). The master logs the updates, which then ripple through to the slaves._ [MariaDB.com](https://mariadb.com/resources/blog/database-master-slave-replication-in-the-cloud/)
### Master slave and sharding at the same time? * Doctrine natively supports either-or * We had to DIY a few things to make it work
### Some implementation gotchas
#### You have to extend the base MasterSlaveConnection class ```php class Doctrine\Migrations\Configuration\Configuration { public function connect() : bool { if ($this->connection instanceof MasterSlaveConnection) { return $this->connection->connect('master'); } return $this->connection->connect(); } } ```
#### Getting creative for specifying multiple hosts No way of changing the configuration format with a single host; as a workaround, we provide master/slave addresses separated with a colon ```yaml doctrine: dbal: connections: default: host: "192.168.0.10:192.168.0.11" ```
#### SchemaTool does something unexpected We created a patch for `SchemaTool` which fixes the use of `Connection::executeQuery` to `Connection::executeUpdate` in writing contexts
### Enter a custom connection class * `MasterSlavePoolingShardConnection` extends `MasterSlaveConnection` * A mix of `MasterSlaveConnection` and `PoolingShardConnection`
Works by overriding methods where `MasterSlaveConnection` would access `$this->connections['master/slave']` and adding an extra layer, turning it into `$this->activeConnections[$this->shardId]['master/slave']`.
#### DBAL Configuration ```yaml doctrine: dbal: connections: default: wrapper_class: Nuvola\AppBundle\Sharding\MasterSlavePoolingShardConnection #... ```
## Deployment
### Step 1 #### Problem We should never write to a slave #### Solution We set up a special version used by our help desk with extra logging features enabled, and we carefully monitored those logs
### Step 2 #### Problem The base scenario (master only) works #### Solution We deployed the code with no slaves enabled, with just one host configured
### Step 3 #### Problem We test the migration to Aurora #### Solution We have some demo environments for our help desk to use (for testing features and reproducing issues), so we migrated those
### Step 4 #### Problem We move all our data #### Solution We slowly batch migrate data, about 50 shards at a time
### Wait for issues to pop up
### The basic issue: replication lag * You change something, data is written to master, but on a following request the slave is still not updated * Amazon says the delay should stay below 100 milliseconds * Waypoint solution implemented by [Booking](https://www.slideshare.net/JeanFranoisGagn/how-bookingcom-avoids-and-deals-with-replication-lag): interesting but too complex for our needs
### Problem Users will make a change, then they are forwarded to another page which will display stale data
### Solution: a Symfony event listener
* Basic idea: if last write was less than 2 seconds ago, force current request to use master * On **request**, checks whether the last write happened less than 2 seconds ago; if so, force connection pool to use master * On **response**, checks whether the current connection was using master (which means some data was written); if so, save current time as last write event
* Not a perfect solution, but extemely pragmatic * Theoretical risk of being stuck to master indefinitely * Works using user sessions, cross-user stale data is not a real concern
### Problem Spikes in failure rates for email and SMS notification delivery
  • We save a notification (email, SMS) row to database, and dispatch a message to the message queue
  • Sometimes the message queue is too fast and consumer looks for a database row which does not exist yet
  • The problem was already there, but failure rate was so low that we never really noticed
### Possible solutions Either add a slight delay to the queue consumer or implement some sort of exponential backoff logic
### Are we happy with this setup? * Yes! It saves us quite some money in billing * Some things are not by the book, but they work for our use case and that's what matters to us
### What did we achieve?
#### We use about 50% less nodes Every node holds about twice as many shards. We could pack even more shards in a single node, but we choose not to do it
#### Replication lag is low and we can easily work around it Usually stays around 20ms
#### Very reduced load on master databases Even with our permissive setup, about 75% of queries go to slaves, without any sort of optimization in the application
#### Our code does not depend on AWS The infrastructure is built using Aurora, but the application configuration could be easily transferred elsewhere
## Thanks! Please rate this talk: [https://joind.in/talk/056f3](https://joind.in/talk/056f3) Take a look at our openings: [https://labs.madisoft.it/entra-nel-team/](https://labs.madisoft.it/entra-nel-team/) Find me on [Twitter](https://twitter.com/davide_b88) and [Github](https://github.com/dborsatto)