Onboarding to Accord
Intro
Accord supports all existing CQL and can be enabled on a per table and per token range within that table basis. Enabling Accord on existing tables requires a migration process that can be done on this same per table and per range basis that safely transitions data from being managed by Cassandra + Paxos to Cassandra + Accord without downtime.
A migration is required because Accord can’t safely read data written by non-SERIAL writes. Accord requires deterministic reads in order to have deterministic transaction recovery and non-SERIAL writes can’t be read deterministically while still being highly available.
This guide describes how to enable Accord and what differences to expect when migrating your existing CQL workload to Accord.
This guide does not cover the new transaction syntax.
Configuration
YAML
You need to set accord.enabled to true for Accord to be initialized at
startup.
accord.default_transactional_mode allows you to set a default
transactional mode for newly created tables which will be used in create
table statements when no transactional_mode is specified. This
prevents accidentally creating non-Accord tables that will need
migration to Accord.
accord.range_migration configures the behavior of altering the
transactional_mode of a table. When set to auto the entire ring
will be marked as migrating when the transactional_mode of a table
is altered. When set to explicit no ranges will be marked as migrating
when the transactional_mode of a table is altered.
Table parameters
transactional_mode can be set when a table is created
CREATE TABLE foo WITH transactional_mode = ‘full' or it can be set
by altering an existing table
ALTER TABLE foo WITH transactional_mode = ‘full'.
transactional_mode designates the target or intended transaction
system for the table and for a newly created table this will be the
transaction system that is used, but for existing tables that are being
altered the table will still need to be migrated to the target system.
transactional_mode can be set to full, mixed_reads, and
off. off means that Paxos will be used and transaction statements
will be rejected. full means that all reads and writes will execute on
Accord. mixed_reads means that all writes will execute on Accord
along with SERIAL reads/writes, but non-SERIAL reads/writes will
execute on the existing eventually consistent path. Applying the
mutations for blocking read repair will always be done through Accord in
full in and mixed_reads.
transactional_migration_from indicates whether a migration is
currently in progress although it does not indicate which ranges are
actively being migrated. This is set automatically when you create a
table or alter transactional_mode and should not be set manually.
It’s possible to manually set transactional_migration_from to
force the completion of migration without actually running the necessary
migration steps.
transactional_migration_from can be set to none, off,
full, and mixed_reads. off, full, and mixed_reads
correspond to the transactional_mode being migrated away from and
none indicates that no migration is in progress either because the
migration has completed or because the table was created with its
current transactional_mode.
mixed_reads vs full
When Accord is running with transactional_mode full it will be
able to perform asynchronous commit saving a WAN roundtrip.
mixed_reads allows non-SERIAL reads to continue to execute using
the original eventually consistent read path. mixed_reads, unlikes
full, always requires Accord to always synchronously commit at the
requested consistency level in order to make acknowledged Accord writes
visible to non-SERIAL reads.
There is no transactional_mode that allows non-SERIAL writes
because they break Accord’s transaction recovery resulting in
transactions appearing to have different outcomes at different nodes.
Accord repair
Repair can now include an optional Accord repair that nodetool repair
will enable by default like Paxos repair. This repair doesn’t actually
synchronize any data it just runs a transaction that checks that Accord
has resolved the state of all transactions in the repaired range up to
the point the transaction was created and that the transactions are
applied at ALL.
Accord is normally doing this in the background anyways this just
ensures that it has occurred at ALL and hasn’t experienced any delays.
Migration to Accord
Migrating an existing table to run on Accord starts by altering the table:
ALTER TABLE foo WITH transactional_mode = 'full'
After the table is altered it is required to run
nodetool consensus_admin begin-migration on ranges in the table
unless accord.range_migration=auto.
When a range is initially marked migrating to Accord all non-SERIAL
writes will execute on Accord while SERIAL writes will continue to
execute on Paxos. non-SERIAL writes include regular writes, logged and
unlogged batches, hints, and read repair. Accord will perform
synchronous commit the specified consistency level requiring 2x WAN RTT.
Tables that are migrating or are partially migrated to Accord (or back to Paxos) can be listed using
nodetool consensus_admin list or the sytem table system_accord_debug.migration_state.
Migration to Accord consists of two phases with the first phase starting
when a range is marked migrating, and the second phase starting after a
full or incremental data repair, and then the migration completing after
a second repair which must be a full data repair + Paxos repair.
While marking the range as migrating can be done automatically with
accord.range_migration=auto, there is not automation for
triggering the repairs. If you regularly run compatible repairs then the
migration will eventually complete, but if you don’t run them or want
the migration to complete sooner then you will need to either trigger
them manually or invoke nodetool consensus_admin finish-migration
to trigger them.
Any repair that is compatible will drive migration forward whether it
only covers part of the migrating range or whether is started via
nodetool consensus_admin finish-migration or some other external
process that initiates repair. Force repair with down nodes will not be
eligible to drive any type or phase of migration forward. Force repair
with all nodes up will still work.
First phase
In the first phase of migration Accord is unable to safely read
non-SERIAL writes so Paxos continues to be used for SERIAL operations
and Accord executes all writes and synchronously commits at the
requested consistency level in order to allow Paxos to safely read
Accord writes. Accord’s read and write metrics are all counted towards the existing Read and Write scope
along with the eventually consistent operations, but you should also start to see writes also being counted in the AccordWrite scope.
A data repair either incremental or full replicates all non-SERIAL
writes at ALL making it safe for Accord to read non-SERIAL writes that
occurred before the migration started. non-SERIAL writes that occurred
after the migration started were executed through Accord so Accord can
safely read them.
Second phase
In the second phase all reads and writes execute through Accord
(assuming transactional_mode="full"). Before an operation can execute on
Accord it is necessary to run a Paxos key repair in order to ensure that
any uncommitted Paxos transactions are committed and this check will
take at least one extra WAN RTT. Additionally Accord has to read at QUORUM
(where it would normally only read from a single replica in transactional_mode="full" and migration completed) because
Paxos writes are only visible at QUORUM.
All reads and CAS operations in the range should start showing up in the Accord metrics and not the existing metrics.
Once a key has been repaired, the repaired state of the key is stored in a small in-memory cache and system table so that it doesn’t need to be repaired again. This information is only stored at replicas of the key so if the coordinator is not a replica it will not know that it can skip repairing the key. Use token aware routing to avoid redundant key repairs.
A full repair + Paxos repair is necessary to complete the second phase of migration to Accord. An incremental repair can’t currently be used because incremental repair doesn’t include the transactions that are repaired by Paxos repair because it selects the data to include in the repair before running the Paxos repair.
Migration from Accord
Migration from Accord to Paxos occurs in a single phase and begins by
altering the table’s transactional_mode to off and then
optionally marking ranges as migrating as discussed above.
Once a range is marked migrating all operations in the migrating range
will stop executing on Accord. Before each operation occurs they will
have to run an Accord key repair similar to the Paxos key repair to
ensure Accord transactions for that key have committed at QUORUM.
An Accord repair needs to be run on the migrating range, triggered
manually or via nodetool finish-migration, and once that completes
non-SERIAL operations will run using the usual eventually consistent
path and SERIAL operations will execute on Paxos.
Migration commands
All the nodetool migration commands are based on new
StorageServiceMBean JMX methods. These methods are
migrateConsensusProtocol, finishConsensusMigration,
listConsensusMigrations, getAccordManagedKeyspaces, and
getAccordManagedTables and can be used by external management tools to
manage consensus migration. The existing methods for starting repairs
can also be used to start the repairs that are needed to complete
migration.
nodetool consensus_admin list
Invoking nodetool with
consensus_admin list [<keyspace> <tables>…]
will connect to the specified node and retrieve that nodes view of what
tables are currently being migrated from transactional cluster metadata.
Tables that are not being migrated are not listed.
The results can be printed out in several different formats using the
format parameter which supports json, minified-json, yaml, and
minified-yaml.
nodetool consensus_admin begin-migration
Invoking nodetool with
consensus_admin begin-migration [<keyspace> <tables>…]
can be used to mark ranges on a table as migrating. This can only be
done after the migration has been started by altering the tables.
Marking ranges as migrating is a lightweight operation and does not
trigger the repairs that will finish the migration.
The range to mark migrating needs to be explicitly
provided otherwise the entire ring will be marked migrating for the
specified keyspace and tables. If the entire range is marked migrating it is
only necessary to invoke begin-migration on one node.
This is only needed if
accord.default_transactional_mode=explicit is set in
cassandra.yaml otherwise all the ranges will already have been marked
migrating when the alter occurred.
Ranges that are migrating will require at least an extra WAN roundtrip for each request that touches a migrating range because both transaction systems may need to be used to execute the request.
nodetool consensus_admin finish-migration
Invoking nodetool with
consensus_admin finish-migration [<keyspace> <tables>…
will run the repairs needed to complete the migration for the specified
ranges. If no range is specified it will default to the primary range of
the node that nodetool is connecting to so you can call it once on
every node to complete migration.
When migrating from Paxos to Accord it will run an incremental data repair and then a full data repair + Paxos repair. When migrating from Accord to Paxos it will run an Accord repair.
Supported consistency levels
Migration requires support for read and write consistency levels because
Accord ends up being required to read Paxos writes at QUORUM and
Accord needs to execute non-SERIAL writes while Paxos is still being
used for SERIAL writes and thus needs to perform synchronous commit at
the requested consistency level.
Once migration is complete the read and write consistency levels will be
ignored with transactional mode full . With transactional mode
mixed_reads Accord will continue to do synchronous commit and
honor the requested commit/write consistency level.
Accord will always reject any requests to execute at unsupported consistency levels to ensure that migration to/from Accord is always possible.
Supported read consistency levels are ONE, QUORUM, SERIAL, and
ALL. Supported write consistency levels are ANY, ONE, QUORUM,
SERIAL, and ALL. LOCAL, TWO, and THREE are not supported.
ANY is executed as an asynchronous commit similar to Paxos.
non-SERIAL consistency
non-SERIAL operations are linearizable by default when migrated to Accord, but during migration they might not be linearizable due to timestamp handling changes. During migration, normal CQL operations will use the Accord timestamp once a range starts migration, but will fall back to server timestamp when migrating away from Accord.
Hints and batch log replay will use server timestamp and they only come into play during migration as hints and batch log replay won’t be generated when a range is on Accord.
USING TIMESTAMP is allowed and the application of the operations will
occur in a linearizable order, but from the perspective of a reader the
merged result may not appear linearizable.
For BATCH operations, timestamp handling is more complex:
- If the batch uses USING TIMESTAMP, the user timestamp will be used
- If all mutations in a BATCH use USING TIMESTAMP, the user timestamp will be used
- If not using USING TIMESTAMP and all partition keys are on Accord, the Accord timestamp is used
- If the BATCH has some partitions on Accord and others not on Accord, the server timestamp will be used (writes to the Accord table will not be linearizable for multi-table batches where one table is not migrating to Accord)
- If the batch mixes server timestamp and USING TIMESTAMP mutations, the default behavior is to reject the batch, configurable via accord.mixed_time_source_handling with values:
* reject (default) - reject the CQL operation
* log - accept the operation using server timestamp but log operations where writes to the Accord table will not be linearizable
* ignore - accept the operation silently
Logged batches that only touch Accord tables will not be written to the batch log because that functionality is redundant with Accord. Batches that touch both Accord and non-Accord tables will be written to the batch log.
Paging runs a separate transaction per page and does not produce a linearizable result.
Partition range reads are split into multiple transactions during
execution and will not produce a strict serializable result.
Additionally during migration there are no barriers/repairs executed
before partition range reads. When migrating from Accord to Paxos the
effective commit CL for Accord writes as viewed from partition range
reads will be ANY. Adding barriers/repairs before partition range
reads would cause them to time out so they are not done.
Batchlog and hints
Pre-existing batchlog entries and hints will be processed during and after migration until they are completed. If they need to be executed through Accord they will be routed through Accord automatically.
Logged batches that only touch Accord data will not be written to the batch log because that functionality is redundant with Accord. Batches that touch both Accord and non-Accord data are split: the Accord mutations are replayed via the Accord transaction protocol, while non-Accord mutations follow the standard batch log path. If Accord replay fails, hint-based fallback is used.
Hints are not written for Accord writes although the batch log may result in new hints because batch log entries are converted to hints after the first retry.
Operations spanning Accord/non-Accord data
Various operations can access both Accord and non-Accord managed data. These are transparently split into parts that execute on Accord and parts that execute outside of Accord and the results are merged. If the splitting process races with migration then the operations is re-split and retried without surfacing an error to the client.
Partition range read with LIMIT performance
Partition range reads with a limit use more memory and CPU at the nodes
being read from and at the coordinator. Accord splits the ranges owned
by each node into smaller subranges and each subrange is owned by a
command store. The partition range read will execute at every
intersecting command store on a node and each will return LIMIT N
results which are sent back to the coordinator. The coordinator then
merges them and re-applies the limit.
The additional memory and CPU will be amplified proportional to the
number of command stores which defaults to
DatabaseDescriptor.getAvailableProcessors().
Metrics
Accord’s read and write metrics are counted under the existing Read and Write scope along with eventually consistent
operations. To see Accord specific metrics you can look at the AccordRead and AccordWrite scope. CASRead and CASWrite will not track
CAS or SERIAL read operations that end up running on Accord and they will instead show up in AccordRead/AccordWrite and Read/Write.
If a single request ends up running on both systems due to misrouting it
will show up as multiple requests. Misrouted requests are counted under the RetryDifferentSystem meter and will show
up in AccordRead and AccordWrite if Accord was the system the request was misrouted to as well as Read and Write.
If the request was misrouted to non-Accord code then it will show up under Read and Write metrics or CASRead and CASWrite metrics.
Hints can be misrouted and this is tracked in HintsServiceMetrics under the HintsRetryDifferentSystem meter.
Partition range reads can also potentially generate additional Accord transactions depending on how the reads end up having to be split due to intersection with migrating ranges.