Live upgrade from Tarantool 1.6 to 1.10
This page includes explanations and solutions to some common issues when upgrading a replica set from Tarantool 1.6 to 1.10.
Versions later that 1.6 have incompatible .snap and .xlog file formats: 1.6 files are supported during upgrade, but you won’t be able to return to 1.6 after running under 1.10 or 2.x for a while. A few configuration parameters are also renamed.
To perform a live upgrade from Tarantool 1.6 to a more recent version, like 2.8.4, 2.10.1 and such, it is necessary to take an intermediate step by upgrading 1.6 -> 1.10 -> 2.x. This is the only way to perform the upgrade without downtime.
However, a direct upgrade of a replica set from 1.6 to 2.x is also possible, but only with downtime.
The procedure of live upgrade from 1.6 to 1.10 is similar to the general upgrade procedure, which is as follows:
Pick any read-only instance in the replica set.
Upgrade this replica to the new Tarantool version. See details in Upgrading Tarantool on a standalone instance. This requires stopping the instance for a while, which won’t interrupt the replica set operation. When the upgraded replica is up again, it will synchronize with the other instances in the replica set so that the data are consistent across all the instances.
Make sure the upgraded replica is running and connected to the rest of the replica set just fine. To do this, run
box.info.replication
in the instance’s console and check the output table for values likeupstream
,downstream
, andlag
.For each instance
id
, there areupstream
anddownstream
values. Both of them should have the valuefollow
, except on the instance where you run this code. This means that the replicas are connected and there are no errors in the data flow.The value of the
lag
field can be less or equal thanbox.cfg.replication_timeout
, but it can also be moderately larger. For example, ifbox.cfg.replication_timeout
is 1 second and the write load on the master is high, it’s generally OK to have a lag of about 10 seconds on the master. It is up to the user to decide what lag values are fine.Upgrade all the read-only instances by repeating steps 1–3 until only the master keeps running the old Tarantool version.
Make one of the updated replicas the new master:
- If the replica set is using asynchronous replication without
RAFT-based leader elections,
first run
box.cfg{ read_only = true }
on the old master and then runbox.cfg{ read_only = false }
on the replica that will be the new master. - If the replica set is using synchronous replication or
RAFT-based leader elections,
run
box.ctl.promote()
on the new master and then runbox.cfg{ election_mode = 'voter' }
on the old master. This will automatically change theread_only
statuses on the instances. - For a Cartridge replica set, it is possible to select the new master in the web UI.
There is no need to restart the new master.
Check that the new master continues following and being followed by all other replicas, similarly to step 3.
- If the replica set is using asynchronous replication without
RAFT-based leader elections,
first run
Upgrade the former master, which is now a read-only instance.
Run box.schema.upgrade() on the new master. This will update the Tarantool system spaces to match the currently installed version of Tarantool. There is no need to run
box.schema.upgrade()
on every node: changes are propagated to other nodes via the regular replication mechanism.
- Run
box.snapshot()
on every node in the replica set to make sure that the replicas immediately see the upgraded database state in case of restart.
What’s different when upgrading from Tarantool 1.6:
Step 2: Tarantool 1.10+ fails to recover from 1.6 xlogs, unless box.cfg{force_recovery = true}
is set.
There is some small difference between 1.6 and 1.10 xlogs, which makes 1.6 xlogs appear erroneous to 1.10+ instances.
In order to work around this, start the instance in force_recovery
mode. To do so, add the line
force_recovery = true
to the file where the instance is initialized – for example, to init.lua
.
Step 3: New Tarantool nodes follow 1.6 nodes just fine, but some 1.6 nodes might disconnect from new nodes with an ER_LOADING error. This is not critical, the error goes away when replication on 1.6 is restarted:
old_repl = box.cfg.replication
box.cfg{replication = ""}
box.cfg{replication = old_repl}
Step 7: There was a breaking change between 1.6 and 1.10 –
in 1.6, the field type num
was an alias to number
, and in 1.10, num
is converted to unsigned
.
This means that after box.schema.upgrade()
is performed on the master,
the user might have some spaces with unsigned
fields containing non-unsigned values:
double
, int
, and so on.
This will make the snapshot inconsistent, unless an extra action is performed after box.schema.upgrade()
.
Run this code in the Tarantool console on the new master:
-- First find all spaces containing unsigned fields with non-unsigned values in them.
-- Say, we have one such space denoted problematic_space and the problem is in field problematic_field_no.
a = box.space.problematic_space:format()
a[problematic_field_no].type = 'number'
box.space.problematic_space:format(a)
Once this is performed on the master, it’s safe to proceed to step 8, making a snapshot on every node.
Step 8: The user might be concerned with snapshot size in 1.10 – it’s drastically smaller than the one created by 1.6 (for example, ~300 Mb vs. 6 Gb in some corner cases). There is nothing to worry about. Tarantool 1.6 didn’t compress snapshots, while Tarantool 1.10 and above does that.