Developer’s guide

For a quick start, skip the details below and jump right away to the Cartridge getting started guide.

For a deep dive into what you can develop with Tarantool Cartridge, go on with the Cartridge developer’s guide.

Introduction

To develop and start an application, you need to go through the following steps:

Install Tarantool Cartridge and other components of the development environment.
Create a project.
Develop the application. In case it is a cluster-aware application, implement its logic in a custom (user-defined) cluster role to initialize the database in a cluster environment.
Deploy the application to target server(s). This includes configuring and starting the instance(s).
In case it is a cluster-aware application, deploy the cluster.

The following sections provide details for each of these steps.

Installing Tarantool Cartridge

Install cartridge-cli, a command-line tool for developing, deploying, and managing Tarantool applications.

Important

cartridge-cli is deprecated in favor of the tt CLI utility. This guide uses cartridge-cli as a native tool for Cartridge applications development. However, we encourage you to switch to tt in order to simplify the migration to Tarantool 3.0 and newer versions.
Install git, a version control system.
Install npm, a package manager for node.js.
Install the unzip utility.

Creating a project

To set up your development environment, create a project using the Tarantool Cartridge project template. In any directory, run:

$ cartridge create --name <app_name> /path/to/

This will automatically set up a Git repository in a new /path/to/<app_name>/ directory, tag it with version 0.1.0, and put the necessary files into it.

In this Git repository, you can develop the application (by simply editing the default files provided by the template), plug the necessary modules, and then easily pack everything to deploy on your server(s).

The project template creates the <app_name>/ directory with the following contents:

<app_name>-scm-1.rockspec file where you can specify the application dependencies.
deps.sh script that resolves dependencies from the .rockspec file.
init.lua file which is the entry point for your application.
.git file necessary for a Git repository.
.gitignore file to ignore the unnecessary files.
env.lua file that sets common rock paths so that the application can be started from any directory.
custom-role.lua file that is a placeholder for a custom (user-defined) cluster role.

The entry point file (init.lua), among other things, loads the cartridge module and calls its initialization function:

...
local cartridge = require('cartridge')
...
cartridge.cfg({
-- cartridge options example
 workdir = '/var/lib/tarantool/app',
 advertise_uri = 'localhost:3301',
 cluster_cookie = 'super-cluster-cookie',
 ...
}, {
-- box options example
 memtx_memory = 1000000000,
 ... })
...

The cartridge.cfg() call renders the instance operable via the administrative console but does not call box.cfg() to configure instances.

Warning

Calling the box.cfg() function is forbidden.

The cluster itself will do it for you when it is time to:

bootstrap the current instance once you:
- run cartridge.bootstrap() via the administrative console, or
- click Create in the web interface;
join the instance to an existing cluster once you:
- run cartridge.join_server({uri = 'other_instance_uri'}) via the console, or
- click Join (an existing replica set) or Create (a new replica set) in the web interface.

Notice that you can specify a cookie for the cluster (cluster_cookie parameter) if you need to run several clusters in the same network. The cookie can be any string value.

Now you can develop an application that will run on a single or multiple independent Tarantool instances (e.g. acting as a proxy to third-party databases) – or will run in a cluster.

If you plan to develop a cluster-aware application, first familiarize yourself with the notion of cluster roles.

Cluster roles

Cluster roles are Lua modules that implement some specific functions and/or logic. In other words, a Tarantool Cartridge cluster segregates instance functionality in a role-based way.

Since all instances running cluster applications use the same source code and are aware of all the defined roles (and plugged modules), you can dynamically enable and disable multiple different roles without restarts, even during cluster operation.

Note that every instance in a replica set performs the same roles and you cannot enable/disable roles individually on some instances. In other words, configuration of enabled roles is set up per replica set. See a step-by-step configuration example in this guide.

Built-in roles

The cartridge module comes with two built-in roles that implement automatic sharding:

vshard-router that handles the vshard’s compute-intensive workload: routes requests to storage nodes.
vshard-storage that handles the vshard’s transaction-intensive workload: stores and manages a subset of a dataset.

Note

For more information on sharding, see the vshard module documentation.

With the built-in and custom roles, you can develop applications with separated compute and transaction handling – and enable relevant workload-specific roles on different instances running on physical servers with workload-dedicated hardware.

Custom roles

You can implement custom roles for any purposes, for example:

define stored procedures;
implement extra features on top of vshard;
go without vshard at all;
implement one or multiple supplementary services such as e-mail notifier, replicator, etc.

To implement a custom cluster role, do the following:

Take the app/roles/custom.lua file in your project as a sample. Rename this file as you wish, e.g. app/roles/custom-role.lua, and implement the role’s logic. For example:
```
-- Implement a custom role in app/roles/custom-role.lua
local role_name = 'custom-role'

local function init()
...
end

local function stop()
...
end

return {
    role_name = role_name,
    init = init,
    stop = stop,
}
```
Here the role_name value may differ from the module name passed to the cartridge.cfg() function. If the role_name variable is not specified, the module name is the default value.

Note

Role names must be unique as it is impossible to register multiple roles with the same name.

-- Register a custom role in init.lua
...
local cartridge = require('cartridge')
...
cartridge.cfg({
    workdir = ...,
    advertise_uri = ...,
    roles = {'custom-role'},
})
...

where custom-role is the name of the Lua module to be loaded.

The role module does not have required functions, but the cluster may execute the following ones during the role’s life cycle:

init() is the role’s initialization function.

Inside the function’s body you can call any box functions: create spaces, indexes, grant permissions, etc. Here is what the initialization function may look like:

 local function init(opts)
     -- The cluster passes an 'opts' Lua table containing an 'is_master' flag.
     if opts.is_master then
         local customer = box.schema.space.create('customer',
             { if_not_exists = true }
         )
         customer:format({
             {'customer_id', 'unsigned'},
             {'bucket_id', 'unsigned'},
             {'name', 'string'},
         })
         customer:create_index('customer_id', {
             parts = {'customer_id'},
             if_not_exists = true,
         })
     end
 end

Note

Neither vshard-router nor vshard-storage manage spaces, indexes, or formats. You should do it within a custom role: add a box.schema.space.create() call to your first cluster role, as shown in the example above.
The function’s body is wrapped in a conditional statement that lets you call box functions on masters only. This protects against replication collisions as data propagates to replicas automatically.

stop() is the role’s termination function. Implement it if initialization starts a fiber that has to be stopped or does any job that needs to be undone on termination.
validate_config() and apply_config() are functions that validate and apply the role’s configuration. Implement them if some configuration data needs to be stored cluster-wide.

Next, get a grip on the role’s life cycle to implement the functions you need.

Defining role dependencies

You can instruct the cluster to apply some other roles if your custom role is enabled.

For example:

-- Role dependencies defined in app/roles/custom-role.lua
local role_name = 'custom-role'
...
return {
    role_name = role_name,
    dependencies = {'cartridge.roles.vshard-router'},
    ...
}

Here vshard-router role will be initialized automatically for every instance with custom-role enabled.

Using multiple vshard storage groups

Replica sets with vshard-storage roles can belong to different groups. For example, hot or cold groups meant to independently process hot and cold data.

Groups are specified in the cluster’s configuration:

-- Specify groups in init.lua
cartridge.cfg({
    vshard_groups = {'hot', 'cold'},
    ...
})

If no groups are specified, the cluster assumes that all replica sets belong to the default group.

With multiple groups enabled, every replica set with a vshard-storage role enabled must be assigned to a particular group. The assignment can never be changed.

Another limitation is that you cannot add groups dynamically (this will become available in future).

Finally, mind the syntax for router access. Every instance with a vshard-router role enabled initializes multiple routers. All of them are accessible through the role:

local router_role = cartridge.service_get('vshard-router')
router_role.get('hot'):call(...)

If you have no roles specified, you can access a static router as before (when Tarantool Cartridge was unaware of groups):

local vshard = require('vshard')
vshard.router.call(...)

However, when using the current group-aware API, you must call a static router with a colon:

local router_role = cartridge.service_get('vshard-router')
local default_router = router_role.get() -- or router_role.get('default')
default_router:call(...)

Role’s life cycle (and the order of function execution)

The cluster displays the names of all custom roles along with the built-in vshard-* roles in the web interface. Cluster administrators can enable and disable them for particular instances – either via the web interface or via the cluster public API. For example:

cartridge.admin.edit_replicaset('replicaset-uuid', {roles = {'vshard-router', 'custom-role'}})

If you enable multiple roles on an instance at the same time, the cluster first initializes the built-in roles (if any) and then the custom ones (if any) in the order the latter were listed in cartridge.cfg().

If a custom role has dependent roles, the dependencies are registered and validated first, prior to the role itself.

The cluster calls the role’s functions in the following circumstances:

The init() function, typically, once: either when the role is enabled by the administrator or at the instance restart. Enabling a role once is normally enough.
The stop() function – only when the administrator disables the role, not on instance termination.
The validate_config() function, first, before the automatic box.cfg() call (database initialization), then – upon every configuration update.
The apply_config() function upon every configuration update.

As a tryout, let’s task the cluster with some actions and see the order of executing the role’s functions:

Join an instance or create a replica set, both with an enabled role:
1. validate_config()
2. init()
3. apply_config()
Restart an instance with an enabled role:
1. validate_config()
2. init()
3. apply_config()
Disable role: stop().
Upon the cartridge.confapplier.patch_clusterwide() call:
1. validate_config()
2. apply_config()
Upon a triggered failover:
1. validate_config()
2. apply_config()

Considering the described behavior:

The init() function may:
- Call box functions.
- Start a fiber and, in this case, the stop() function should take care of the fiber’s termination.
- Configure the built-in HTTP server.
- Execute any code related to the role’s initialization.
The stop() functions must undo any job that needs to be undone on role’s termination.
The validate_config() function must validate any configuration change.
The apply_config() function may execute any code related to a configuration change, e.g., take care of an expirationd fiber.

The validation and application functions together allow you to change the cluster-wide configuration as described in the next section.

Configuring custom roles

You can:

Store configurations for your custom roles as sections in cluster-wide configuration, for example:

# in YAML configuration file
my_role:
  notify_url: "https://localhost:8080"

-- in init.lua file
local notify_url = 'http://localhost'
function my_role.apply_config(conf, opts)
    local conf = conf['my_role'] or {}
    notify_url = conf.notify_url or 'default'
end

Download and upload cluster-wide configuration using the web interface or API (via GET/PUT queries to admin/config endpoint like curl localhost:8081/admin/config and curl -X PUT -d "{'my_parameter': 'value'}" localhost:8081/admin/config).
Utilize it in your role’s apply_config() function.

Every instance in the cluster stores a copy of the configuration file in its working directory (configured by cartridge.cfg({workdir = ...})):

/var/lib/tarantool/<instance_name>/config.yml for instances deployed from RPM packages and managed by systemd.
/home/<username>/tarantool_state/var/lib/tarantool/config.yml for instances deployed from tar+gz archives.

The cluster’s configuration is a Lua table, downloaded and uploaded as YAML. If some application-specific configuration data, e.g. a database schema as defined by DDL (data definition language), needs to be stored on every instance in the cluster, you can implement your own API by adding a custom section to the table. The cluster will help you spread it safely across all instances.

Such section goes in the same file with topology-specific and vshard-specific sections that the cluster generates automatically. Unlike the generated, the custom section’s modification, validation, and application logic has to be defined.

The common way is to define two functions:

validate_config(conf_new, conf_old) to validate changes made in the new configuration (conf_new) versus the old configuration (conf_old).
apply_config(conf, opts) to execute any code related to a configuration change. As input, this function takes the configuration to apply (conf, which is actually the new configuration that you validated earlier with validate_config()) and options (the opts argument that includes is_master, a Boolean flag described later).

Important

The validate_config() function must detect all configuration problems that may lead to apply_config() errors. For more information, see the next section.

When implementing validation and application functions that call box ones for some reason, mind the following precautions:

Due to the role’s life cycle, the cluster does not guarantee an automatic box.cfg() call prior to calling validate_config().

If the validation function calls any box functions (e.g., to check a format), make sure the calls are wrapped in a protective conditional statement that checks if box.cfg() has already happened:
```
-- Inside the validate_config() function:
if type(box.cfg) == 'table' then
    -- Here you can call box functions
end
```
Unlike the validation function, apply_config() can call box functions freely as the cluster applies custom configuration after the automatic box.cfg() call.

However, creating spaces, users, etc., can cause replication collisions when performed on both master and replica instances simultaneously. The appropriate way is to call such box functions on masters only and let the changes propagate to replicas automatically.

Upon the apply_config(conf, opts) execution, the cluster passes an is_master flag in the opts table which you can use to wrap collision-inducing box functions in a protective conditional statement:
```
-- Inside the apply_config() function:
if opts.is_master then
    -- Here you can call box functions
end
```

Custom configuration example

Consider the following code as part of the role’s module (custom-role.lua) implementation:

-- Custom role implementation

local cartridge = require('cartridge')

local role_name = 'custom-role'

-- Modify the config by implementing some setter (an alternative to HTTP PUT)
local function set_secret(secret)
    local custom_role_cfg = cartridge.confapplier.get_deepcopy(role_name) or {}
    custom_role_cfg.secret = secret
    cartridge.confapplier.patch_clusterwide({
        [role_name] = custom_role_cfg,
    })
end
-- Validate
local function validate_config(cfg)
    local custom_role_cfg = cfg[role_name] or {}
    if custom_role_cfg.secret ~= nil then
        assert(type(custom_role_cfg.secret) == 'string', 'custom-role.secret must be a string')
    end
    return true
end
-- Apply
local function apply_config(cfg)
    local custom_role_cfg = cfg[role_name] or {}
    local secret = custom_role_cfg.secret or 'default-secret'
    -- Make use of it
end

return {
    role_name = role_name,
    set_secret = set_secret,
    validate_config = validate_config,
    apply_config = apply_config,
}

Once the configuration is customized, do one of the following:

continue developing your application and pay attention to its versioning;
(optional) enable authorization in the web interface.
in case the cluster is already deployed, apply the configuration cluster-wide.

Applying custom role’s configuration

With the implementation showed by the example, you can call the set_secret() function to apply the new configuration via the administrative console – or an HTTP endpoint if the role exports one.

The set_secret() function calls cartridge.confapplier.patch_clusterwide() which performs a two-phase commit:

It patches the active configuration in memory: copies the table and replaces the "custom-role" section in the copy with the one given by the set_secret() function.
The cluster checks if the new configuration can be applied on all instances except disabled and expelled. All instances subject to update must be healthy and alive according to the membership module.
(Preparation phase) The cluster propagates the patched configuration. Every instance validates it with the validate_config() function of every registered role. Depending on the validation’s result:
- If successful (i.e., returns true), the instance saves the new configuration to a temporary file named config.prepare.yml within the working directory.
- (Abort phase) Otherwise, the instance reports an error and all the other instances roll back the update: remove the file they may have already prepared.
(Commit phase) Upon successful preparation of all instances, the cluster commits the changes. Every instance:
1. Creates the active configuration’s hard-link.
2. Atomically replaces the active configuration file with the prepared one. The atomic replacement is indivisible – it can either succeed or fail entirely, never partially.
3. Calls the apply_config() function of every registered role.

If any of these steps fail, an error pops up in the web interface next to the corresponding instance. The cluster does not handle such errors automatically, they require manual repair.

You will avoid the repair if the validate_config() function can detect all configuration problems that may lead to apply_config() errors.

Using the built-in HTTP server

The cluster launches an httpd server instance during initialization (cartridge.cfg()). You can bind a port to the instance via an environmental variable:

-- Get the port from an environmental variable or the default one:
local http_port = os.getenv('HTTP_PORT') or '8080'

local ok, err = cartridge.cfg({
    ...
    -- Pass the port to the cluster:
    http_port = http_port,
    ...
})

To make use of the httpd instance, access it and configure routes inside the init() function of some role, e.g. a role that exposes API over HTTP:

local function init(opts)

...

    -- Get the httpd instance:
    local httpd = cartridge.service_get('httpd')
    if httpd ~= nil then
        -- Configure a route to, for example, metrics:
        httpd:route({
            method = 'GET',
            path = '/metrics',
            public = true,
        },
        function(req)
            return req:render({json = stat.stat()})
        end
        )
    end
end

For more information on using Tarantool’s HTTP server, see its documentation.

Implementing authorization in the web interface

To implement authorization in the web interface of every instance in a Tarantool cluster:

Implement a new, say, auth module with a check_password function. It should check the credentials of any user trying to log in to the web interface.

The check_password function accepts a username and password and returns an authentication success or failure.

-- auth.lua

-- Add a function to check the credentials
local function check_password(username, password)

    -- Check the credentials any way you like

    -- Return an authentication success or failure
    if not ok then
        return false
    end
    return true
end
...

Pass the implemented auth module name as a parameter to cartridge.cfg(), so the cluster can use it:
```
-- init.lua

local ok, err = cartridge.cfg({
    auth_backend_name = 'auth',
    -- The cluster will automatically call 'require()' on the 'auth' module.
    ...
})
```
This adds a Log in button to the upper right corner of the web interface but still lets the unsigned users interact with the interface. This is convenient for testing.

Note

Also, to authorize requests to cluster API, you can use the HTTP basic authorization header.
To require the authorization of every user in the web interface even before the cluster bootstrap, add the following line:
```
-- init.lua

local ok, err = cartridge.cfg({
    auth_backend_name = 'auth',
    auth_enabled = true,
    ...
})
```
With the authentication enabled and the auth module implemented, the user will not be able to even bootstrap the cluster without logging in. After the successful login and bootstrap, the authentication can be enabled and disabled cluster-wide in the web interface and the auth_enabled parameter is ignored.

Application versioning

Tarantool Cartridge understands semantic versioning as described at semver.org. When developing an application, create new Git branches and tag them appropriately. These tags are used to calculate version increments for subsequent packing.

For example, if your application has version 1.2.1, tag your current branch with 1.2.1 (annotated or not).

To retrieve the current version from Git, run:

$ git describe --long --tags
1.2.1-12-g74864f2

This output shows that we are 12 commits after the version 1.2.1. If we are to package the application at this point, it will have a full version of 1.2.1-12 and its package will be named <app_name>-1.2.1-12.rpm.

Non-semantic tags are prohibited. You will not be able to create a package from a branch with the latest tag being non-semantic.

Once you package your application, the version is saved in a VERSION file in the package root.

Using .cartridge.ignore files

You can add a .cartridge.ignore file to your application repository to exclude particular files and/or directories from package builds.

For the most part, the logic is similar to that of .gitignore files. The major difference is that in .cartridge.ignore files the order of exceptions relative to the rest of the templates does not matter, while in .gitignore files the order does matter.

.cartridge.ignore entry	ignores every…
`target/`	folder (due to the trailing `/`) named `target`, recursively
`target`	file or folder named `target`, recursively
`/target`	file or folder named `target` in the top-most directory (due to the leading `/`)
`/target/`	folder named `target` in the top-most directory (leading and trailing `/`)
`*.class`	every file or folder ending with `.class`, recursively
`#comment`	nothing, this is a comment (the first character is a `#`)
`\#comment`	every file or folder with name `#comment` (`\` for escaping)
`target/logs/`	every folder named `logs` which is a subdirectory of a folder named `target`
`target/*/logs/`	every folder named `logs` two levels under a folder named `target` (`*` doesn’t include `/`)
`target/**/logs/`	every folder named `logs` somewhere under a folder named `target` (`**` includes `/`)
`*.py[co]`	every file or folder ending in `.pyc` or `.pyo`; however, it doesn’t match `.py!`
`*.py[!co]`	every file or folder ending in anything other than `c` or `o`
`*.file[0-9]`	every file or folder ending in digit
`*.file[!0-9]`	every file or folder ending in anything other than digit
`*`	every
`/*`	everything in the top-most directory (due to the leading `/`)
`*/.tar.gz`	every `.tar.gz` file or folder which is one or more* levels under the starting folder
`!file`	every file or folder will be ignored even if it matches other patterns

Failover architecture

An important concept in cluster topology is appointing a leader. Leader is an instance which is responsible for performing key operations. To keep things simple, you can think of a leader as of the only writable master. Every replica set has its own leader, and there’s usually not more than one.

Which instance will become a leader depends on topology settings and failover configuration. For more information about the failover configuration in Cartridge, see Enabling automatic failover.

An important topology parameter is the failover priority within a replica set. This is an ordered list of instances. By default, the first instance in the list becomes a leader, but with the failover enabled it may be changed automatically if the first one is malfunctioning.

Instance configuration upon a leader change

When Cartridge configures roles, it takes into account the leadership map (consolidated in the failover.lua module). The leadership map is composed when the instance enters the ConfiguringRoles state for the first time. Later the map is updated according to the failover mode.

Every change in the leadership map is accompanied by instance re-configuration. When the map changes, Cartridge updates the read_only setting and calls the apply_config callback for every role. It also specifies the is_master flag (which actually means is_leader, but hasn’t been renamed yet due to historical reasons).

It’s important to say that we discuss a distributed system where every instance has its own opinion. Even if all opinions coincide, there still may be races between instances, and you (as an application developer) should take them into account when designing roles and their interaction.

Leader appointment rules

The logic behind leader election depends on the failover mode: disabled, eventual, or stateful.

Disabled mode

This is the simplest case. The leader is always the first instance in the failover priority. No automatic switching is performed. When it’s dead, it’s dead.

Eventual failover

In the eventual mode, the leader isn’t elected consistently. Instead, every instance in the cluster thinks that the leader is the first healthy instance in the failover priority list, while instance health is determined according to the membership status (the SWIM protocol). Not recommended to use on large clusters in production. If you have highload production cluster, use stateful failover with etcd instead.

The member is considered healthy if both are true:

It reports either ConfiguringRoles or RolesConfigured state;
Its SWIM status is either alive or suspect.

A suspect member becomes dead after the failover_timout expires.

Leader election is done as follows. Suppose there are two replica sets in the cluster:

a single router “R”,
two storages, “S1” and “S2”.

Then we can say: all the three instances (R, S1, S2) agree that S1 is the leader.

The SWIM protocol guarantees that eventually all instances will find a common ground, but it’s not guaranteed for every intermediate moment of time. So we may get a conflict.

For example, soon after S1 goes down, R is already informed and thinks that S2 is the leader, but S2 hasn’t received the gossip yet and still thinks he’s not. This is a conflict.

Similarly, when S1 recovers and takes the leadership, S2 may be unaware of that yet. So, both S1 and S2 consider themselves as leaders.

Moreover, SWIM protocol isn’t perfect and still can produce false-negative gossips (announce the instance is dead when it’s not). It may cause “failover storms”, when failover triggers too many times per minute under a high load. You can pause failover at runtime using Lua API (require('cartridge.lua-api.failover').pause()) or GraphQL mutation (mutation { cluster { failover_pause } }). Those functions will pause failover on every instance they can reach. To see if failover is paused, check the logs or use the function require('cartridge.failover').is_paused(). Don’t forget to resume failover using Lua API (require('cartridge.lua-api.failover').resume()) or GraphQL mutation (mutation { cluster { failover_resume } }).

You can also enable failover suppressing by cartridge.cfg parameter enable_failover_suppressing. It allows to automatically pause failover in runtime if failover triggers too many times per minute. It could be configured by argparse parameters failover_suppress_threshold (count of times than failover triggers per failover_suppress_timeout to be suppressed) and failover_suppress_timeout (time in seconds, if failover triggers more than failover_suppress_threshold, it’ll be suppressed and released after failover_suppress_timeout sec).

Stateful failover

Similarly to the eventual mode, every instance composes its own leadership map, but now the map is fetched from an external state provider (that’s why this failover mode called “stateful”). Nowadays there are two state providers supported – etcd and stateboard (standalone Tarantool instance). State provider serves as a domain-specific key-value storage (simply replicaset_uuid -> leader_uuid) and a locking mechanism.

Changes in the leadership map are obtained from the state provider with the long polling technique.

All decisions are made by the coordinator – the one that holds the lock. The coordinator is implemented as a built-in Cartridge role. There may be many instances with the coordinator role enabled, but only one of them can acquire the lock at the same time. We call this coordinator the “active” one.

The lock is released automatically when the TCP connection is closed, or it may expire if the coordinator becomes unresponsive (in stateboard it’s set by the stateboard’s --lock_delay option, for etcd it’s a part of clusterwide configuration), so the coordinator renews the lock from time to time in order to be considered alive.

The coordinator makes a decision based on the SWIM data, but the decision algorithm is slightly different from that in case of eventual failover:

Right after acquiring the lock from the state provider, the coordinator fetches the leadership map.
If there is no leader appointed for the replica set, the coordinator appoints the first leader according to the failover priority, regardless of the SWIM status.
If a leader becomes dead, the coordinator makes a decision. A new leader is the first healthy instance from the failover priority list. If an old leader recovers, no leader change is made until the current leader down. Changing failover priority doesn’t affect this.
Every appointment (self-made or fetched) is immune for a while (controlled by the IMMUNITY_TIMEOUT option).

You can also enable leader_autoreturn to return leadership to the first leader in failover_priority list after failover was triggered. It might be useful when you have active and passive data centers. The time before failover will try to return the leader is configured by autoreturn_delay option in a failover configuration. Note that leader_autoreturn won’t work if the prime leader is unhealthy.

Stateful failover automatically checks if there is a registered cluster in a state provider. Check is performed on a first stateful failover configuration and every time when cluster is restarted. You can disable that option by using check_cookie_hash = false in failover configuration.

Stateful failover may call box.ctl.promote on the leader instance. It doesn’t work with ALL_RW replicasets and replicasets with one existing or enabled node. It works on any Tarantool versions where box.ctl.promote is available. If you face any issue with promoting, you can try call it manually on leader. If you want to enable this functionality, you should enable it in your init.lua file:

 cartridge.cfg({
    ...
    enable_synchro_mode = true,
})

Migrating a stateful replicaset to manual election mode

Stateful failover can also work with Tarantool election_mode = 'manual'. This mode gives stronger guarantees than election_mode = 'off'.

With election_mode = 'off', Cartridge consistent promotion relies on wait_lsn against the previous vclockkeeper. With election_mode = 'manual', the new leader must also win a built-in Tarantool election triggered via box.ctl.promote(). Tarantool elections use Raft-based terms and quorum, so the switchover verifies that the target instance can actually become the leader, not only catch up to the previous leader’s WAL.

In practice this makes a stateful replicaset more resilient during leader changes and failovers: inability to gather quorum is detected during promotion, before the replicaset continues as writable on the new leader.

Manual election mode is supported only when all of the following are true:

failover mode is stateful;
enable_synchro_mode = true is set in cartridge.cfg();
the replicaset doesn’t use all_rw.

Warning

Use Tarantool election_fencing_mode = 'off' together with election_mode = 'manual'. If Tarantool election fencing makes an instance read-only, Cartridge doesn’t switch it back to writable mode automatically. If you need fencing in stateful failover, prefer Cartridge fencing (failover.fencing_enabled).

To migrate a replicaset with restart, set startup configuration on every instance of the replicaset:

export TARANTOOL_ELECTION_MODE=manual
export TARANTOOL_ELECTION_FENCING_MODE=off

After restart, Tarantool starts in manual election mode and Cartridge uses it together with stateful failover.

Starting with Cartridge 2.17.0, you can also switch a running replicaset at runtime via Lua API:

require('cartridge.lua-api.failover').switch_to_manual_election_mode()

Call the helper on the current appointed leader of the replicaset. The helper switches only the current replicaset and fails unless all instances in it are enabled and healthy.

Important

Runtime helpers are not persistent. They change only the current runtime box.cfg state and don’t modify startup configuration. Before calling switch_to_manual_election_mode(), roll out TARANTOOL_ELECTION_MODE=manual and TARANTOOL_ELECTION_FENCING_MODE=off in deployment configuration for future restarts, but do not restart the instances yet. Otherwise the next restart restores the old startup values.

To roll back a running replicaset from manual to off, first update startup configuration for future restarts and then call:

require('cartridge.lua-api.failover').switch_to_off_election_mode()

By default the rollback helper switches the replicaset to election_mode = 'off' with election_fencing_mode = 'soft'. You can override the fencing mode explicitly:

require('cartridge.lua-api.failover').switch_to_off_election_mode({
    election_fencing_mode = 'off',
})

Just like switch_to_manual_election_mode(), the rollback helper affects only the current replicaset and must be called on its current appointed leader.

Case: external provider outage

In this case, instances do nothing: the leader remains a leader, read-only instances remain read-only. If any instance restarts during an external state provider outage, it composes an empty leadership map: it doesn’t know who actually is a leader and thinks there is none.

Case: coordinator outage

An active coordinator may be absent in a cluster either because of a failure or due to disabling the role on all instances. Just like in the previous case, instances do nothing about it: they keep fetching the leadership map from the state provider. But it will remain the same until a coordinator appears.

Raft failover (beta)

Raft failover in Cartridge based on built-in Tarantool Raft failover, the box.ctl.on_election trigger that was introduced in Tarantool 2.10.0, and eventual failover mechanisms. The replicaset leader is chosen by built-in Raft, then the other replicasets get information about leader change from membership. It’s needed to use Cartridge RPC calls. The user can control an instance’s election mode using the argparse option TARANTOOL_ELECTION_MODE or --election-mode or use box.cfg{election_mode = ...} API in runtime.

Raft failover can be enabled only on replicasets of 3 or more instances (you can change the behavior by using cartridge.cfg option disable_raft_on_small_clusters) and can’t be enabled with ALL_RW replicasets.

Important

Raft failover in Cartridge is in beta. Don’t use it in production.

Manual leader promotion

It differs a lot depending on the failover mode.

In the disabled and eventual modes, you can only promote a leader by changing the failover priority (and applying a new clusterwide configuration).

In the stateful mode, the failover priority doesn’t make much sense (except for the first appointment). Instead, you should use the promotion API (the Lua cartridge.failover_promote or the GraphQL mutation {cluster{failover_promote()}}) which pushes manual appointments to the state provider.

The stateful failover mode implies consistent promotion: before becoming writable, each instance performs the wait_lsn operation to sync up with the previous one.

Information about the previous leader (we call it a vclockkeeper) is also stored on the external storage. Even when the old leader is demoted, it remains the vclockkeeper until the new leader successfully awaits and persists its vclock on the external storage.

If replication is stuck and consistent promotion isn’t possible, a user has two options: to revert promotion (to re-promote the old leader) or to force it inconsistently (all kinds of failover_promote API has force_inconsistency flag).

Consistent promotion doesn’t work for replicasets with all_rw flag enabled and for single-instance replicasets. In these two cases an instance doesn’t even try to query vclockkeeper and to perform wait_lsn. But the coordinator still appoints a new leader if the current one dies.

In the Raft failover mode, the user can also use the promotion API: cartridge.failover_promote in Lua or mutation {cluster{failover_promote()}} in GraphQL, which calls box.ctl.promote on the specified instances. Note that box.ctl.promote starts fair elections, so some other instance may become the leader in the replicaset.

Unelectable nodes

You can restrict the election of a particular node in the stateful failover mode by GraphQL or Lua API. An “unelectable” node can’t become a leader in a replicaset. It could be useful for nodes that could only be used for election process and for routers that shouldn’t store the data.

In edit_topology:

{
   "replicasets": [
     {
         "alias": "storage",
         "uuid": "aaaaaaaa-aaaa-0000-0000-000000000000",
         "join_servers": [
             {
                 "uri": "localhost:3301",
                 "uuid": "aaaaaaaa-aaaa-0000-0000-000000000001",
                 "electable": false
             }
         ],
         "roles": []
     }
   ]
 }

In Lua API:

-- to make nodes unelectable:
require('cartridge.lua-api.topology').api_topology.set_unelectable_servers(uuids)
-- to make nodes electable:
require('cartridge.lua-api.topology').api_topology.set_electable_servers(uuids)

You can also make a node unelectable in WebUI:

If everything is ok, you will see a crossed-out crown to the left of the instance name.

Fencing

Neither eventual nor stateful failover mode protects a replicaset from the presence of multiple leaders when the network is partitioned. But fencing does. It enforces at-most-one leader policy in a replicaset.

Fencing operates as a fiber that occasionally checks connectivity with the state provider and with replicas. Fencing fiber runs on vclockkeepers; it starts right after consistent promotion succeeds. Replicasets which don’t need consistency (single-instance and all_rw) don’t defend, though.

The condition for fencing actuation is the loss of both the state provider quorum and at least one replica. Otherwise, if either state provider is healthy or all replicas are alive, the fencing fiber waits and doesn’t intervene.

When fencing is actuated, it generates a fake appointment locally and sets the leader to nil. Consequently, the instance becomes read-only. Subsequent recovery is only possible when the quorum reestablishes; replica connection isn’t a must for recovery. Recovery is performed according to the rules of consistent switchover unless some other instance has already been promoted to a new leader.

Raft failover supports fencing too. Check election_fencing_mode parameter of box.cfg{}

Failover configuration

These are clusterwide parameters:

mode: “disabled” / “eventual” / “stateful” / “raft”.
state_provider: “tarantool” / “etcd”.
failover_timeout – time (in seconds) to mark suspect members as dead and trigger failover (default: 20).
tarantool_params: {uri = "...", password = "..."}.
etcd2_params: {endpoints = {...}, prefix = "/", lock_delay = 10, username = "", password = ""}.
fencing_enabled: true / false (default: false).
fencing_timeout – time to actuate fencing after the check fails (default: 10).
fencing_pause – the period of performing the check (default: 2).
leader_autoreturn: true / false (default: false).
autoreturn_delay – the time before failover will try to return leader in replicaset to the first instance in failover_priority list (default: 300).
check_cookie_hash – enable check that nobody else uses this stateboard.

It’s required that failover_timeout > fencing_timeout >= fencing_pause.

Lua API

See:

GraphQL API

Use your favorite GraphQL client (e.g. Altair) for requests introspection:

query {cluster{failover_params{}}},
mutation {cluster{failover_params(){}}},
mutation {cluster{failover_promote()}}.

Here is an example of how to setup stateful failover:

mutation {
  cluster { failover_params(
    mode: "stateful"
    failover_timeout: 20
    state_provider: "etcd2"
    etcd2_params: {
        endpoints: ["http://127.0.0.1:4001"]
        prefix: "etcd-prefix"
    }) {
        mode
        }
    }
}

Stateboard configuration

Like other Cartridge instances, the stateboard supports cartridge.argprase options:

listen
workdir
password
lock_delay

Similarly to other argparse options, they can be passed via command-line arguments or via environment variables, e.g.:

.rocks/bin/stateboard --workdir ./dev/stateboard --listen 4401 --password qwerty

Fine-tuning failover behavior

Besides failover priority and mode, there are some other private options that influence failover operation:

LONGPOLL_TIMEOUT (failover) – the long polling timeout (in seconds) to fetch new appointments (default: 30);
NETBOX_CALL_TIMEOUT (failover/coordinator) – stateboard client’s connection timeout (in seconds) applied to all communications (default: 1);
RECONNECT_PERIOD (coordinator) – time (in seconds) to reconnect to the state provider if it’s unreachable (default: 5);
IMMUNITY_TIMEOUT (coordinator) – minimal amount of time (in seconds) to wait before overriding an appointment (default: 15).

Configuring instances

Cartridge orchestrates a distributed system of Tarantool instances – a cluster. One of the core concepts is clusterwide configuration. Every instance in a cluster stores a copy of it.

Clusterwide configuration contains options that must be identical on every cluster node, such as the topology of the cluster, failover and vshard configuration, authentication parameters and ACLs, and user-defined configuration.

Clusterwide configuration doesn’t provide instance-specific parameters: ports, workdirs, memory settings, etc.

Configuration basics

Instance configuration includes two sets of parameters:

You can set any of these parameters in:

Command line arguments.
Environment variables.
YAML configuration file.
init.lua file.

The order here indicates the priority: command-line arguments override environment variables, and so forth.

No matter how you start the instances, you need to set the following cartridge.cfg() parameters for each instance:

advertise_uri – either <HOST>:<PORT>, or <HOST>:, or <PORT>. Used by other instances to connect to the current one. DO NOT specify 0.0.0.0 – this must be an external IP address, not a socket bind.
http_port – port to open administrative web interface and API on. Defaults to 8081. To disable it, specify "http_enabled": False.
workdir – a directory where all data will be stored: snapshots, wal logs, and cartridge configuration file. Defaults to ..

If you start instances using cartridge CLI or systemctl, save the configuration as a YAML file, for example:

my_app.router: {"advertise_uri": "localhost:3301", "http_port": 8080}
my_app.storage_A: {"advertise_uri": "localhost:3302", "http_enabled": False}
my_app.storage_B: {"advertise_uri": "localhost:3303", "http_enabled": False}

With cartridge CLI, you can pass the path to this file as the --cfg command-line argument to the cartridge start command – or specify the path in cartridge CLI configuration (in ./.cartridge.yml or ~/.cartridge.yml):

cfg: cartridge.yml
run-dir: tmp/run

With systemctl, save the YAML file to /etc/tarantool/conf.d/ (the default systemd path) or to a location set in the TARANTOOL_CFG environment variable.

If you start instances with tarantool init.lua, you need to pass other configuration options as command-line parameters and environment variables, for example:

$ tarantool init.lua --alias router --memtx-memory 100 --workdir "~/db/3301" --advertise_uri "localhost:3301" --http_port "8080"

Internal representation of clusterwide configuration

In the file system, clusterwide configuration is represented by a file tree. Inside workdir of any configured instance you can find the following directory:

config/
├── auth.yml
├── topology.yml
└── vshard_groups.yml

This is the clusterwide configuration with three default config sections – auth, topology, and vshard_groups.

Due to historical reasons clusterwide configuration has two appearances:

old-style single-file config.yml with all sections combined, and
modern multi-file representation mentioned above.

Before cartridge v2.0 it used to look as follows, and this representation is still used in HTTP API and luatest helpers.

# config.yml
---
auth: {...}
topology: {...}
vshard_groups: {...}
...

Beyond these essential sections, clusterwide configuration may be used for storing some other role-specific data. Clusterwide configuration supports YAML as well as plain text sections. It can also be organized in nested subdirectories.

In Lua it’s represented by the ClusterwideConfig object (a table with metamethods). Refer to the cartridge.clusterwide-config module documentation for more details.

Two-phase commit

Cartridge manages clusterwide configuration to be identical everywhere using the two-phase commit algorithm implemented in the cartridge.twophase module. Changes in clusterwide configuration imply applying it on every instance in the cluster.

Almost every change in cluster parameters triggers a two-phase commit: joining/expelling a server, editing replica set roles, managing users, setting failover and vshard configuration.

Two-phase commit requires all instances to be alive and healthy, otherwise it returns an error.

For more details, please, refer to the cartridge.config_patch_clusterwide API reference.

Managing role-specific data

Beside system sections, clusterwide configuration may be used for storing some other role-specific data. It supports YAML as well as plain text sections. And it can also be organized in nested subdirectories.

Role-specific sections are used by some third-party roles, i.e. sharded-queue and cartridge-extensions.

A user can influence clusterwide configuration in various ways. You can alter configuration using Lua, HTTP or GraphQL API. Also there are luatest helpers available.

HTTP API

It works with old-style single-file representation only. It’s useful when there are only few sections needed.

Example:

cat > config.yml << CONFIG
---
custom_section: {}
...
CONFIG

Upload new config:

curl -v "localhost:8081/admin/config" -X PUT --data-binary @config.yml

Download it:

curl -v "localhost:8081/admin/config" -o config.yml

It’s suitable for role-specific sections only. System sections (topology, auth, vshard_groups, users_acl) can be neither uploaded nor downloaded.

If authorization is enabled, use the curl option --user username:password.

GraphQL API

GraphQL API, by contrast, is only suitable for managing plain-text sections in the modern multi-file appearance. It is mostly used by WebUI, but sometimes it’s also helpful in tests:

g.cluster.main_server:graphql({query = [[
    mutation($sections: [ConfigSectionInput!]) {
        cluster {
            config(sections: $sections) {
                filename
                content
            }
        }
    }]],
    variables = {sections = {
      {
        filename = 'custom_section.yml',
        content = '---\n{}\n...',
      }
    }}
})

Unlike HTTP API, GraphQL affects only the sections mentioned in the query. All the other sections remain unchanged.

Similarly to HTTP API, GraphQL cluster {config} query isn’t suitable for managing system sections.

Lua API

It’s not the most convenient way to configure third-party role, but it may be useful for role development. Please, refer to the corresponding API reference:

cartridge.config_patch_clusterwide
cartridge.config_get_deepcopy
cartridge.config_get_readonly

Example (from sharded-queue, simplified):

function create_tube(tube_name, tube_opts)
    local tubes = cartridge.config_get_deepcopy('tubes') or {}
    tubes[tube_name] = tube_opts or {}

    return cartridge.config_patch_clusterwide({tubes = tubes})
end

local function validate_config(conf)
    local tubes = conf.tubes or {}
    for tube_name, tube_opts in pairs(tubes) do
        -- validate tube_opts
    end
    return true
end

local function apply_config(conf, opts)
    if opts.is_master then
        local tubes = cfg.tubes or {}
        -- create tubes according to the configuration
    end
    return true
end

Luatest helpers

Cartridge test helpers provide methods for configuration management:

cartridge.test-helpers.cluster:upload_config,
cartridge.test-helpers.cluster:download_config.

Internally they wrap the HTTP API.

Example:

g.before_all(function()
    g.cluster = helpers.Cluster.new(...)
    g.cluster:upload_config({some_section = 'some_value'})
    t.assert_equals(
        g.cluster:download_config(),
        {some_section = 'some_value'}
    )
end)

Deploying an application

After you’ve developed your Tarantool Cartridge application locally, you can deploy it to a test or production environment.

Deploying includes:

packing the application into a specific distribution format
installing it to the target server
running the application.

You have four options to deploy a Tarantool Cartridge application:

as an RPM package (for production)
as a DEB package (for production)
as a tar+gz archive (for testing or as a workaround for production if root access is unavailable)
from sources (for local testing only).

Deploying as an RPM or DEB package

The choice between DEB and RPM depends on the package manager of the target OS. DEB is used for Debian Linux and its derivatives, and RPM—for CentOS/RHEL and other RPM-based Linux distributions.

Important

If you use the Tarantool Community Edition while packing the application, the package will have a dependency on this version of Tarantool.

In this case, on a target server, add the Tarantool repository for the version equal or later than the one used for packing the application. This lets a package manager install the dependency correctly. See details for your OS on the Download page.

For a production environment, it is recommended to use the systemd subsystem for managing the application instances and accessing log entries.

To deploy your Tarantool Cartridge application:

Pack the application into a deliverable:
```
$ cartridge pack rpm [APP_PATH] [--use-docker]
$ # -- OR --
$ cartridge pack deb [APP_PATH] [--use-docker]
```
where
- APP_PATH—a path to the application directory. Defaults to . (the current directory).
- --use-docker – the flag to use if packing the application on a different Linux distribution or on macOS. It ensures the resulting artifact contains the Linux compatible external modules and executables.
This creates an RPM or DEB package with the following naming: <APP_NAME>-<VERSION>.{rpm,deb}. For example, ./my_app-0.1.0-1-g8c57dcb.rpm or ./my_app-0.1.0-1-g8c57dcb.deb. For more details on the format and usage of the cartridge pack command, refer to the command description.
Upload the generated package to a target server.

Install the application:

$ sudo yum install <APP_NAME>-<VERSION>.rpm
$ # -- OR --
$ sudo dpkg -i <APP_NAME>-<VERSION>.deb

Configure the application instances.

The configuration is stored in the /etc/tarantool/conf.d/instances.yml file. Create the file and specify parameters of the instances. For details, refer to Configuring instances.

For example:
```
my_app:
  cluster_cookie: secret-cookie

my_app.router:
  advertise_uri: localhost:3301
  http_port: 8081

my_app.storage-master:
  advertise_uri: localhost:3302
  http_port: 8082

my_app.storage-replica:
  advertise_uri: localhost:3303
  http_port: 8083
```
Note

Do not specify working directories of the instances in this configuration. They are defined via the TARANTOOL_WORKDIR environmental variable in the instantiated unit file (/etc/systemd/system/<APP_NAME>@.service).

Start the application instances by using systemctl.

For more details, see Start/stop using systemctl.

$ sudo systemctl start my_app@router
$ sudo systemctl start my_app@storage-master
$ sudo systemctl start my_app@storage-replica

In case of a cluster-aware application, proceed to deploying the cluster.
Note

If you’re migrating your application from local test environment to production, you can re-use your test configuration at this step:
1. In the cluster web interface of the test environment, click Configuration files > Download to save the test configuration.
2. In the cluster web interface of the production environment, click Configuration files > Upload to upload the saved configuration.

You can further manage the running instances by using the standard operations of the systemd utilities:

systemctl for stopping, re-starting, checking the status of the instances, and so on
journalctl for collecting logs of the instances.

Entities created during installation

During the installation of a Tarantool Cartridge application, the following entities are additionally created:

The tarantool user group.
The tarantool system user. All the application instances start under this user. The tarantool user group is the main group for the tarantool user. The user is created with the option -s /sbin/nologin.
Directories and files listed in the table below (<APP_NAME> is the application name, %i is the instance name):

Path	Access Rights	Owner:Group	Description
`/etc/systemd/system/<APP_NAME>.service`	`-rw-r--r--`	`root:root`	systemd unit file for the <APP_NAME> service
`/etc/systemd/system/<APP_NAME>@.service`	`-rw-r--r--`	`root:root`	systemd instantiated unit file for the <APP_NAME> service
`/usr/share/tarantool/<APP_NAME>/`	`drwxr-xr-x`	`root:root`	Directory. Contains executable files of the application.
`/etc/tarantool/conf.d/`	`drwxr-xr-x`	`root:root`	Directory for YAML files with the configuration of the application instances, such as `instances.yml`.
`/var/lib/tarantool/<APP_NAME>.%i/`	`drwxr-xr-x`	`tarantool:tarantool`	Working directories of the application instances. Each directory contains the instance data, namely, the WAL and snapshot files, and also the application configuration YAML files.
`/var/run/tarantool/`	`drwxr-xr-x`	`tarantool:tarantool`	Directory. Contains the following files for each instance: `<APP_NAME>.%i.pid` and `<APP_NAME>.%i.control`.
`/var/run/tarantool/<APP_NAME>.%i.pid`	`-rw-r--r--`	`tarantool:tarantool`	Contains the process ID.
`/var/run/tarantool/<APP_NAME>.%i.control`	`srwxr-xr-x`	`tarantool:tarantool`	Unix socket to connect to the instance via the tt CLI utility.

Deploying as a tar+gz archive

Pack the application into a distributable:
```
$ cartridge pack tgz APP_NAME
```
This will create a tar+gz archive (e.g. ./my_app-0.1.0-1.tgz).
Upload the archive to target servers, with tarantool and (optionally) cartridge-cli installed.
Extract the archive:
```
$ tar -xzvf APP_NAME-VERSION.tgz
```

Configure the instance(s). Create a file called /etc/tarantool/conf.d/instances.yml. For example:

my_app:
 cluster_cookie: secret-cookie

my_app.instance-1:
 http_port: 8081
 advertise_uri: localhost:3301

my_app.instance-2:
 http_port: 8082
 advertise_uri: localhost:3302

See details here.

Start Tarantool instance(s). You can do it using:

tarantool, for example:

$ tarantool init.lua # starts a single instance

or cartridge, for example:

$ # in application directory
$ cartridge start # starts all instances
$ cartridge start .router_1 # starts a single instance

$ # in multi-application environment
$ cartridge start my_app # starts all instances of my_app
$ cartridge start my_app.router # starts a single instance

In case it is a cluster-aware application, proceed to deploying the cluster.
Note

If you’re migrating your application from local test environment to production, you can re-use your test configuration at this step:
1. In the cluster web interface of the test environment, click Configuration files > Download to save the test configuration.
2. In the cluster web interface of the production environment, click Configuration files > Upload to upload the saved configuration.

Deploying from sources

This deployment method is intended for local testing only.

Pull all dependencies to the .rocks directory:
```
$ tt rocks make
```

Configure the instance(s). Create a file called /etc/tarantool/conf.d/instances.yml. For example:

my_app:
 cluster_cookie: secret-cookie

my_app.instance-1:
 http_port: 8081
 advertise_uri: localhost:3301

my_app.instance-2:
 http_port: 8082
 advertise_uri: localhost:3302

See details here.

Start Tarantool instance(s). You can do it using:

tarantool, for example:

$ tarantool init.lua # starts a single instance

or cartridge, for example:

$ # in application directory
$ cartridge start # starts all instances
$ cartridge start .router_1 # starts a single instance

$ # in multi-application environment
$ cartridge start my_app # starts all instances of my_app
$ cartridge start my_app.router # starts a single instance

In case it is a cluster-aware application, proceed to deploying the cluster.
Note

If you’re migrating your application from local test environment to production, you can re-use your test configuration at this step:
1. In the cluster web interface of the test environment, click Configuration files > Download to save the test configuration.
2. In the cluster web interface of the production environment, click Configuration files > Upload to upload the saved configuration.

Starting/stopping instances

Depending on your deployment method, you can start/stop the instances using tarantool, cartridge CLI, or systemctl.

Start/stop using tarantool

With tarantool, you can start only a single instance:

# the simplest command
$ tarantool init.lua

You can also specify more options on the command line or in environment variables.

To stop the instance, use Ctrl+C.

Start/stop using cartridge CLI

With cartridge CLI, you can start one or multiple instances:

$ cartridge start [APP_NAME[.INSTANCE_NAME]] [options]

The options are listed in the cartridge start reference.

Here are some commonly used options:

--script FILE

Application’s entry point. Defaults to:

TARANTOOL_SCRIPT, or
./init.lua when running from the app’s directory, or
app_name/init.lua in a multi-app environment.

--run-dir DIR

Directory with pid and sock files. Defaults to TARANTOOL_RUN_DIR or /var/run/tarantool.

--cfg FILE

Cartridge instances YAML configuration file. Defaults to TARANTOOL_CFG or ./instances.yml. The instances.yml file contains cartridge.cfg() parameters described in the configuration section of this guide.

For example:

$ cartridge start my_app --cfg demo.yml --run-dir ./tmp/run

It starts all tarantool instances specified in cfg file, in foreground, with enforced environment variables.

When APP_NAME is not provided, cartridge parses it from ./*.rockspec filename.

When INSTANCE_NAME is not provided, cartridge reads cfg file and starts all defined instances:

$ # in application directory
$ cartridge start # starts all instances
$ cartridge start .router_1 # starts a single instance

$ # in multi-application environment
$ cartridge start my_app # starts all instances of my_app
$ cartridge start my_app.router # starts a single instance

To stop the instances, run:

$ cartridge stop [APP_NAME[.INSTANCE_NAME]] [options]

These options from the cartridge start command are supported:

--run-dir DIR
--cfg FILE

Start/stop using systemctl

To run a single instance:
```
$ systemctl start APP_NAME
```
This will start a systemd service that will listen to the port specified in instance configuration (http_port parameter).
To run multiple instances on one or multiple servers:
```
$ systemctl start APP_NAME@INSTANCE_1
$ systemctl start APP_NAME@INSTANCE_2
...
$ systemctl start APP_NAME@INSTANCE_N
```
where APP_NAME@INSTANCE_N is the instantiated service name for systemd with an incremental N – a number, unique for every instance, added to the port the instance will listen to (e.g., 3301, 3302, etc.)
To stop all services on a server, use the systemctl stop command and specify instance names one by one. For example:
```
$ systemctl stop APP_NAME@INSTANCE_1 APP_NAME@INSTANCE_2 ... APP_NAME@INSTANCE_<N>
```

When running instances with systemctl, keep these practices in mind:

You can specify instance configuration in a YAML file.

This file can contain these options; see an example here).

Save this file to /etc/tarantool/conf.d/ (the default systemd path) or to a location set in the TARANTOOL_CFG environment variable (if you’ve edited the application’s systemd unit file). The file name doesn’t matter: it can be instances.yml or anything else you like.

Here’s what systemd is doing further:
- obtains app_name (and instance_name, if specified) from the name of the application’s systemd unit file (e.g. APP_NAME@default or APP_NAME@INSTANCE_1);
- sets default console socket (e.g. /var/run/tarantool/APP_NAME@INSTANCE_1.control), PID file (e.g. /var/run/tarantool/APP_NAME@INSTANCE_1.pid) and workdir (e.g. /var/lib/tarantool/<APP_NAME>.<INSTANCE_NAME>). Environment=TARANTOOL_WORKDIR=${workdir}.%i
Finally, cartridge looks across all YAML files in /etc/tarantool/conf.d for a section with the appropriate name (e.g. app_name that contains common configuration for all instances, and app_name.instance_1 that contain instance-specific configuration). As a result, Cartridge options workdir, console_sock, and pid_file in the YAML file cartridge.cfg become useless, because systemd overrides them.

The default tool for querying logs is journalctl. For example:

$ # show log messages for a systemd unit named APP_NAME.INSTANCE_1
$ journalctl -u APP_NAME.INSTANCE_1

$ # show only the most recent messages and continuously print new ones
$ journalctl -f -u APP_NAME.INSTANCE_1

If really needed, you can change logging-related box.cfg options in the YAML configuration file: see log and other related options.

Error handling guidelines

Almost all errors in Cartridge follow the return nil, err style, where err is an error object produced by Tarantool’s errors module. Cartridge doesn’t raise errors except for bugs and functions contracts mismatch. Developing new roles should follow these guidelines as well.

Note that in triggers (cartridge.graphql.on_resolve and cartridge.twophase.on_patch) return values are ignored. So if you want to raise error from trigger function, you need to call error() explicitly.

Error objects in Lua

Error classes help to locate the problem’s source. For this purpose, an error object contains its class, stack traceback, and a message.

local errors = require('errors')
local DangerousError = errors.new_class("DangerousError")

local function some_fancy_function()

    local something_bad_happens = true

    if something_bad_happens then
        return nil, DangerousError:new("Oh boy")
    end

    return "success" -- not reachable due to the error
end

print(some_fancy_function())

nil DangerousError: Oh boy
stack traceback:
    test.lua:9: in function 'some_fancy_function'
    test.lua:15: in main chunk

For uniform error handling, errors provides the :pcall API:

local ret, err = DangerousError:pcall(some_fancy_function)
print(ret, err)

nil DangerousError: Oh boy
stack traceback:
    test.lua:9: in function <test.lua:4>
    [C]: in function 'xpcall'
    .rocks/share/tarantool/errors.lua:139: in function 'pcall'
    test.lua:15: in main chunk

print(DangerousError:pcall(error, 'what could possibly go wrong?'))

nil DangerousError: what could possibly go wrong?
stack traceback:
    [C]: in function 'xpcall'
    .rocks/share/tarantool/errors.lua:139: in function 'pcall'
    test.lua:15: in main chunk

For errors.pcall there is no difference between the return nil, err and error() approaches.

Note that errors.pcall API differs from the vanilla Lua pcall. Instead of true the former returns values returned from the call. If there is an error, it returns nil instead of false, plus an error message.

Remote net.box calls keep no stack trace from the remote. In that case, errors.netbox_eval comes to the rescue. It will find a stack trace from local and remote hosts and restore metatables.

> conn = require('net.box').connect('localhost:3301')
> print( errors.netbox_eval(conn, 'return nil, DoSomethingError:new("oops")') )
nil     DoSomethingError: oops
stack traceback:
        eval:1: in main chunk
during net.box eval on localhost:3301
stack traceback:
        [string "return print( errors.netbox_eval("]:1: in main chunk
        [C]: in function 'pcall'

However, vshard implemented in Tarantool doesn’t utilize the errors module. Instead it uses its own errors. Keep this in mind when working with vshard functions.

Data included in an error object (class name, message, traceback) may be easily converted to string using the tostring() function.

GraphQL

GraphQL implementation in Cartridge wraps the errors module, so a typical error response looks as follows:

{
    "errors":[{
        "message":"what could possibly go wrong?",
        "extensions":{
            "io.tarantool.errors.stack":"stack traceback: ...",
            "io.tarantool.errors.class_name":"DangerousError"
        }
    }]
}

Read more about errors in the GraphQL specification.

If you’re going to implement a GraphQL handler, you can add your own extension like this:

local err = DangerousError:new('I have extension')
err.graphql_extensions = {code = 403}

It will lead to the following response:

{
    "errors":[{
        "message":"I have extension",
        "extensions":{
            "io.tarantool.errors.stack":"stack traceback: ...",
            "io.tarantool.errors.class_name":"DangerousError",
            "code":403
        }
    }]
}

HTTP

In a nutshell, an errors object is a table. This means that it can be swiftly represented in JSON. This approach is used by Cartridge to handle errors via http:

local err = DangerousError:new('Who would have thought?')

local resp = req:render({
    status = 500,
    headers = {
        ['content-type'] = "application/json; charset=utf-8"
    },
    json = json.encode(err),
})

{
    "line":27,
    "class_name":"DangerousError",
    "err":"Who would have thought?",
    "file":".../app/roles/api.lua",
    "stack":"stack traceback:..."
}

Cluster instance lifecycle

Every instance in the cluster has an internal state machine. It helps manage cluster operation and describe a distributed system simpler.

Instance lifecycle starts with a cartridge.cfg call. During the initialization, Cartridge instance binds TCP (iproto) and UDP sockets (SWIM), checks working directory. Depending on the result, it enters one of the following states:

Unconfigured

If the working directory is clean and neither snapshots nor cluster-wide configuration files exist, the instance enters the Unconfigured state.

The instance starts to accept iproto requests (Tarantool binary protocol) and remains in the state until the user decides to join it to a cluster (to create replicaset or join an existing one).

After that, the instance moves to the BootstrappingBox state.

ConfigFound

If the instance finds all configuration files and snapshots, it enters the ConfigFound state. The instance does not load the files and snapshots yet, because it will download and validate the config first. On success, the state enters the ConfigLoaded state. On failure, it will move to the InitError state.

ConfigLoaded

Config is found, loaded and validated. The next step is instance configuring. If there are any snapshots, the instance will change its state to RecoveringSnapshot. Otherwise, it will move to BootstrappingBox state. By default, all instances start in read-only mode and don’t start listening until bootstrap/recovery finishes.

InitError

The following events can cause instance initialization error:

Error occurred during cartridge.remote-control’s connection to binary port
Missing config.yml from workdir (tmp/), while snapshots are present
Error while loading configuration from disk
Invalid config - Server is not present in the cluster configuration

BootstrappingBox

Configuring arguments for box.cfg if snapshots or config files are not present. box.cfg execution. Setting up users and stopping remote-control. The instance will try to start listening to full-featured iproto protocol. In case of failed attempt instance will change its state to BootError. On success, the instance enters the ConnectingFullmesh state. If there is no replicaset in cluster-wide config, the instance will set the state to BootError.

RecoveringSnapshot

If snapshots are present, box.cfg will start a recovery process. After that, the process is similar to BootstrappingBox.

BootError

This state can be caused by the following events:

Failed binding to binary port for iproto usage
Server is missing in cluster-wide config
Replicaset is missing in cluster-wide config
Failed replication configuration

ConnectingFullmesh

During this state, a configuration of servers and replicasets is being performed. Eventually, cluster topology, which is described in the config, is implemented. But in case of an error instance, the state moves to BootError. Otherwise, it proceeds to configuring roles.

BoxConfigured

This state follows the successful configuration of replicasets and cluster topology. The next step is a role configuration.

ConfiguringRoles

The state of role configuration. Instance enters this state while initial setup, after failover trigger(failover.lua) or after altering cluster-wide config(twophase.lua).

RolesConfigured

Successful role configuration.

OperationError

Error during role configuration.

Version:

Developer’s guide

Introduction

Installing Tarantool Cartridge

Creating a project

Cluster roles

Built-in roles

Custom roles

Defining role dependencies

Using multiple vshard storage groups

Role’s life cycle (and the order of function execution)

Configuring custom roles

Custom configuration example

Applying custom role’s configuration

Using the built-in HTTP server

Implementing authorization in the web interface

Application versioning

Using .cartridge.ignore files

Failover architecture

Instance configuration upon a leader change

Leader appointment rules

Disabled mode

Eventual failover

Stateful failover

Migrating a stateful replicaset to manual election mode

Case: external provider outage

Case: coordinator outage

Raft failover (beta)

Manual leader promotion

Unelectable nodes

Fencing

Failover configuration

Lua API

GraphQL API

Stateboard configuration

Fine-tuning failover behavior

Configuring instances

Configuration basics

Internal representation of clusterwide configuration

Two-phase commit

Managing role-specific data

HTTP API

GraphQL API

Lua API

Luatest helpers

Deploying an application

Deploying as an RPM or DEB package

Entities created during installation

Deploying as a tar+gz archive

Deploying from sources

Starting/stopping instances

Start/stop using tarantool

Start/stop using cartridge CLI

Start/stop using systemctl

Error handling guidelines

Error objects in Lua

GraphQL

HTTP

Cluster instance lifecycle

Unconfigured

ConfigFound

ConfigLoaded

InitError

BootstrappingBox

RecoveringSnapshot

BootError

ConnectingFullmesh

BoxConfigured

ConfiguringRoles

RolesConfigured

OperationError