Top.Mail.Ru
LuaJIT memory profiler | Tarantool
Tarantool
Check out the new release 2.8
Application server LuaJIT memory profiler

LuaJIT memory profiler

Starting from version 2.7.1, Tarantool has a built‑in module called misc.memprof that implements a LuaJIT memory profiler (which we will just call the profiler in this section). The profiler provides a memory allocation report that helps analyze Lua code and find the places that put the most pressure on the Lua garbage collector (GC).

Usage of the profiler involves two steps:

  1. Collecting a binary profile of allocations, reallocations, and deallocations in memory related to Lua (further, binary memory profile or binary profile for short).
  2. Parsing the collected binary profile to get a human-readable profiling report.

To collect a binary profile for a particular part of the Lua code, you need to place this part between two misc.memprof functions, namely, misc.memprof.start() and misc.memprof.stop(), and then execute the code under Tarantool.

Below is a chunk of simple Lua code named test.lua to illustrate this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
-- Prevent allocations on traces.
jit.off()
local str, err = misc.memprof.start("memprof_new.bin")
-- Lua doesn't create a new frame to call string.rep, and all allocations
-- are attributed not to the append() function but to the parent scope.
local function append(str, rep)
    return string.rep(str, rep)
end

local t = {}
for i = 1, 1e4 do
    -- table.insert is the built-in function and all corresponding
    -- allocations are reported in the scope of the main chunk.
    table.insert(t,
        append('q', i)
    )
end
local stp, err = misc.memprof.stop()

The Lua code for starting the profiler – as in line 3 in the test.lua example above – is:

local str, err = misc.memprof.start(FILENAME)

where FILENAME is the name of the binary file where profiling events are written.

If the operation fails, for example if it is not possible to open a file for writing or if the profiler is already running, misc.memprof.start() returns nil as the first result, an error-message string as the second result, and a system-dependent error code number as the third result. If the operation succeeds, misc.memprof.start() returns true.

The Lua code for stopping the profiler – as in line 18 in the test.lua example above – is:

local stp, err = misc.memprof.stop()

If the operation fails, for example if there is an error when the file descriptor is being closed or if there is a failure during reporting, misc.memprof.stop() returns nil as the first result, an error-message string as the second result, and a system-dependent error code number as the third result. If the operation succeeds, misc.memprof.stop() returns true.

To generate the file with memory profile in binary format (in the test.lua code example above the file name is memprof_new.bin), execute the code under Tarantool:

$ tarantool test.lua

Tarantool collects the allocation events in memprof_new.bin, puts the file in its working directory, and closes the session.

The test.lua code example above also illustrates the memory allocation logic in some cases that are important to understand for reading and analyzing a profiling report:

  • Line 2: It is recommended to switch the JIT compilation off by calling jit.off() before the profiler start. Refer to the following note about jitoff for more details.
  • Lines 6-8: Tail call optimization doesn’t create a new call frame, so all allocations inside the function called via the CALLT/CALLMT bytecodes are attributed to the function’s caller. See also the comments preceding these lines.
  • Lines 14-16: Usually the information about allocations inside Lua built‑ins is not really useful for developers. That’s why if a Lua built‑in function is called from a Lua function, the profiler attributes all allocations to the Lua function. Otherwise, this event is attributed to a C function. See also the comments preceding these lines.

After getting the memory profile in binary format, the next step is to parse it to get a human-readable profiling report. You can do this via Tarantool by using the following command (mind the hyphen - prior to the file name):

$ tarantool -e 'require("memprof")(arg)' - memprof_new.bin

where memprof_new.bin is the binary profile generated earlier by tarantool test.lua. (Warning: there is a slight behavior change here, the tarantool -e ... command was slightly different in Tarantool versions prior to Tarantool 2.8.1.)

Tarantool generates a profiling report and displays it on the console before closing the session:

ALLOCATIONS
@test.lua:14: 10000 events  +50240518 bytes -0 bytes
@test.lua:9: 1 events       +32 bytes       -0 bytes
@test.lua:8: 1 events       +20 bytes       -0 bytes
@test.lua:13: 1 events      +24 bytes       -0 bytes

REALLOCATIONS
@test.lua:13: 13 events     +262216 bytes   -131160 bytes
    Overrides:
        @test.lua:13

@test.lua:14: 11 events     +49536 bytes    -24768 bytes
            Overrides:
        @test.lua:14
        INTERNAL

INTERNAL: 3 events          +8448 bytes     -16896 bytes
    Overrides:
        @test.lua:14

DEALLOCATIONS
INTERNAL: 1723 events       +0 bytes        -483515 bytes
@test.lua:14: 1 events      +0 bytes        -32768 bytes

HEAP SUMMARY:
@test.lua:14 holds 50248326 bytes: 10010 allocs, 10 frees
@test.lua:13 holds 131080 bytes: 14 allocs, 13 frees
INTERNAL holds 8448 bytes: 3 allocs, 3 frees
@test.lua:9 holds 32 bytes: 1 allocs, 0 frees
@test.lua:8 holds 20 bytes: 1 allocs, 0 frees

Note

On MacOS, a report will be different for the same chunk of code because Tarantool and LuaJIT are built with the GC64 mode enabled for MacOS.

Let’s examine the report structure. A report has four sections:

Each section contains event records that are sorted from the most frequent to the least frequent.

An event record has the following format:

@<filename>:<line_number>: <number_of_events> events +<allocated> bytes -<freed> bytes
  • <filename>—a name of the file containing Lua code.
  • <line_number>—the line number where the event is detected.
  • <number_of_events>—a number of events for this code line.
  • +<allocated> bytes—amount of memory allocated during all the events on this line.
  • -<freed> bytes—amount of memory freed during all the events on this line.

The Overrides label shows what allocation has been overridden.

See the test.lua chunk above with the explanation in the comments for some examples.

The INTERNAL label indicates that this event is caused by internal LuaJIT structures.

Note

Important note regarding the INTERNAL label and the recommendation of switching the JIT compilation off (jit.off()): this version of the profiler doesn’t support verbose reporting for allocations on traces. If memory allocations are made on a trace, the profiler can’t associate the allocations with the part of Lua code that generated the trace. In this case, the profiler labels such allocations as INTERNAL.

So, if the JIT compilation is on, new traces will be generated and there will be a mixture of events labeled INTERNAL in the profiling report: some of them are really caused by internal LuaJIT structures, but some of them are caused by allocations on traces.

If you want to have a more definite report without JIT compiler allocations, call jit.off() before starting the profiling. And if you want to completely exclude the trace allocations from the report, remove also the old traces by additionally calling jit.flush() after jit.off().

Nevertheless, switching the JIT compilation off before the profiling is not “a must”. It is rather a recommendation, and in some cases, for example in a production environment, you may need to keep JIT compilation on to see the full picture of all the memory allocations. In this case, the majority of the INTERNAL events are most probably caused by traces.

As for investigating the Lua code with the help of profiling reports, it is always code-dependent and there can’t be hundred per cent definite recommendations in this regard. Nevertheless, you can see some of the things in the Profiling report analysis example later.

Also, below is the FAQ section with the questions that most probably can arise while using profiler.

In this section, some profiler-related points are discussed in a Q&A format.

Question (Q): Is the profiler suitable for C allocations or allocations inside C code?

Answer (A): The profiler reports only allocation events caused by the Lua allocator. All Lua-related allocations, like table or string creation are reported. But the profiler doesn’t report allocations made by malloc() or other non-Lua allocators. You can use valgrind to debug them.


Q: Why are there so many INTERNAL allocations in my profiling report? What does it mean?

A: INTERNAL means that these allocations/reallocations/deallocations are related to the internal LuaJIT structures or are made on traces. Currently, the profiler doesn’t verbosely report allocations of objects that are made during trace execution. Try adding jit.off() before the profiler start.


Q: Why are there some reallocations/deallocations without an Overrides section?

A: These objects can be created before the profiler starts. Adding collectgarbage() before the profiler’s start enables collecting all previously allocated objects that are dead when the profiler starts.


Q: Why are some objects not collected during profiling? Is it a memory leak?

A: LuaJIT uses incremental Garbage Collector (GC). A GC cycle may not be finished at the moment the profiler stops. Add collectgarbage() before stopping the profiler to collect all the dead objects for sure.


Q: Can I profile not just a current chunk but the entire running application? Can I start the profiler when the application is already running?

A: Yes. Here is an example of code that can be inserted in the Tarantool console for a running instance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
local fiber = require "fiber"
local log = require "log"

fiber.create(function()
  fiber.name("memprof")

  collectgarbage() -- Collect all objects already dead
  log.warn("start of profile")

  local st, err = misc.memprof.start(FILENAME)
  if not st then
    log.error("failed to start profiler: %s", err)
  end

  fiber.sleep(TIME)

  collectgarbage()
  st, err = misc.memprof.stop()

  if not st then
    log.error("profiler on stop error: %s", err)
  end

  log.warn("end of profile")
end)

where

  • FILENAME—the name of the binary file where profiling events are written
  • TIME—duration of profiling, in seconds.

Also, you can directly call misc.memprof.start() and misc.memprof.stop() from a console.

In the example below, the following Lua code named format_concat.lua is investigated with the help of the memory profiler reports.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Prevent allocations on new traces.
jit.off()

local function concat(a)
  local nstr = a.."a"
  return nstr
end

local function format(a)
  local nstr = string.format("%sa", a)
  return nstr
end

collectgarbage()

local binfile = "/tmp/memprof_"..(arg[0]):match("([^/]*).lua")..".bin"

local st, err = misc.memprof.start(binfile)
assert(st, err)

-- Payload.
for i = 1, 10000 do
  local f = format(i)
  local c = concat(i)
end
collectgarbage()

local st, err = misc.memprof.stop()
assert(st, err)

os.exit()

When you run this code under Tarantool and then parse the binary memory profile in /tmp/memprof_format_concat.bin, you will get the following profiling report:

ALLOCATIONS
@format_concat.lua:10: 19996 events +624284 bytes   -0 bytes
INTERNAL: 1 events                  +65536 bytes    -0 bytes

REALLOCATIONS

DEALLOCATIONS
INTERNAL: 19996 events              +0 bytes        -558778 bytes
    Overrides:
        @format_concat.lua:10

@format_concat.lua:10: 2 events     +0 bytes        -98304 bytes
    Overrides:
        @format_concat.lua:10

HEAP SUMMARY:
INTERNAL holds 65536 bytes: 1 allocs, 0 frees

Reasonable questions regarding the report can be:

  • Why are there no allocations related to the concat() function?
  • Why is the number of allocations not a round number?
  • Why are there approximately 20K allocations instead of 10K?

First of all, LuaJIT doesn’t create a new string if the string with the same payload exists (see details on lua-users.org/wiki). This is called string interning. So, when a string is created via the format() function, there is no need to create the same string via the concat() function, and LuaJIT just uses the previous one.

That is also the reason why the number of allocations is not a round number as could be expected from the cycle operator for i = 1, 10000...: Tarantool creates some strings for internal needs and built‑in modules, so some strings already exist.

But why are there so many allocations? It’s almost twice as big as the expected amount. This is because the string.format() built‑in function creates another string necessary for the %s identifier, so there are two allocations for each iteration: for tostring(i) and for string.format("%sa", string_i_value). You can see the difference in behavior by adding the line local _ = tostring(i) between lines 22 and 23.

To profile only the concat() function, comment out line 23 (which is local f = format(i)) and run the profiler. Now the output will look like this:

ALLOCATIONS
@format_concat.lua:5: 10000 events  +284411 bytes    -0 bytes

REALLOCATIONS

DEALLOCATIONS
INTERNAL: 10000 events              +0 bytes         -218905 bytes
    Overrides:
        @format_concat.lua:5

@format_concat.lua:5: 1 events      +0 bytes         -32768 bytes

HEAP SUMMARY:
@format_concat.lua:5 holds 65536 bytes: 10000 allocs, 9999 frees

Q: But what will change if JIT compilation is enabled?

A: In the code, comment out line 2 (which is jit.off()) and run the profiler . Now there are only 56 allocations in the report, and all other allocations are JIT-related (see also the related dev issue):

ALLOCATIONS
@format_concat.lua:5: 56 events +1112 bytes -0 bytes
@format_concat.lua:0: 4 events  +640 bytes  -0 bytes
INTERNAL: 2 events              +382 bytes  -0 bytes

REALLOCATIONS

DEALLOCATIONS
INTERNAL: 58 events             +0 bytes    -1164 bytes
    Overrides:
        @format_concat.lua:5
        INTERNAL


HEAP SUMMARY:
@format_concat.lua:0 holds 640 bytes: 4 allocs, 0 frees
INTERNAL holds 360 bytes: 2 allocs, 1 frees

This happens because a trace has been compiled after 56 iterations (the default value of the hotloop compiler parameter). Then, the JIT-compiler removed the unused variable c from the trace, and, therefore, the dead code of the concat() function is eliminated.

Next, let’s profile only the format() function with JIT enabled. For that, comment out lines 2 and 24 (jit.off() and local c = concat(i)), do not comment out line 23 (local f = format(i)), and run the profiler. Now the output will look like this:

ALLOCATIONS
@format_concat.lua:10: 19996 events +624284 bytes  -0 bytes
INTERNAL: 4 events                  +66928 bytes   -0 bytes
@format_concat.lua:0: 4 events      +640 bytes     -0 bytes

REALLOCATIONS

DEALLOCATIONS
INTERNAL: 19997 events              +0 bytes       -559034 bytes
    Overrides:
        @format_concat.lua:0
        @format_concat.lua:10

@format_concat.lua:10: 2 events     +0 bytes       -98304 bytes
    Overrides:
        @format_concat.lua:10


HEAP SUMMARY:
INTERNAL holds 66928 bytes: 4 allocs, 0 frees
@format_concat.lua:0 holds 384 bytes: 4 allocs, 1 frees

Q: Why are there so many allocations in comparison to the concat() function?

A: The answer is simple: the string.format() function with the %s identifier is not yet compiled via LuaJIT. So, a trace can’t be recorded and the compiler doesn’t perform the corresponding optimizations.

If we change the format() function in lines 9-12 of the Profiling report analysis example in the following way

local function format(a)
  local nstr = string.format("%sa", tostring(a))
  return nstr
end

the profiling report becomes much prettier:

ALLOCATIONS
@format_concat.lua:10: 109 events   +2112 bytes -0 bytes
@format_concat.lua:0: 4 events      +640 bytes  -0 bytes
INTERNAL: 3 events                  +1206 bytes -0 bytes

REALLOCATIONS

DEALLOCATIONS
INTERNAL: 112 events                +0 bytes    -2460 bytes
    Overrides:
        @format_concat.lua:0
        @format_concat.lua:10
        INTERNAL


HEAP SUMMARY:
INTERNAL holds 1144 bytes: 3 allocs, 1 frees
@format_concat.lua:0 holds 384 bytes: 4 allocs, 1 frees

This feature was added in version 2.8.1.

The end of each display is a HEAP SUMMARY section which looks like this:

@<filename>:<line number> holds <number of still reachable bytes> bytes:
<number of allocation events> allocs, <number of deallocation events> frees

Sometimes a program can cause many deallocations, so the DEALLOCATION section can become large, so the display is not easy to read. To minimize output, start the parsing with an extra flag: --leak-only, for example

$ tarantool -e 'require("memprof")(arg)' - --leak-only memprof_new.bin

When –leak-only` is used, only the HEAP SUMMARY section is displayed.