Tuesday, 9 March 2021

mdb calculates ZFS related values and how those differ from ZFS ARC size

 


Applies to:

Solaris Operating System - Version 10 6/06 U2 and later
Information in this document applies to any platform.

Purpose

This document describes how mdb calculates ZFS related values and how those differ from ZFS ARC size so that users understand correctly the relationship between these two.

Details

ARC size reported by arcstats

arcstats kernel statistics reports the current ZFS ARC usage.

# kstat -n arcstats
module: zfs                             instance: 0     
name:   arcstats                        class:    misc
        buf_size                        37861488
        data_size                       7838309824
        l2_hdr_size                     0
        meta_used                       170464568
        other_size                      115650152
        prefetch_meta_size              16952928
        rawdata_size                    0
        size                            8008774392

(The output is cut for brevity.)

'size' is the amount of active data in the ARC and it can be broken down as follows.

Solaris 11.x prior to Solaris 11.3 SRU 13.4 and Solaris 10 without 150400-46/150401-46

size = meta_used + data_size;

Solaris 11.3 SRU 13.4 or later and Solaris 10 with 150400-46/150401-46 or later

size = data_size;


meta_used = buf_size + other_size + l2_hdr_size + rawdata_size + prefetch_meta_size;

buf_size: size of in-core data to manage ARC buffers.

other_size: size of in-core data to mange ZFS objects.

l2_hdr_size: size of in-core data to manage L2ARC.

rawdata_size: size of raw data used for persistent L2ARC. (Solaris 11.2.8 or later)

prefetch_meta_size: size of in-core data to manage prefetch. (Solaris 11.3 or later)

data_size: size of cached on-disk file data and on-disk meta data.

 

How ZFS ARC is allocated from kernel memory

The way ZFS ARC is allocated from kernel memory depends on Solaris versions.

Solaris 10, Solaris 11.0, Solaris 11.1

To cache on-disk file data, ARC is allocated from 'zio_data_buf_XXX' (XXX indicates cache unit size, such as '4096', '8192' etc.) kmem caches allocated from 'zfs_file_data_buf' virtual memory (vmem) arena.
To cache on-disk meta data, ARC is allocated from 'zio_buf_XXX' kmem caches allocated from 'kmem_default' vmem arena.
In-core data is allocated from other kmem caches, 'arc_buf_t', 'dmu_buf_impl_t', 'l2arc_buf_t', etc. allocated from 'kmem_default' vmem arena.
Also 'zio_data_buf_XXX' and 'zio_buf_XXX' are not used only to cache on-disk file and meta data but also used by ZFS IO routines not for ZFS ARC purpose.

Pages for 'zio_data_buf_XXX' are associated with the 'zvp' vnode and in the 'kzioseg' kernel segment.
Pages for 'zio_buf_XXX' and other caches are associated with the 'kvp', usual kernel vnode.

On Solaris 11.1 with SRU 3.4 or later, in addition to the above, 'zfs_file_data_lp_buf' vmem arena is used to allocate large pages.

Solaris 11.2

To cache on-disk file data, ARC is allocated from 'zio_data_buf_XXX' kmem caches allocated from 'zfs_file_data_buf' vmem arena.
To cache on-disk meta data, ARC is allocated from 'zio_buf_XXX' kmem caches allocated from 'zfs_metadta_buf' vmem arena.
In-core data is allocated from other kmem caches, 'arc_buf_t', 'dmu_buf_impl_t', 'l2arc_buf_t', 'zfetch_triggert_t', etc. allocated from 'kmem_default' vmem arena.
Also 'zio_data_buf_XXX' and 'zio_buf_XXX' are not used only to cache on-disk file and meta data but also used by ZFS IO routines not for ZFS ARC purpose.

Pages for both 'zio_data_buf_XXX' and 'zio_buf_XXX' are associated with the 'zvp' vnode and in the 'kzioseg' kernel segment.
Pages for other caches are associated with the 'kvp', usual kernel vnode.

Solaris 11.3 prior to SRU 21.5

The new kernel memory allocation mechanism, Kernel Object Manager (KOM) is introduced.
To cache on-disk file data, ARC is allocated from 'arc_data' kom class.
To cache on-disk meta data, ARC is allocated from 'arc_meta' kom class.
In-core data is allocated from other kmem caches, 'arc_buf_t', 'dmu_buf_impl_t', 'l2arc_buf_t', 'zfetch_triggert_t', etc. allocated from 'kmem_default' vmem arena.
Memory used by ZFS IO routines not for ZFS ARC purpose are allocated as 'kmem_alloc_XXX' from 'kmem_default' vmem arena.

'kzioseg' segment and 'zvp' vnode no longer exist.

Solaris 11.3 SRU 21.5 or later

To cache on-disk file data, ARC is allocated from 'arc_data' kom class.
To cache on-disk meta data, ARC is allocated from 'arc_meta' kom_class.

'kmem_default_zfs' vmem arena is introduced to account for kernel memory used by zfs not to cache on-disk data.

In-core data, 'arc_buf_t', 'dmu_buf_impl_t', 'l2arc_buf_t', 'zfetch_triggert_t', etc., are now allocated from 'kmem_default_zfs' vmem arena.
Memory used by ZFS IO routines not for ZFS ARC purpose are allocated as 'zio_buf_XXX' from 'kmem_default_zfs' vmem arena too.

 

ZFS information reported by ::memstat in mdb
::memstat reports ZFS related memory usage also, but it's not exactly the same as arcstats and its implementation depends on OS versions.

Solaris 10, Solaris 11.0, Solaris 11.1

> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     540356              2110   13%
ZFS File Data              609140              2379   15%
Anon                        41590               162    1%
Exec and libs                5231                20    0%
Page cache                   2883                11    0%
Free (cachelist)           800042              3125   19%
Free (freelist)           2192512              8564   52%

Total                     4191754             16374
Physical                  4102251             16024

'ZFS File Data' shows the size of pages associated with the 'zvp', which is the size allocated from 'zio_data_buf_XXX' kmem caches.
It does not include on-disk meta data and in-core data. Also it contains some amount of data used by ZFS IO routines.

Solaris 11.2

> ::memstat
Page Summary                 Pages             Bytes  %Tot
----------------- ----------------  ----------------  ----
Kernel                      237329              1.8G   23%
Guest                            0                 0    0%
ZFS Metadata                 28989            226.4M    3%
ZFS File Data               699858              5.3G   67%
Anon                         41418            323.5M    4%
Exec and libs                 1366             10.6M    0%
Page cache                    4782             37.3M    0%
Free (cachelist)              1017              7.9M    0%
Free (freelist)              33817            264.1M    3%
Total                      1048576                8G

'ZFS File Data' shows the size allocated from 'zfs_file_data_buf' vmem arena. 'ZFS Metadata' shows the size of "pages associated with zvp" - 'ZFS File Data'.

Solaris 11.3 prior to SRU 17.5.0

> ::memstat
Page Summary                 Pages             Bytes  %Tot
----------------- ----------------  ----------------  ----
Kernel                      558607              4.2G    7%
ZFS Metadata                 27076            211.5M    0%
ZFS File Data              2743214             20.9G   33%
Anon                         68656            536.3M    1%
Exec and libs                 2067             16.1M    0%
Page cache                    7285             56.9M    0%
Free (cachelist)             21596            168.7M    0%
Free (freelist)            4927709             37.5G   59%
Total                      8372224             63.8G

> ::kom_class
ADDR             FLAGS NAME             RSS        MEM_TOTAL
4c066e91d80      -L-   arc_meta         211.5m     280m      
4c066e91c80      ---   arc_data         20.9g      20.9g 

'ZFS File Data' shows the size of KOM statistics of 'arc_data''ZFS Metadata' shows the size of KOM statistics of 'arc_meta'.

Solaris 11.3 with SRU 17.5 and without SRU 21.5

> ::memstat -v
Page Summary                            Pages             Bytes  %Tot
---------------------------- ----------------  ----------------  ----
Kernel                                 636916              4.8G    4%
Kernel (ZFS ARC excess)                 16053            125.4M    0%
Defdump prealloc                       291049              2.2G    2%
ZFS Metadata                           137434              1.0G    1%
ZFS File Data                         4244593             32.3G   25%
Anon                                   114975            898.2M    1%
Exec and libs                            2000             15.6M    0%
Page cache                              15548            121.4M    0%
Free (cachelist)                       253689              1.9G    2%
Free (freelist)                      11064959             84.4G   66%
Total                                16777216              128G

::memstat on Solaris 11.3 SRU 17.5 or later has '-v' option to show the details.

'ZFS File Data' and 'ZFS Metadata' shows the KOM stat same as before.

In addition, 'Kernel (ZFS ARC excess)' shows the wasted memory of the sum of 'ZFS File Data' and 'ZFS Metadata'.

KOM can keep allocated memory which is not actually used at the moment, which is considered wasted.

Solaris 11.3 SRU 21.5 or later

> ::memstat -v
Page Summary                            Pages             Bytes  %Tot
---------------------------- ----------------  ----------------  ----
Kernel                                 671736              2.5G    6%
Kernel (ZFS ARC excess)                 21159             82.6M    0%
Defdump prealloc                       361273              1.3G    3%
ZFS Kernel Data                        131699            514.4M    1%
ZFS Metadata                            42962            167.8M    0%
ZFS File Data                         8857479             33.7G   84%
Anon                                    99066            386.9M    1%
Exec and libs                            2050              8.0M    0%
Page cache                               9265             36.1M    0%
Free (cachelist)                        14663             57.2M    0%
Free (freelist)                        273905              1.0G    3%
Total                                10485257             39.9G

 In addition to the information prior to Solarsi 11.3 SRU 21.5, 'ZFS Kernel Data' shows the size allocated from 'kmem_default_zfs' arena (and its overhead).

Solaris 11.4 or later

> ::memstat -v
Usage Type/Subtype                      Pages    Bytes  %Tot  %Tot/%Subt
---------------------------- ---------------- -------- ----- -----------
Kernel                                3669091    13.9g  7.2%
  Regular Kernel                      2602037     9.9g        5.1%/70.9%
  ZFS ARC Fragmentation                 14515    56.6m        0.0%/ 0.3%
  Defdump prealloc                    1052539     4.0g        2.0%/28.6%
ZFS                                  28359638   108.1g 56.3%
  ZFS Metadata                         116083   453.4m        0.2%/ 0.4%
  ZFS Data                           27959629   106.6g       55.5%/98.5%
  ZFS Kernel Data                      283926     1.0g        0.5%/ 1.0%
User/Anon                              201462   786.9m  0.4%
Exec and libs                            3062    11.9m  0.0%
Page Cache                              29372   114.7m  0.0%
Free (cachelist)                          944     3.6m  0.0%
Free                                 18033911    68.7g 35.8%
Total                                50297480   191.8g  100%

 'ZFS ARC Fragmentation' under 'Kernel' shows the wasted memory.

 

Why values reported by ::memstat is different from size reported by arcstats?

There are a few factors.

ARC size includes cached on-disk file data, cached on-disk meta data, and various in-core data. But ::memstat does not report each of them. Prior to Solaris 11.2, only 'ZFS File Data' is reported.
Even on Solaris 11.2 and 11.3, in-core data is not reported. Also the accounting by arcstats and ::memstat does not completely match.

::memstat on Solaris 11.3 SRU 21.5 or later reports in-core data as 'ZFS Kernel Data', though in-core data counted by arcstats and by ::memstat are not exactly the same.

Another factor is wasted memory in kmem caches.
Consider a possible scenario here: customer ran a workload that was largely 128K blocksize based. This resulted in filling up the ARC cache with say X GB of 128K blocks. The customer then switched to a workload that was 8K based. The ARC cache now filled up Y GB of 8K blocks (the 128K blocks are evicted). When the 128K blocks are evicted from the ARC cache, they are returned to the 'zio_data_buf_131072' cache, where they will stay (unused by the ARC) until either re-allocated or "reaped" by the VM system.

Under such a condition, 'ZFS File Data' shown by ::memstat can be much higher than the ARC size.
Especially, from Solaris 11.1 with SRU 3.4 through Solaris 11.1 with SRU 21.4, large pages are used by default and the situation can be worse.

::memstat reports such waste as 'Kernel (ZFS ARC excess)' on Solaris 11.3 SRU 17.5 or later, or 'ZFS ARC Fragmentation' on Solaris 11.4 or later.

Also it could happen 'ZFS File Data' is higher than the ARC size even though 'ZFS ARC excess / ZFS ARC Fragmentation' is not high.
In this case, the ARC memory is freed but still have KOM objects associated.

As discussed above, it is clear that reported values by ::memstat do not have to match with the value of ZFS ARC size.  It is not an issue if ::memstat values are more or less than ZFS ARC size.

 

-------------

 


Click to add to FavoritesTo BottomTo Bottom

Applies to:

Solaris Operating System - Version 8.0 to 11.4 [Release 8.0 to 11.0]
All Platforms
*** Checked for currency and updated for Solaris 11.2 11-March-2015 ***


Goal

This document is intended to give hints, where to look for in checking and troubleshooting memory usage.
In principle, investigation of memory usage is split in checking usage of kernel memory and user memory.

Please be aware that in case of a memory-usage problem on a system, corrective actions usually requires deep knowledge and must be performed with great care.

Solution

General System Practices is to keep system up-to-date with latest Solaris releases and patches

First, you need to check  how much Memory is used in Kernel and how much is used in User Memory. This is important to decide, which further troubleshooting steps are required.

A very useful mdb dcmd is '::memstat' ( this command can take several minutes to complete )
For more information on using the modular debugger, see the Oracle Solaris Modular Debugger Guide.
Solaris[TM] 9 Operating System or greater only !  Format varies with OS release.  This example is from Solaris 11.2

# echo "::memstat" | mdb -k
Page Summary                 Pages             Bytes  %Tot
----------------- ----------------  ----------------  ----
Kernel                      585584              4.4G   14%
Defdump prealloc            204802              1.5G    5%
Guest                            0                 0    0%
ZFS Metadata                 21436            167.4M    0%
ZFS File Data               342833              2.6G    8%
Anon                         56636            442.4M    1%
Exec and libs                 1131              8.8M    0%
Page cache                    4339             33.8M    0%
Free (cachelist)              8011             62.5M    0%
Free (freelist)            2969532             22.6G   71%
Total                      4194304               32G



User memory usage :  print out processes using most USER - memory
% prstat -s size # sorted by userland virtual memory consumption
% prstat -s rss # sorted by userland physical memory consumption

% prstat -s rss
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  4051 user1     297M  258M sleep   59    0   1:35:05 0.0% mysqld/10
 26286 user2     229M  180M sleep   59    0   0:05:07 0.0% java/53
 27101 user2     237M  150M sleep   59    0   0:02:21 0.0% soffice.bin/5
 23335 user2     193M  135M sleep   59    0   0:12:33 0.0% firefox-bin/10
  3727 noaccess  192M  131M sleep   59    0   0:36:22 0.0% java/18
 22751 root      165M  131M sleep   59    0   1:13:12 0.0% java/46
  1448 noaccess  192M  108M sleep   59    0   0:34:47 0.0% java/18
 10115 root      129M   82M sleep   59    0   0:31:29 0.0% java/41
 20274 root      136M   77M stop    59    0   0:04:08 0.0% java/25
  3397 root      138M   76M sleep   59    0   0:12:42 0.0% java/37
 12949 pgsql      81M   70M sleep   59    0   0:09:36 0.0% postgres/1
 12945 pgsql      80M   70M sleep   59    0   0:00:05 0.0% postgres/1



 User Memory Usage : shows Shared Memory and Semaphores:

% ipcs -a

IPC status from
T  ID     KEY        MODE     OWNER   GROUP  CREATOR  CGROUP CBYTES  QNUM     QBYTES  LSPID  LRPID   STIME    RTIME    CTIME
Message Queues:
q  0  0x55460272 -Rrw-rw----   root    root     root    root    0       0     4194304  1390  18941  14:12:20  14:12:21  10:23:32
q  1  0x41460272 --rw-rw----   root    root     root    root    0       0     4194304  5914   1390   8:03:34   8:03:34  10:23:39
q  2  0x4b460272 --rw-rw----   root    root     root    root    0       0     4194304     0      0  no-entry  no-entry  10:23:39

T  ID      KEY       MODE      OWNER     GROUP CREATOR    CGROUP    NATTCH       SEGSZ  CPID   LPID     ATIME     DTIME    CTIME
Shared Memory:
m  0  0x50000b3f --rw-r--r--   root      root     root      root         1           4   738   738   18:50:36  18:50:36  18:50:36
m  1  0x52574801 --rw-rw----   root    oracle     root    oracle        35  1693450240  2049  26495  10:30:00  10:30:00  18:51:13
m  2  0x52574802 --rw-rw----   root    oracle     root    oracle        35  1258291200  2049  26495  10:30:00  10:30:00  18:51:16
m  3  0x52594801 --rw-rw----   root    oracle     root    oracle        12   241172480  2098  14328   7:58:33   7:58:33  18:51:27
m  4  0x52594802 --rw-rw----   root    oracle     root    oracle        12    78643200  2098  14329   7:58:32   7:58:33  18:51:27
m  5  0x52584801 --rw-rw----   root    oracle     root    oracle        13   125829120  2125  27492   1:36:12   1:36:12  18:51:34
m  6  0x52584802 --rw-rw----   root    oracle     root    oracle        13   268435456  2125  27487   1:36:10   1:36:11  18:51:34
m  7  0x525a4801 --rw-rw----   root    oracle     root    oracle        15   912261120  2160  27472   1:36:09   1:36:09  18:51:40
m  8  0x525a4802 --rw-rw----   root    oracle     root    oracle        15   268435456  2160  27467   1:36:08   1:36:09  18:51:42
m 8201 0x4d2     --rw-rw-rw-   root      root     root      root         0       32008  1528   1543  10:26:03  10:26:04  10:25:53

T  ID  KEY       MODE     OWNER       GROUP       CREATOR        CGROUP         NSEMS     OTIME    CTIME
Semaphores:
s  0   0x1   --ra-ra-ra-   root        root          root         root              1     16:17:35  18:50:33
s  1     0   --ra-ra----   root       oracle         root         oracle           36     10:33:28  18:51:17
s  2     0   --ra-ra----   root       oracle         root         oracle           13     10:33:28  18:51:27
s  3     0   --ra-ra----   root       oracle         root         oracle           14     10:33:28  18:51:34
s  4     0   --ra-ra----   root       oracle         root         oracle           16     10:33:27  18:51:42
s  5 0x4d2   --ra-ra-ra-   root       root           root         root               1    no-entry  10:25:53
s  6 0x4d3   --ra-ra-ra-   root       root           root         root               1    no-entry  10:25:53




User Memory Usage : lists User Memory usage of all processes ( except PID 0,2,3 )

# pmap -x /proc/* > /var/tmp/pmap-x
short list of total usage of these processes

% egrep "[0-9]:|^total" /var/tmp/pmap-x
     1:   /sbin/init
total Kb 2336 2080  128 -
1006:  rlogin cores4
total Kb 2216 1696    80 -
1007:  rlogin cores4
total Kb 2216 1696  104 -
  115:  /usr/sbin/nscd
total Kb 4208 3784 1704 -
-- snip --




User Memory Usage : check the usage of /tmp

% df -kl /tmp
Filesystem kbytes        used        avail capacity  Mounted on 
swap        1355552    2072 1353480        1%      /tmp

print the biggest 10 files and dirs in /tmp

% du -akd /tmp/ | sort -n | tail -10
288     /tmp/SUNWut
328     /tmp/log
576     /tmp/ips2
584     /tmp/explo
608     /tmp/ipso
3408    /tmp/sshd-truss.out
17992   /tmp/truss.p
22624   /tmp/js
49208   /tmp



 
User Memory Usage : Overall Memory usage on system

% vmstat -p 3
     memory           page          executable      anonymous      filesystem
   swap  free     re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
19680912 27487976 21  94   0   0   0    0    0    0    0    0    0   14    0    0
 3577608 11959480  0  20   0   0   0    0    0    0    0    0    0    0    0    0
 3577328 11959240  0   5   0   0   0    0    0    0    0    0    0    0    0    0
 3577328 11959112 38 207   0   0   0    0    0    0    0    0    0    0    0    0
 3577280 11958944  0   1   0   0   0    0    0    0    0    0    0    0    0    0

 

scanrate 'sr'  should be 0  or near zero



 
User Memory Usage : Swap usage

% swap -l
swapfile              dev    swaplo  blocks      free
/dev/dsk/c0t0d0s1   32,25        16  1946032  1946032

% swap -s
total: 399400k bytes allocated + 18152k reserved = 417552k used, 1355480k available




common kernel statistics

print out all kernel statistics in a parse'able format

% kstat -p > /var/tmp/kstat-p



kernel memory statistics:

% kstat -p -c kmem_cache
% kstat -p -m vmem
% kstat -p -c vmem
% kstat -p | egrep zfs_file_data_buf | egrep mem_total



alternatively to kstat you can get kernel memory usage with kmastat
prints kmastat buffers

# echo "::kmastat" | mdb -k > /var/tmp/kmastat
% more /var/tmp/kmastat
    cache                     buf    buf    buf     memory     alloc  alloc
    name                     size in use   total    in use   succeed  fail
 ------------------------- ------ ------  ------ --------- --------- -----
  kmem_magazine_1              16    470     508      8192       470     0
  kmem_magazine_3              32    970    1016     32768      1164     0
  kmem_magazine_7              64   1690    1778    114688      1715     0


Look for the highest numbers in column "memory in use" and for any numbers higher than '0' in column "alloc fail"

 

ZFS File Data:
    Keep system up-to-date with latest Solaris releases and patches
    Size memory requirements to actual system workload

        With a known application memory footprint, such as for a database application, you might cap the ARC size so that the
        application will not need to reclaim its necessary memory from the ZFS cache.
        Consider de-duplication memory requirements
        Identify ZFS memory usage with the following command:

# mdb -k
Loading modules: [ unix genunix specfs dtrace zfs scsi_vhci sd mpt mac px ldc ip
 hook neti ds arp usba kssl sockfs random mdesc idm nfs cpc crypto fcip fctl ufs
 logindmux ptm sppp ipc ]
> ::memstat
Page Summary                 Pages             Bytes  %Tot
----------------- ----------------  ----------------  ----
Kernel                      261969              1.9G    6%
Guest                            0                 0    0%
ZFS Metadata                 13915            108.7M    0%
ZFS File Data               111955            874.6M    3%
Anon                         52339            408.8M    1%
Exec and libs                 1308             10.2M    0%
Page cache                    5932             46.3M    0%
Free (cachelist)             16460            128.5M    0%
Free (freelist)            3701754             28.2G   89%
Total                      4165632             31.7G
> $q

In case the amount of ZFS File Data is too high on the system, you might consider how to limit how much memory ZFS can consume.

For Solaris revisions prior to Solaris 11, the only way accomplish this is to limit the ARC cache
by setting zfs:zfs_arc_max in /etc/system
set zfs:zfs_arc_max = [size]
i.e. limit the cache to 1 GB in size
set zfs:zfs_arc_max = 1073741824

Please check the following documents to check/limit the ARC
How to Understand "ZFS File Data" Value by mdb and ZFS ARC Size. (Doc ID 1430323.1)
Oracle Solaris Tunable Parameters Reference Manual

Starting at Solaris 11, a second method, reserving memory for applications, may be used to prevent ZFS from using too much memory.

The entry in /etc/system looks like this:

set user_reserve_hint_pct=60

 

No comments:

Post a Comment