drivers/staging/zcache/ramster/ramster-howto.txt

   1                         RAMSTER HOW-TO
   2
   3 Author: Dan Magenheimer
   4 Ramster maintainer: Konrad Wilk <konrad.wilk@oracle.com>
   5
   6 This is a HOWTO document for ramster which, as of this writing, is in
   7 the kernel as a subdirectory of zcache in drivers/staging, called ramster.
   8 (Zcache can be built with or without ramster functionality.)  If enabled
   9 and properly configured, ramster allows memory capacity load balancing
  10 across multiple machines in a cluster.  Further, the ramster code serves
  11 as an example of asynchronous access for zcache (as well as cleancache and
  12 frontswap) that may prove useful for future transcendent memory
  13 implementations, such as KVM and NVRAM.  While ramster works today on
  14 any network connection that supports kernel sockets, its features may
  15 become more interesting on future high-speed fabrics/interconnects.
  16
  17 Ramster requires both kernel and userland support.  The userland support,
  18 called ramster-tools, is known to work with EL6-based distros, but is a
  19 set of poorly-hacked slightly-modified cluster tools based on ocfs2, which
  20 includes an init file, a config file, and a userland binary that interfaces
  21 to the kernel.  This state of userland support reflects the abysmal userland
  22 skills of this suitably-embarrassed author; any help/patches to turn
  23 ramster-tools into more distributable rpms/debs useful for a wider range
  24 of distros would be appreciated.  The source RPM that can be used as a
  25 starting point is available at:
  26     http://oss.oracle.com/projects/tmem/files/RAMster/
  27
  28 As a result of this author's ignorance, userland setup described in this
  29 HOWTO assumes an EL6 distro and is described in EL6 syntax.  Apologies
  30 if this offends anyone!
  31
  32 Kernel support has only been tested on x86_64.  Systems with an active
  33 ocfs2 filesystem should work, but since ramster leverages a lot of
  34 code from ocfs2, there may be latent issues.  A kernel configuration that
  35 includes CONFIG_OCFS2_FS should build OK, and should certainly run OK
  36 if no ocfs2 filesystem is mounted.
  37
  38 This HOWTO demonstrates memory capacity load balancing for a two-node
  39 cluster, where one node called the "local" node becomes overcommitted
  40 and the other node called the "remote" node provides additional RAM
  41 capacity for use by the local node.  Ramster is capable of more complex
  42 topologies; see the last section titled "ADVANCED RAMSTER TOPOLOGIES".
  43
  44 If you find any terms in this HOWTO unfamiliar or don't understand the
  45 motivation for ramster, the following LWN reading is recommended:
  46 -- Transcendent Memory in a Nutshell (lwn.net/Articles/454795)
  47 -- The future calculus of memory management (lwn.net/Articles/475681)
  48 And since ramster is built on top of zcache, this article may be helpful:
  49 -- In-kernel memory compression (lwn.net/Articles/545244)
  50
  51 Now that you've memorized the contents of those articles, let's get started!
  52
  53 A. PRELIMINARY
  54
  55 1) Install two x86_64 Linux systems that are known to work when
  56    upgraded to a recent upstream Linux kernel version.
  57
  58 On each system:
  59
  60 2) Configure, build and install, then boot Linux, just to ensure it
  61    can be done with an unmodified upstream kernel.  Confirm you booted
  62    the upstream kernel with "uname -a".
  63
  64 3) If you plan to do any performance testing or unless you plan to
  65    test only swapping, the "WasActive" patch is also highly recommended.
  66    (Search lkml.org for WasActive, apply the patch, rebuild your kernel.)
  67    For a demo or simple testing, the patch can be ignored.
  68
  69 4) Install ramster-tools as root.  An x86_64 rpm for EL6-based systems
  70    can be found at:
  71     http://oss.oracle.com/projects/tmem/files/RAMster/
  72    (Sorry but for now, non-EL6 users must recreate ramster-tools on
  73    their own from source.  See above.)
  74
  75 5) Ensure that debugfs is mounted at each boot.  Examples below assume it
  76    is mounted at /sys/kernel/debug.
  77
  78 B. BUILDING RAMSTER INTO THE KERNEL
  79
  80 Do the following on each system:
  81
  82 1) Using the kernel configuration mechanism of your choice, change
  83    your config to include:
  84
  85         CONFIG_CLEANCACHE=y
  86         CONFIG_FRONTSWAP=y
  87         CONFIG_STAGING=y
  88         CONFIG_CONFIGFS_FS=y # NOTE: MUST BE y, not m
  89         CONFIG_ZCACHE=y
  90         CONFIG_RAMSTER=y
  91
  92    For a linux-3.10 or later kernel, you should also set:
  93
  94         CONFIG_ZCACHE_DEBUG=y
  95         CONFIG_RAMSTER_DEBUG=y
  96
  97    Before building the kernel please doublecheck your kernel config
  98    file to ensure all of the settings are correct.
  99
 100 2) Build this kernel and change your boot file (e.g. /etc/grub.conf)
 101    so that the new kernel will boot.
 102
 103 3) Add "zcache" and "ramster" as kernel boot parameters for the new kernel.
 104
 105 4) Reboot each system approximately simultaneously.
 106
 107 5) Check dmesg to ensure there are some messages from ramster, prefixed
 108    by "ramster:"
 109
 110         # dmesg | grep ramster
 111
 112    You should also see a lot of files in:
 113
 114         # ls /sys/kernel/debug/zcache
 115         # ls /sys/kernel/debug/ramster
 116
 117    These are mostly counters for various zcache and ramster activities.
 118    You should also see files in:
 119
 120         # ls /sys/kernel/mm/ramster
 121
 122    These are sysfs files that control ramster as we shall see.
 123
 124    Ramster now will act as a single-system zcache on each system
 125    but doesn't yet know anything about the cluster so can't yet do
 126    anything remotely.
 127
 128 C. CONFIGURING THE RAMSTER CLUSTER
 129
 130 This part can be error prone unless you are familiar with clustering
 131 filesystems.  We need to describe the cluster in a /etc/ramster.conf
 132 file and the init scripts that parse it are extremely picky about
 133 the syntax.
 134
 135 1) Create a /etc/ramster.conf file and ensure it is identical on both
 136    systems.  This file mimics the ocfs2 format and there is a good amount
 137    of documentation that can be searched for ocfs2.conf, but you can use:
 138
 139         cluster:
 140                 name = ramster
 141                 node_count = 2
 142         node:
 143                 name = system1
 144                 cluster = ramster
 145                 number = 0
 146                 ip_address = my.ip.ad.r1
 147                 ip_port = 7777
 148         node:
 149                 name = system2
 150                 cluster = ramster
 151                 number = 1
 152                 ip_address = my.ip.ad.r2
 153                 ip_port = 7777
 154
 155    You must ensure that the "name" field in the file exactly matches
 156    the output of "hostname" on each system; if "hostname" shows a
 157    fully-qualified hostname, ensure the name is fully qualified in
 158    /etc/ramster.conf.  Obviously, substitute my.ip.ad.rx with proper
 159    ip addresses.
 160
 161 2) Enable the ramster service and configure it.  If you used the
 162    EL6 ramster-tools, this would be:
 163
 164         # chkconfig --add ramster
 165         # service ramster configure
 166
 167    Set "load on boot" to "y", cluster to start is "ramster" (or whatever
 168    name you chose in ramster.conf), heartbeat dead threshold as "500",
 169    network idle timeout as "1000000".  Leave the others as default.
 170
 171 3) Reboot both systems.  After reboot, try (assuming EL6 ramster-tools):
 172
 173         # service ramster status
 174
 175    You should see "Checking RAMSTER cluster "ramster": Online".  If you do
 176    not, something is wrong and ramster will not work.  Note that you
 177    should also see that the driver for "configfs" is loaded and mounted,
 178    the driver for ocfs2_dlmfs is not loaded, and some numbers for network
 179    parameters.  You will also see "Checking RAMSTER heartbeat: Not active".
 180    That's all OK.
 181
 182 4) Now you need to start the cluster heartbeat; the cluster is not "up"
 183    until all nodes detect a heartbeat.  In a real cluster, heartbeat detection
 184    is done via a cluster filesystem, but ramster doesn't require one.  Some
 185    hack-y kernel code in ramster can start the heartbeat for you though if
 186    you tell it what nodes are "up".  To enable the heartbeat, do:
 187
 188         # echo 0 > /sys/kernel/mm/ramster/manual_node_up
 189         # echo 1 > /sys/kernel/mm/ramster/manual_node_up
 190
 191    This must be done on BOTH nodes and, to avoid timeouts, must be done
 192    approximately concurrently on both nodes.  On an EL6 system, it is
 193    convenient to put these lines in /etc/rc.local.  To confirm that the
 194    cluster is now up, on both systems do:
 195
 196         # dmesg | grep ramster
 197
 198    You should see ramster "Accepted connection" messages in dmesg on both
 199    nodes after this.  Note that if you check userland status again with
 200
 201         # service ramster status
 202
 203    you will still see "Checking RAMSTER heartbeat: Not active".  That's
 204    still OK... the ramster kernel heartbeat hack doesn't communicate to
 205    userland.
 206
 207 5) You now must tell each node the node to which it should "remotify" pages.
 208    On this two node cluster, we will assume the "local" node, node 0, has
 209    memory overcommitted and will use ramster to utilize RAM capacity on
 210    the "remote node", node 1.  To configure this, on node 0, you do:
 211
 212         # echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum
 213
 214    You should see "ramster: node 1 set as remotification target" in dmesg
 215    on node 0.  Again, on EL6, /etc/rc.local is a good place to put this
 216    on node 0 so you don't forget to do it at each boot.
 217
 218 6) One more step:  By default, the ramster code does not "remotify" any
 219    pages; this is primarily for testing purposes, but sometimes it is
 220    useful.  This may change in the future, but for now, on node 0, you do:
 221
 222         # echo 1 > /sys/kernel/mm/ramster/pers_remotify_enable
 223         # echo 1 > /sys/kernel/mm/ramster/eph_remotify_enable
 224
 225    The first enables remotifying swap (persistent, aka frontswap) pages,
 226    the second enables remotifying of page cache (ephemeral, cleancache)
 227    pages.
 228
 229    On EL6, these lines can also be put in /etc/rc.local (AFTER the
 230    node_up lines), or at the beginning of a script that runs a workload.
 231
 232 7) Note that most testing has been done with both/all machines booted
 233    roughly simultaneously to avoid cluster timeouts.  Ideally, you should
 234    do this too unless you are trying to break ramster rather than just
 235    use it. ;-)
 236
 237 D. TESTING RAMSTER
 238
 239 1) Note that ramster has no value unless pages get "remotified".  For
 240    swap/frontswap/persistent pages, this doesn't happen unless/until
 241    the workload would cause swapping to occur, at which point pages
 242    are put into frontswap/zcache, and the remotification thread starts
 243    working.  To get to the point where the system swaps, you either
 244    need a workload for which the working set exceeds the RAM in the
 245    system; or you need to somehow reduce the amount of RAM one of
 246    the system sees.  This latter is easy when testing in a VM, but
 247    harder on physical systems.  In some cases, "mem=xxxM" on the
 248    kernel command line restricts memory, but for some values of xxx
 249    the kernel may fail to boot.  One may also try creating a fixed
 250    RAMdisk, doing nothing with it, but ensuring that it eats up a fixed
 251    amount of RAM.
 252
 253 2) To see if ramster is working, on the "remote node", node 1, try:
 254
 255         # grep . /sys/kernel/debug/ramster/foreign_*
 256         # # note, that is space-dot-space between grep and the pathname
 257
 258    to monitor the number (and max) ephemeral and persistent pages
 259    that ramster has sent.  If these stay at zero, ramster is not working
 260    either because the workload on the local node (node 0) isn't creating
 261    enough memory pressure or because "remotifying" isn't working.  On the
 262    local system, node 0, you can watch lots of useful information also.
 263    Try:
 264
 265         grep . /sys/kernel/debug/zcache/*pageframes* \
 266                 /sys/kernel/debug/zcache/*zbytes* \
 267                 /sys/kernel/debug/zcache/*zpages* \
 268                 /sys/kernel/debug/ramster/*remote*
 269
 270    Of particular note are the remote_*_pages_succ_get counters.  These
 271    show how many disk reads and/or disk writes have been avoided on the
 272    overcommitted local system by storing pages remotely using ramster.
 273
 274    At the risk of information overload, you can also grep:
 275
 276         /sys/kernel/debug/cleancache/* and /sys/kernel/debug/frontswap/*
 277
 278    These show, for example, how many disk reads and/or disk writes have
 279    been avoided by using zcache to optimize RAM on the local system.
 280
 281
 282 AUTOMATIC SWAP REPATRIATION
 283
 284 You may notice that while the systems are idle, the foreign persistent
 285 page count on the remote machine slowly decreases.  This is because
 286 ramster implements "frontswap selfshrinking":  When possible, swap
 287 pages that have been remotified are slowly repatriated to the local
 288 machine.  This is so that local RAM can be used when possible and
 289 so that, in case of remote machine crash, the probability of loss
 290 of data is reduced.
 291
 292 REBOOTING / POWEROFF
 293
 294 If a system is shut down while some of its swap pages still reside
 295 on a remote system, the system may lock up during the shutdown
 296 sequence.  This will occur if the network is shut down before the
 297 swap mechansim is shut down, which is the default ordering on many
 298 distros.  To avoid this annoying problem, simply shut off the swap
 299 subsystem before starting the shutdown sequence, e.g.:
 300
 301         # swapoff -a
 302         # reboot
 303
 304 Ideally, this swapoff-before-ifdown ordering should be enforced permanently
 305 using shutdown scripts.
 306
 307 KNOWN PROBLEMS
 308
 309 1) You may periodically see messages such as:
 310
 311     ramster_r2net, message length problem
 312
 313    This is harmless but indicates that a node is sending messages
 314    containing compressed pages that exceed the maximum for zcache
 315    (PAGE_SIZE*15/16).  The sender side needs to be fixed.
 316
 317 2) If you see a "No longer connected to node..." message or a "No connection
 318    established with node X after N seconds", it is possible you may
 319    be in an unrecoverable state.  If you are certain all of the
 320    appropriate cluster configuration steps described above have been
 321    performed, try rebooting the two servers concurrently to see if
 322    the cluster starts.
 323
 324    Note that "Connection to node... shutdown, state 7" is an intermediate
 325    connection state.  As long as you later see "Accepted connection", the
 326    intermediate states are harmless.
 327
 328 3) There are known issues in counting certain values.  As a result
 329    you may see periodic warnings from the kernel.  Almost always you
 330    will see "ramster: bad accounting for XXX".  There are also "WARN_ONCE"
 331    messages.  If you see kernel warnings with a tombstone, please report
 332    them.  They are harmless but reflect bugs that need to be eventually fixed.
 333
 334 ADVANCED RAMSTER TOPOLOGIES
 335
 336 The kernel code for ramster can support up to eight nodes in a cluster,
 337 but no testing has been done with more than three nodes.
 338
 339 In the example described above, the "remote" node serves as a RAM
 340 overflow for the "local" node.  This can be made symmetric by appropriate
 341 settings of the sysfs remote_target_nodenum file.  For example, by setting:
 342
 343         # echo 1 > /sys/kernel/mm/ramster/remote_target_nodenum
 344
 345 on node 0, and
 346
 347         # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
 348
 349 on node 1, each node can serve as a RAM overflow for the other.
 350
 351 For more than two nodes, a "RAM server" can be configured.  For a
 352 three node system, set:
 353
 354         # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
 355
 356 on node 1, and
 357
 358         # echo 0 > /sys/kernel/mm/ramster/remote_target_nodenum
 359
 360 on node 2.  Then node 0 is a RAM server for node 1 and node 2.
 361
 362 In this implementation of ramster, any remote node is potentially a single
 363 point of failure (SPOF).  Though the probability of failure is reduced
 364 by automatic swap repatriation (see above), a proposed future enhancement
 365 to ramster improves high-availability for the cluster by sending a copy
 366 of each page of date to two other nodes.  Patches welcome!