Documentation/scheduler/sched-tune.txt

   1              Central, scheduler-driven, power-performance control
   2                                (EXPERIMENTAL)
   3
   4 Abstract
   5 ========
   6
   7 The topic of a single simple power-performance tunable, that is wholly
   8 scheduler centric, and has well defined and predictable properties has come up
   9 on several occasions in the past [1,2]. With techniques such as a scheduler
  10 driven DVFS [3], we now have a good framework for implementing such a tunable.
  11 This document describes the overall ideas behind its design and implementation.
  12
  13
  14 Table of Contents
  15 =================
  16
  17 1. Motivation
  18 2. Introduction
  19 3. Signal Boosting Strategy
  20 4. OPP selection using boosted CPU utilization
  21 5. Per task group boosting
  22 6. Question and Answers
  23    - What about "auto" mode?
  24    - What about boosting on a congested system?
  25    - How CPUs are boosted when we have tasks with multiple boost values?
  26 7. References
  27
  28
  29 1. Motivation
  30 =============
  31
  32 Sched-DVFS [3] is a new event-driven cpufreq governor which allows the
  33 scheduler to select the optimal DVFS operating point (OPP) for running a task
  34 allocated to a CPU. The introduction of sched-DVFS enables running workloads at
  35 the most energy efficient OPPs.
  36
  37 However, sometimes it may be desired to intentionally boost the performance of
  38 a workload even if that could imply a reasonable increase in energy
  39 consumption. For example, in order to reduce the response time of a task, we
  40 may want to run the task at a higher OPP than the one that is actually required
  41 by it's CPU bandwidth demand.
  42
  43 This last requirement is especially important if we consider that one of the
  44 main goals of the sched-DVFS component is to replace all currently available
  45 CPUFreq policies. Since sched-DVFS is event based, as opposed to the sampling
  46 driven governors we currently have, it is already more responsive at selecting
  47 the optimal OPP to run tasks allocated to a CPU. However, just tracking the
  48 actual task load demand may not be enough from a performance standpoint.  For
  49 example, it is not possible to get behaviors similar to those provided by the
  50 "performance" and "interactive" CPUFreq governors.
  51
  52 This document describes an implementation of a tunable, stacked on top of the
  53 sched-DVFS which extends its functionality to support task performance
  54 boosting.
  55
  56 By "performance boosting" we mean the reduction of the time required to
  57 complete a task activation, i.e. the time elapsed from a task wakeup to its
  58 next deactivation (e.g. because it goes back to sleep or it terminates).  For
  59 example, if we consider a simple periodic task which executes the same workload
  60 for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
  61 that task must complete each of its activations in less than 5[s].
  62
  63 A previous attempt [5] to introduce such a boosting feature has not been
  64 successful mainly because of the complexity of the proposed solution.  The
  65 approach described in this document exposes a single simple interface to
  66 user-space.  This single tunable knob allows the tuning of system wide
  67 scheduler behaviours ranging from energy efficiency at one end through to
  68 incremental performance boosting at the other end.  This first tunable affects
  69 all tasks. However, a more advanced extension of the concept is also provided
  70 which uses CGroups to boost the performance of only selected tasks while using
  71 the energy efficient default for all others.
  72
  73 The rest of this document introduces in more details the proposed solution
  74 which has been named SchedTune.
  75
  76
  77 2. Introduction
  78 ===============
  79
  80 SchedTune exposes a simple user-space interface with a single power-performance
  81 tunable:
  82
  83   /proc/sys/kernel/sched_cfs_boost
  84
  85 This permits expressing a boost value as an integer in the range [0..100].
  86
  87 A value of 0 (default) configures the CFS scheduler for maximum energy
  88 efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
  89 required to satisfy their workload demand.
  90 A value of 100 configures scheduler for maximum performance, which translates
  91 to the selection of the maximum OPP on that CPU.
  92
  93 The range between 0 and 100 can be set to satisfy other scenarios suitably. For
  94 example to satisfy interactive response or depending on other system events
  95 (battery level etc).
  96
  97 A CGroup based extension is also provided, which permits further user-space
  98 defined task classification to tune the scheduler for different goals depending
  99 on the specific nature of the task, e.g. background vs interactive vs
 100 low-priority.
 101
 102 The overall design of the SchedTune module is built on top of "Per-Entity Load
 103 Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
 104 Performance Point (OPP) selection.
 105 Each time a task is allocated on a CPU, sched-DVFS has the opportunity to tune
 106 the operating frequency of that CPU to better match the workload demand. The
 107 selection of the actual OPP being activated is influenced by the global boost
 108 value, or the boost value for the task CGroup when in use.
 109
 110 This simple biasing approach leverages existing frameworks, which means minimal
 111 modifications to the scheduler, and yet it allows to achieve a range of
 112 different behaviours all from a single simple tunable knob.
 113 The only new concept introduced is that of signal boosting.
 114
 115
 116 3. Signal Boosting Strategy
 117 ===========================
 118
 119 The whole PELT machinery works based on the value of a few load tracking signals
 120 which basically track the CPU bandwidth requirements for tasks and the capacity
 121 of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
 122 some of these load tracking signals to make a task or RQ appears more demanding
 123 that it actually is.
 124
 125 Which signals have to be inflated depends on the specific "consumer".  However,
 126 independently from the specific (signal, consumer) pair, it is important to
 127 define a simple and possibly consistent strategy for the concept of boosting a
 128 signal.
 129
 130 A boosting strategy defines how the "abstract" user-space defined
 131 sched_cfs_boost value is translated into an internal "margin" value to be added
 132 to a signal to get its inflated value:
 133
 134   margin         := boosting_strategy(sched_cfs_boost, signal)
 135   boosted_signal := signal + margin
 136
 137 Different boosting strategies were identified and analyzed before selecting the
 138 one found to be most effective.
 139
 140 Signal Proportional Compensation (SPC)
 141 --------------------------------------
 142
 143 In this boosting strategy the sched_cfs_boost value is used to compute a
 144 margin which is proportional to the complement of the original signal.
 145 When a signal has a maximum possible value, its complement is defined as
 146 the delta from the actual value and its possible maximum.
 147
 148 Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
 149 the maximum possible value, the margin becomes:
 150
 151         margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
 152
 153 Using this boosting strategy:
 154 - a 100% sched_cfs_boost means that the signal is scaled to the maximum value
 155 - each value in the range of sched_cfs_boost effectively inflates the signal in
 156   question by a quantity which is proportional to the maximum value.
 157
 158 For example, by applying the SPC boosting strategy to the selection of the OPP
 159 to run a task it is possible to achieve these behaviors:
 160
 161 -   0% boosting: run the task at the minimum OPP required by its workload
 162 - 100% boosting: run the task at the maximum OPP available for the CPU
 163 -  50% boosting: run at the half-way OPP between minimum and maximum
 164
 165 Which means that, at 50% boosting, a task will be scheduled to run at half of
 166 the maximum theoretically achievable performance on the specific target
 167 platform.
 168
 169 A graphical representation of an SPC boosted signal is represented in the
 170 following figure where:
 171  a) "-" represents the original signal
 172  b) "b" represents a  50% boosted signal
 173  c) "p" represents a 100% boosted signal
 174
 175
 176    ^
 177    |  SCHED_LOAD_SCALE
 178    +-----------------------------------------------------------------+
 179    |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
 180    |
 181    |                                             boosted_signal
 182    |                                          bbbbbbbbbbbbbbbbbbbbbbbb
 183    |
 184    |                                            original signal
 185    |                  bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
 186    |                                          |
 187    |bbbbbbbbbbbbbbbbbb                        |
 188    |                                          |
 189    |                                          |
 190    |                                          |
 191    |                  +-----------------------+
 192    |                  |
 193    |                  |
 194    |                  |
 195    |------------------+
 196    |
 197    |
 198    +----------------------------------------------------------------------->
 199
 200 The plot above shows a ramped load signal (titled 'original_signal') and it's
 201 boosted equivalent. For each step of the original signal the boosted signal
 202 corresponding to a 50% boost is midway from the original signal and the upper
 203 bound. Boosting by 100% generates a boosted signal which is always saturated to
 204 the upper bound.
 205
 206
 207 4. OPP selection using boosted CPU utilization
 208 ==============================================
 209
 210 It is worth calling out that the implementation does not introduce any new load
 211 signals. Instead, it provides an API to tune existing signals. This tuning is
 212 done on demand and only in scheduler code paths where it is sensible to do so.
 213 The new API calls are defined to return either the default signal or a boosted
 214 one, depending on the value of sched_cfs_boost. This is a clean an non invasive
 215 modification of the existing existing code paths.
 216
 217 The signal representing a CPU's utilization is boosted according to the
 218 previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
 219 (ie CFS run-queue) to appear more used then it actually is.
 220
 221 Thus, with the sched_cfs_boost enabled we have the following main functions to
 222 get the current utilization of a CPU:
 223
 224   cpu_util()
 225   boosted_cpu_util()
 226
 227 The new boosted_cpu_util() is similar to the first but returns a boosted
 228 utilization signal which is a function of the sched_cfs_boost value.
 229
 230 This function is used in the CFS scheduler code paths where sched-DVFS needs to
 231 decide the OPP to run a CPU at.
 232 For example, this allows selecting the highest OPP for a CPU which has
 233 the boost value set to 100%.
 234
 235
 236 5. Per task group boosting
 237 ==========================
 238
 239 The availability of a single knob which is used to boost all tasks in the
 240 system is certainly a simple solution but it quite likely doesn't fit many
 241 utilization scenarios, especially in the mobile device space.
 242
 243 For example, on battery powered devices there usually are many background
 244 services which are long running and need energy efficient scheduling. On the
 245 other hand, some applications are more performance sensitive and require an
 246 interactive response and/or maximum performance, regardless of the energy cost.
 247 To better service such scenarios, the SchedTune implementation has an extension
 248 that provides a more fine grained boosting interface.
 249
 250 A new CGroup controller, namely "schedtune", could be enabled which allows to
 251 defined and configure task groups with different boosting values.
 252 Tasks that require special performance can be put into separate CGroups.
 253 The value of the boost associated with the tasks in this group can be specified
 254 using a single knob exposed by the CGroup controller:
 255
 256    schedtune.boost
 257
 258 This knob allows the definition of a boost value that is to be used for
 259 SPC boosting of all tasks attached to this group.
 260
 261 The current schedtune controller implementation is really simple and has these
 262 main characteristics:
 263
 264   1) It is only possible to create 1 level depth hierarchies
 265
 266      The root control groups define the system-wide boost value to be applied
 267      by default to all tasks. Its direct subgroups are named "boost groups" and
 268      they define the boost value for specific set of tasks.
 269      Further nested subgroups are not allowed since they do not have a sensible
 270      meaning from a user-space standpoint.
 271
 272   2) It is possible to define only a limited number of "boost groups"
 273
 274      This number is defined at compile time and by default configured to 16.
 275      This is a design decision motivated by two main reasons:
 276      a) In a real system we do not expect utilization scenarios with more then few
 277         boost groups. For example, a reasonable collection of groups could be
 278         just "background", "interactive" and "performance".
 279      b) It simplifies the implementation considerably, especially for the code
 280         which has to compute the per CPU boosting once there are multiple
 281         RUNNABLE tasks with different boost values.
 282
 283 Such a simple design should allow servicing the main utilization scenarios identified
 284 so far. It provides a simple interface which can be used to manage the
 285 power-performance of all tasks or only selected tasks.
 286 Moreover, this interface can be easily integrated by user-space run-times (e.g.
 287 Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
 288 classification, which has been a long standing requirement.
 289
 290 Setup and usage
 291 ---------------
 292
 293 0. Use a kernel with CGROUP_SCHEDTUNE support enabled
 294
 295 1. Check that the "schedtune" CGroup controller is available:
 296
 297    root@linaro-nano:~# cat /proc/cgroups
 298    #subsys_name hierarchy       num_cgroups     enabled
 299    cpuset       0               1               1
 300    cpu          0               1               1
 301    schedtune    0               1               1
 302
 303 2. Mount a tmpfs to create the CGroups mount point (Optional)
 304
 305    root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
 306
 307 3. Mount the "schedtune" controller
 308
 309    root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
 310    root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
 311
 312 4. Setup the system-wide boost value (Optional)
 313
 314    If not configured the root control group has a 0% boost value, which
 315    basically disables boosting for all tasks in the system thus running in
 316    an energy-efficient mode.
 317
 318    root@linaro-nano:~# echo $SYSBOOST > /sys/fs/cgroup/stune/schedtune.boost
 319
 320 5. Create task groups and configure their specific boost value (Optional)
 321
 322    For example here we create a "performance" boost group configure to boost
 323    all its tasks to 100%
 324
 325    root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
 326    root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
 327
 328 6. Move tasks into the boost group
 329
 330    For example, the following moves the tasks with PID $TASKPID (and all its
 331    threads) into the "performance" boost group.
 332
 333    root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
 334
 335 This simple configuration allows only the threads of the $TASKPID task to run,
 336 when needed, at the highest OPP in the most capable CPU of the system.
 337
 338
 339 6. Question and Answers
 340 =======================
 341
 342 What about "auto" mode?
 343 -----------------------
 344
 345 The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
 346 with some suitable user-space element. This element could use the exposed
 347 system-wide or cgroup based interface.
 348
 349 How are multiple groups of tasks with different boost values managed?
 350 ---------------------------------------------------------------------
 351
 352 The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
 353 on a CPU. Once sched-DVFS selects the OPP to run a CPU at, the CPU utilization
 354 is boosted with a value which is the maximum of the boost values of the
 355 currently RUNNABLE tasks in its RQ.
 356
 357 This allows sched-DVFS to boost a CPU only while there are boosted tasks ready
 358 to run and switch back to the energy efficient mode as soon as the last boosted
 359 task is dequeued.
 360
 361
 362 7. References
 363 =============
 364 [1] http://lwn.net/Articles/552889
 365 [2] http://lkml.org/lkml/2012/5/18/91
 366 [3] http://lkml.org/lkml/2015/6/26/620