firefly-linux-kernel-4.4.55.git
6 years agosched/fair: Simplify idle_idx handling in select_idle_sibling()
Dietmar Eggemann [Mon, 16 Jan 2017 12:42:59 +0000 (12:42 +0000)]
sched/fair: Simplify idle_idx handling in select_idle_sibling()

Rename best_idle to best_idle_cpu so the same name is used like in
find_best_target().

Fix if (best_idle > 0) since best_idle_cpu = 0 is a valid target.

Use 'unsigned long' data type for best_idle_capacity.

Since we're looking for the shallowest best_idle_cstate initialize
best_idle_cstate = INT_MAX. For cpus which are not idle (idle_idx = -1)
the condition 'if (idle_idx < best_idle_cstate && ...)' is never
executed.

Change-Id: Ic5b63d58478696b3d1ec6253cf739a69a574cf99
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 8bff5e9c0968108d465e1f2a4624fc5ec2f00849)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched/fair: refactor find_best_target() for simplicity
Dietmar Eggemann [Sat, 7 Jan 2017 14:33:51 +0000 (14:33 +0000)]
sched/fair: refactor find_best_target() for simplicity

Simplify backup_capacity handling and use 'unsigned long'
data type for cpu capacity, simplify target_util handling,
simplify idle_idx handling & refactor min_util, new_util.

Also return first idle cpu for prefer_idle task immediately.

Change-Id: Ic89e140f7b369f3965703fdc8463013d16e9b94a
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched/fair: Change cpu iteration order in find_best_target()
Dietmar Eggemann [Fri, 13 Jan 2017 17:54:34 +0000 (17:54 +0000)]
sched/fair: Change cpu iteration order in find_best_target()

The schedtune task parameter 'boosted' is mapped into the cpu iteration
order. Currently for 'boosted' equal true the iteration starts at the
last cpu (NR_CPUS-1) whereas for 'boosted' equal false it starts at the
first cpu (0).

This only has the desired effect if the cpu topology oerdering matches
the underlying assumption. This e.g. is the case for the
Qc snapdragon 821 with its [L0 L1 b0 b1] cpu topology layout
(L=lower max freq, b=higher max freq). This results in cpus with higher
maximum capacity being given the highest logical cpu ids. However not
all big.LITTLE systems enumerate their cpus in the same way. For example,
the ARM Versatile Express Juno board has 6 cpus for which the default
configuration has topology [L0 b0 b1 L1 L2 L3].

To make this approach independent from the cpu topology layout it now
iterates over the cpus in the order of the sched_groups of the EAS
sched_domain (sd_ea). The order of cpu iteration is different for the
different cpu types in case the cpu is used to dereference sd_ea.

Considering the Qc snapdragon 821 again, for cpu L0 and L1 the order is
'b0->b1->L0->L1' whereas for b0 and b1 the order is 'b0->b1->L0->L1'.

This approach does not allow the exact same iteration order as with the
currently used flat iteration over [0 .. NR_CPUS-1] but the cpus
are ordered by the original cpu capacity.

The cpu iteration is now done in the sd_ea sched_group order required by
the 'boosted' value ['L0->L1->b0->b1'/'b0->b1->L0->L1'] rather than
forward/backward over the flat cpu space ['L0->L1->b0->b1'/
'b1->b0->L1->L0'].

Change-Id: I8fbe2073dedd2ecb1c750620c6000c11a5ff4358
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit a0c6a4272c3968c0ff50d3fed65f5865b72d777b)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched/core: Add first cpu w/ max/min orig capacity to root domain
Dietmar Eggemann [Sun, 8 Jan 2017 16:16:59 +0000 (16:16 +0000)]
sched/core: Add first cpu w/ max/min orig capacity to root domain

This will allow to start iterating from a cpu with max or min original
capacity in the wakeup path regardless on which cpu the scheduler is
currently running (smp_processor_id()) or the previous cpu of the task
(task_cpu(p)). This iteration has to happen on a sched_domain spanning
all cpus in the order of the sched_groups of this sched_domain seen by
the starting cpu.

In case of an SMP system the first cpu with max orig capacity and the
the one with min orig capacity is the same. This can temporally happen
on a big.LITTLE system with hotplug as well.

E.g. the different order of cpu iteration can be used to map schedtune
task parameter 'boosted' into the cpu iteration order in
find_best_target().

Use of READ_ONCE()/WRITE_ONCE() to avoid load/store tearing.

Change-Id: I812fbd9c7e5f506617e456c0eec3edcd2c016e92
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit fd6e9543c1fd8971a5e2e68e39b2f6e591d46114)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched/core: Remove remnants of commit fd5c98da1a42
Dietmar Eggemann [Sun, 8 Jan 2017 14:40:38 +0000 (14:40 +0000)]
sched/core: Remove remnants of commit fd5c98da1a42

Commit fd5c98da1a42 "WIP: sched: Store system-wide maximum cpu capacity
in root domain" was repalced by commit 8148bdfff4f5 "WIP: sched: Update
max cpu capacity in case of max frequency constraints" which didn't
remove all the now unused bits.

Change-Id: I067f6366431f43337cffa7a2a8e0de32dd33d2f9
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 6d284a607cec51bcafca313bc396bc3103b1e876)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched: Remove sysctl_sched_is_big_little
Dietmar Eggemann [Fri, 13 Jan 2017 13:51:24 +0000 (13:51 +0000)]
sched: Remove sysctl_sched_is_big_little

With the new wakeup approach this sysctl is not necessary any more.

Change-Id: I52114b3c918791f6a4f9f30f50002919ccbc1a9c
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 885c0d503bcdf0ef4e9b46822496f16b20aa3bbd)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched/fair: Code !is_big_little path into select_energy_cpu_brute()
Dietmar Eggemann [Wed, 22 Mar 2017 18:16:03 +0000 (18:16 +0000)]
sched/fair: Code !is_big_little path into select_energy_cpu_brute()

This patch replaces the existing EAS upstream implementation of
select_energy_cpu_brute() with the one of find_best_target() used
in Android previously.

It also removes the cpumask 'and' from select_energy_cpu_brute,
see the existing use of 'cpu = smp_processor_id()' in
select_task_rq_fair().

Change-Id: If678c002efaa87d1ba3ec9989a4e9f8df98b83ec
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[ added guarding for non-schedtune builds ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoEAS: sched/fair: Re-integrate 'honor sync wakeups' into wakeup path
Dietmar Eggemann [Mon, 5 Dec 2016 14:15:54 +0000 (14:15 +0000)]
EAS: sched/fair: Re-integrate 'honor sync wakeups' into wakeup path

This patch re-integrates the part which was initially provided by
3b9d7554aeec ("EAS: sched/fair: tunable to honor sync wakeups") into
energy_aware_wake_cpu() into select_energy_cpu_brute().

Change-Id: I748fde3ecdeb44651179bce0a5bb8dd82d1903f6
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit b75b7286cb068d5761621ea134c23dd131db953f)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoFixup!: sched/fair.c: Set SchedTune specific struct energy_env.task
Dietmar Eggemann [Sun, 4 Dec 2016 17:29:34 +0000 (17:29 +0000)]
Fixup!: sched/fair.c: Set SchedTune specific struct energy_env.task

This has to be done in the caller function of energy_diff() version of
SchedTune to avoid Null pointer dereference in energy_diff().

Change-Id: I3f0f68dbd11efb15bbb3b1832f8294419ed85241
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 14531d4e245d063f713ee5ed835df958e6c7838f)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched/fair: Energy-aware wake-up task placement
Morten Rasmussen [Wed, 30 Mar 2016 13:29:48 +0000 (14:29 +0100)]
sched/fair: Energy-aware wake-up task placement

When the systems is not overutilized, place waking tasks on the most
energy efficient cpu. Previous attempts reduced the search space by
matching task utilization to cpu capacity before consulting the energy
model as this is an expensive operation. The search heuristics didn't
work very well and lacking any better alternatives this patch takes the
brute-force route and tries all potential targets.

This approach doesn't scale, but it might be sufficient for many
embedded applications while work is continuing on a heuristic that can
minimize the necessary computations. The heuristic must be derrived from
the platform energy model rather than make additional assumptions, such
lower capacity implies better energy efficiency. PeterZ mentioned in the
past that we might be able to derrive some simpler deciding functions
using mathematical (modal?) analysis.

Change-Id: I772bacb4c8fd599f8006fa422f842e66377a9c6c
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[rebase: on top of msm-google/android-msm-marlin-3.18]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit a894422dbdb7b77ea2acfe7ff909ccb5ded23514)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched/fair: Add energy_diff dead-zone margin
Morten Rasmussen [Wed, 30 Mar 2016 13:20:12 +0000 (14:20 +0100)]
sched/fair: Add energy_diff dead-zone margin

It is not worth the overhead to migrate tasks for tiny insignificant
energy savings. To prevent this, an energy margin is introduced in
energy_diff() which effectively adds a dead-zone that rounds tiny energy
differences to zero. Since no scale is enforced for energy model data
the margin can't be absolute. Instead it is defined as +/-1.56% energy
saving compared to the current total estimated energy consumption.

Change-Id: I6be069c752c701fb825430896b3b768a7ab2fee4
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[rebase: on top of msm-google/android-msm-marlin-3.18,
         massage original patch which changes code in energy_diff()
 into __energy_diff() introduced by SchedTune]
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
(cherry picked from commit 780cb5a5fa47adf13d4fc2b77e8e94448cd56098)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched/fair: Decommission energy_aware_wake_cpu()
Dietmar Eggemann [Thu, 26 Jan 2017 16:04:34 +0000 (16:04 +0000)]
sched/fair: Decommission energy_aware_wake_cpu()

The EAS functionality in the wakeup path will be brought back by the
following patch ("sched/fair: Energy-aware wake-up task placement")
providing the function select_energy_cpu_brute().

Change-Id: I927fb9e8261cfacfe404695f853941c7959aa146
[ Trivial merge conflicts resolved. ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 80aee424fb7765a777267e144037642625a71304)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched/fair: Do not force want_affine eq. true if EAS is enabled
Dietmar Eggemann [Thu, 1 Dec 2016 18:37:58 +0000 (18:37 +0000)]
sched/fair: Do not force want_affine eq. true if EAS is enabled

This lets us use Capacity-Aware Scheduling (CAS) if EAS is enabled.

Change-Id: I2e647a201ea0b733d1487c3e153047a49fb22847
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 00b7da2ae58bf568529e67614980f77e275b8d29)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoarm64: Set SD_ASYM_CPUCAPACITY sched_domain flag on DIE level
Dietmar Eggemann [Wed, 30 Nov 2016 20:36:31 +0000 (20:36 +0000)]
arm64: Set SD_ASYM_CPUCAPACITY sched_domain flag on DIE level

Enable Capacity-Aware_scheduling.

This is a shortcut from the EAS integration implementation. We know
there are SoCs in use which have asymmetric cpu capacities on DIE level.

Change-Id: I7f1326e86b8fc08b65d1c1341da7d1ef1013dcbc
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
(cherry picked from commit 82f95a8cdff67770f02b7e4f5b1c084f42afc0e2)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/fair: Fix incorrect comment for capacity_margin
Morten Rasmussen [Fri, 14 Oct 2016 13:41:12 +0000 (14:41 +0100)]
UPSTREAM: sched/fair: Fix incorrect comment for capacity_margin

The comment for capacity_margin introduced in:

  3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")

... got its usage the wrong way round - fix it.

Change-Id: Ie46eac3e5ff43397b5bed61d0999d2817f1a1d96
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 893c5d2279041afeb593f1fa8edd9d02edf5b7cb)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
Morten Rasmussen [Fri, 14 Oct 2016 13:41:10 +0000 (14:41 +0100)]
UPSTREAM: sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups

For asymmetric CPU capacity systems it is counter-productive for
throughput if low capacity CPUs are pulling tasks from non-overloaded
CPUs with higher capacity. The assumption is that higher CPU capacity is
preferred over running alone in a group with lower CPU capacity.

This patch rejects higher CPU capacity groups with one or less task per
CPU as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

Change-Id: I428875bb6267c780026ef75e2882300738d016e7
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9e0994c0a1c1f82c705f1f66388e1bcffcee8bb9)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/fair: Add per-CPU min capacity to sched_group_capacity
Morten Rasmussen [Fri, 14 Oct 2016 13:41:09 +0000 (14:41 +0100)]
UPSTREAM: sched/fair: Add per-CPU min capacity to sched_group_capacity

struct sched_group_capacity currently represents the compute capacity
sum of all CPUs in the sched_group.

Unless it is divided by the group_weight to get the average capacity
per CPU, it hides differences in CPU capacity for mixed capacity systems
(e.g. high RT/IRQ utilization or ARM big.LITTLE).

But even the average may not be sufficient if the group covers CPUs of
different capacities.

Instead, by extending struct sched_group_capacity to indicate min per-CPU
capacity in the group a suitable group for a given task utilization can
more easily be found such that CPUs with reduced capacity can be avoided
for tasks with high utilization (not implemented by this patch).

Change-Id: If3cae1be62d01a199e752bca5abb45357d5d0fbd
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit bf475ce0a3dd75b5d1df6c6c14ae25168caa15ac)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/fair: Consider spare capacity in find_idlest_group()
Morten Rasmussen [Fri, 14 Oct 2016 13:41:08 +0000 (14:41 +0100)]
UPSTREAM: sched/fair: Consider spare capacity in find_idlest_group()

In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

Change-Id: I6097af76c302a5a12e240ca24c70f707ad118242
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6a0b19c0f39a7a7b7fb77d3867a733136ff059a3)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/fair: Compute task/cpu utilization at wake-up correctly
Morten Rasmussen [Fri, 14 Oct 2016 13:41:07 +0000 (14:41 +0100)]
UPSTREAM: sched/fair: Compute task/cpu utilization at wake-up correctly

At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().

Change-Id: I5605cca0c94c6ba43d9ce11554765a2456cf85bc
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1476452472-24740-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 104cb16d9eb684f071d5bf3aa87c0d01af259b7c)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/fair: Let asymmetric CPU configurations balance at wake-up
Morten Rasmussen [Mon, 25 Jul 2016 13:34:26 +0000 (14:34 +0100)]
UPSTREAM: sched/fair: Let asymmetric CPU configurations balance at wake-up

Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
configurations SD_WAKE_AFFINE is only desirable if the waking task's
compute demand (utilization) is suitable for the waking CPU and the
previous CPU, and all CPUs within their respective
SD_SHARE_PKG_RESOURCES domains (sd_llc). If not, let wakeup balancing
take over (find_idlest_{group, cpu}()).

This patch makes affine wake-ups conditional on whether both the waker
CPU and the previous CPU has sufficient capacity for the waking task,
or not, assuming that the CPU capacities within an SD_SHARE_PKG_RESOURCES
domain (sd_llc) are homogeneous.

Change-Id: I6d5d0426713da9ef6198f574ad9afbe58dacc1f0
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-10-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 3273163c6775c4c21823985304c2364b08ca6ea2)
[removed existing definition of capacity_margin]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems
Morten Rasmussen [Mon, 25 Jul 2016 13:34:24 +0000 (14:34 +0100)]
UPSTREAM: sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems

A domain with the SD_ASYM_CPUCAPACITY flag set indicate that
sched_groups at this level and below do not include CPUs of all
capacities available (e.g. group containing little-only or big-only CPUs
in big.LITTLE systems). It is therefore necessary to put in more effort
in finding an appropriate CPU at task wake-up by enabling balancing at
wake-up (SD_BALANCE_WAKE) on all lower (child) levels.

Change-Id: I4615917f540d03d7e7ef7de8f0da33b1ad97387c
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-8-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9ee1cda5ee25c7dd82acf25892e0d229e818f8c7)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/core: Pass child domain into sd_init()
Morten Rasmussen [Mon, 25 Jul 2016 13:34:23 +0000 (14:34 +0100)]
UPSTREAM: sched/core: Pass child domain into sd_init()

If behavioural sched_domain flags depend on topology flags set at higher
domain levels we need a way to update the child domain flags. Moving the
child pointer assignment inside sd_init() should make that possible.

Change-Id: If043921fdf102c310adcc9e0280afa33c48c4783
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 3676b13e8524c576825fe1e731e347dba0083888)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
Morten Rasmussen [Mon, 25 Jul 2016 13:34:22 +0000 (14:34 +0100)]
UPSTREAM: sched/core: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag

Add a topology flag to the sched_domain hierarchy indicating the lowest
domain level where the full range of CPU capacities is represented by
the domain members for asymmetric capacity topologies (e.g. ARM
big.LITTLE).

The flag is intended to indicate that extra care should be taken when
placing tasks on CPUs and this level spans all the different types of
CPUs found in the system (no need to look further up the domain
hierarchy). This information is currently only available through
iterating through the capacities of all the CPUs at parent levels in the
sched_domain hierarchy.

  SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY

  SD 1      [  0      1] [   2      3]  !SD_ASYM_CPUCAPACITY

  CPU:         0      1      2      3
  capacity:  756    756   1024   1024

If the topology in the example above is duplicated to create an eight
CPU example with third sched_domain level on top (SD 3), this level
should not have the flag set (!SD_ASYM_CPUCAPACITY) as its two group
would both have all CPU capacities represented within them.

Change-Id: I1526407b90567cac387419719b7d7fdc8b259a85
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-6-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1f6e6c7cb9bcd58abb5ee11243e0eefe6b36fc8e)
[trivial merge conflict]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/core: Remove unnecessary NULL-pointer check
Morten Rasmussen [Mon, 25 Jul 2016 13:34:21 +0000 (14:34 +0100)]
UPSTREAM: sched/core: Remove unnecessary NULL-pointer check

Checking if the sched_domain pointer returned by sd_init() is NULL seems
pointless as sd_init() neither checks if it is valid to begin with nor
set it to NULL.

Change-Id: I5e16fd0c2ca7234b097be7c95409ddb15c5e9de9
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: freedom.tan@mediatek.com
Cc: keita.kobayashi.ym@renesas.com
Cc: mgalbraith@suse.de
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1469453670-2660-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 0e6d2a67a41321b3ef650b780a279a37855de08e)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/fair: Optimize find_idlest_cpu() when there is no choice
Morten Rasmussen [Wed, 22 Jun 2016 17:03:14 +0000 (18:03 +0100)]
UPSTREAM: sched/fair: Optimize find_idlest_cpu() when there is no choice

In the current find_idlest_group()/find_idlest_cpu() search we end up
calling find_idlest_cpu() in a sched_group containing only one CPU in
the end. Checking idle-states becomes pointless when there is no
alternative, so bail out instead.

Change-Id: Ic62bf09b53a7984143ac2431aaa69c69b204cd56
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: linux-kernel@vger.kernel.org
Cc: mgalbraith@suse.de
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1466615004-3503-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit eaecf41f5abf80b63c8e025fcb9ee4aa203c3038)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoBACKPORT: sched/fair: Make the use of prev_cpu consistent in the wakeup path
Morten Rasmussen [Wed, 22 Jun 2016 17:03:13 +0000 (18:03 +0100)]
BACKPORT: sched/fair: Make the use of prev_cpu consistent in the wakeup path

In commit:

  ac66f5477239 ("sched/numa: Introduce migrate_swap()")

select_task_rq() got a 'cpu' argument to enable overriding of prev_cpu
in special cases (NUMA task swapping).

However, the select_task_rq_fair() helper functions: wake_affine() and
select_idle_sibling(), still use task_cpu(p) directly to work out
prev_cpu, which leads to inconsistencies.

This patch passes prev_cpu (potentially overridden by NUMA code) into
the helper functions to ensure prev_cpu is indeed the same CPU
everywhere in the wakeup path.

Change-Id: I4951c4eead2e6045e4fb34e89f6cda17d881d4d7
cc: Ingo Molnar <mingo@redhat.com>
cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: linux-kernel@vger.kernel.org
Cc: mgalbraith@suse.de
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1466615004-3503-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 772bd008cd9a1d4e8ce566f2edcc61d1c28fcbe5)
[merged with Android/EAS wakeup path]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoUPSTREAM: sched/core: Fix power to capacity renaming in comment
Morten Rasmussen [Wed, 22 Jun 2016 17:03:12 +0000 (18:03 +0100)]
UPSTREAM: sched/core: Fix power to capacity renaming in comment

It is seems that this one escaped Nico's renaming of cpu_power to
cpu_capacity a while back.

Change-Id: Ic2569d714db7b740f1df4ccc381ba8c1772c2793
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: linux-kernel@vger.kernel.org
Cc: mgalbraith@suse.de
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1466615004-3503-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit bd425d4bfc7a1a6064dbbadfbac9c7eec0e426ec)
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoPartial Revert: "WIP: sched: Add cpu capacity awareness to wakeup balancing"
Dietmar Eggemann [Wed, 25 Jan 2017 10:03:36 +0000 (10:03 +0000)]
Partial Revert: "WIP: sched: Add cpu capacity awareness to wakeup balancing"

Revert the changes in find_idlest_cpu() and find_idlest_group().

Keep the infrastructure bits which are used in following EAS patches.

Change-Id: Id516ca5f3e51b9a13db1ebb8de2df3aa25f9679b
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
6 years agoRevert "WIP: sched: Consider spare cpu capacity at task wake-up"
Dietmar Eggemann [Sun, 4 Dec 2016 17:47:53 +0000 (17:47 +0000)]
Revert "WIP: sched: Consider spare cpu capacity at task wake-up"

This reverts commit 75a9695b619741019363f889c99c97c7bb823797.

Change-Id: I846b21f2bdeb0b0ca30ad65683564ed07a429428
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
[ minor merge changes ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoFROM-LIST: cpufreq: schedutil: Redefine the rate_limit_us tunable
Viresh Kumar [Tue, 21 Feb 2017 04:45:18 +0000 (10:15 +0530)]
FROM-LIST: cpufreq: schedutil: Redefine the rate_limit_us tunable

The rate_limit_us tunable is intended to reduce the possible overhead
from running the schedutil governor.  However, that overhead can be
divided into two separate parts: the governor computations and the
invocation of the scaling driver to set the CPU frequency.  The latter
is where the real overhead comes from.  The former is much less
expensive in terms of execution time and running it every time the
governor callback is invoked by the scheduler, after rate_limit_us
interval has passed since the last frequency update, would not be a
problem.

For this reason, redefine the rate_limit_us tunable so that it means the
minimum time that has to pass between two consecutive invocations of the
scaling driver by the schedutil governor (to set the CPU frequency).

Change-Id: Iced64116b826c25441ef537c27a3dabfcf81919e
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[pulled from linux-pm linux-next https://patchwork.kernel.org/patch/9583949/ ]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agocpufreq: schedutil: add up/down frequency transition rate limits
Steve Muckle [Thu, 17 Nov 2016 05:18:45 +0000 (10:48 +0530)]
cpufreq: schedutil: add up/down frequency transition rate limits

The rate-limit tunable in the schedutil governor applies to transitions
to both lower and higher frequencies. On several platforms it is not the
ideal tunable though, as it is difficult to get best power/performance
figures using the same limit in both directions.

It is common on mobile platforms with demanding user interfaces to want
to increase frequency rapidly for example but decrease slowly.

One of the example can be a case where we have short busy periods
followed by similar or longer idle periods. If we keep the rate-limit
high enough, we will not go to higher frequencies soon enough. On the
other hand, if we keep it too low, we will have too many frequency
transitions, as we will always reduce the frequency after the busy
period.

It would be very useful if we can set low rate-limit while increasing
the frequency (so that we can respond to the short busy periods quickly)
and high rate-limit while decreasing frequency (so that we don't reduce
the frequency immediately after the short busy period and that may avoid
frequency transitions before the next busy period).

Implement separate up/down transition rate limits. Note that the
governor avoids frequency recalculations for a period equal to minimum
of up and down rate-limit. A global mutex is also defined to protect
updates to min_rate_limit_us via two separate sysfs files.

Note that this wouldn't change behavior of the schedutil governor for
the platforms which wish to keep same values for both up and down rate
limits.

This is tested with the rt-app [1] on ARM Exynos, dual A15 processor
platform.

Testcase: Run a SCHED_OTHER thread on CPU0 which will emulate work-load
for X ms of busy period out of the total period of Y ms, i.e. Y - X ms
of idle period. The values of X/Y taken were: 20/40, 20/50, 20/70, i.e
idle periods of 20, 30 and 50 ms respectively. These were tested against
values of up/down rate limits as: 10/10 ms and 10/40 ms.

For every test we noticed a performance increase of 5-10% with the
schedutil governor, which was very much expected.

[Viresh]: Simplified user interface and introduced min_rate_limit_us +
      mutex, rewrote commit log and included test results.

[1] https://github.com/scheduler-tools/rt-app/

Change-Id: I18720a83855b196b8e21dcdc8deae79131635b84
Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
(applied from https://marc.info/?l=linux-kernel&m=147936011103832&w=2)
[trivial adaptations]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
6 years agotrace/sched: add rq utilization signal for WALT
Juri Lelli [Wed, 30 Nov 2016 11:09:42 +0000 (11:09 +0000)]
trace/sched: add rq utilization signal for WALT

It is useful to be able to check current capacity against rq utilization
signal generated by WALT (to check how a cpufreq governor is behaving
for example).

Add rq utilization signal (same scale as capacity) to the walt_update_
task_ravg tracepoint.

Change-Id: I9aae3884a741d23ac494bef80d2303f107f135ce
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
6 years agosched/cpufreq: make schedutil use WALT signal
Juri Lelli [Wed, 14 Dec 2016 16:10:10 +0000 (16:10 +0000)]
sched/cpufreq: make schedutil use WALT signal

If WALT is available and enabled, make schedutil governor use its
utilization signal.

Change-Id: I92bc37989447a76616e9bcc4e9e8616774fb9925
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[we need to use boosted_cpu_util for schedutil, so make it
 not static]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched: cpufreq: use rt_avg as estimate of required RT CPU capacity
Steve Muckle [Thu, 25 Aug 2016 22:59:17 +0000 (15:59 -0700)]
sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

A policy of going to fmax on any RT activity will be detrimental
for power on many platforms. Often RT accounts for only a small amount
of CPU activity so sending the CPU frequency to fmax is overkill. Worse
still, some platforms may not be able to even complete the CPU frequency
change before the RT activity has already completed.

Cpufreq governors have not treated RT activity this way in the past so
it is not part of the expected semantics of the RT scheduling class. The
DL class offers guarantees about task completion and could be used for
this purpose.

Modify the schedutil algorithm to instead use rt_avg as an estimate of
RT utilization of the CPU.

Based on previous work by Vincent Guittot <vincent.guittot@linaro.org>.

Change-Id: I1ed605a3e2512a94d34217a8e57c3fd97cca60be
Signed-off-by: Steve Muckle <smuckle@linaro.org>
6 years agocpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task
Viresh Kumar [Tue, 15 Nov 2016 08:23:22 +0000 (13:53 +0530)]
cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task

If slow path frequency changes are conducted in a SCHED_OTHER context
then they may be delayed for some amount of time, including
indefinitely, when real time or deadline activity is taking place.

Move the slow path to a real time kernel thread. In the future the
thread should be made SCHED_DEADLINE. The RT priority is arbitrarily set
to 50 for now.

Hackbench results on ARM Exynos, dual core A15 platform for 10
iterations:

$ hackbench -s 100 -l 100 -g 10 -f 20

Before After
---------------------------------
1.808 1.603
1.847 1.251
2.229 1.590
1.952 1.600
1.947 1.257
1.925 1.627
2.694 1.620
1.258 1.621
1.919 1.632
1.250 1.240

Average:

1.8829 1.5041

Based on initial work by Steve Muckle.

Change-Id: I8f53037e94f353960c6d10abf07822d671631ef7
Signed-off-by: Steve Muckle <smuckle.linux@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from 02a7b1ee3baa)
[adapt to the 3.18 kthread interface]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
6 years agoBACKPORT: kthread: allow to cancel kthread work
Petr Mladek [Tue, 11 Oct 2016 20:55:43 +0000 (13:55 -0700)]
BACKPORT: kthread: allow to cancel kthread work

We are going to use kthread workers more widely and sometimes we will need
to make sure that the work is neither pending nor running.

This patch implements cancel_*_sync() operations as inspired by
workqueues.  Well, we are synchronized against the other operations via
the worker lock, we use del_timer_sync() and a counter to count parallel
cancel operations.  Therefore the implementation might be easier.

First, we check if a worker is assigned.  If not, the work has newer been
queued after it was initialized.

Second, we take the worker lock.  It must be the right one.  The work must
not be assigned to another worker unless it is initialized in between.

Third, we try to cancel the timer when it exists.  The timer is deleted
synchronously to make sure that the timer call back is not running.  We
need to temporary release the worker->lock to avoid a possible deadlock
with the callback.  In the meantime, we set work->canceling counter to
avoid any queuing.

Fourth, we try to remove the work from a worker list. It might be
the list of either normal or delayed works.

Fifth, if the work is running, we call kthread_flush_work().  It might
take an arbitrary time.  We need to release the worker-lock again.  In the
meantime, we again block any queuing by the canceling counter.

As already mentioned, the check for a pending kthread work is done under a
lock.  In compare with workqueues, we do not need to fight for a single
PENDING bit to block other operations.  Therefore we do not suffer from
the thundering storm problem and all parallel canceling jobs might use
kthread_flush_work().  Any queuing is blocked until the counter gets zero.

Change-Id: I8a8ece0f93c828f311d0ad5c88d80db2388e4808
Link: http://lkml.kernel.org/r/1470754545-17632-10-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-picked from 37be45d49dec2a411e29d50c9597cfe8184b5645)
[major changes to the original patch while cherry-picking; only rebased
the sync variant]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
6 years agosched/cpufreq: fix tunables for schedfreq governor
Steve Muckle [Tue, 25 Oct 2016 01:22:19 +0000 (18:22 -0700)]
sched/cpufreq: fix tunables for schedfreq governor

The schedfreq governor does not currently handle cpufreq drivers which
use a global set of tunables (!have_governor_per_policy).

For example on x86 and using the acpi cpufreq driver, doing this

  cat /sys/devices/system/cpu/cpufreq/sched/up_throttle_nsec

will result in a bad pointer access.

Update the tunable code using the upstream schedutil tunable code by
Rafael Wysocki as a guide.

Includes a partial backport of the reorganized cpufreq tunable
infrastructure.

Change-Id: I7e6f8de1dac297077ad43f37dd2f6ddbfe921c98
Signed-off-by: Steve Muckle <smuckle@linaro.org>
[fixed cherry-pick issue]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[fixed cherry-pick issue]
Signed-off-by: Thierry Strudel <tstrudel@google.com>
6 years agoBACKPORT: cpufreq: schedutil: New governor based on scheduler utilization data
Steve Muckle [Sat, 12 Nov 2016 01:14:39 +0000 (17:14 -0800)]
BACKPORT: cpufreq: schedutil: New governor based on scheduler utilization data

Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.

Doing that is possible after commit 34e2c55 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.

The new governor is relatively simple.

The frequency selection formula used by it depends on whether or not
the utilization is frequency-invariant.  In the frequency-invariant
case the new CPU frequency is given by

next_freq = 1.25 * max_freq * util / max

where util and max are the last two arguments of cpufreq_update_util().
In turn, if util is not frequency-invariant, the maximum frequency in
the above formula is replaced with the current frequency of the CPU:

next_freq = 1.25 * curr_freq * util / max

The coefficient 1.25 corresponds to the frequency tipping point at
(util / max) = 0.8.

All of the computations are carried out in the utilization update
handlers provided by the new governor.  One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).

The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers.  If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).

Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes.  That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.

The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.

Change-Id: I03876e622768e4b3ee4dc28682af7cce771f2f4c
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
(cherry-picked from 9bdcb44e391da5c41b98573bf0305a0e0b1c9569)
[ Backport the schedutil cpufreq governor from 4.9. Some cpufreq
  tunable infrastructure as well as the resolve_freq API is also
  backported as those are dependencies]
Signed-off-by: Steve Muckle <smuckle@linaro.org>
[trivial cherry-picking fixes]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
[fixed default governor machinery]
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agosched: backport cpufreq hooks from 4.9-rc4
Steve Muckle [Fri, 11 Nov 2016 22:04:43 +0000 (14:04 -0800)]
sched: backport cpufreq hooks from 4.9-rc4

The scheduler cpufreq hooks are required by the schedutil cpufreq
governor.

Change-Id: Ied6c46262bb33b7e81bbb3d3d2761124e0c676b7
Signed-off-by: Steve Muckle <smuckle@linaro.org>
[trivial cherry-picking fixes]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Chris Redpath <chris.redpath@arm.com>
6 years agoANDROID: Kconfig: add depends for UID_SYS_STATS
Ganesh Mahendran [Wed, 24 May 2017 02:28:27 +0000 (10:28 +0800)]
ANDROID: Kconfig: add depends for UID_SYS_STATS

uid_io depends on TASK_XACCT and TASK_IO_ACCOUNTING.
So add depends in Kconfig before compiling code.

Change-Id: Ie6bf57ec7c2eceffadf4da0fc2aca001ce10c36e
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
6 years agoANDROID: hid: uhid: implement refcount for open and close
Dmitry Torokhov [Tue, 30 May 2017 21:46:26 +0000 (14:46 -0700)]
ANDROID: hid: uhid: implement refcount for open and close

Fix concurrent open and close activity sending a UHID_CLOSE while
some consumers still have the device open.

Temporary solution for reference counts on device open and close
calls, absent a facility for this in the HID core likely to appear
in the future.

[toddpoynor@google.com: commit text]
Bug: 38448648
Signed-off-by: Todd Poynor <toddpoynor@google.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Change-Id: I57413e42ec961a960a8ddc4942228df22c730d80

6 years agoRevert "ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY"
Eric Biggers [Mon, 8 May 2017 17:21:34 +0000 (10:21 -0700)]
Revert "ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY"

This reverts commit e2968fb8e7980dccc199dac2593ad476db20969f.

For various reasons, we've had to start enforcing upstream that ext4
encryption can only be used if the filesystem superblock has the
EXT4_FEATURE_INCOMPAT_ENCRYPT flag set, as was the intended design.
Unfortunately, Android isn't ready for this quite yet, since its
userspace still needs to be updated to set the flag at mkfs time, or
else fix it later with tune2fs.  It will need some more time to be fixed
properly, so for now to avoid breaking some devices, revert the kernel
change.

Bug: 36231741
Signed-off-by: Eric Biggers <ebiggers@google.com>
Change-Id: I30bd54afb68dbaf9801f8954099dffa90a2f8df1

6 years agoANDROID: mnt: Fix next_descendent
Daniel Rosenberg [Mon, 29 May 2017 23:38:16 +0000 (16:38 -0700)]
ANDROID: mnt: Fix next_descendent

next_descendent did not properly handle the case
where the initial mount had no slaves. In this case,
we would look for the next slave, but since don't
have a master, the check for wrapping around to the
start of the list will always fail. Instead, we check
for this case, and ensure that we end the iteration
when we come back to the root.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 62094374
Change-Id: I43dfcee041aa3730cb4b9a1161418974ef84812e

6 years agoMerge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android
Alex Shi [Thu, 8 Jun 2017 04:11:52 +0000 (12:11 +0800)]
Merge branch 'linux-linaro-lsk-v4.4' into linux-linaro-lsk-v4.4-android

6 years ago Merge tag 'v4.4.71' into linux-linaro-lsk-v4.4
Alex Shi [Thu, 8 Jun 2017 04:11:49 +0000 (12:11 +0800)]
 Merge tag 'v4.4.71' into linux-linaro-lsk-v4.4

 This is the 4.4.71 stable release

6 years agoLinux 4.4.71
Greg Kroah-Hartman [Wed, 7 Jun 2017 10:06:14 +0000 (12:06 +0200)]
Linux 4.4.71

6 years agoxfs: only return -errno or success from attr ->put_listent
Eric Sandeen [Tue, 5 Apr 2016 21:57:18 +0000 (07:57 +1000)]
xfs: only return -errno or success from attr ->put_listent

commit 2a6fba6d2311151598abaa1e7c9abd5f8d024a43 upstream.

Today, the put_listent formatters return either 1 or 0; if
they return 1, some callers treat this as an error and return
it up the stack, despite "1" not being a valid (negative)
error code.

The intent seems to be that if the input buffer is full,
we set seen_enough or set count = -1, and return 1;
but some callers check the return before checking the
seen_enough or count fields of the context.

Fix this by only returning non-zero for actual errors
encountered, and rely on the caller to first check the
return value, then check the values in the context to
decide what to do.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: in _attrlist_by_handle, copy the cursor back to userspace
Darrick J. Wong [Wed, 3 Aug 2016 00:58:53 +0000 (10:58 +1000)]
xfs: in _attrlist_by_handle, copy the cursor back to userspace

commit 0facef7fb053be4353c0a48c2f48c9dbee91cb19 upstream.

When we're iterating inode xattrs by handle, we have to copy the
cursor back to userspace so that a subsequent invocation actually
retrieves subsequent contents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Cc: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: fix unaligned access in xfs_btree_visit_blocks
Eric Sandeen [Tue, 23 May 2017 02:54:10 +0000 (19:54 -0700)]
xfs: fix unaligned access in xfs_btree_visit_blocks

commit a4d768e702de224cc85e0c8eac9311763403b368 upstream.

This structure copy was throwing unaligned access warnings on sparc64:

Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs]

xfs_btree_copy_ptrs does a memcpy, which avoids it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: bad assertion for delalloc an extent that start at i_size
Zorro Lang [Mon, 15 May 2017 15:40:02 +0000 (08:40 -0700)]
xfs: bad assertion for delalloc an extent that start at i_size

commit 892d2a5f705723b2cb488bfb38bcbdcf83273184 upstream.

By run fsstress long enough time enough in RHEL-7, I find an
assertion failure (harder to reproduce on linux-4.11, but problem
is still there):

  XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c

The assertion is in xfs_getbmap() funciton:

  if (map[i].br_startblock == DELAYSTARTBLOCK &&
-->   map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip)))
          ASSERT((iflags & BMV_IF_DELALLOC) != 0);

When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the
startoff is just at EOF. But we only need to make sure delalloc
extents that are within EOF, not include EOF.

Signed-off-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: fix indlen accounting error on partial delalloc conversion
Brian Foster [Fri, 12 May 2017 17:44:08 +0000 (10:44 -0700)]
xfs: fix indlen accounting error on partial delalloc conversion

commit 0daaecacb83bc6b656a56393ab77a31c28139bc7 upstream.

The delalloc -> real block conversion path uses an incorrect
calculation in the case where the middle part of a delalloc extent
is being converted. This is documented as a rare situation because
XFS generally attempts to maximize contiguity by converting as much
of a delalloc extent as possible.

If this situation does occur, the indlen reservation for the two new
delalloc extents left behind by the conversion of the middle range
is calculated and compared with the original reservation. If more
blocks are required, the delta is allocated from the global block
pool. This delta value can be characterized as the difference
between the new total requirement (temp + temp2) and the currently
available reservation minus those blocks that have already been
allocated (startblockval(PREV.br_startblock) - allocated).

The problem is that the current code does not account for previously
allocated blocks correctly. It subtracts the current allocation
count from the (new - old) delta rather than the old indlen
reservation. This means that more indlen blocks than have been
allocated end up stashed in the remaining extents and free space
accounting is broken as a result.

Fix up the calculation to subtract the allocated block count from
the original extent indlen and thus correctly allocate the
reservation delta based on the difference between the new total
requirement and the unused blocks from the original reservation.
Also remove a bogus assert that contradicts the fact that the new
indlen reservation can be larger than the original indlen
reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: wait on new inodes during quotaoff dquot release
Brian Foster [Wed, 26 Apr 2017 15:30:40 +0000 (08:30 -0700)]
xfs: wait on new inodes during quotaoff dquot release

commit e20c8a517f259cb4d258e10b0cd5d4b30d4167a0 upstream.

The quotaoff operation has a race with inode allocation that results
in a livelock. An inode allocation that occurs before the quota
status flags are updated acquires the appropriate dquots for the
inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode
into the perag radix tree, sometime later attaches the dquots to the
inode and finally clears the XFS_INEW flag. Quotaoff expects to
release the dquots from all inodes in the filesystem via
xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator,
which skips inodes in the XFS_INEW state because they are not fully
constructed. If the scan occurs after dquots have been attached to
an inode, but before XFS_INEW is cleared, the newly allocated inode
will continue to hold a reference to the applicable dquots. When
quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those
dquot(s) remain elevated and the dqpurge scan spins indefinitely.

To address this problem, update the xfs_qm_dqrele_all_inodes() scan
to wait on inodes marked on the XFS_INEW state. We wait on the
inodes explicitly rather than skip and retry to avoid continuous
retry loops due to a parallel inode allocation workload. Since
quotaoff updates the quota state flags and uses a synchronous
transaction before the dqrele scan, and dquots are attached to
inodes after radix tree insertion iff quota is enabled, one INEW
waiting pass through the AG guarantees that the scan has processed
all inodes that could possibly hold dquot references.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: update ag iterator to support wait on new inodes
Brian Foster [Wed, 26 Apr 2017 15:30:39 +0000 (08:30 -0700)]
xfs: update ag iterator to support wait on new inodes

commit ae2c4ac2dd39b23a87ddb14ceddc3f2872c6aef5 upstream.

The AG inode iterator currently skips new inodes as such inodes are
inserted into the inode radix tree before they are fully
constructed. Certain contexts require the ability to wait on the
construction of new inodes, however. The fs-wide dquot release from
the quotaoff sequence is an example of this.

Update the AG inode iterator to support the ability to wait on
inodes flagged with XFS_INEW upon request. Create a new
xfs_inode_ag_iterator_flags() interface and support a set of
iteration flags to modify the iteration behavior. When the
XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
radix tree inode lookup and wait on them before the callback is
executed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: support ability to wait on new inodes
Brian Foster [Wed, 26 Apr 2017 15:30:39 +0000 (08:30 -0700)]
xfs: support ability to wait on new inodes

commit 756baca27fff3ecaeab9dbc7a5ee35a1d7bc0c7f upstream.

Inodes that are inserted into the perag tree but still under
construction are flagged with the XFS_INEW bit. Most contexts either
skip such inodes when they are encountered or have the ability to
handle them.

The runtime quotaoff sequence introduces a context that must wait
for construction of such inodes to correctly ensure that all dquots
in the fs are released. In anticipation of this, support the ability
to wait on new inodes. Wake the appropriate bit when XFS_INEW is
cleared.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: fix up quotacheck buffer list error handling
Brian Foster [Fri, 21 Apr 2017 19:40:44 +0000 (12:40 -0700)]
xfs: fix up quotacheck buffer list error handling

commit 20e8a063786050083fe05b4f45be338c60b49126 upstream.

The quotacheck error handling of the delwri buffer list assumes the
resident buffers are locked and doesn't clear the _XBF_DELWRI_Q flag
on the buffers that are dequeued. This can lead to assert failures
on buffer release and possibly other locking problems.

Move this code to a delwri queue cancel helper function to
encapsulate the logic required to properly release buffers from a
delwri queue. Update the helper to clear the delwri queue flag and
call it from quotacheck.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: prevent multi-fsb dir readahead from reading random blocks
Brian Foster [Thu, 20 Apr 2017 15:06:47 +0000 (08:06 -0700)]
xfs: prevent multi-fsb dir readahead from reading random blocks

commit cb52ee334a45ae6c78a3999e4b473c43ddc528f4 upstream.

Directory block readahead uses a complex iteration mechanism to map
between high-level directory blocks and underlying physical extents.
This mechanism attempts to traverse the higher-level dir blocks in a
manner that handles multi-fsb directory blocks and simultaneously
maintains a reference to the corresponding physical blocks.

This logic doesn't handle certain (discontiguous) physical extent
layouts correctly with multi-fsb directory blocks. For example,
consider the case of a 4k FSB filesystem with a 2 FSB (8k) directory
block size and a directory with the following extent layout:

 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL
   0: [0..7]:          88..95            0 (88..95)             8
   1: [8..15]:         80..87            0 (80..87)             8
   2: [16..39]:        168..191          0 (168..191)          24
   3: [40..63]:        5242952..5242975  1 (72..95)            24

Directory block 0 spans physical extents 0 and 1, dirblk 1 lies
entirely within extent 2 and dirblk 2 spans extents 2 and 3. Because
extent 2 is larger than the directory block size, the readahead code
erroneously assumes the block is contiguous and issues a readahead
based on the physical mapping of the first fsb of the dirblk. This
results in read verifier failure and a spurious corruption or crc
failure, depending on the filesystem format.

Further, the subsequent readahead code responsible for walking
through the physical table doesn't correctly advance the physical
block reference for dirblk 2. Instead of advancing two physical
filesystem blocks, the first iteration of the loop advances 1 block
(correctly), but the subsequent iteration advances 2 more physical
blocks because the next physical extent (extent 3, above) happens to
cover more than dirblk 2. At this point, the higher-level directory
block walking is completely off the rails of the actual physical
layout of the directory for the respective mapping table.

Update the contiguous dirblock logic to consider the current offset
in the physical extent to avoid issuing directory readahead to
unrelated blocks. Also, update the mapping table advancing code to
consider the current offset within the current dirblock to avoid
advancing the mapping reference too far beyond the dirblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: handle array index overrun in xfs_dir2_leaf_readbuf()
Eric Sandeen [Thu, 13 Apr 2017 22:15:47 +0000 (15:15 -0700)]
xfs: handle array index overrun in xfs_dir2_leaf_readbuf()

commit 023cc840b40fad95c6fe26fff1d380a8c9d45939 upstream.

Carlos had a case where "find" seemed to start spinning
forever and never return.

This was on a filesystem with non-default multi-fsb (8k)
directory blocks, and a fragmented directory with extents
like this:

0:[0,133646,2,0]
1:[2,195888,1,0]
2:[3,195890,1,0]
3:[4,195892,1,0]
4:[5,195894,1,0]
5:[6,195896,1,0]
6:[7,195898,1,0]
7:[8,195900,1,0]
8:[9,195902,1,0]
9:[10,195908,1,0]
10:[11,195910,1,0]
11:[12,195912,1,0]
12:[13,195914,1,0]
...

i.e. the first extent is a contiguous 2-fsb dir block, but
after that it is fragmented into 1 block extents.

At the top of the readdir path, we allocate a mapping array
which (for this filesystem geometry) can hold 10 extents; see
the assignment to map_info->map_size.  During readdir, we are
therefore able to map extents 0 through 9 above into the array
for readahead purposes.  If we count by 2, we see that the last
mapped index (9) is the first block of a 2-fsb directory block.

At the end of xfs_dir2_leaf_readbuf() we have 2 loops to fill
more readahead; the outer loop assumes one full dir block is
processed each loop iteration, and an inner loop that ensures
that this is so by advancing to the next extent until a full
directory block is mapped.

The problem is that this inner loop may step past the last
extent in the mapping array as it tries to reach the end of
the directory block.  This will read garbage for the extent
length, and as a result the loop control variable 'j' may
become corrupted and never fail the loop conditional.

The number of valid mappings we have in our array is stored
in map->map_valid, so stop this inner loop based on that limit.

There is an ASSERT at the top of the outer loop for this
same condition, but we never made it out of the inner loop,
so the ASSERT never fired.

Huge appreciation for Carlos for debugging and isolating
the problem.

Debugged-and-analyzed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: fix over-copying of getbmap parameters from userspace
Darrick J. Wong [Mon, 3 Apr 2017 22:17:57 +0000 (15:17 -0700)]
xfs: fix over-copying of getbmap parameters from userspace

commit be6324c00c4d1e0e665f03ed1fc18863a88da119 upstream.

In xfs_ioc_getbmap, we should only copy the fields of struct getbmap
from userspace, or else we end up copying random stack contents into the
kernel.  struct getbmap is a strict subset of getbmapx, so a partial
structure copy should work fine.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()
Eryu Guan [Tue, 23 May 2017 15:30:46 +0000 (08:30 -0700)]
xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()

commit 8affebe16d79ebefb1d9d6d56a46dc89716f9453 upstream.

xfs_find_get_desired_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size XFS on x86_64 host.

  # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
       -c "seek -d 0" /mnt/xfs/testfile
  wrote 1024/1024 bytes at offset 2048
  1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec)
  Whence  Result
  DATA    EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is uncovered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxfs: Fix missed holes in SEEK_HOLE implementation
Jan Kara [Thu, 18 May 2017 23:36:22 +0000 (16:36 -0700)]
xfs: Fix missed holes in SEEK_HOLE implementation

commit 5375023ae1266553a7baa0845e82917d8803f48c upstream.

XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as
can be seen by the following command:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
       -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec)
Whence Result
HOLE 139264

Where we can see that hole at offset 56k was just ignored by SEEK_HOLE
implementation. The bug is in xfs_find_get_desired_pgoff() which does
not properly detect the case when pages are not contiguous.

Fix the problem by properly detecting when found page has larger offset
than expected.

Fixes: d126d43f631f996daeee5006714fed914be32368
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomlock: fix mlock count can not decrease in race condition
Yisheng Xie [Fri, 2 Jun 2017 21:46:43 +0000 (14:46 -0700)]
mlock: fix mlock count can not decrease in race condition

commit 70feee0e1ef331b22cc51f383d532a0d043fbdcc upstream.

Kefeng reported that when running the follow test, the mlock count in
meminfo will increase permanently:

 [1] testcase
 linux:~ # cat test_mlockal
 grep Mlocked /proc/meminfo
  for j in `seq 0 10`
  do
  for i in `seq 4 15`
  do
  ./p_mlockall >> log &
  done
  sleep 0.2
 done
 # wait some time to let mlock counter decrease and 5s may not enough
 sleep 5
 grep Mlocked /proc/meminfo

 linux:~ # cat p_mlockall.c
 #include <sys/mman.h>
 #include <stdlib.h>
 #include <stdio.h>

 #define SPACE_LEN 4096

 int main(int argc, char ** argv)
 {
  int ret;
  void *adr = malloc(SPACE_LEN);
  if (!adr)
  return -1;

  ret = mlockall(MCL_CURRENT | MCL_FUTURE);
  printf("mlcokall ret = %d\n", ret);

  ret = munlockall();
  printf("munlcokall ret = %d\n", ret);

  free(adr);
  return 0;
 }

In __munlock_pagevec() we should decrement NR_MLOCK for each page where
we clear the PageMlocked flag.  Commit 1ebb7cc6a583 ("mm: munlock: batch
NR_MLOCK zone state updates") has introduced a bug where we don't
decrement NR_MLOCK for pages where we clear the flag, but fail to
isolate them from the lru list (e.g.  when the pages are on some other
cpu's percpu pagevec).  Since PageMlocked stays cleared, the NR_MLOCK
accounting gets permanently disrupted by this.

Fix it by counting the number of page whose PageMlock flag is cleared.

Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates")
Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joern Engel <joern@logfs.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: zhongjiang <zhongjiang@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomm/migrate: fix refcount handling when !hugepage_migration_supported()
Punit Agrawal [Fri, 2 Jun 2017 21:46:40 +0000 (14:46 -0700)]
mm/migrate: fix refcount handling when !hugepage_migration_supported()

commit 30809f559a0d348c2dfd7ab05e9a451e2384962e upstream.

On failing to migrate a page, soft_offline_huge_page() performs the
necessary update to the hugepage ref-count.

But when !hugepage_migration_supported() , unmap_and_move_hugepage()
also decrements the page ref-count for the hugepage.  The combined
behaviour leaves the ref-count in an inconsistent state.

This leads to soft lockups when running the overcommitted hugepage test
from mce-tests suite.

  Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
  soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
  INFO: rcu_preempt detected stalls on CPUs/tasks:
   Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
    (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
    thugetlb_overco R  running task        0  2715   2685 0x00000008
    Call trace:
      dump_backtrace+0x0/0x268
      show_stack+0x24/0x30
      sched_show_task+0x134/0x180
      rcu_print_detail_task_stall_rnp+0x54/0x7c
      rcu_check_callbacks+0xa74/0xb08
      update_process_times+0x34/0x60
      tick_sched_handle.isra.7+0x38/0x70
      tick_sched_timer+0x4c/0x98
      __hrtimer_run_queues+0xc0/0x300
      hrtimer_interrupt+0xac/0x228
      arch_timer_handler_phys+0x3c/0x50
      handle_percpu_devid_irq+0x8c/0x290
      generic_handle_irq+0x34/0x50
      __handle_domain_irq+0x68/0xc0
      gic_handle_irq+0x5c/0xb0

Address this by changing the putback_active_hugepage() in
soft_offline_huge_page() to putback_movable_pages().

This only triggers on systems that enable memory failure handling
(ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
(!ARCH_ENABLE_HUGEPAGE_MIGRATION).

I imagine this wasn't triggered as there aren't many systems running
this configuration.

[akpm@linux-foundation.org: remove dead comment, per Naoya]
Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com
Reported-by: Manoj Iyer <manoj.iyer@canonical.com>
Tested-by: Manoj Iyer <manoj.iyer@canonical.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agodrm/gma500/psb: Actually use VBT mode when it is found
Patrik Jakobsson [Tue, 18 Apr 2017 11:43:32 +0000 (13:43 +0200)]
drm/gma500/psb: Actually use VBT mode when it is found

commit 82bc9a42cf854fdf63155759c0aa790bd1f361b0 upstream.

With LVDS we were incorrectly picking the pre-programmed mode instead of
the prefered mode provided by VBT. Make sure we pick the VBT mode if
one is provided. It is likely that the mode read-out code is still wrong
but this patch fixes the immediate problem on most machines.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=78562
Signed-off-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20170418114332.12183-1-patrik.r.jakobsson@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoslub/memcg: cure the brainless abuse of sysfs attributes
Thomas Gleixner [Fri, 2 Jun 2017 21:46:25 +0000 (14:46 -0700)]
slub/memcg: cure the brainless abuse of sysfs attributes

commit 478fe3037b2278d276d4cd9cd0ab06c4cb2e9b32 upstream.

memcg_propagate_slab_attrs() abuses the sysfs attribute file functions
to propagate settings from the root kmem_cache to a newly created
kmem_cache.  It does that with:

     attr->show(root, buf);
     attr->store(new, buf, strlen(bug);

Aside of being a lazy and absurd hackery this is broken because it does
not check the return value of the show() function.

Some of the show() functions return 0 w/o touching the buffer.  That
means in such a case the store function is called with the stale content
of the previous show().  That causes nonsense like invoking
kmem_cache_shrink() on a newly created kmem_cache.  In the worst case it
would cause handing in an uninitialized buffer.

This should be rewritten proper by adding a propagate() callback to
those slub_attributes which must be propagated and avoid that insane
conversion to and from ASCII, but that's too large for a hot fix.

Check at least the return value of the show() function, so calling
store() with stale content is prevented.

Steven said:
 "It can cause a deadlock with get_online_cpus() that has been uncovered
  by recent cpu hotplug and lockdep changes that Thomas and Peter have
  been doing.

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(cpu_hotplug.lock);
                                   lock(slab_mutex);
                                   lock(cpu_hotplug.lock);
      lock(slab_mutex);

     *** DEADLOCK ***"

Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430
Alexander Tsoy [Mon, 22 May 2017 17:58:11 +0000 (20:58 +0300)]
ALSA: hda - apply STAC_9200_DELL_M22 quirk for Dell Latitude D430

commit 1fc2e41f7af4572b07190f9dec28396b418e9a36 upstream.

This model is actually called 92XXM2-8 in Windows driver. But since pin
configs for M22 and M28 are identical, just reuse M22 quirk.

Fixes external microphone (tested) and probably docking station ports
(not tested).

Signed-off-by: Alexander Tsoy <alexander@tsoy.me>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agopcmcia: remove left-over %Z format
Nicolas Iooss [Fri, 2 Jun 2017 21:46:28 +0000 (14:46 -0700)]
pcmcia: remove left-over %Z format

commit ff5a20169b98d84ad8d7f99f27c5ebbb008204d6 upstream.

Commit 5b5e0928f742 ("lib/vsprintf.c: remove %Z support") removed some
usages of format %Z but forgot "%.2Zx".  This makes clang 4.0 reports a
-Wformat-extra-args warning because it does not know about %Z.

Replace %Z with %z.

Link: http://lkml.kernel.org/r/20170520090946.22562-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Harald Welte <laforge@gnumonks.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agodrm/radeon: Unbreak HPD handling for r600+
Lyude [Thu, 11 May 2017 23:31:12 +0000 (19:31 -0400)]
drm/radeon: Unbreak HPD handling for r600+

commit 3d18e33735a02b1a90aecf14410bf3edbfd4d3dc upstream.

We end up reading the interrupt register for HPD5, and then writing it
to HPD6 which on systems without anything using HPD5 results in
permanently disabling hotplug on one of the display outputs after the
first time we acknowledge a hotplug interrupt from the GPU.

This code is really bad. But for now, let's just fix this. I will
hopefully have a large patch series to refactor all of this soon.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Lyude <lyude@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agodrm/radeon/ci: disable mclk switching for high refresh rates (v2)
Alex Deucher [Thu, 11 May 2017 17:14:14 +0000 (13:14 -0400)]
drm/radeon/ci: disable mclk switching for high refresh rates (v2)

commit 58d7e3e427db1bd68f33025519a9468140280a75 upstream.

Even if the vblank period would allow it, it still seems to
be problematic on some cards.

v2: fix logic inversion (Nils)

bug: https://bugs.freedesktop.org/show_bug.cgi?id=96868

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoscsi: mpt3sas: Force request partial completion alignment
Ram Pai [Thu, 26 Jan 2017 18:37:01 +0000 (16:37 -0200)]
scsi: mpt3sas: Force request partial completion alignment

commit f2e767bb5d6ee0d988cb7d4e54b0b21175802b6b upstream.

The firmware or device, possibly under a heavy I/O load, can return on a
partial unaligned boundary. Scsi-ml expects these requests to be
completed on an alignment boundary. Scsi-ml blindly requeues the I/O
without checking the alignment boundary of the I/O request for the
remaining bytes. This leads to errors, since devices cannot perform
non-aligned read/write operations.

This patch fixes the issue in the driver. It aligns unaligned
completions of FS requests, by truncating them to the nearest alignment
boundary.

[mkp: simplified if statement]

Reported-by: Mauricio Faria De Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Acked-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoHID: wacom: Have wacom_tpc_irq guard against possible NULL dereference
Jason Gerecke [Tue, 25 Apr 2017 18:29:56 +0000 (11:29 -0700)]
HID: wacom: Have wacom_tpc_irq guard against possible NULL dereference

commit 2ac97f0f6654da14312d125005c77a6010e0ea38 upstream.

The following Smatch complaint was generated in response to commit
2a6cdbd ("HID: wacom: Introduce new 'touch_input' device"):

    drivers/hid/wacom_wac.c:1586 wacom_tpc_irq()
             error: we previously assumed 'wacom->touch_input' could be null (see line 1577)

The 'touch_input' and 'pen_input' variables point to the 'struct input_dev'
used for relaying touch and pen events to userspace, respectively. If a
device does not have a touch interface or pen interface, the associated
input variable is NULL. The 'wacom_tpc_irq()' function is responsible for
forwarding input reports to a more-specific IRQ handler function. An
unknown report could theoretically be mistaken as e.g. a touch report
on a device which does not have a touch interface. This can be prevented
by only calling the pen/touch functions are called when the pen/touch
pointers are valid.

Fixes: 2a6cdbd ("HID: wacom: Introduce new 'touch_input' device")
Signed-off-by: Jason Gerecke <jason.gerecke@wacom.com>
Reviewed-by: Ping Cheng <ping.cheng@wacom.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agommc: sdhci-iproc: suppress spurious interrupt with Multiblock read
Srinath Mannam [Thu, 18 May 2017 16:57:40 +0000 (22:27 +0530)]
mmc: sdhci-iproc: suppress spurious interrupt with Multiblock read

commit f5f968f2371ccdebb8a365487649673c9af68d09 upstream.

The stingray SDHCI hardware supports ACMD12 and automatically
issues after multi block transfer completed.

If ACMD12 in SDHCI is disabled, spurious tx done interrupts are seen
on multi block read command with below error message:

Got data interrupt 0x00000002 even though no data
operation was in progress.

This patch uses SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 to enable
ACM12 support in SDHCI hardware and suppress spurious interrupt.

Signed-off-by: Srinath Mannam <srinath.mannam@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Fixes: b580c52d58d9 ("mmc: sdhci-iproc: add IPROC SDHCI driver")
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoi2c: i2c-tiny-usb: fix buffer not being DMA capable
Sebastian Reichel [Fri, 5 May 2017 09:06:50 +0000 (11:06 +0200)]
i2c: i2c-tiny-usb: fix buffer not being DMA capable

commit 5165da5923d6c7df6f2927b0113b2e4d9288661e upstream.

Since v4.9 i2c-tiny-usb generates the below call trace
and longer works, since it can't communicate with the
USB device. The reason is, that since v4.9 the USB
stack checks, that the buffer it should transfer is DMA
capable. This was a requirement since v2.2 days, but it
usually worked nevertheless.

[   17.504959] ------------[ cut here ]------------
[   17.505488] WARNING: CPU: 0 PID: 93 at drivers/usb/core/hcd.c:1587 usb_hcd_map_urb_for_dma+0x37c/0x570
[   17.506545] transfer buffer not dma capable
[   17.507022] Modules linked in:
[   17.507370] CPU: 0 PID: 93 Comm: i2cdetect Not tainted 4.11.0-rc8+ #10
[   17.508103] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[   17.509039] Call Trace:
[   17.509320]  ? dump_stack+0x5c/0x78
[   17.509714]  ? __warn+0xbe/0xe0
[   17.510073]  ? warn_slowpath_fmt+0x5a/0x80
[   17.510532]  ? nommu_map_sg+0xb0/0xb0
[   17.510949]  ? usb_hcd_map_urb_for_dma+0x37c/0x570
[   17.511482]  ? usb_hcd_submit_urb+0x336/0xab0
[   17.511976]  ? wait_for_completion_timeout+0x12f/0x1a0
[   17.512549]  ? wait_for_completion_timeout+0x65/0x1a0
[   17.513125]  ? usb_start_wait_urb+0x65/0x160
[   17.513604]  ? usb_control_msg+0xdc/0x130
[   17.514061]  ? usb_xfer+0xa4/0x2a0
[   17.514445]  ? __i2c_transfer+0x108/0x3c0
[   17.514899]  ? i2c_transfer+0x57/0xb0
[   17.515310]  ? i2c_smbus_xfer_emulated+0x12f/0x590
[   17.515851]  ? _raw_spin_unlock_irqrestore+0x11/0x20
[   17.516408]  ? i2c_smbus_xfer+0x125/0x330
[   17.516876]  ? i2c_smbus_xfer+0x125/0x330
[   17.517329]  ? i2cdev_ioctl_smbus+0x1c1/0x2b0
[   17.517824]  ? i2cdev_ioctl+0x75/0x1c0
[   17.518248]  ? do_vfs_ioctl+0x9f/0x600
[   17.518671]  ? vfs_write+0x144/0x190
[   17.519078]  ? SyS_ioctl+0x74/0x80
[   17.519463]  ? entry_SYSCALL_64_fastpath+0x1e/0xad
[   17.519959] ---[ end trace d047c04982f5ac50 ]---

Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.co.uk>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Till Harbaum <till@harbaum.org>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agovlan: Fix tcp checksum offloads in Q-in-Q vlans
Vlad Yasevich [Tue, 23 May 2017 17:38:41 +0000 (13:38 -0400)]
vlan: Fix tcp checksum offloads in Q-in-Q vlans

commit 35d2f80b07bbe03fb358afb0bdeff7437a7d67ff upstream.

It appears that TCP checksum offloading has been broken for
Q-in-Q vlans.  The behavior was execerbated by the
series
    commit afb0bc972b52 ("Merge branch 'stacked_vlan_tso'")
that that enabled accleleration features on stacked vlans.

However, event without that series, it is possible to trigger
this issue.  It just requires a lot more specialized configuration.

The root cause is the interaction between how
netdev_intersect_features() works, the features actually set on
the vlan devices and HW having the ability to run checksum with
longer headers.

The issue starts when netdev_interesect_features() replaces
NETIF_F_HW_CSUM with a combination of NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM,
if the HW advertises IP|IPV6 specific checksums.  This happens
for tagged and multi-tagged packets.   However, HW that enables
IP|IPV6 checksum offloading doesn't gurantee that packets with
arbitrarily long headers can be checksummed.

This patch disables IP|IPV6 checksums on the packet for multi-tagged
packets.

CC: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
CC: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: phy: marvell: Limit errata to 88m1101
Andrew Lunn [Tue, 23 May 2017 15:49:13 +0000 (17:49 +0200)]
net: phy: marvell: Limit errata to 88m1101

commit f2899788353c13891412b273fdff5f02d49aa40f upstream.

The 88m1101 has an errata when configuring autoneg. However, it was
being applied to many other Marvell PHYs as well. Limit its scope to
just the 88m1101.

Fixes: 76884679c644 ("phylib: Add support for Marvell 88e1111S and 88e1145")
Reported-by: Daniel Walker <danielwa@cisco.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Harini Katakam <harinik@xilinx.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonetem: fix skb_orphan_partial()
Eric Dumazet [Thu, 11 May 2017 22:24:41 +0000 (15:24 -0700)]
netem: fix skb_orphan_partial()

commit f6ba8d33cfbb46df569972e64dbb5bb7e929bfd9 upstream.

I should have known that lowering skb->truesize was dangerous :/

In case packets are not leaving the host via a standard Ethernet device,
but looped back to local sockets, bad things can happen, as reported
by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )

So instead of tweaking skb->truesize, lets change skb->destructor
and keep a reference on the owner socket via its sk_refcnt.

Fixes: f2f872f9272a ("netem: Introduce skb_orphan_partial() helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Michael Madsen <mkm@nabto.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoipv4: add reference counting to metrics
Eric Dumazet [Thu, 25 May 2017 21:27:35 +0000 (14:27 -0700)]
ipv4: add reference counting to metrics

[ Upstream commit 3fb07daff8e99243366a081e5129560734de4ada ]

Andrey Konovalov reported crashes in ipv4_mtu()

I could reproduce the issue with KASAN kernels, between
10.246.7.151 and 10.246.7.152 :

1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &

2) At the same time run following loop :
while :
do
 ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
done

Cong Wang attempted to add back rt->fi in commit
82486aa6f1b9 ("ipv4: restore rt->fi for reference counting")
but this proved to add some issues that were complex to solve.

Instead, I suggested to add a refcount to the metrics themselves,
being a standalone object (in particular, no reference to other objects)

I tried to make this patch as small as possible to ease its backport,
instead of being super clean. Note that we believe that only ipv4 dst
need to take care of the metric refcount. But if this is wrong,
this patch adds the basic infrastructure to extend this to other
families.

Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
for his efforts on this problem.

Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agosctp: fix ICMP processing if skb is non-linear
Davide Caratti [Thu, 25 May 2017 17:14:56 +0000 (19:14 +0200)]
sctp: fix ICMP processing if skb is non-linear

[ Upstream commit 804ec7ebe8ea003999ca8d1bfc499edc6a9e07df ]

sometimes ICMP replies to INIT chunks are ignored by the client, even if
the encapsulated SCTP headers match an open socket. This happens when the
ICMP packet is carried by a paged skb: use skb_header_pointer() to read
packet contents beyond the SCTP header, so that chunk header and initiate
tag are validated correctly.

v2:
- don't use skb_header_pointer() to read the transport header, since
  icmp_socket_deliver() already puts these 8 bytes in the linear area.
- change commit message to make specific reference to INIT chunks.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agotcp: avoid fastopen API to be used on AF_UNSPEC
Wei Wang [Wed, 24 May 2017 16:59:31 +0000 (09:59 -0700)]
tcp: avoid fastopen API to be used on AF_UNSPEC

[ Upstream commit ba615f675281d76fd19aa03558777f81fb6b6084 ]

Fastopen API should be used to perform fastopen operations on the TCP
socket. It does not make sense to use fastopen API to perform disconnect
by calling it with AF_UNSPEC. The fastopen data path is also prone to
race conditions and bugs when using with AF_UNSPEC.

One issue reported and analyzed by Vegard Nossum is as follows:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Thread A:                            Thread B:
------------------------------------------------------------------------
sendto()
 - tcp_sendmsg()
     - sk_stream_memory_free() = 0
         - goto wait_for_sndbuf
     - sk_stream_wait_memory()
        - sk_wait_event() // sleep
          |                          sendto(flags=MSG_FASTOPEN, dest_addr=AF_UNSPEC)
  |                           - tcp_sendmsg()
  |                              - tcp_sendmsg_fastopen()
  |                                 - __inet_stream_connect()
  |                                    - tcp_disconnect() //because of AF_UNSPEC
  |                                       - tcp_transmit_skb()// send RST
  |                                    - return 0; // no reconnect!
  |                           - sk_stream_wait_connect()
  |                                 - sock_error()
  |                                    - xchg(&sk->sk_err, 0)
  |                                    - return -ECONNRESET
- ... // wake up, see sk->sk_err == 0
    - skb_entail() on TCP_CLOSE socket

If the connection is reopened then we will send a brand new SYN packet
after thread A has already queued a buffer. At this point I think the
socket internal state (sequence numbers etc.) becomes messed up.

When the new connection is closed, the FIN-ACK is rejected because the
sequence number is outside the window. The other side tries to
retransmit,
but __tcp_retransmit_skb() calls tcp_trim_head() on an empty skb which
corrupts the skb data length and hits a BUG() in copy_and_csum_bits().
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Hence, this patch adds a check for AF_UNSPEC in the fastopen data path
and return EOPNOTSUPP to user if such case happens.

Fixes: cf60af03ca4e7 ("tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agovirtio-net: enable TSO/checksum offloads for Q-in-Q vlans
Vlad Yasevich [Tue, 23 May 2017 17:38:43 +0000 (13:38 -0400)]
virtio-net: enable TSO/checksum offloads for Q-in-Q vlans

[ Upstream commit 2836b4f224d4fd7d1a2b23c3eecaf0f0ae199a74 ]

Since virtio does not provide it's own ndo_features_check handler,
TSO, and now checksum offload, are disabled for stacked vlans.
Re-enable the support and let the host take care of it.  This
restores/improves Guest-to-Guest performance over Q-in-Q vlans.

Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agobe2net: Fix offload features for Q-in-Q packets
Vlad Yasevich [Tue, 23 May 2017 17:38:42 +0000 (13:38 -0400)]
be2net: Fix offload features for Q-in-Q packets

[ Upstream commit cc6e9de62a7f84c9293a2ea41bc412b55bb46e85 ]

At least some of the be2net cards do not seem to be capabled
of performing checksum offload computions on Q-in-Q packets.
In these case, the recevied checksum on the remote is invalid
and TCP syn packets are dropped.

This patch adds a call to check disbled acceleration features
on Q-in-Q tagged traffic.

CC: Sathya Perla <sathya.perla@broadcom.com>
CC: Ajit Khaparde <ajit.khaparde@broadcom.com>
CC: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
CC: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoipv6: fix out of bound writes in __ip6_append_data()
Eric Dumazet [Fri, 19 May 2017 21:17:48 +0000 (14:17 -0700)]
ipv6: fix out of bound writes in __ip6_append_data()

[ Upstream commit 232cd35d0804cc241eb887bb8d4d9b3b9881c64a ]

Andrey Konovalov and idaifish@gmail.com reported crashes caused by
one skb shared_info being overwritten from __ip6_append_data()

Andrey program lead to following state :

copy -4200 datalen 2000 fraglen 2040
maxfraglen 2040 alloclen 2048 transhdrlen 0 offset 0 fraggap 6200

The skb_copy_and_csum_bits(skb_prev, maxfraglen, data + transhdrlen,
fraggap, 0); is overwriting skb->head and skb_shared_info

Since we apparently detect this rare condition too late, move the
code earlier to even avoid allocating skb and risking crashes.

Once again, many thanks to Andrey and syzkaller team.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: <idaifish@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agobridge: start hello_timer when enabling KERNEL_STP in br_stp_start
Xin Long [Fri, 19 May 2017 14:20:29 +0000 (22:20 +0800)]
bridge: start hello_timer when enabling KERNEL_STP in br_stp_start

[ Upstream commit 6d18c732b95c0a9d35e9f978b4438bba15412284 ]

Since commit 76b91c32dd86 ("bridge: stp: when using userspace stp stop
kernel hello and hold timers"), bridge would not start hello_timer if
stp_enabled is not KERNEL_STP when br_dev_open.

The problem is even if users set stp_enabled with KERNEL_STP later,
the timer will still not be started. It causes that KERNEL_STP can
not really work. Users have to re-ifup the bridge to avoid this.

This patch is to fix it by starting br->hello_timer when enabling
KERNEL_STP in br_stp_start.

As an improvement, it's also to start hello_timer again only when
br->stp_enabled is KERNEL_STP in br_hello_timer_expired, there is
no reason to start the timer again when it's NO_STP.

Fixes: 76b91c32dd86 ("bridge: stp: when using userspace stp stop kernel hello and hold timers")
Reported-by: Haidong Li <haili@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Ivan Vecera <cera@cera.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoqmi_wwan: add another Lenovo EM74xx device ID
Bjørn Mork [Wed, 17 May 2017 14:31:41 +0000 (16:31 +0200)]
qmi_wwan: add another Lenovo EM74xx device ID

[ Upstream commit 486181bcb3248e2f1977f4e69387a898234a4e1e ]

In their infinite wisdom, and never ending quest for end user frustration,
Lenovo has decided to use a new USB device ID for the wwan modules in
their 2017 laptops.  The actual hardware is still the Sierra Wireless
EM7455 or EM7430, depending on region.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agobridge: netlink: check vlan_default_pvid range
Tobias Jungel [Wed, 17 May 2017 07:29:12 +0000 (09:29 +0200)]
bridge: netlink: check vlan_default_pvid range

[ Upstream commit a285860211bf257b0e6d522dac6006794be348af ]

Currently it is allowed to set the default pvid of a bridge to a value
above VLAN_VID_MASK (0xfff). This patch adds a check to br_validate and
returns -EINVAL in case the pvid is out of bounds.

Reproduce by calling:

[root@test ~]# ip l a type bridge
[root@test ~]# ip l a type dummy
[root@test ~]# ip l s bridge0 type bridge vlan_filtering 1
[root@test ~]# ip l s bridge0 type bridge vlan_default_pvid 9999
[root@test ~]# ip l s dummy0 master bridge0
[root@test ~]# bridge vlan
port vlan ids
bridge0  9999 PVID Egress Untagged

dummy0  9999 PVID Egress Untagged

Fixes: 0f963b7592ef ("bridge: netlink: add support for default_pvid")
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Tobias Jungel <tobias.jungel@bisdn.de>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoipv6: Check ip6_find_1stfragopt() return value properly.
David S. Miller [Thu, 18 May 2017 02:54:11 +0000 (22:54 -0400)]
ipv6: Check ip6_find_1stfragopt() return value properly.

[ Upstream commit 7dd7eb9513bd02184d45f000ab69d78cb1fa1531 ]

Do not use unsigned variables to see if it returns a negative
error or not.

Fixes: 2423496af35d ("ipv6: Prevent overrun when parsing v6 header options")
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoipv6: Prevent overrun when parsing v6 header options
Craig Gallek [Tue, 16 May 2017 18:36:23 +0000 (14:36 -0400)]
ipv6: Prevent overrun when parsing v6 header options

[ Upstream commit 2423496af35d94a87156b063ea5cedffc10a70a1 ]

The KASAN warning repoted below was discovered with a syzkaller
program.  The reproducer is basically:
  int s = socket(AF_INET6, SOCK_RAW, NEXTHDR_HOP);
  send(s, &one_byte_of_data, 1, MSG_MORE);
  send(s, &more_than_mtu_bytes_data, 2000, 0);

The socket() call sets the nexthdr field of the v6 header to
NEXTHDR_HOP, the first send call primes the payload with a non zero
byte of data, and the second send call triggers the fragmentation path.

The fragmentation code tries to parse the header options in order
to figure out where to insert the fragment option.  Since nexthdr points
to an invalid option, the calculation of the size of the network header
can made to be much larger than the linear section of the skb and data
is read outside of it.

This fix makes ip6_find_1stfrag return an error if it detects
running out-of-bounds.

[   42.361487] ==================================================================
[   42.364412] BUG: KASAN: slab-out-of-bounds in ip6_fragment+0x11c8/0x3730
[   42.365471] Read of size 840 at addr ffff88000969e798 by task ip6_fragment-oo/3789
[   42.366469]
[   42.366696] CPU: 1 PID: 3789 Comm: ip6_fragment-oo Not tainted 4.11.0+ #41
[   42.367628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[   42.368824] Call Trace:
[   42.369183]  dump_stack+0xb3/0x10b
[   42.369664]  print_address_description+0x73/0x290
[   42.370325]  kasan_report+0x252/0x370
[   42.370839]  ? ip6_fragment+0x11c8/0x3730
[   42.371396]  check_memory_region+0x13c/0x1a0
[   42.371978]  memcpy+0x23/0x50
[   42.372395]  ip6_fragment+0x11c8/0x3730
[   42.372920]  ? nf_ct_expect_unregister_notifier+0x110/0x110
[   42.373681]  ? ip6_copy_metadata+0x7f0/0x7f0
[   42.374263]  ? ip6_forward+0x2e30/0x2e30
[   42.374803]  ip6_finish_output+0x584/0x990
[   42.375350]  ip6_output+0x1b7/0x690
[   42.375836]  ? ip6_finish_output+0x990/0x990
[   42.376411]  ? ip6_fragment+0x3730/0x3730
[   42.376968]  ip6_local_out+0x95/0x160
[   42.377471]  ip6_send_skb+0xa1/0x330
[   42.377969]  ip6_push_pending_frames+0xb3/0xe0
[   42.378589]  rawv6_sendmsg+0x2051/0x2db0
[   42.379129]  ? rawv6_bind+0x8b0/0x8b0
[   42.379633]  ? _copy_from_user+0x84/0xe0
[   42.380193]  ? debug_check_no_locks_freed+0x290/0x290
[   42.380878]  ? ___sys_sendmsg+0x162/0x930
[   42.381427]  ? rcu_read_lock_sched_held+0xa3/0x120
[   42.382074]  ? sock_has_perm+0x1f6/0x290
[   42.382614]  ? ___sys_sendmsg+0x167/0x930
[   42.383173]  ? lock_downgrade+0x660/0x660
[   42.383727]  inet_sendmsg+0x123/0x500
[   42.384226]  ? inet_sendmsg+0x123/0x500
[   42.384748]  ? inet_recvmsg+0x540/0x540
[   42.385263]  sock_sendmsg+0xca/0x110
[   42.385758]  SYSC_sendto+0x217/0x380
[   42.386249]  ? SYSC_connect+0x310/0x310
[   42.386783]  ? __might_fault+0x110/0x1d0
[   42.387324]  ? lock_downgrade+0x660/0x660
[   42.387880]  ? __fget_light+0xa1/0x1f0
[   42.388403]  ? __fdget+0x18/0x20
[   42.388851]  ? sock_common_setsockopt+0x95/0xd0
[   42.389472]  ? SyS_setsockopt+0x17f/0x260
[   42.390021]  ? entry_SYSCALL_64_fastpath+0x5/0xbe
[   42.390650]  SyS_sendto+0x40/0x50
[   42.391103]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[   42.391731] RIP: 0033:0x7fbbb711e383
[   42.392217] RSP: 002b:00007ffff4d34f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[   42.393235] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbbb711e383
[   42.394195] RDX: 0000000000001000 RSI: 00007ffff4d34f60 RDI: 0000000000000003
[   42.395145] RBP: 0000000000000046 R08: 00007ffff4d34f40 R09: 0000000000000018
[   42.396056] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400aad
[   42.396598] R13: 0000000000000066 R14: 00007ffff4d34ee0 R15: 00007fbbb717af00
[   42.397257]
[   42.397411] Allocated by task 3789:
[   42.397702]  save_stack_trace+0x16/0x20
[   42.398005]  save_stack+0x46/0xd0
[   42.398267]  kasan_kmalloc+0xad/0xe0
[   42.398548]  kasan_slab_alloc+0x12/0x20
[   42.398848]  __kmalloc_node_track_caller+0xcb/0x380
[   42.399224]  __kmalloc_reserve.isra.32+0x41/0xe0
[   42.399654]  __alloc_skb+0xf8/0x580
[   42.400003]  sock_wmalloc+0xab/0xf0
[   42.400346]  __ip6_append_data.isra.41+0x2472/0x33d0
[   42.400813]  ip6_append_data+0x1a8/0x2f0
[   42.401122]  rawv6_sendmsg+0x11ee/0x2db0
[   42.401505]  inet_sendmsg+0x123/0x500
[   42.401860]  sock_sendmsg+0xca/0x110
[   42.402209]  ___sys_sendmsg+0x7cb/0x930
[   42.402582]  __sys_sendmsg+0xd9/0x190
[   42.402941]  SyS_sendmsg+0x2d/0x50
[   42.403273]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[   42.403718]
[   42.403871] Freed by task 1794:
[   42.404146]  save_stack_trace+0x16/0x20
[   42.404515]  save_stack+0x46/0xd0
[   42.404827]  kasan_slab_free+0x72/0xc0
[   42.405167]  kfree+0xe8/0x2b0
[   42.405462]  skb_free_head+0x74/0xb0
[   42.405806]  skb_release_data+0x30e/0x3a0
[   42.406198]  skb_release_all+0x4a/0x60
[   42.406563]  consume_skb+0x113/0x2e0
[   42.406910]  skb_free_datagram+0x1a/0xe0
[   42.407288]  netlink_recvmsg+0x60d/0xe40
[   42.407667]  sock_recvmsg+0xd7/0x110
[   42.408022]  ___sys_recvmsg+0x25c/0x580
[   42.408395]  __sys_recvmsg+0xd6/0x190
[   42.408753]  SyS_recvmsg+0x2d/0x50
[   42.409086]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[   42.409513]
[   42.409665] The buggy address belongs to the object at ffff88000969e780
[   42.409665]  which belongs to the cache kmalloc-512 of size 512
[   42.410846] The buggy address is located 24 bytes inside of
[   42.410846]  512-byte region [ffff88000969e780ffff88000969e980)
[   42.411941] The buggy address belongs to the page:
[   42.412405] page:ffffea000025a780 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
[   42.413298] flags: 0x100000000008100(slab|head)
[   42.413729] raw: 0100000000008100 0000000000000000 0000000000000000 00000001800c000c
[   42.414387] raw: ffffea00002a9500 0000000900000007 ffff88000c401280 0000000000000000
[   42.415074] page dumped because: kasan: bad access detected
[   42.415604]
[   42.415757] Memory state around the buggy address:
[   42.416222]  ffff88000969e880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   42.416904]  ffff88000969e900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   42.417591] >ffff88000969e980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   42.418273]                    ^
[   42.418588]  ffff88000969ea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   42.419273]  ffff88000969ea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   42.419882] ==================================================================

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: Improve handling of failures on link and route dumps
David Ahern [Tue, 16 May 2017 06:19:17 +0000 (23:19 -0700)]
net: Improve handling of failures on link and route dumps

[ Upstream commit f6c5775ff0bfa62b072face6bf1d40f659f194b2 ]

In general, rtnetlink dumps do not anticipate failure to dump a single
object (e.g., link or route) on a single pass. As both route and link
objects have grown via more attributes, that is no longer a given.

netlink dumps can handle a failure if the dump function returns an
error; specifically, netlink_dump adds the return code to the response
if it is <= 0 so userspace is notified of the failure. The missing
piece is the rtnetlink dump functions returning the error.

Fix route and link dump functions to return the errors if no object is
added to an skb (detected by skb->len != 0). IPv6 route dumps
(rt6_dump_route) already return the error; this patch updates IPv4 and
link dumps. Other dump functions may need to be ajusted as well.

Reported-by: Jan Moskyto Matejka <mq@ucw.cz>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agotcp: eliminate negative reordering in tcp_clean_rtx_queue
Soheil Hassas Yeganeh [Mon, 15 May 2017 21:05:47 +0000 (17:05 -0400)]
tcp: eliminate negative reordering in tcp_clean_rtx_queue

[ Upstream commit bafbb9c73241760023d8981191ddd30bb1c6dbac ]

tcp_ack() can call tcp_fragment() which may dededuct the
value tp->fackets_out when MSS changes. When prior_fackets
is larger than tp->fackets_out, tcp_clean_rtx_queue() can
invoke tcp_update_reordering() with negative values. This
results in absurd tp->reodering values higher than
sysctl_tcp_max_reordering.

Note that tcp_update_reordering indeeds sets tp->reordering
to min(sysctl_tcp_max_reordering, metric), but because
the comparison is signed, a negative metric always wins.

Fixes: c7caf8d3ed7a ("[TCP]: Fix reord detection due to snd_una covered holes")
Reported-by: Rebecca Isaacs <risaacs@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agosctp: do not inherit ipv6_{mc|ac|fl}_list from parent
Eric Dumazet [Wed, 17 May 2017 14:16:40 +0000 (07:16 -0700)]
sctp: do not inherit ipv6_{mc|ac|fl}_list from parent

[ Upstream commit fdcee2cbb8438702ea1b328fb6e0ac5e9a40c7f8 ]

SCTP needs fixes similar to 83eaddab4378 ("ipv6/dccp: do not inherit
ipv6_mc_list from parent"), otherwise bad things can happen.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agosctp: fix src address selection if using secondary addresses for ipv6
Xin Long [Fri, 12 May 2017 06:39:52 +0000 (14:39 +0800)]
sctp: fix src address selection if using secondary addresses for ipv6

[ Upstream commit dbc2b5e9a09e9a6664679a667ff81cff6e5f2641 ]

Commit 0ca50d12fe46 ("sctp: fix src address selection if using secondary
addresses") has fixed a src address selection issue when using secondary
addresses for ipv4.

Now sctp ipv6 also has the similar issue. When using a secondary address,
sctp_v6_get_dst tries to choose the saddr which has the most same bits
with the daddr by sctp_v6_addr_match_len. It may make some cases not work
as expected.

hostA:
  [1] fd21:356b:459a:cf10::11 (eth1)
  [2] fd21:356b:459a:cf20::11 (eth2)

hostB:
  [a] fd21:356b:459a:cf30::2  (eth1)
  [b] fd21:356b:459a:cf40::2  (eth2)

route from hostA to hostB:
  fd21:356b:459a:cf30::/64 dev eth1  metric 1024  mtu 1500

The expected path should be:
  fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2
But addr[2] matches addr[a] more bits than addr[1] does, according to
sctp_v6_addr_match_len. It causes the path to be:
  fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2

This patch is to fix it with the same way as Marcelo's fix for sctp ipv4.
As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check
if the saddr is in a dev instead.

Note that for backwards compatibility, it will still do the addr_match_len
check here when no optimal is found.

Reported-by: Patrick Talbert <ptalbert@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agotcp: avoid fragmenting peculiar skbs in SACK
Yuchung Cheng [Thu, 11 May 2017 00:01:27 +0000 (17:01 -0700)]
tcp: avoid fragmenting peculiar skbs in SACK

[ Upstream commit b451e5d24ba6687c6f0e7319c727a709a1846c06 ]

This patch fixes a bug in splitting an SKB during SACK
processing. Specifically if an skb contains multiple
packets and is only partially sacked in the higher sequences,
tcp_match_sack_to_skb() splits the skb and marks the second fragment
as SACKed.

The current code further attempts rounding up the first fragment
to MSS boundaries. But it misses a boundary condition when the
rounded-up fragment size (pkt_len) is exactly skb size.  Spliting
such an skb is pointless and causses a kernel warning and aborts
the SACK processing. This patch universally checks such over-split
before calling tcp_fragment to prevent these unnecessary warnings.

Fixes: adb92db857ee ("tcp: Make SACK code to split only at mss boundaries")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agos390/qeth: avoid null pointer dereference on OSN
Julian Wiedmann [Wed, 10 May 2017 17:07:53 +0000 (19:07 +0200)]
s390/qeth: avoid null pointer dereference on OSN

[ Upstream commit 25e2c341e7818a394da9abc403716278ee646014 ]

Access card->dev only after checking whether's its valid.

Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agos390/qeth: unbreak OSM and OSN support
Julian Wiedmann [Wed, 10 May 2017 17:07:52 +0000 (19:07 +0200)]
s390/qeth: unbreak OSM and OSN support

[ Upstream commit 2d2ebb3ed0c6acfb014f98e427298673a5d07b82 ]

commit b4d72c08b358 ("qeth: bridgeport support - basic control")
broke the support for OSM and OSN devices as follows:

As OSM and OSN are L2 only, qeth_core_probe_device() does an early
setup by loading the l2 discipline and calling qeth_l2_probe_device().
In this context, adding the l2-specific bridgeport sysfs attributes
via qeth_l2_create_device_attributes() hits a BUG_ON in fs/sysfs/group.c,
since the basic sysfs infrastructure for the device hasn't been
established yet.

Note that OSN actually has its own unique sysfs attributes
(qeth_osn_devtype), so the additional attributes shouldn't be created
at all.
For OSM, add a new qeth_l2_devtype that contains all the common
and l2-specific sysfs attributes.
When qeth_core_probe_device() does early setup for OSM or OSN, assign
the corresponding devtype so that the ccwgroup probe code creates the
full set of sysfs attributes.
This allows us to skip qeth_l2_create_device_attributes() in case
of an early setup.

Any device that can't do early setup will initially have only the
generic sysfs attributes, and when it's probed later
qeth_l2_probe_device() adds the l2-specific attributes.

If an early-setup device is removed (by calling ccwgroup_ungroup()),
device_unregister() will - using the devtype - delete the
l2-specific attributes before qeth_l2_remove_device() is called.
So make sure to not remove them twice.

What complicates the issue is that qeth_l2_probe_device() and
qeth_l2_remove_device() is also called on a device when its
layer2 attribute changes (ie. its layer mode is switched).
For early-setup devices this wouldn't work properly - we wouldn't
remove the l2-specific attributes when switching to L3.
But switching the layer mode doesn't actually make any sense;
we already decided that the device can only operate in L2!
So just refuse to switch the layer mode on such devices. Note that
OSN doesn't have a layer2 attribute, so we only need to special-case
OSM.

Based on an initial patch by Ursula Braun.

Fixes: b4d72c08b358 ("qeth: bridgeport support - basic control")
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agos390/qeth: handle sysfs error during initialization
Ursula Braun [Wed, 10 May 2017 17:07:51 +0000 (19:07 +0200)]
s390/qeth: handle sysfs error during initialization

[ Upstream commit 9111e7880ccf419548c7b0887df020b08eadb075 ]

When setting up the device from within the layer discipline's
probe routine, creating the layer-specific sysfs attributes can fail.
Report this error back to the caller, and handle it by
releasing the layer discipline.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
[jwi: updated commit msg, moved an OSN change to a subsequent patch]
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoipv6/dccp: do not inherit ipv6_mc_list from parent
WANG Cong [Tue, 9 May 2017 23:59:54 +0000 (16:59 -0700)]
ipv6/dccp: do not inherit ipv6_mc_list from parent

[ Upstream commit 83eaddab4378db256d00d295bda6ca997cd13a52 ]

Like commit 657831ffc38e ("dccp/tcp: do not inherit mc_list from parent")
we should clear ipv6_mc_list etc. for IPv6 sockets too.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agodccp/tcp: do not inherit mc_list from parent
Eric Dumazet [Tue, 9 May 2017 13:29:19 +0000 (06:29 -0700)]
dccp/tcp: do not inherit mc_list from parent

[ Upstream commit 657831ffc38e30092a2d5f03d385d710eb88b09a ]

syzkaller found a way to trigger double frees from ip_mc_drop_socket()

It turns out that leave a copy of parent mc_list at accept() time,
which is very bad.

Very similar to commit 8b485ce69876 ("tcp: do not inherit
fastopen_req from parent")

Initial report from Pray3r, completed by Andrey one.
Thanks a lot to them !

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Pray3r <pray3r.z@gmail.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agosparc: Fix -Wstringop-overflow warning
Orlando Arias [Tue, 16 May 2017 19:34:00 +0000 (15:34 -0400)]
sparc: Fix -Wstringop-overflow warning

[ Upstream commit deba804c90642c8ed0f15ac1083663976d578f54 ]

Greetings,

GCC 7 introduced the -Wstringop-overflow flag to detect buffer overflows
in calls to string handling functions [1][2]. Due to the way
``empty_zero_page'' is declared in arch/sparc/include/setup.h, this
causes a warning to trigger at compile time in the function mem_init(),
which is subsequently converted to an error. The ensuing patch fixes
this issue and aligns the declaration of empty_zero_page to that of
other architectures. Thank you.

Cheers,
Orlando.

[1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg02308.html
[2] https://gcc.gnu.org/gcc-7/changes.html

Signed-off-by: Orlando Arias <oarias@knights.ucf.edu>
--------------------------------------------------------------------------------
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7 years agoBACKPORT: mm/slab: clean up DEBUG_PAGEALLOC processing code
Joonsoo Kim [Tue, 15 Mar 2016 21:54:21 +0000 (14:54 -0700)]
BACKPORT: mm/slab: clean up DEBUG_PAGEALLOC processing code

Currently, open code for checking DEBUG_PAGEALLOC cache is spread to
some sites.  It makes code unreadable and hard to change.

This patch cleans up this code.  The following patch will change the
criteria for DEBUG_PAGEALLOC cache so this clean-up will help it, too.

[akpm@linux-foundation.org: fix build with CONFIG_DEBUG_PAGEALLOC=n]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[AmitP: fix build with CONFIG_DEBUG_PAGEALLOC=n]
(cherry picked from commit 40b44137971c2e5865a78f9f7de274449983ccb5)
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
7 years agoMerge branch 'v4.4/topic/hibernate' into linux-linaro-lsk-v4.4
Alex Shi [Fri, 26 May 2017 05:55:23 +0000 (13:55 +0800)]
Merge branch 'v4.4/topic/hibernate' into linux-linaro-lsk-v4.4

7 years agoarm64: make ARCH_SUPPORTS_DEBUG_PAGEALLOC depend on !HIBERNATION
Will Deacon [Thu, 28 Apr 2016 18:38:16 +0000 (19:38 +0100)]
arm64: make ARCH_SUPPORTS_DEBUG_PAGEALLOC depend on !HIBERNATION

Selecting both DEBUG_PAGEALLOC and HIBERNATION results in a build failure:

| kernel/built-in.o: In function `saveable_page':
| memremap.c:(.text+0x100f90): undefined reference to `kernel_page_present'
| kernel/built-in.o: In function `swsusp_save':
| memremap.c:(.text+0x1026f0): undefined reference to `kernel_page_present'
| make: *** [vmlinux] Error 1

James sayeth:

"This is caused by DEBUG_PAGEALLOC, which clears the PTE_VALID bit from
'free' pages. Hibernate uses it as a hint that it shouldn't save/access
that page. This function is used to test whether the PTE_VALID bit has
been cleared by kernel_map_pages(), hibernate is the only user.

Fixing this exposes a bigger problem with that configuration though: if
the resume kernel has cut free pages out of the linear map, we copy this
swiss-cheese view of memory, and try to use it to restore...

We can fixup the copy of the linear map, but it then explodes in my lazy
'clean the whole kernel to PoC' after resume, as now both the kernel and
linear map have holes in them."

On closer inspection, the whole Kconfig machinery around DEBUG_PAGEALLOC,
HIBERNATION, ARCH_SUPPORTS_DEBUG_PAGEALLOC and PAGE_POISONING looks like
it might need some affection. In particular, DEBUG_ALLOC has:

> depends on !HIBERNATION || ARCH_SUPPORTS_DEBUG_PAGEALLOC && !PPC && !SPARC

which looks pretty fishy.

For the moment, require ARCH_SUPPORTS_DEBUG_PAGEALLOC to depend on
!HIBERNATION on arm64 and get allmodconfig building again.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
(cherry picked from commit da24eb1f3f9e2c7b75c5f8c40d8e48e2c4789596)
Signed-off-by: Alex Shi <alex.shi@linaro.org>