diff --git a/specs/ocata/stochastic-weighing-scheduler.rst b/specs/ocata/stochastic-weighing-scheduler.rst new file mode 100644 index 0000000..f2d71c3 --- /dev/null +++ b/specs/ocata/stochastic-weighing-scheduler.rst @@ -0,0 +1,258 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +============================= +Stochastic Weighing Scheduler +============================= + +Include the URL of your launchpad blueprint: + +https://blueprints.launchpad.net/manila/+spec/stochastic-weighing-scheduler + +The filter scheduler is the de-facto standard scheduler for Manila and has +a lot of desirable properties. However there are certain scenarios where it's +hard or impossible to get it to do the right thing. I think some small tweaks +could help make admin's lives easier. + + +Problem description +=================== + +I'm concerned about 2 specific problems: + +1. When there are multiple backends that are not identical, it can be hard to + ensure that load is spread across all the backends. Consider the case of a + few "large" backends mixed with some "small" backends. Even if they're from + the same vendor by default new shares will go exclusively to the large + backends until free space decreases to the same level as the small backends. + This can be worked around by using something other than free space to weigh + hosts, but no matter what you choose, you'll have a similar issue whenever + the backends aren't homogeneous. + +2. Even if the admin is able to ensure that all the backends are identical in + every way, at some point the cloud will probably need to grow, by adding + new storage backends. When this happens there will be a mix of brand new + empty backends and mostly full backends. No matter what kind of weighing + function you use, initially 100% of new requests will be scheduled on the + new backends. Depending on how good or bad the weighing function is, it + could take a long time before the old backends start receiving new requests + and during this period system performance is likely to drop dramatically. + The problem is particularly bad if the upgrade is a small one: consider + adding 1 new backend to a system with 10 existing backends. If 100% of + new shares go to the new backend, then for some period, there will be 10x + load on the single backend. + +There is one existing partial solution to the above problems -- the goodness +weigher -- but that has some limitations worth mentioning. Consider an ideal +goodness function -- an oracle that always returns the right value such +that the best backend for new shares is sorted to the top. Because the inputs +to the goodness function (other than available space) are only evaluated every +60 seconds, bursts of creation requests will nearly always go to the same +backend within a 60 second window. While we could shrink the time window of +this problem by sending more frequent updates, that has its own costs and also +has diminishing returns. In the more realistic case of a goodness function +that's non-ideal, it may take longer than 60 seconds for the goodness function +output to reflect changes based on recent creation requests. + + +Use Cases +========= + +The existing scheduler handles homogeneous backends best, and relies on a +relatively low rate of creation requests compared to the capacity of the whole +system, so that it can get keep up to date information with which to make +optimal decisions. It also deals best with cases when you don't add capacity +over time. + +I'm interested in making the scheduler perform well across a broad range of +deployment scenarios: + +1. Mixed vendor scenarios +2. A mix of generations of hardware from a single vendor +3. A mix of capacities of hardware (small vs. large configs) +4. Adding new capacity to a running cloud to deal with growth + +These are all deployer/administrator concerns. Part of the proposed solution +is to enable certain things which are impossible today, but mostly the goal +is to make the average case "just work" so that administrators don't have to +babysit the system to get reasonable behavior. + + +Proposed change +=============== + +Currently the filter scheduler does 3 things: + +1. Takes a list of all pools and filters out the ones that are unsuitable for + a given creation request. +2. Generates a weight for each pool based on one of the available weighers. +3. Sorts the pools and chooses the one with the highest weight. + +I call the above system "winner-take-all" because whether the top 2 weights +are 100 and 0 or 49 and 51, the winner gets the request 100% of the +time. + +I propose renaming the existing HostWeightHandler to OrderedHostWeightHandler +and adding a new weight handler to the filter scheduler called +StochasticHostWeightHandler. The OrderedHostWeightHandler would continue to +be the default and would not change behavior. The new weight handler would +be implement different behavior as follows: + +In step 3 above, rather than simply selecting the highest weight, the +weight handler would sum up the weight of all choices, assign each pool a +subset of that range with a size equal to that host's weight, then generate a +random number across the whole range and choose the pool mapped to that range. + +An easier way to visualize the above algorithm is to imagine a raffle drawing. +Each pool is given a number of raffle tickets equal to the pool's weight +(assume weights normalized from 0-100). The winning pool is chosen by a raffle +drawing. Every creation request results in a new raffle being held. + +Pools with higher weights get more raffle tickets and thus have a higher +chance to win, but any pool with a weight higher than 0 has some chance to +win. + +The advantage to the above algorithm is that it distinguishes between weights +that are close (49 and 51) vs weights that are far (0 and 100) so just because +one pools is slightly better than another pool, it doesn't always win. Also, +it can give different results within a 60 second window of time when the +inputs to the weigher aren't updated, significantly decreasing the pain of +slow share stats updates. + +It should be pointed out that this algorithm not only requires that weights +are properly normalized (the current goodness weigher also requires this) but +that the weight should be roughly linear across the range of possible values. +Any deviation from linear "goodness" can result in bad decisions being made, +due to the randomness inherent in this approach. + + +Alternatives +------------ + +There aren't many good options to deal with the problem of bursty requests +relative to the update frequency of share stats. You can update stats faster +but there's a limit. The limit is to have the scheduler synchronously request +absolute latest share stats from every backend for every request. Clearly +that approach won't scale. + +To deal with the heterogeneous backends problem, we have the goodness +function, but it's challenging to pick a goodness function that yields +acceptable results across all types of variation in backends. This proposal +keeps the goodness function and builds upon it to both make it stronger, and +also more tolerant to imperfection. + + +Data model impact +----------------- + +No database changes. + + +REST API impact +--------------- + +No REST API changes. + + +Security impact +--------------- + +No security impact. + + +Notifications impact +-------------------- + +No notification impact. + + +Other end user impact +--------------------- + +End users may indirectly experience better (or conceivably worse) scheduling +choices made by the modified scheduler. + + +Performance Impact +------------------ + +No performance impact. In fact this approach is proposed expressly because +alternative solutions would have a performance impact and I want to avoid +that. + + +Other deployer impact +--------------------- + +I propose a single new weigher class for the scheduler. The default weigher +would continue to be the existing weighter. An administrator would need to +intentionally modify the weigher class config option to observe changed +behavior. + + +Developer impact +---------------- + +Developers wouldn't be directly impacted, but anyone working on goodness +functions or other weighers would need to be aware of the linearity +requirement for getting good behavior out of this new scheduler mode. + +In order to avoid accidentally feeding nonlinear goodness values into the +stochastic weighing scheduler, we may want to create alternatively-named +version of the various weights or weighers, forcing driver authors to +explicitly opt-in to the new scheme and thus indicate that the weights +they're returning are suitably linear. + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + bswartz + +Work Items +---------- + +This should be doable in a single patch. + + +Dependencies +============ + +* Filter scheduler (manila) +* Goodness weigher (manila) + + +Testing +======= + +Testing this feature will require a multibackend configuration (otherwise +scheduling is just a no-op). + +Because randomness is inherently required for the correctness of the +algorithm, it will be challenging to write automated functional test cases +without subverting the random number generation. I propose that we rely on +unit tests to ensure correctness because it's easy to "fake" random numbers +in unit tests. + + +Documentation Impact +==================== + +Dev docs need updated to explain to driver authors what the expectations are +for goodness functions. + +Config ref needs to explain to deployers what the new config option does. + + +References +========== + +This spec is a copy of any idea accepted by the Cinder community: +https://github.com/openstack/cinder-specs/blob/master/specs/newton/stochastic-weighing-scheduler.rst