Scheduling

In this part, we will talk about the topics of Job Scheduling in OS. We will cover different scheduling algorithm and real-life schedulers. Prepare for the journey of Scheduling!

Scheduling

Learning Materials

Videos:

Computer Architecture Primer

Lecture Slides:

Scheduling

Advanced scheduling

Readings:

[OSTEP] Scheduling

[OSTEP] The multi-level feedback queue

[OSTEP] Lottery scheduling and Linux CFS

[OSTEP] Multiprocessor scheduling

A Decade of Wasted Cores (Paper)

Architecture Overview

Basic Processor:

Multicore:

Modern Multiprocessor:

sys/161:

Not multithreaded: Each core/processor has only one thread

Do not distinguish cores from processors; N-way single-core multiprocessors

Scheduling: Introduction

How to build an optimal scheduler if we only consider turnaround time?

  • FIFO(First in First out)

  • SJF(Shortest Job First)

  • STCF(Shortest Time-to-Completion First)

Optimal, if we know job lengths and our only metric is turnaround time.

How can we build a scheduler that is sensitive to response time?

  • RR(Round Robin)

A tradeoff between turnaround time and response time

How to take I/O into account?

The answer is Overlap!: Treat each CPU burst as a job.

How to determine the timeslice?

timeslice should be short, maximizing multitasking, but shouldn't be too short, otherwise context switch time would be long

context switch:

  • save the register of the old thread
  • find the new thread from the runqueue
  • restore the new thread's register

How do we pick the next thread from the runqueue?

Scheduling Algorithm:

Multiple queues with different priority and timeslice

Short timeslice thread has higher priority.

If a lower priority thread hasn't run for a long time, it will be moved to upper priority.

The multi-level feedback queue

MLFQ observes the execution of a job and prioritizes it accordingly.

  • Rule 1: If Priority(A) > Priority(B), A runs (B doesn’t).
  • Rule 2: If Priority(A) = Priority(B), A & B run in Round-Robin fashion using the time slice (quantum length) of the given queue.
  • Rule 3: When a job enters the system, it is placed at the highest priority (the topmost queue).
  • Rule 4: Once a job uses up its time allotment at a given level (regardless of how many times it has given up the CPU), its priority is reduced (i.e., it moves down one queue).
  • Rule 5: After some time period S, move all the jobs in the system to the topmost queue.

MLFQ Problem:

Simulate the Multi-Level Feedback Queue Algorithm discussed in the book. Assume the final version of the algorithm presented in the book, but no tuning. You have the following parameters and constraints:

  1. Time starts at 0ms.
  2. There are three queues: highest-priority, middle-priority and low-priority.
  3. Timeslice equals to 10ms.
  4. Allotment equals to 20ms.
  5. S equals to 100ms.
  6. All jobs use up their entire timeslice every time they are scheduled.
  7. Job J1 arrives at 0ms.
  8. Job J2 arrives at 20ms.
  9. J1 requires 60ms of CPU time to complete.
  10. J2 requires 70ms of CPU time to complete.

bright color: J1; light color: J2

Scheduling: Proportional Share

Guarantee that each job obtains a certain percentage of CPU time.

Use tickets to represent a process's share of the CPU


Lottery Scheduling Decision Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// counter: used to track if we’ve found the winner yet
int counter = 0;

// winner: call some random number generator to
// get a value >= 0 and <= (totaltickets - 1)
int winner = getrandom(0, totaltickets);

// current: use this to walk through the list of jobs
node_t *current = head;
while (current) {
counter = counter + current->tickets;
if (counter > winner)
break; // found the winner
current = current->next;
}
// ’current’ is the winner: schedule it...

Stride Scheduling

Deterministic fair-share scheduler

Stride is inverse in proportion to the number of tickets it has, computed by dividing some large number by the number of tickets each process has been assigned.

Every time a process runs, we will increment a counter for it (called its pass value) by its stride to track its global progress. At any given time, the scheduler picks the process to run that has the lowest pass value so far.

1
2
3
4
curr = remove_min(queue); // pick client with min pass
schedule(curr); // run for quantum
curr->pass += curr->stride; // update pass using stride
insert(queue, curr); // return curr to queue

Lottery scheduling makes more sense than stride scheduling when incorporating new processes.


Linux Completely Fair Scheduler (CFS)

As each process runs, it accumulates vruntime at the same rate in proportion with real time.

When a scheduling decision occurs, CFS will pick the process with the lowest vruntime to run next.

CFS uses sched_latency to determine how long one process should run before considering a switch, determining its time slice but in a dynamic fashion.

CFS will never set the time slice to less than min_granularity.

  • Weighting (Niceness)

CFS gives each process a nice value indicating its priority.

CFS maps the nice value of each process to a weight:

1
2
3
4
5
6
7
8
9
10
static const int prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
};

Then calculate each process's time slice:

Change the way CFS calculates vruntime:

That is to say, a process has higher priority will dominate CPU for a longer time but its vruntime increases slower.

  • CFS Red Black Tree

CFS uses a red black tree to keep the runnable processes by their vruntime.

CFS handles sleeping processes by setting the vruntime of a job to the minimum value found in the tree when it wakes up.


Problems of fair-share schedulers

  • Do not particularly mesh well with I/O.

  • leave open the hard problem of ticket or priority assignment (How to assign tickets? How to set nice values?)

Multiprocessor scheduling

How should the OS schedule jobs on multiple CPUs?

Cache Affinity

A process, when run on a particular CPU, builds up a fair bit of state in the caches (and TLBs) of the CPU. Therefore, a multiprocessor scheduler should consider keeping a process on the same CPU if at all possible.


Single-Queue Scheduling

It does not scale well (due to synchronization overheads), and it does not readily preserve cache affinity.


Multi-Queue Scheduling

How to deal with load imbalance?

work stealing: A (source) queue that is low on jobs will occasionally peek at another (target) queue, to see how full it is. If the target queue is (notably) more full than the source queue, the source will “steal” one or more jobs from the target to help balance load.

Advanced Scheduling

Avoid scheduling memory-intensive threads on the same shared cache to prevent cache contention.

Why did performance suck on a NUMA system?

We didn't take account where the data lives in memory, resulting in remote latency.

Our algorithm:

Continuously measure traffic(memory requests) of the thread.

Why placing on random node?

Balance the nodes and memory controllers.

Linux’s Completely Fair Scheduling (CFS)

Bugs in the Linux Scheduler: Cores may stay idle for seconds while ready threads are waiting in runqueues.

  • On a single-CPU system, CFS is very simple

The scheduler defines a fixed time interval(sched_latency) during which each thread in the system must run at least once. The interval is divided among threads proportionally to their weights. The resulting interval (after division) is what we call the timeslice.

Threads are organized in a runqueue, implemented as a red-black tree.

  • On multi-core systems, CFS becomes quite complex

Periodically run a load-balancing algorithm that will keep the queues roughly balanced.

The scheduler also invokes “emergency” load balancing when a core becomes idle.

  • The (Hierarchical) load balancing algorithm

CFS balances runqueues based on a metric called load, which is the combination of the thread’s weight and its average CPU utilization. If a thread does not use much of a CPU, its load will be decreased accordingly.

load=weight(cpuUse)2load = weight * (cpuUse)^2

Function running on each cpu cur_cpu:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* iterate through each scheduling domain */
for all sd in sched domains of cur_cpu do

/* At each level, either the first idle core of the scheduling domain or the first core of the scheduling
domain is reponsible for balancing the load. */
if sd has idle cores then
first cpu = 1st idle CPU of sd
else
first cpu = 1st CPU of sd
end if
if cur_cpu != first cpu then
continue
end if

/* Compute the average load of each scheduling group of the scheduling domain and pick the busiest group. */
for all sched group sg in sd do
sg.load = average loads of CPUs in sg
end for
busiest = overloaded sg with the highest load
(or, if nonexistent) imbalanced sg with highest load
(or, if nonexistent) sg with highest load

/* If the busiest group’s load is lower than the local group’s load, the load is considered balanced at this level. */
local = sg containing cur_cpu
if busiest.load ≤ local.load then
continue
end if

/* Otherwise, the load is balanced between the local CPU and the busiest CPU of the group. */
busiest cpu = pick busiest cpu of sg
try to balance load between busiest cpu and cur_cpu
if load cannot be balanced due to tasksets then
exclude busiest cpu, goto line 29
end if

end for
  • Special case

When a thread wakes up, after sleeping or waiting for a resource (e.g., locks, I/O), the scheduler tries to place it on the idlest core.

When the thread is awoken by another thread (waker thread). In that case the scheduler will favor cores sharing a cache with the waker thread to improve cache reuse.


Scheduling
http://oooscar8.github.io/2024/10/10/Scheduling/
作者
Alex Sun
发布于
2024年10月10日
许可协议