Honing my craft

CS-537 Introduction to Operating Systems - Lecture 2

August 03, 2020


Link: http://pages.cs.wisc.edu/~remzi/Classes/537/Spring2018/Discussion/videos.html


Scheduling takes its origins from operations research and concerns itself with making sure that resources are utilized most effectively given a set of metrics. The most obvious metric is time to completion, aka how long does it take to finish a set of work, but we’ll introduce other metrics and strategies for scheduling. In our case, we’re asking the question of “how do we utilize the CPU most effectively to share resources across many jobs of various lengths and CPU use needs?”

The textbook makes some very clear but incorrect assumptions about workload, the processes in the system, in order to simplify our starting point. We then remove those assumptions in order to understand the limitations of a proposed solution.

We start with the following assumptions about our jobs:

  • Each job runs for the same amount of time
  • All jobs arrive at the same time
  • Once started, all jobs run to completion
  • All jobs only use CPU (no disk or network IO)
  • The runtime of each job is known ahead of time

We also need to define scheduling metrics, a measurement of SOMETHING in our system. The book starts with turnaround time, which is defined at the time a job completes minute the time a job arrived in the system. That is, the turnaround time of me at McDonalds is the time I get my food minus the time I walk in the door.

Since we are currently assuming all jobs arrive at the same time, we can say that turnaround time = completion time, and arrival time = 0.

Turnaround time is a performance metric, but we could also measure fairness. Something that is high performance may not be very fair, and vice versa.

First In, First Out (FIFO)

The most basic strategy is FIFO, implementing a queue structure. FIFO works for our assumptions, particularly when we say that all jobs are CPU-only, and arrive at the same time.

The book’s example takes 3 jobs that need 10 seconds a piece, called A, B and C. In a FIFO strategy, we do job A and complete it at time = 10, job B and complete at time = 20, then job C and complete at time = 30. Thus making the average turnaround time 20 seconds (10 + 20 + 30 / 3).

We now relax assumption 1 - not every job runs for the same amount of time. What happens when A needs 100 seconds, then B and C need 10 seconds each? We complete A at time = 100, B at time = 110 and C at time = 120. Thus making the average turnaround time 110 seconds - much worse as a result of simply being FIFO.

This effect of low-resource-needs jobs getting stuck behind a high-resource-needs job is called the convoy effect.

Shortest Job First (SJF)

Given the one issue we have now identified, perhaps there’s an obvious solution - we always take the shortest job first, leaving the resource-intensive ones for last. Assuming our last case (What happens when A needs 100 seconds, then B and C need 10 seconds each?) - we now finish B at time = 10, C at time = 20, then A at time = 120, giving us an average turnaround time of 50 seconds for the same set of jobs.

We now relax assumption number 2 - not every job arrives at the same time, and in fact they arrive at various times. What happens now?

Well, what if job A arrives before job B and C? Specifically, what if B and C arrive at time = 10?

In that case, our time looks like (100 + (110-10) + (120 - 10)) / 3 = ~103 seconds. Not good!

Shortest Time-to-Completion First (STCF)

We can relax another assumption - we no longer need jobs to run to completion. We’ve discussed in virtualization the idea of timer interrupts and context switching - we can borrow that same concept here. We will introduce preemptive scheduling, where STCF is the preemptive version of the non-preemtive SJF. We also call STCF Preemptive Shortest Job First (PSJF).

Our STCF scheduler, whenever a new job enters the system, will look to see which of the jobs needs the LEAST amount of time left and run that first. So, given our above job where A needs 100 seconds and arrives at time=0, B and C need 10 seconds and arrive at time = 10:

The scheduler will run A for 10 seconds, then see that B and C have arrived. It will switch to B, finish B at time = 20, then C at time = 30, the finish the remaining 90 seconds of A at time = 120 - thus giving us our turnaround time of (120 - 0) + (20 - 10) + (30 -10) / 3 of 50 seconds. Less than half the time of SJF just by introducing preemptive scheduling.

Response Time

Given our current assumptions we are happy. However, we’re now going to formalize a new metric called response time - defined as the time a job arrives in a system to the time it’s scheduled. Metaphorically, the time I walk in the door at McDonalds vs the time my order is taken. So the response time for our STCF case ( A needs 100 seconds and arrives at time=0, B and C need 10 seconds and arrive at time = 10) is 0 for A, 0 for B, and 10 for C for an average of 3.33 seconds of response time.

Since all our previous strategies, including STCF, try to run short jobs to completion, they are not necessarily good for response time. If I’m the person that submitted job C, I’d be waiting 10 seconds for B to complete prior to a response from the system. So, to improve response time, we introduce new strategies.

Round Robin

If we want to improve response time, we need a strategy to…respond faster. Round robin aims to handle that. Instead of running jobs to completion, we have a time slice (or a scheduling quantum), a unit of time dedicated to working on one job at which point we switch to the next time. We switch jobs every slice until all jobs are done. We can also call this strategy time slicing. The length of a time slice must be a multiple of the timer interrupt period, that is if the timer interrupt is every 10 ms, then a time slice cannot be 25ms, but can be 20 or 30 ms.

Imagine a system where jobs A, B and C all arrive at time = 0 and need 5 seconds to run. In our SJF or SJTF model, we run A at time = 0, B at time = 5, and C at time = 10 for an average response time of 5 seconds.

In round robin, with a time slice length of 1, we run A at time = 0, switch to B at time = 1, then switch to C at time = 2 for an average response time of 1 second - a greatly improved number.

This switching between jobs has a cost, including loading and saving register state, CPU caches being blown away, branch predictors, etc all flushing. As a result, we may want to limit our switching frequency in order to amortize the switching cost and pay it off over a longer period of time.

So if round robin is great for response time, how does it do with average turnaround time?

The turnaround time of our SJF model would be (5 + 10 + 15) / 3 = 10 seconds. However, in round robin we don’t finish A until 13 seconds in, B until 14 and C until time = 15 seconds for an average turnaround time of 14 seconds.

The book notes that any fair policy (that is, spreading CPU resources evenly across processes over a small unit of time) will do well on response time at a tradeoff of turnaround time.

Incorporating I/O

We now relax assumption 4 and allow processes to perform IO - Chrome making a network request, VS Code reading / writing to disk. When a process does so, it is blocked on waiting for IO completion. The resource (disk, network) needs some number of milliseconds to perform work so the CPU can and should move onto another task in the mean time.

When a block, an interrupt is raised and the OS can move the job from blocked to the ready state.

Let’s use two jobs as example. Jobs A and B need 50 ms of CPU. Job A runs for 10ms, then switches to IO and blocks (assume IO takes 10 ms) while B only does CPU work for the whole 50 ms.

In a naive, SJF world, A would run for 10 ms, block for 10 ms, run for 10 ms, block, run, block, run again for 50ms of CPU and 40ms of IO and B won’t start until 90ms have passed and won’t finish until time = 140.

A better system would treat each 10ms sub-job of A as an independent job. So then the system sees a 10ms job of A, and a 50ms job of B. So what does it do?

Schedule A for 10ms, then while that does IO the scheduler works on B for 10ms. Then the next sub-job of A comes in, so the scheduler switches back and works on A again for 10ms - this switching back and forth means that A will complete at time = 90 again, but B will have already done 40ms of work in the gaps so B will complete at time = 100.

Thus, the CPU can treat every discrete “burst” of CPU-utilization as its own job and when they interact with other resources, the CPU can switch over to another CPU-intensive job.

Multi-Level Feedback Queue

We are now giving up our last assumption - we no longer know ahead of time how long a job will take to run. We introduce a new approach to scheduling, called the Multi-Level Feedback Queue which tries to optimize turnaround time while also minimizing response time.

The book uses an implementation of MLFQ that has a number of distinct queues, each of which are assigned a different priority level. The scheduler will choose to run jobs with highest priority first, then will move down the priority list. Formally:

  • Rule 1: If Priority(A) > Priority(B), A runs and B does not
  • Rule 2: If Priority(A) == Priority(B), A and B run in round-robin

How do we set priority? Well, it depends on the observed behavior of the work. If a job tends to often relinquish CPU in order to do something else, the MLFQ will identify that as a high priority job as it seems like an interactive, user-blocking task. If something tends to run its entire time-slice on CPU, the scheduler will slowly decrease its priority as that works akin to long batch jobs. Thus, we use previous behavior to predict future behavior.

Are there some flaws? Yes! Imagine we have three priority queues (high, medium, low) - with two high priority jobs A and B, a medium priority job C and a low priority job D. A and B will run in round-robin, starving our C and D jobs of resources. Thus, we need strategies to change priority.

Attempt 1

Since our workload is a blend of CPU intensive and interactive jobs, we need to identify a way to adjust priorities. Here’s one try

  • Rule 3: When a job enters the system, it’s placed at the highest priority
  • Rule 4a: If a job uses its entire time slice while running, we move it down a priority level
  • Rule 4b: If a job gives up CPU before time slice is up, it stays at the same priority.

Imagine for example a very basic job. Job A takes 200ms. We run it at high priority (what we call Q2, or queue 2) for 10ms, then move it down to a lower priority (Q1) for 10ms, then down to Q0 for the remainder.


Let’s make a change - we introduce job B, a short interactive job. B arrives at time = 100 in Q2 and runs for 20 ms. Now we, at time = 100, run B for 10 ms from, move it from Q2 to Q1, run it for 10ms again, then we go back to Q0 to finish our low priority job A.

In other words, we approximate SJF by assuming ANY new job could be short, high priority work and adjust behavior according to how it operates in the real world. If it is short, we complete it soon. If it’s not, we round-robin with other lower priority work.

What if B runs for 1ms and then does I/O for 1ms?

Well then at time = 100, we’ve moved A to Q0 (lowest priority) and now have B. We work on B for 1ms, then while that’s blocked we look for the next job in Q2, Nothing - go to Q1 - nothing. So while B is blocked, we work on low priority A and keep B in highest priority since it gave up CPU before its time slice. We then alternative back and forth between high priority B and low priority A until B completes.


What if we have a lot of interactive tasks? Our low-priority A will be starved of resources since high priority tasks will always be running.

What if someone is malicious and wants to game the system? What if someone wrote a process B that always gave up CPU right before its time slice ended? The scheduler would always treat that as high priority work, and it would never demote priority thus allowing a user to stay in high priority even those, in reality, it’s CPU work. Imagine if they just did a file read for no reason to force disk I/O at 9ms in. Not good!

Finally, what if a process just changes behavior over time? How can something move from low priority to high? If it starts very CPU bound but eventually becomes interactive (a test runner with suggested fixes?) it will get demoted to low and then its I/O phase will stay on the lowest priority.

Attempt 2

If we know somethnig can get stuck on a low priority, and if we know that CPU-bound jobs can starve or not make much progress, what can we do?

  • Rule 5: After some time period S, move all jobs in the system back to the highest priority queue

We call this boosting priority.

This will take our CPU bound jobs, and if they’re been stuck waiting, will get a little bit of CPU resources. Additionally, if our CPU-bound job became interactive it’ll get a chance to do IO and give up a time slice again. So, hypothetically, every 50ms or so, we flush all queues into the highest priority queue and let tasks settle back down into lowest priority as needed.

Attempt 3

We still allow people to game our system - giving up CPU just before a time slice ends to stay in high priority. Thus we rewrite rules 4a and 4b into

  • Rule 4: once a job uses up its time allotment (regardless of how many times it released the CPU) we move it down a priority.

This new rule change means that even if we give up CPU and switch to I/O, we’ll eventually run out our time slice and be forced down a priority queue.

Tuning MLFQ

The book makes points - how do we parameterize our scheduler? Number of queues? Length of time slice? Length of time period before we boost priority?

The short answer is…trial and error. We want to be able to monitor and benchmark our workloads and test our assumptions accordingly. We can start with some default values and tweak them according to desired turnaround and response time.

Nikhil Thomas

I work and live in Brooklyn, NY building software.