Completely Fair Scheduler (CFS)

Relevant source files

This document covers the Completely Fair Scheduler (CFS) implementation provided by the scheduler crate. The CFS implementation aims to provide fair CPU time allocation among tasks based on virtual runtime calculations and supports priority adjustment through nice values.

For information about the overall scheduler architecture and BaseScheduler trait, see Core Architecture. For comparisons with other scheduler implementations, see Scheduler Comparison.

Purpose and Fairness Model

The CFScheduler implements a simplified version of the Linux Completely Fair Scheduler algorithm. It maintains fairness by tracking virtual runtime for each task and always selecting the task with the lowest virtual runtime for execution. Tasks with higher priority (lower nice values) accumulate virtual runtime more slowly, allowing them to run more frequently.

Key Fairness Principles:

  • Tasks are ordered by virtual runtime in a red-black tree (implemented as BTreeMap)
  • Virtual runtime accumulates based on actual execution time and task weight
  • Higher priority tasks (negative nice values) get lower weights, accumulating vruntime slower
  • The scheduler always picks the leftmost task (lowest virtual runtime) from the tree

Sources: src/cfs.rs(L100 - L122) 

Core Data Structures

CFSTask Wrapper

The CFSTask<T> struct wraps user tasks with CFS-specific metadata:

FieldTypePurpose
innerTThe wrapped user task
init_vruntimeAtomicIsizeInitial virtual runtime baseline
deltaAtomicIsizeAccumulated execution time delta
niceAtomicIsizePriority value (-20 to 19 range)
idAtomicIsizeUnique task identifier
flowchart TD
CFSTask["CFSTask<T>"]
inner["inner: T"]
init_vruntime["init_vruntime: AtomicIsize"]
delta["delta: AtomicIsize"]
nice["nice: AtomicIsize"]
id["id: AtomicIsize"]
vruntime_calc["Virtual Runtime Calculation"]
weight_lookup["Nice-to-Weight Lookup"]
scheduling_decision["Scheduling Decision"]

CFSTask --> delta
CFSTask --> id
CFSTask --> init_vruntime
CFSTask --> inner
CFSTask --> nice
delta --> vruntime_calc
init_vruntime --> vruntime_calc
nice --> weight_lookup
vruntime_calc --> scheduling_decision
weight_lookup --> vruntime_calc

Sources: src/cfs.rs(L7 - L14)  src/cfs.rs(L31 - L41) 

CFScheduler Structure

The main scheduler maintains a BTreeMap as the ready queue and tracks global minimum virtual runtime:

flowchart TD
CFScheduler["CFScheduler<T>"]
ready_queue["ready_queue: BTreeMap<(isize, isize), Arc<CFSTask<T>>>"]
min_vruntime["min_vruntime: Option<AtomicIsize>"]
id_pool["id_pool: AtomicIsize"]
key_structure["Key: (vruntime, taskid)"]
value_structure["Value: Arc<CFSTask<T>>"]
ordering["BTreeMap Natural Ordering"]
leftmost["Leftmost = Lowest vruntime"]
pick_next["pick_next_task()"]

CFScheduler --> id_pool
CFScheduler --> min_vruntime
CFScheduler --> ready_queue
key_structure --> ordering
leftmost --> pick_next
ordering --> leftmost
ready_queue --> key_structure
ready_queue --> value_structure

Sources: src/cfs.rs(L103 - L107)  src/cfs.rs(L109 - L122) 

Virtual Runtime Calculation

The virtual runtime algorithm ensures fairness by adjusting how quickly tasks accumulate runtime based on their priority:

Nice-to-Weight Conversion

CFS uses Linux-compatible lookup tables to convert nice values to weights:

flowchart TD
nice_value["Nice Value (-20 to 19)"]
weight_check["nice >= 0?"]
pos_table["NICE2WEIGHT_POS[nice]"]
neg_table["NICE2WEIGHT_NEG[-nice]"]
weight_result["Task Weight"]
vruntime_formula["vruntime = init + delta * 1024 / weight"]

neg_table --> weight_result
nice_value --> weight_check
pos_table --> weight_result
weight_check --> neg_table
weight_check --> pos_table
weight_result --> vruntime_formula

The weight tables are defined as constants with nice value 0 corresponding to weight 1024:

  • Positive nice values (0-19): Lower priority, higher weights
  • Negative nice values (-20 to -1): Higher priority, lower weights

Sources: src/cfs.rs(L16 - L29)  src/cfs.rs(L43 - L50) 

Runtime Accumulation

flowchart TD
task_tick["task_tick()"]
delta_increment["delta.fetch_add(1)"]
get_vruntime["get_vruntime()"]
nice_check["nice == 0?"]
simple_calc["init_vruntime + delta"]
weighted_calc["init_vruntime + delta * 1024 / weight"]
final_vruntime["Virtual Runtime"]
btreemap_key["BTreeMap Key: (vruntime, id)"]

final_vruntime --> btreemap_key
get_vruntime --> nice_check
nice_check --> simple_calc
nice_check --> weighted_calc
simple_calc --> final_vruntime
task_tick --> delta_increment
weighted_calc --> final_vruntime

Sources: src/cfs.rs(L56 - L63)  src/cfs.rs(L83 - L85) 

Scheduling Algorithm Flow

Task Addition and Removal

sequenceDiagram
    participant Kernel as Kernel
    participant CFScheduler as CFScheduler
    participant BTreeMap as BTreeMap
    participant CFSTask as CFSTask

    Kernel ->> CFScheduler: add_task(task)
    CFScheduler ->> CFScheduler: assign_id()
    CFScheduler ->> CFSTask: set_vruntime(min_vruntime)
    CFScheduler ->> CFSTask: set_id(taskid)
    CFScheduler ->> BTreeMap: insert((vruntime, taskid), task)
    CFScheduler ->> CFScheduler: update_min_vruntime()
    Note over CFScheduler,BTreeMap: Task ordered by virtual runtime
    Kernel ->> CFScheduler: pick_next_task()
    CFScheduler ->> BTreeMap: pop_first()
    BTreeMap -->> CFScheduler: leftmost task
    CFScheduler -->> Kernel: task with lowest vruntime

Sources: src/cfs.rs(L129 - L143)  src/cfs.rs(L161 - L167) 

Preemption Decision

The CFS scheduler makes preemption decisions in task_tick() by comparing the current task's virtual runtime with the minimum virtual runtime in the ready queue:

flowchart TD
task_tick_call["task_tick(current)"]
increment_delta["current.task_tick()"]
check_empty["ready_queue.is_empty()?"]
no_preempt["return false"]
compare_vruntime["current.vruntime > min_vruntime?"]
should_preempt["return true"]
continue_running["return false"]

check_empty --> compare_vruntime
check_empty --> no_preempt
compare_vruntime --> continue_running
compare_vruntime --> should_preempt
increment_delta --> check_empty
task_tick_call --> increment_delta

Sources: src/cfs.rs(L176 - L183) 

Priority Management

CFS supports dynamic priority changes through the nice value interface:

flowchart TD
set_priority["set_priority(task, nice)"]
validate_range["nice in [-20, 19]?"]
return_false["return false"]
save_current["current_vruntime = task.get_vruntime()"]
update_init["init_vruntime = current_vruntime"]
reset_delta["delta = 0"]
set_nice["nice = new_nice"]
return_true["return true"]

reset_delta --> set_nice
save_current --> update_init
set_nice --> return_true
set_priority --> validate_range
update_init --> reset_delta
validate_range --> return_false
validate_range --> save_current

This implementation preserves the task's current virtual runtime position while applying the new priority for future runtime accumulation.

Sources: src/cfs.rs(L69 - L77)  src/cfs.rs(L185 - L192) 

BaseScheduler Trait Integration

The CFScheduler implements the BaseScheduler trait with the following type and method mappings:

BaseScheduler MethodCFS ImplementationBehavior
SchedItemArc<CFSTask>Reference-counted task wrapper
add_task()Insert into BTreeMapAssigns ID and initial vruntime
remove_task()Remove from BTreeMapUpdates min_vruntime tracking
pick_next_task()pop_first()Selects leftmost (lowest vruntime)
put_prev_task()Re-insert with new IDMaintains vruntime ordering
task_tick()Compare with min_vruntimePreemption based on fairness
set_priority()Update nice valueRange validation and vruntime preservation

Sources: src/cfs.rs(L124 - L193) 

Thread Safety and Atomics

All CFS task metadata uses atomic operations for thread-safe access:

flowchart TD
CFSTask_atomics["CFSTask Atomic Fields"]
init_vruntime_atomic["init_vruntime: AtomicIsize"]
delta_atomic["delta: AtomicIsize"]
nice_atomic["nice: AtomicIsize"]
id_atomic["id: AtomicIsize"]
ordering_acquire["Ordering::Acquire/Release"]
memory_safety["Memory-safe concurrent access"]

CFSTask_atomics --> delta_atomic
CFSTask_atomics --> id_atomic
CFSTask_atomics --> init_vruntime_atomic
CFSTask_atomics --> nice_atomic
delta_atomic --> ordering_acquire
id_atomic --> ordering_acquire
init_vruntime_atomic --> ordering_acquire
nice_atomic --> ordering_acquire
ordering_acquire --> memory_safety

The scheduler itself is not thread-safe and requires external synchronization, but individual task metadata can be safely accessed concurrently.

Sources: src/cfs.rs(L10 - L13)  src/cfs.rs(L44)  src/cfs.rs(L57)  src/cfs.rs(L66)  src/cfs.rs(L76)  src/cfs.rs(L84)