📊
linux-insides
  • README
  • Summary
    • Booting
      • From bootloader to kernel
      • First steps in the kernel setup code
      • Video mode initialization and transition to protected mode
      • Transition to 64-bit mode
      • Kernel decompression
      • Kernel load address randomization
    • Initialization
      • First steps in the kernel
      • Early interrupts handler
      • Last preparations before the kernel entry point
      • Kernel entry point
      • Continue architecture-specific boot-time initializations
      • Architecture-specific initializations, again...
      • End of the architecture-specific initializations, almost...
      • Scheduler initialization
      • RCU initialization
      • End of initialization
    • Interrupts
      • Introduction
      • Start to dive into interrupts
      • Interrupt handlers
      • Initialization of non-early interrupt gates
      • Implementation of some exception handlers
      • Handling Non-Maskable interrupts
      • Dive into external hardware interrupts
      • Initialization of external hardware interrupts structures
      • Softirq, Tasklets and Workqueues
      • Last part
    • System calls
      • Introduction to system calls
      • How the Linux kernel handles a system call
      • vsyscall and vDSO
      • How the Linux kernel runs a program
      • Implementation of the open system call
      • Limits on resources in Linux
    • Timers and time management
      • Introduction
      • Clocksource framework
      • The tick broadcast framework and dyntick
      • Introduction to timers
      • Clockevents framework
      • x86 related clock sources
      • Time related system calls
    • Synchronization primitives
      • Introduction to spinlocks
      • Queued spinlocks
      • Semaphores
      • Mutex
      • Reader/Writer semaphores
      • SeqLock
      • RCU
      • Lockdep
    • Memory management
      • Memblock
      • Fixmaps and ioremap
      • kmemcheck
    • Cgroups
      • Introduction to Control Groups
    • SMP
    • Concepts
      • Per-CPU variables
      • Cpumasks
      • The initcall mechanism
      • Notification Chains
    • Data Structures in the Linux Kernel
      • Doubly linked list
      • Radix tree
      • Bit arrays
    • Theory
      • Paging
      • Elf64
      • Inline assembly
      • CPUID
      • MSR
    • Initial ram disk
    • Misc
      • Linux kernel development
      • How the kernel is compiled
      • Linkers
      • Program startup process in userspace
      • Write and Submit your first Linux kernel Patch
      • Data types in the kernel
    • KernelStructures
      • IDT
    • Useful links
    • Contributors
Powered by GitBook
On this page
  • Semaphores
  • Introduction to the semaphores in the Linux kernel
  • Semaphore API
  • Conclusion
  • Links

Was this helpful?

  1. Summary
  2. Synchronization primitives

Semaphores

PreviousQueued spinlocksNextMutex

Last updated 2 years ago

Was this helpful?

Semaphores

This is the third part of the which describes synchronization primitives in the Linux kernel and in the previous part we saw special type of - queued spinlocks. The previous was the last part which describes spinlocks related stuff. So we need to go ahead.

The next after spinlock which we will see in this part is . We will start from theoretical side and will learn what is it semaphore and only after this, we will see how it is implemented in the Linux kernel as we did in the previous part.

So, let's start.

Introduction to the semaphores in the Linux kernel

So, what is it semaphore? As you may guess - semaphore is yet another mechanism for support of thread or process synchronization. The Linux kernel already provides implementation of one synchronization mechanism - spinlocks, why do we need in yet another one? To answer on this question we need to know details of both of these mechanisms. We already familiar with the spinlocks, so let's start from this mechanism.

spinlock creates a lock which will be acquired to protect a shared resource from being modified by more than one process. As a result, other processes that try to acquire the current lock get stopped (aka "spin-in-place" or busy waiting). is not allowed because is disabled to avoid . As a result, spinlock should only be used if the lock will only be acquired for a very short period of time, otherwise amount of busy waiting accumulated by other processes results in extremely inefficient operation. For locks that need to be acquired for a relatively long period of time, we turn to semaphore.

is a good solution for locks which may be acquired for a long time. In other way this mechanism is not optimal for locks that acquired for a short time. To understand this, we need to know what is semaphore.

As usual synchronization primitive, a semaphore is based on a variable. This variable may be incremented or decremented and it's state will represent ability to acquire lock. Notice that value of the variable is not limited to 0 and 1. There are two types of semaphores:

  • binary semaphore;

  • normal semaphore.

In the first case, value of semaphore may be only 1 or 0. In the second case value of semaphore any non-negative number. If the value of semaphore is greater than 1 it is called as counting semaphore and it allows to acquire a lock to more than 1 process. This allows us to keep records of available resources, when spinlock allows to hold a lock only on one task. Besides all of this, one more important thing that semaphore allows to sleep. Moreover when processes waits for a lock which is acquired by other process, the may switch on another process.

Semaphore API

So, we know a little about semaphores from theoretical side, let's look on its implementation in the Linux kernel. All semaphore is located in the header file.

We may see that the semaphore mechanism is represented by the following structure:

struct semaphore {
	raw_spinlock_t		lock;
	unsigned int		count;
	struct list_head	wait_list;
};

in the Linux kernel. The semaphore structure consists of three fields:

  • lock - spinlock for a semaphore data protection;

  • count - amount available resources;

  • wait_list - list of processes which are waiting to acquire a lock.

  • statically;

  • dynamically.

ways. Let's look at the first approach. We are able to initialize a semaphore statically with the DEFINE_SEMAPHORE macro:

#define DEFINE_SEMAPHORE(name)  \
         struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)

as we may see, the DEFINE_SEMAPHORE macro provides ability to initialize only binary semaphore. The DEFINE_SEMAPHORE macro expands to the definition of the semaphore structure which is initialized with the __SEMAPHORE_INITIALIZER macro. Let's look at the implementation of this macro:

#define __SEMAPHORE_INITIALIZER(name, n)              \
{                                                                       \
        .lock           = __RAW_SPIN_LOCK_UNLOCKED((name).lock),        \
        .count          = n,                                            \
        .wait_list      = LIST_HEAD_INIT((name).wait_list),             \
}
#define __ARCH_SPIN_LOCK_UNLOCKED       { { 0 } }
static inline void sema_init(struct semaphore *sem, int val)
{
       static struct lock_class_key __key;
       *sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val);
       lockdep_init_map(&sem->lock.dep_map, "semaphore->lock", &__key, 0);
}
void down(struct semaphore *sem);
void up(struct semaphore *sem);
int  down_interruptible(struct semaphore *sem);
int  down_killable(struct semaphore *sem);
int  down_trylock(struct semaphore *sem);
int  down_timeout(struct semaphore *sem, long jiffies);

The down_killable function does the same as the down_interruptible function, but set the TASK_KILLABLE flag for the current process. This means that the waiting process may be interrupted by the kill signal.

void down(struct semaphore *sem)
{
        unsigned long flags;

        raw_spin_lock_irqsave(&sem->lock, flags);
        if (likely(sem->count > 0))
                sem->count--;
        else
                __down(sem);
        raw_spin_unlock_irqrestore(&sem->lock, flags);
}
EXPORT_SYMBOL(down);

As you already may guess, the main work is done between the raw_spin_lock_irqsave and raw_spin_unlock_irqrestore macros in the down function. We compare the value of the semaphore counter with zero and if it is bigger than zero, we may decrement this counter. This means that we already acquired the lock. In other way counter is zero. This means that all available resources already finished and we need to wait to acquire this lock. As we may see, the __down function will be called in this case.

static noinline void __sched __down(struct semaphore *sem)
{
        __down_common(sem, TASK_UNINTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}

The __down function just calls the __down_common function with three parameters:

  • semaphore;

  • flag - for the task;

  • timeout - maximum timeout to wait semaphore.

Before we will consider implementation of the __down_common function, notice that implementation of the down_trylock, down_timeout and down_killable functions based on the __down_common too:

static noinline int __sched __down_interruptible(struct semaphore *sem)
{
        return __down_common(sem, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
}

The __down_killable:

static noinline int __sched __down_killable(struct semaphore *sem)
{
        return __down_common(sem, TASK_KILLABLE, MAX_SCHEDULE_TIMEOUT);
}

And the __down_timeout:

static noinline int __sched __down_timeout(struct semaphore *sem, long timeout)
{
        return __down_common(sem, TASK_UNINTERRUPTIBLE, timeout);
}
struct task_struct *task = current;
struct semaphore_waiter waiter;
#define current get_current()
DECLARE_PER_CPU(struct task_struct *, current_task);

static __always_inline struct task_struct *get_current(void)
{
        return this_cpu_read_stable(current_task);
}

The second variable is waiter represents an entry of a semaphore.wait_list list:

struct semaphore_waiter {
        struct list_head list;
        struct task_struct *task;
        bool up;
};

Next we add current task to the wait_list and fill waiter fields after definition of these variables:

list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
waiter.up = false;

In the next step we join into the following infinite loop:

for (;;) {
        if (signal_pending_state(state, task))
            goto interrupted;

        if (unlikely(timeout <= 0))
            goto timed_out;

        __set_task_state(task, state);

        raw_spin_unlock_irq(&sem->lock);
        timeout = schedule_timeout(timeout);
        raw_spin_lock_irq(&sem->lock);

        if (waiter.up)
            return 0;
}
static inline int signal_pending_state(long state, struct task_struct *p)
{
         if (!(state & (TASK_INTERRUPTIBLE | TASK_WAKEKILL)))
                 return 0;
         if (!signal_pending(p))
                 return 0;

         return (state & TASK_INTERRUPTIBLE) || __fatal_signal_pending(p);
}
interrupted:
    list_del(&waiter.list);
    return -EINTR;
if (unlikely(timeout <= 0))
    goto timed_out;

we jump at the timed_out label:

timed_out:
    list_del(&waiter.list);
    return -ETIME;

Where we do almost the same that we did in the interrupted label. We delete task from the list of lock waiters, but return the -ETIME error code. If a task has no pending signal and the given timeout is not expired yet, the given state will be set in the given task:

__set_task_state(task, state);

and call the schedule_timeout function:

raw_spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
raw_spin_lock_irq(&sem->lock);

That is all about the __down_common function. A task which wants to acquire a lock which is already acquired by another task will be spun in the infinite loop while it will not be interrupted by a signal, the given timeout will not be expired or the task which holds a lock will not release it. Now let's look at the implementation of the up function.

void up(struct semaphore *sem)
{
        unsigned long flags;

        raw_spin_lock_irqsave(&sem->lock, flags);
        if (likely(list_empty(&sem->wait_list)))
                sem->count++;
        else
                __up(sem);
        raw_spin_unlock_irqrestore(&sem->lock, flags);
}
EXPORT_SYMBOL(up);

It looks almost the same as the down function. There are only two differences here. First of all we increment a counter of a semaphore if the list of waiters is empty. In other way we call the __up function from the same source code file. If the list of waiters is not empty we need to allow the first task from the list to acquire a lock:

static noinline void __sched __up(struct semaphore *sem)
{
        struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
                                                struct semaphore_waiter, list);
        list_del(&waiter->list);
        waiter->up = true;
        wake_up_process(waiter->task);
}

That's all.

Conclusion

Links

Before we will consider an of the semaphore mechanism in the Linux kernel, we need to know how to initialize a semaphore. Actually the Linux kernel provides two approaches to execute initialization of the given semaphore structure. These methods allows to initialize a semaphore in a:

The __SEMAPHORE_INITIALIZER macro takes the name of the future semaphore structure and does initialization of the fields of this structure. First of all we initialize a spinlock of the given semaphore with the __RAW_SPIN_LOCK_UNLOCKED macro. As you may remember from the parts, the __RAW_SPIN_LOCK_UNLOCKED is defined in the header file and expands to the __ARCH_SPIN_LOCK_UNLOCKED macro which just expands to zero or unlocked state:

The last two fields of the semaphore structure count and wait_list are initialized with the given value which represents count of available resources and empty .

The second way to initialize a semaphore structure is to pass the semaphore and number of available resources to the sema_init function which is defined in the header file:

Let's consider implementation of this function. It looks pretty easy and actually it does almost the same. Thus function executes initialization of the given semaphore with the __SEMAPHORE_INITIALIZER macro which we just saw. As I already wrote in the previous parts of this , we will skip the stuff which is related to the of the Linux kernel.

So, from now we are able to initialize a semaphore let's look at how to lock and unlock. The Linux kernel provides following to manipulate semaphores:

The first two functions: down and up are for acquiring and releasing of the given semaphore. The down_interruptible function tries to acquire a semaphore. If this try was successful, the count field of the given semaphore will be decremented and lock will be acquired, in other way the task will be switched to the blocked state or in other words the TASK_INTERRUPTIBLE flag will be set. This TASK_INTERRUPTIBLE flag means that the process may returned to ruined state by .

The down_trylock function is similar on the spin_trylock function. This function tries to acquire a lock and exit if this operation was unsuccessful. In this case the process which wants to acquire a lock, will not wait. The last down_timeout function tries to acquire a lock. It will be interrupted in a waiting state when the given timeout will be expired. Additionally, you may notice that the timeout is in

We just saw definitions of the semaphore . We will start from the down function. This function is defined in the source code file. Let's look on the implementation function:

We may see the definition of the flags variable at the beginning of the down function. This variable will be passed to the raw_spin_lock_irqsave and raw_spin_lock_irqrestore macros which are defined in the header file and protect a counter of the given semaphore here. Actually both of these macro do the same that spin_lock and spin_unlock macros, but additionally they save/restore current value of interrupt flags and disables .

The __down function is defined in the source code file and its implementation looks:

Now let's look at the implementation of the __down_common function. This function is defined in the source code file too and starts from the definition of the two following local variables:

The first represents current task for the local processor which wants to acquire a lock. The current is a macro which is defined in the header file:

Where the get_current function returns value of the current_task variable:

In the previous piece of code we set waiter.up to false. So, a task will spin in this loop while up will not be set to true. This loop starts from the check that the current task is in the pending state or in other words flags of this task contains TASK_INTERRUPTIBLE or TASK_WAKEKILL flag. As I already wrote above a task may be interrupted by during wait of ability to acquire a lock. The signal_pending_state function is defined in the source code file and looks:

We check that the state contains TASK_INTERRUPTIBLE or TASK_WAKEKILL bits and if the bitmask does not contain this bit we exit. At the next step we check that the given task has a pending signal and exit if there is not. In the end we just check TASK_INTERRUPTIBLE bit in the state bitmask again or the signal. So, if our task has a pending signal, we will jump at the interrupted label:

where we delete task from the list of lock waiters and return the -EINTR . If a task has no pending signal, we check the given timeout and if it is less or equal zero:

which is defined in the source code file. The schedule_timeout function makes the current task sleep until the given timeout.

The up function is defined in the source code file as down function. As we already know, the main purpose of this function is to release a lock. This function looks:

Here we takes the first task from the list of waiters, delete it from the list, set its waiter-up to true. From this point the infinite loop from the __down_common function will be stopped. The wake_up_process function will be called in the end of the __up function. As you remember we called the schedule_timeout function in the infinite loop from the __down_common this function. The schedule_timeout function makes the current task sleep until the given timeout will not be expired. So, as our process may sleep right now, we need to wake it up. That's why we call the wake_up_process function from the source code file.

This is the end of the third part of the chapter in the Linux kernel. In the two previous parts we already met the first synchronization primitive spinlock provided by the Linux kernel which is implemented as ticket spinlock and used for a very short time locks. In this part we saw yet another synchronization primitive - which is used for long time locks as it leads to . In the next part we will continue to dive into synchronization primitives in the Linux kernel and will see next synchronization primitive - .

If you have questions or suggestions, feel free to ping me in twitter , drop me or just create .

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to .

chapter
spinlocks
part
synchronization primitive
semaphore
Context switch
preemption
deadlocks
semaphores
scheduler
API
include/linux/semaphore.h
API
previous
include/linux/spinlock_types.h
list
include/linux/semaphore.h
chapter
lock validator
API
signal
jiffies
API
kernel/locking/semaphore.c
include/linux/spinlock.h
interrupts
same
kernel/locking/semaphore.c
arch/x86/include/asm/current.h
per-cpu
signal
include/linux/sched.h
bitmask
SIGKILL
error code
kernel/time/timer.c
same
kernel/sched/core.c
synchronization primitives
semaphore
context switch
mutex
0xAX
email
issue
linux-insides
spinlocks
synchronization primitive
semaphore
context switch
preemption
deadlocks
scheduler
Doubly linked list in the Linux kernel
jiffies
interrupts
per-cpu
bitmask
SIGKILL
errno
API
mutex
Previous part