Introduction

The FreeBSD disk scheduler is the standard "elevator", also known as C-LOOK, which is designed to maximize throughput, but cannot grant fairness among clients of the disk subsystem. In fact, large sequential I/O requests, or certain access patterns from one or a few processes, might almost completely starve other processes. This is particularly critical for applications with soft real time requirements such as audio/video players.
For those reasons, support for pluggable (ie. run-time selectable) disk schedulers has been added to FreeBSD, in all branches (4.x, 5,x and 6.x). The three branches implementation is quite different, but luckly the interface you have to deal with when writing your own scheduler is just the same, except some little name-changes for release 4.x.

The present document will guide you step-by-step in writing your own scheduler and in adding it to the FreeBSD source. The scheduler will be run-time selectable thanks to a sysctl call.

Architectural overview

A scheduler is basically a set of three callback functions which are called by the OS when needed. Those funcions are:

* disksort Called by OS to put a new disk request into the scheduler
* getfirst Called by the disk driver to get the request to serve. Must return NULL if queue is empty.
* remove Called by disk driver to delete a previuous got request. This funcion is always called after getfirst, but in the meantime, other disksort or getfirst calls may be involved.

Prototypes for functions are:
void my_disksort(struct bio_queue_head *bioq, struct bio *bp);
struct bio *my_first(struct bio_queue_head *head);
void my_remove(struct bio_queue_head *head, struct bio *bp);

The bio struct is the disk request. You should never modify its fields, since they are set for disk driver use; the only exception is the class field which we will discuss later.

bio_queue_head identifies a disk device. There is always a bio_queue_head instance for each of the following:
* Hard Disk (not a partition/slice, just a phisycal disk)
* MFS mount point
* procfs

bioq_queue_head has a field named queue which is a fifo queue (of TAILQ type. See man queue(3) for more infos).
This queue is the one used by the default cscan scheduler (if you want to see how, take a look at kern/subr_disk.c,functions: cscan_bioq_disksort, cscan_bioq_first and cscan_bioq_remove). If you want you can use this queue for your own purposes, but you don't really need it. More important is the sched_info field: this member is void *, and you can set its value as you wish. Always remember that scheduler works per-device based, so sched_info is the only piece of information you have to distinguish between devices. Each per-device scheduler data (e.g. accounting infos) should be stored in a structure; sched_info pointer should then be set to this structure. This is useful because at each disksort, getfirst and remove calls you have instant access to all device data you need.

Other than disksort, getfirst and remove there are four optionals functions which you could implement. Prototypes are:
void my_load(void);
void my_unload(void);
void my_delete(struct bio_queue_head *bioq);
void my_init(struct bio_queue_head *bioq);

Load and unload are called each time the scheduler module is loaded/unloaded in/from memory. If scheduler code is statically loaded at boot time, then we'll have only a load call (at boot) and only an unload (during shotdown).

init and delete are called per-device based; the first one is called each time a new device is attached (e.g. a new MFS is mounted), while delete is called on device detach. init is a good point to do allocate needed data and to set the sched_info field; delete instead, is a good point to free data. You can think that init is called just after head allocation, and delete just before head deallocation. Those two functions are also called during scheduler switching. When user executes a sysctl to change current scheduler, the delete function will be called for all devices, followed by an init call for all devices.

Registering

After writing functions you should register callbacks functions. This is made filling a global structure and calling some macro. Following code is mandatory to correctly register the scheduler:
static struct _disk_sched_interface my_sched = {
	.next= NULL,
	.name= "my",
	.disksort= my_disksort,
	.remove= my_remove,
	.get_first= my_first,
	.init= my_init,
	.delete= my_delete,
	.load= my_load,
	.unload= my_unload,
};

SYSINIT(myload, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, disk_sched_load, &my_sched);

SYSCTL_DECL(_vfs_scheduler);
SYSCTL_NODE(_vfs_scheduler, OID_AUTO, my,
            CTLFLAG_RW, 0, "My scheduler");
SYSCTL_INT(_vfs_scheduler_my, OID_AUTO, refcount,
                CTLTYPE_INT, &my_sched.refcount, 0, "refcount");

.init, .delete, .load, .unload can also be set to NULL. The first SYSINIT call, will register the my_sched structure inside the kernel, so that the scheduler will be accessible through sysctl. The name member of the structure is the string you should use in sysctl for scheduler switching; other SYSCTL stuffs are just needed to create nodes in syscl tree.
If your scheduler needs it, you can add as many sysctl nodes and handlers you need; just remember that you should put them under vfs.scheduler.my. tree. This is not mandatory, however, just a good rule.

To perform a scheduler switching you simply type:
# sysctl vfs.scheduler.name=my

Locking Notes

The seven functions previous described are always called by the kernel after locking semaphores. Locking is relative to the head parameter; this means that in each function you can safety modify any per-device data.
Warning: It's guaranteed that only the head parameter passed to the function is protected by lock; this is another reason to use a separate structure for each device and having sched_info which points to it. You should not have global variables which refers to all devices, because you can never assume you can safety modify them.

Classifiers

Some schedulers require a classifier module to annotate requests with some identifier of the entity they should be charged to -- e.g. the process/user/group ID. This is necessary in case the scheduler wants to provide some kind of fairness among different entities.

To this purpose I added the class field to the bio structure, containing classifier information. This is the structure:

struct buf_class {
	enum client_type_t	t;
	int			pug;
	int			weight;
	char *msg;		/* set to non-null when initialized */
};

The msg field is used only for interla purposes and you should never us it. t tells you which classifier you are currently using. Actually only 3 classifiers are supported, so t can assume one of the following values:
* PID
* UID
* GID
Value of pug depends on the t one. When t is PID/UID/GID then pug contains the pid/uid/gid of the process which created the request.

The weight field can be used to retrieve the process/user/gid weight to give (for example) different bandwidths to the clients.

pug field is set by the kernel, but t and weight should be set by user.

t is set with a sysctl call:
*# sysctl vfs.scheduler.classifier=PID
*# sysctl vfs.scheduler.classifier=UID
*# sysctl vfs.scheduler.classifier=GID

Weight can be set in the following way:
* for PID you can use the nice command, with a new parameter which I added:
nice -d n... runs a process with I/O weight n; renice -d m changes the I/O weight.
* for UID and GID you must use a sysctl call:
sysctl vfs.scheduler.set=UID X Y sets the weight for UID X equal to Y, and a similar syntax is used to set the per-GID weight.

Example: the "dumb" scheduler

As example I report here the code of a "dumb" scheduler which is just a FIFO one. It used the queue field as queue:

#include  
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

#include 
#include    /* curproc ? */
#define LEN(x) ( (dn_key) (x)->bio_bcount )
#endif

/*
 *	dumb scheduler
 *
 * This one does just FIFO
 */

/*---- the disk scheduler API ----*/
/* 
 * Show first buf in a queue (bioq_first) 
 */
static struct bio* 
dumb_first(struct bio_queue_head *bioq)
{	
	return TAILQ_FIRST(&bioq->queue);
}

/*
 * Remove a buf from a device queue (bioq_remove)
 */
static void 
dumb_remove(struct bio_queue_head *bioq, struct bio *bp)
{
	TAILQ_REMOVE(&bioq->queue, bp, bio_queue);
}


/* 
 * Insert a buf into a sorted device queue (bioq_disksort) 
 */
static void 
dumb_disksort(struct bio_queue_head *bioq, struct bio *bp)
{
	TAILQ_INSERT_TAIL(&bioq->queue, bp, bio_queue);
}

static void
dumb_init(struct bio_queue_head *head)
{
	TAILQ_INIT(&head->queue);
}

static struct _disk_sched_interface dumb_sched = {
	.next= NULL,
	.name= "dumb",
	.disksort= dumb_disksort,
	.remove= dumb_remove,
	.get_first= dumb_first,
	.init= dumb_init,
};

SYSINIT(dumbload, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, disk_sched_load, &dumb_sched);

SYSCTL_DECL(_vfs_scheduler);
SYSCTL_NODE(_vfs_scheduler, OID_AUTO, dumb,
            CTLFLAG_RW, 0, "cscan disk I/O scheduler");
SYSCTL_INT(_vfs_scheduler_dumb, OID_AUTO, refcount,
                CTLTYPE_INT, &dumb_sched.refcount, 0, "refcount");