Blog

Author Archive

Openstack Swift Data Storage Explained

Posted on: June 6th, 2012 by stephenbroeker No Comments

I would like to take this opportunity to describe how Openstack Swift stores data. Swift storage basically consists of four components: Ring, Database, Zones, and File system.

Openstack Swift Accounts, Containers & Objects

Openstack Swift data consists of Accounts, Containers, and Objects. A Swift item then, can be an account, a container, or an object. A Swift Cluster can have multiple accounts, an account can have multiple containers, and a container can have multiple objects. Containers cannot be nested. That is, a container can only contain objects, it can not contain other containers. So Swift storage is considered to be a flat file system.

By default, Swift uses three replicas. This means that when an item get created (via a PUT), two additional copies are also created. Thus if two copies of the item are not available, the item can still be read.

Two Types of Nodes

Basically, a Swift cluster consists of two types of nodes: Proxy and Storage. Proxy Nodes provide an external interface into the Swift repository. Proxy nodes perform authentication and provide a REST API to the data. Storage nodes are used to store the actual data items: account, container, and object. Storage nodes consist of a number of Storage Devices. Each Storage Device is mounted as a separate file system using XFS. A Swift Cluster could (and should) contain a number of Proxy and Storage nodes.

Openstack Swift groups Storage nodes into Zones. By default, a Swift Cluster has four zones. Thus if a Swift Cluster has 40 Storage nodes, then each Zone consists of 10 Storage nodes. A zone is commonly a single Storage Rack, where a Storage Rack consists of multiple Storage nodes. Item replicas are guaranteed to be distributed accross zones. Thus the three replicas for a given Item are guaranteed to reside on different zones. This greatly improves data reliability: if a Storage Rack is down, then only a single replica is unavailable.

So how do Proxy nodes keep track of Storage nodes?

If a Proxy node receives a GET Object operation, then the object is stored on three different Storage nodes. How does the Proxy Node figure out which Storage Nodes are holding the Object data? The answer is: by using Consistent Hashing. In Openstack Swift Data Storage, this caching is called a “ring.” There is a separate ring for accounts, containers, and objects.

Next time I’ll cover more about Rings.

SMP: Queued Locks – Part 9

Posted on: May 30th, 2012 by stephenbroeker No Comments

I’m wrapping up a nine-part series on Symmetric Multi Processors (SMP). Following up on #8 the Random Back-off Problem which focused on a solution to isolating locks so that a cache line can only hold a single lock, I’ll conclude with another solution that completely solves the Thundering Herd problem in an optimal manner: Queued Locks.

I’ve shown you how a 32 bit cpu commonly uses an 8-word cache line. And we only want a single lock (or word) on a cache line. We can use the rest of the cache line to implement a Queued Lock.

Queued Locks

The Queued Lock algorithm essentually consists of having a cpu add itself to the end of a queue when it is waiting for a lock (this gives us fairness). The cpu then does a tight spin on a private cache line, waiting for the lock to be released (this gives us optimality). When the lock is released, the holding cpu picks the first cpu off of the queue (this is fair) and pokes the private cache line of that cpu to wake it up (optimal).

Single Cache

Here is the Queued Lock data structure that resides on a single cache line:

struct queued_lock
{
    int  mutex; // locks this data structure.

    int  value; // 0 = free, 1 = locked

    char  queue[24];
};

This data structure can thus handle a 25 cpu system. The queue field contains cpu numbers. A zero value indicates that the entry is free.

Private Wait Array

Here is the Private Wait Array:

struct private_wait
{
    int  value;

    int  fill[7];
};

struct private_wait private_wait_array[25];

This array is indexed by the cpu number.

Queued Lock Algorithms

Here is the Queued Lock lock algorithm:

queued_lock_lock (struct queued_lock *lock)
{

Now that I've laid out various scenarios and solutions, have I missed anything? What solutions would you add?
    while (test_and_set(lock->mutex) == 1)    // lock the data structure
        ;

    if (lock->value == 0)    // lock is not set, take it
    {
        lock->value = 1;
        lock->mutex = 0;    // unlock the data structure
    }
    else    // lock is set, add myself to the queue, and wait
    {
        for (index = 0; ; index++)
        {
            assert(index < 24)

            if (lock->queue[index] == 0)
                break;
        }

        lock->queue[index] = my_cpu_number + 1;
        private_wait_array[my_cpu_number]->value = 0;
        lock->mutex = 0;    // unlock the data structure

        while (private_wait_array[my_cpu_number]->value == 0)
            ;
    }
}

And finally here is the Queued Lock unlock algorithm:

queued_lock_unlock (struct queued_lock *lock)
{
    while (test_and_set(lock->mutex) == 1)    // lock the data structure
        ;

    if (lock->queue[0] == 0)  // no one is waiting
    {
        lock->value = 0;
    }
    else  // somebody is waiting, poke them
    {
        cpu = lock->queue[0] - 1;
        memmove(&lock->queue[0], &lock->queue[1], 23);
        private_wait_array[cpu]->value = 1;
    }

    lock->mutex = 0;    // unlock the data structure
}

SMP: Random Backoff Solution – Part 8

Posted on: May 23rd, 2012 by stephenbroeker No Comments

Last time I presented a solution to the Thundering Herd problem: the Random Backoff, but noted a problem with solution. It is quite possible that when a cpu releases a contentious lock, waiting cpus will not wake up immediately. That is, it could take up to a second before a waiting lock tries to obtain the now-free lock. This solution is thus not optimal.

As a follow-up I’d like to present a SMP refinement that I believe solves this problem: isolating and separating all locks so that there is a single lock per cache line. Why do we do this?

Isolated & Separated Locks

Consider two SMP locks: L1 and L2, both on a single cache line. Note: on a 32 bit cpu, a word is 32 bits and a cache line is commonly 8 words, and a single cache line could thus hold 8 locks. If a cpu locks L1, then its cached copy of L1 is updated and the cache copy of L1 for the other cpus is invalidated. A waiting cpu will thus have to get the up to date copy of L1 from main memory. This is an expensive operation in time and bus traffic. If L2 is also on the same cache line, then L1 traffic will invalidate L2 cache copies. The L2 cache invalidates are not necessary for L2 cache coheriency and thus adversly effect system performance.

If we seperate L1 and L2 and place them on different cache lines, then L1 activity does not effect L2 activity and visa versa.

Memory Wasted

The downside of this solution is that memory is wasted. Optimally (as far as memory is concerned) we could pack 8 locks into a single cache line. So for every lock we are wasting 7 words. If we consider Moore’s Law and how the price of memory chips is constantly dropping, I think this solution is worth the price.

So how much does this solution help? Turns out that cache line isolation does not completely solve this problem. It improves the situation, but does not completely elliminate it. So in my next post, I will present another solution to the SMP Random Backoff problem.

SMP: Thundering Herd Solution – Part 7

Posted on: May 16th, 2012 by stephenbroeker No Comments

In Part 6 I presented Symmetric Multi-Processor’s (smp) Thundering Herd problem; it’s the result of a lack of fairness in the mutex algorithm. This problem can occur when multiple cpus are using a lock under high contention. Next I’d like to share one of the smp solutions we used at Pyramid to solve this problem.

SMP Solution 1: Random Backoff

You may recall the mutex algorithm as:

mutex (void *addr)
{
    while (test_and_set(addr) == 1)
        ;
}

The problem is that when multiple cpus are waiting for a single lock, the while loop is too predictable. It lacks variation. What we need is a random period before the next call to call test_and_set(). We thus create a Random Backoff array. The size of this array should be large relative to the number of cpus. In the case of 24 cpus, we use an array size of 1024. We then initialize the array entries to random numbers (in micro-seconds) with an appropriate upper bound of 1 second.

So the mutex algorithm now becomes:

mutex (void *addr)
{
    int  index;
    extern int random_backoff[];

    while (test_and_set(addr) == 1)
    {
        index = random() % 1000000;
        usec = random_backoff[index];
        usleep(usec);
    }
}

The particular use of random() is not required, any type of random number generater will do. What is important is how the random number generater is seeded. Each cpu must use a different seed. I recommend using the cpu number multiplied by the current date.

Notice how we index the array with a random number. If we did not did this, then indexing the array could become predictable.

Contentious Locks

There is a problem with this smp solution though. It is quite possible that when a cpu releases a contentious lock, waiting cpus will not wake up immediately. That is, it could take up to a second before a waiting lock trys to obtain the now free lock. Thus the Random Backoff solution is not optimal.

In my next blog, I will present solutions and refinements to this situation.

SMP: Thundering Herd Problem – Part 6

Posted on: May 9th, 2012 by stephenbroeker No Comments

My last five blogs posts have presented:

Now, let’s explore something called the Thundering Herd Problem. If you recall, a previous blog presented the mutex algorithm as:

mutex (void *addr)
{
    while (test_and_set(addr) == 1)
        ;
}

Now consider the scenerio where Process 1 (P1) is holding a lock and two other processes (P2 and P3) are trying to obtain the same lock. If P2 has waiting for a long time and P3 has been waiting for a small amount of time, then there is nothing in the mutext algorithm that enforces fairness. That is, once P1 releases the lock, P3 can immediately obtain it, even though P2 has been waiting for a much longer time.

Now extend this problem to a 12 CPU SMP system. One cpu could hold a lock and 11 cpu’s could be waiting on the same lock. And extend this problem to a 24 CPU SMP system. One cpu could hold a lock and 23 cpu’s could be waiting on the same lock.

In this case, when a cpu releases a lock, possibly 23 other cpus could all try to get the same lock. This mass attempt to obtain a single lock results in a non linear increase in bus and cache traffic. The Thundering Herd is the 23 other cpus. And the problem is, only one cpu is going to obtain the lock. The other 23 cpus go back into a wait state and the process continues. Since fairness is not built into this algorithm, there is no gaurantee that a particular cpu will obtain the lock in a reletively fair fashon.

Are there locks that suffer from this kind of contention? Does the Thundering Herd problem occur in a real engineering envirnoment? Or, is all of this just an interesting theory?

At Pyramid we actually ran into this problem. In the Unix kernel, there a number of locks that experience this kind of contention. And we experienced the Thundering Herd problem when we expanded our cachsis from 12 cpus to 24 cpus. We actually ran into Lock Timeouts (30 seconds) where no single cpu was holding the lock for an excessive amount of time. The problem was that the mutex algorithm was not fair, a waiting cpu was starved for attention.

In my next post I will present solutions to this problem.

SMP: Debugging the System Part 5

Posted on: May 2nd, 2012 by stephenbroeker No Comments

My last four blogs posts have presented:

It’s logical to now explore how to debug an SMP system. But first I need to present the classic locking problem as defined by Dijkstra.

Classic Locking Problem

Consider two processes: P1 and P2; and two locks: L1 and L2. Now consider the following locking scenerio: P1 locks L1 (P1.L1) and then L2 (P2.L1); P2 locks L2 (P2.L2) and then L1 (P2.L1).

If the locks are aquired in the order: (1: P1.L1, 2: P2.L2, 3: P1.L2, 4: P2,L1), then steps 3 and 4 will never complete. This scenerio is called a deadlock in that both processes are stuck waiting for locks that are held by each other.

So how do we deal with a SMP deadlock? There are essentually two ways to deal with deadlocks. Method 1: try to detect when a deadlock has occured. Method 2: prevent deadlocks from occuring.

The first method is somewhat problomatic in that you have to keep track of all of the locks in the system and periodically monitor them for a deadlock.

The second method is much easier to implement. Pyramid dealt with this problem by adding code to detect when a deadlock was immenant. To wit, a positive, non-zero level was assigned to each lock and the lock protocol was modified to ensure that a lock could only be aquired if the lock level was increased.

Now let us apply lock levels to the Dijkstra lock problem. L1 will now be defined as L1.5 (lock 1 with level 5) and L2 will now be defined as L2.10 (lock 1 with level 10). If the locks are aquired in the order: (1: P1.L1, 2: P2.L2, 3: P1.L2, 4: P2,L1), then step 4 will generate a Lock Level Violation since P2 is holding L2 (which has a level of 10) and P2 is trying to lock L1 (which has a level of 5) – you can only increase the lock levels.

Handling SMP Interrupts

There is another problem that we have to deal with: interrupts. If a process is holding a lock and interrupts occur, then it will look like the process is holding the lock for a long time.

This problem is dealt with by adding an interrupt level to the definition of a lock. In addition, the lock protocol is expanded by requiring that the cpu interrupt level be set to the lock interrupt level. Setting the cpu interrupt level blocks interrupts at a lower level. System interrupts are defined by levels, for example: 0 = none, 1 = tty, 2 = network, 3 = disk, etc. The higher the interrupt level, the higher the priority. Higher level interrupts block out lower level interrupts. Thus, setting the interrupt level of a cpu to priority X, blocks all interrupts lower than and including level X.

Defining the Locking Protocol

So a lock is now defined as: (Name, Level, Interrupt). And the locking protocol is now defined as:

1) If the new lock level is less than the current lock level, then generate a Lock Level Protocol Violation.

2) If the new lock interrupt priority is less than the current cpu interrupt priority, then generate a Lock Interrupt Protocol Violation.

Next up? SMP’s Thundering Herd Problem.

SMP: Solving The Lock Problem Part 4

Posted on: April 25th, 2012 by stephenbroeker No Comments

My last three blog posts have presented the SMP (Symmetric Multi Processor scaling problem, Performance Solution as presented by Pyramid Technology, and the user space SMP problem. Now, I’d like to delve into the locks themselves. What actually is a lock?

There are essentually two types of SMP locks: mutexes and semaphores.

SMP Mutex Lock

A mutex (short for mutual exclusion), is a lock that waits for a field to be released before proceeding. In its simplist incantation, a mutex constantly checks the field in question. This type of lock thus uses the following alorithm:

mutex (void *addr)  {
while (test_and_set(addr) == 1) ;
}

For this algorithm to work, there must be an atomic operation: test_and_set(void *ptr). This operation writes a 1 into the given word address. If the value was previously 0, then test_and_set() returns 0. Otherwise this operation returns 1.

It is important to note that a mutex does not voluntarily give up the cpu. In the kernel, this means never. In user space, the schedular can run another process if it has higher priority.

SMP Semaphore Lock

A semaphore is a lock in which the cpu is released if the field is not free. In its simplist incantation, a semaphore uses a mutex to try to set the lock. If the lock is not set, then the semaphore gives up the cpu and tries later. This type of lock thus uses the following alorithm:

semaphore (void *addr)
{
    while (test_and_set(addr) == 1)
        sleep(0);
}

The call to sleep with a seconds value of 0, simply results in a call to sched(). And sched() will preempt the semaphore if there is another process ready to run.

So why is mutex used at all? Why not always use a semaphore? The reason is that the kernel has to use a mutex, it cannot use a semaphore, since there is only a single copy of the kernel that can run on a given cpu. But in user mode, semaphores allow other processes to run, thus improving system efficiency.

2012 Beehives: B-Haven Apiary Started

Posted on: April 20th, 2012 by stephenbroeker No Comments

On April 14, we picked up five new queen bees packages from Honey Bee Genetics. Our intent was to kick of 2012 for B-Haven Apiary.

But first I had to get the apiary ready for the new beehives. If you will recall from B-Haven’s 2011 Annual Report, all four of last years hives died; they did not make it through the Winter and Spring. As a result I decided to employ a Hive Nucleus in the apiary.

New Beehives

I had two supers (20 frames) of honey and pollen left over from last year. I set up four new beehives plus the nucleus. I placed four honey/pollen frames in each empty hive – to give the new hives a good start.

This year, our youngest son-in-law (Mathew Togami) was nice enough to assist me in installing the new beehives. When we went to install the first new hive, we had quite the surprise. A swarm had moved in and taken over – awesome!

Installing Hive Nucleus

So we left the swarm alone and installed the four new beehives and nucleus. I had to rebalance the old honey/pollen frames. So instead of four old frames per hive, we used three old frames in two of the hives, four old frames in three of the hives, and 2 old frames in the nucleus.

All of the new hives look good. The new hives consist of four pounds of bees and a caged queen. I will check in three days to make sure that the new hives have released the queens from their cages. The swarm hive is on a single super. The four new hives are all on a single super. The nucleus is on two supers since each super is only five frames.

In my next bee blog, I will report on how queen acceptance is going. Did the new hives accept their queens? And how is the swarm hive doing?

SMP: Symmetric Multi-Processor User Processes (Part 3)

Posted on: April 18th, 2012 by stephenbroeker No Comments

Continuing on the information presented in Part 1: The Symmetric Multi Processor (SMP) scaling problem and Part 2: The Performance Solution as presented by Pyramid Technology, my comments were addressed at the kernel. But what about user processes? Do they also need to use locking?

The answer is yes, if multiple processes share the same physical memory. This can occur with Shared Memory. User space proccesses use Shared Memory so that proccesses can easily and quickly share data. Shared Memory is the fastest Inter Process Communication (IPC) mechanism that can be employed. But it does have a cost (alas nothing is free): lack of synchronization.

If a single process is writing to shared memory, and a single process is reading the same shared memory, then locking is not needed.

If a single process is writing to shared memory, and multiple processes are reading the shared memory, then locking may or may not be required. If the shared memory is used as a resource allocator, then locks will be required. On the other hand, if the shared memory is used to post events, then locks might not be required.

Multiple Processes, Locking Required

If SMP multiple processes are writing to shared memory, then locking will be required.

User space Shared Memory is thus like the kernel. Each user process can be thought of as a seperate kernel running on different cpu. Multiple instances of a user process can run on a single cpu. If they use Shared Memory, they will then all use the same physical memory, just like the kernel. It turns out that using user processes in this manner is a great way to simulate a kernel SMP environment.

A true SMP system requires hardware memory support, called cache coherency. A cache coheriancy problem occurs when a cpu writes to physical memory and does not update or invalidate the caches on the other cpus in the system.

Next up I’ll delve more into locks.

SMP: Solving The Performance Problem (Part 2)

Posted on: April 11th, 2012 by stephenbroeker No Comments

In my last blog post, I described the SMP problem. How adding cpus to an SMP system does not necessarily reliably increase performance.

Pyramid Technology solved this problem by decreasing the granularity of the lock. This means that the amount of real estate that was controlled by the lock, was decreased. Thus instead of a single lock for the entire kernel, multiple locks were added for the kernel. That is, locks were added for each major data structure: Process Table, Schedular, File Systems, etc.

Pyramid found that this increase in kernel locks greatly increased system performance. But it was not enough, processing were still stalling for long periods of time (greater than 30 seconds), waiting for kernel locks to be released. The problem was that some of the locks were too course.

Process Table Example

For example: consider the Process Table. Initially, a single lock was used to control access to this data structure: the Process Table Lock. This lock was used when entries were added or deleted from the table. This lock was also used were entries were modified. Thus, when a process was created or killed, no other cpu could modify an existing process: change the wait state, post interrupts, etc. The solution was to add a Process Entry Lock. Thus the Process Table Lock controlled access when entries were added and deleted, and each Process Entry Lock was used to synchronize access to a single Process Table Entry.

This process thus decreased the granularity of the Process Table Lock. More locks were added to the system and the Process Table Lock was used to synchronize less code.

This process had to be repeated until cpus were no longer stalling. The definition of stalling is left up to the designer. A lower stall time means that the system is more efficient. It also means that more locks will have to be added.

By the way, the brains behind Pyramid Technology was Rich Hammonds. He designed both the software and hardware at Pyramid. He is the finest engineer that I have ever had the pleasure of working with.