Blog

Openstack Swift Data Storage Explained

Posted on: June 6th, 2012 by stephenbroeker No Comments

I would like to take this opportunity to describe how Openstack Swift stores data.  Swift storage basically consists of four components: Ring, Database, Zones, and File system.

Openstack Swift Accounts, Containers & Objects

Openstack Swift data consists of Accounts, Containers, and Objects. A Swift item then, can be an account, a container, or an object.  A Swift Cluster can have multiple accounts, an account can have multiple containers, and a container can have multiple objects.  Containers cannot be nested.  That is, a container can only contain objects,  it can not contain other containers.  So Swift storage is considered to be a flat file system.

By default, Swift uses three replicas.  This means that when an item get created (via a PUT),  two additional copies are also created.  Thus if two copies of the item are not available, the item can still be read.

Two Types of Nodes

Basically, a Swift cluster consists of two types of nodes: Proxy and Storage.  Proxy Nodes provide an external interface into the Swift repository.  Proxy nodes perform authentication and provide a REST API to the data.  Storage nodes are used to store the actual data items: account, container, and object.  Storage nodes consist of a number of Storage Devices.  Each Storage Device is mounted as a separate file system using XFS.  A Swift Cluster could (and should) contain a number of Proxy and Storage nodes.

Openstack Swift groups Storage nodes into Zones.  By default, a Swift Cluster has four zones.  Thus if a Swift Cluster has 40 Storage nodes, then each Zone consists of 10 Storage nodes.  A zone is commonly a single Storage Rack, where a Storage Rack consists of multiple Storage nodes.  Item replicas are guaranteed to be distributed accross zones.  Thus the three replicas for a given Item are guaranteed to reside on different zones.  This greatly improves data reliability: if a Storage Rack is down, then only a single replica is unavailable.

So how do Proxy nodes keep track of Storage nodes?

If a Proxy node receives a GET Object operation, then the object is stored on three different Storage nodes.  How does the Proxy Node figure out which Storage Nodes are holding the Object data?  The answer is: by using Consistent Hashing.  In Openstack Swift Data Storage, this caching is called a “ring.”  There is a separate ring for accounts, containers, and objects.

Next time I’ll cover more about Rings.

SMP: Queued Locks – Part 9

Posted on: May 30th, 2012 by stephenbroeker No Comments

I’m wrapping up a nine-part series on Symmetric Multi Processors (SMP). Following up on #8 the Random Back-off Problem which focused on a solution to isolating locks so that a cache line can only hold a single lock, I’ll conclude with another solution that completely solves the Thundering Herd problem in an optimal manner: Queued Locks.

I’ve shown you how a 32 bit cpu commonly uses an 8-word cache line.  And we only want a single lock (or word) on a cache line.  We can use the rest of the cache line to implement a Queued Lock.

Queued Locks

The Queued Lock algorithm essentually consists of having a cpu add itself to the end of a queue when it is waiting for a lock (this gives us fairness).  The cpu then does a tight spin on a private cache line, waiting for the lock to be released (this gives us optimality).  When the lock is released, the holding cpu picks the first cpu off of the queue (this is fair) and pokes the private cache line of that cpu to wake it up (optimal).

Single Cache

Here is the Queued Lock data structure that resides on a single cache line:

struct queued_lock
{
    int  mutex; // locks this data structure.

    int  value; // 0 = free, 1 = locked

    char  queue[24];
};

This data structure can thus handle a 25 cpu system.  The queue field contains cpu numbers.  A zero value indicates that the entry is free.

Private Wait Array

Here is the Private Wait Array:

struct private_wait
{
    int  value;

    int  fill[7];
};

struct private_wait private_wait_array[25];

This array is indexed by the cpu number.

Queued Lock Algorithms

Here is the Queued Lock lock algorithm:

queued_lock_lock (struct queued_lock *lock)
{

Now that I've laid out various scenarios and solutions, have I missed anything? What solutions would you add?
    while (test_and_set(lock->mutex) == 1)    // lock the data structure
        ;

    if (lock->value == 0)    // lock is not set, take it
    {
        lock->value = 1;
        lock->mutex = 0;    // unlock the data structure
    }
    else    // lock is set, add myself to the queue, and wait
    {
        for (index = 0; ; index++)
        {
            assert(index < 24)

            if (lock->queue[index] == 0)
                break;
        }

        lock->queue[index] = my_cpu_number + 1;
        private_wait_array[my_cpu_number]->value = 0;
        lock->mutex = 0;    // unlock the data structure

        while (private_wait_array[my_cpu_number]->value == 0)
            ;
    }
}

And finally here is the Queued Lock unlock algorithm:

queued_lock_unlock (struct queued_lock *lock)
{
    while (test_and_set(lock->mutex) == 1)    // lock the data structure
        ;

    if (lock->queue[0] == 0)  // no one is waiting
    {
        lock->value = 0;
    }
    else  // somebody is waiting, poke them
    {
        cpu = lock->queue[0] - 1;
        memmove(&lock->queue[0], &lock->queue[1], 23);
        private_wait_array[cpu]->value = 1;
    }

    lock->mutex = 0;    // unlock the data structure
}

SMP: Random Backoff Solution – Part 8

Posted on: May 23rd, 2012 by stephenbroeker No Comments

Last time I presented a solution to the Thundering Herd problem: the Random Backoff, but noted a problem with solution.  It is quite possible that when a cpu releases a contentious lock, waiting cpus will not wake up immediately.  That is, it could take up to a second before a waiting lock tries to obtain the now-free lock.  This solution is thus not optimal.

As a follow-up I’d like to present a SMP refinement that I believe solves this problem: isolating and separating all locks so that there is a single lock per cache line.  Why do we do this?

Isolated & Separated Locks

Consider two SMP locks: L1 and L2, both on a single cache line.  Note: on a 32 bit cpu, a word is 32 bits and a cache line is commonly 8 words, and a single cache line could thus hold 8 locks.  If a cpu locks L1, then its cached copy of L1 is updated and the cache copy of L1 for the other cpus is invalidated.  A waiting cpu will thus have to get the up to date copy of L1 from main memory.  This is an expensive operation in time and bus traffic.  If L2 is also on the same cache line, then L1 traffic will invalidate L2 cache copies.  The L2 cache invalidates are not necessary for L2 cache coheriency and thus adversly effect system performance.

If we seperate L1 and L2 and place them on different cache lines, then L1 activity does not effect L2 activity and visa versa.

Memory Wasted

The downside of this solution is that memory is wasted.  Optimally (as far as memory is concerned) we could pack 8 locks into a single cache line.  So for every lock we are wasting 7 words.  If we consider Moore’s Law and how the price of memory chips is constantly dropping, I think this solution is worth the price.

So how much does this solution help?  Turns out that cache line isolation does not completely solve this problem.  It improves the situation, but does not completely elliminate it.  So in my next post, I will present another solution to the SMP Random Backoff problem.

SMP: Thundering Herd Solution – Part 7

Posted on: May 16th, 2012 by stephenbroeker No Comments

In Part 6 I presented Symmetric Multi-Processor’s (smp) Thundering Herd problem; it’s the result of a lack of fairness in the mutex algorithm.  This problem can occur when multiple cpus are using a lock under high contention. Next I’d like to share one of the smp solutions we used at Pyramid to solve this problem.

SMP Solution 1: Random Backoff

You may recall the mutex algorithm as:

mutex (void *addr)
{
    while (test_and_set(addr) == 1)
        ;
}

The problem is that when multiple cpus are waiting for a single lock, the while loop is too predictable.  It lacks variation.  What we need is a random period before the next call to call test_and_set().  We thus create a Random Backoff array.  The size of this array should be large relative to the number of cpus.  In the case of 24 cpus, we use an array size of 1024.  We then initialize the array entries to random numbers (in micro-seconds) with an appropriate upper bound of 1 second.

So the mutex algorithm now becomes:

mutex (void *addr)
{
    int  index;
    extern int random_backoff[];

    while (test_and_set(addr) == 1)
    {
        index = random() % 1000000;
        usec = random_backoff[index];
        usleep(usec);
    }
}

The particular use of random() is not required, any type of random number generater will do.  What is important is how the random number generater is seeded.  Each cpu must use a different seed.  I recommend using the cpu number multiplied by the current date.

Notice how we index the array with a random number.  If we did not did this, then indexing the array could become predictable.

Contentious Locks

There is a problem with this smp solution though.  It is quite possible that when a cpu releases a contentious lock, waiting cpus will not wake up immediately.  That is, it could take up to a second before a waiting lock trys to obtain the now free lock.  Thus the Random Backoff solution is not optimal.

In my next blog, I will present solutions and refinements to this situation.

SMP: Thundering Herd Problem – Part 6

Posted on: May 9th, 2012 by stephenbroeker No Comments

My last five blogs posts have presented:

  1. The Symmetric Multi Processor (SMP) scaling problem.
  2. The solution as presented by Pyramid Technology.
  3. The user space SMP problem.
  4. The various lock primitives.
  5. Debugging locks.

Now, let’s explore something called the Thundering Herd Problem. If you recall, a previous blog presented the mutex algorithm as:

mutex (void *addr)
{
    while (test_and_set(addr) == 1)
        ;
}

Now consider the scenerio where Process 1 (P1) is holding a lock and two other processes (P2 and P3) are trying to obtain the same lock.  If P2 has waiting for a long time and P3 has been waiting for a small amount of time, then there is nothing in the mutext algorithm that enforces fairness.  That is, once P1 releases the lock, P3 can immediately obtain it, even though P2 has been waiting for a much longer time.

Now extend this problem to a 12 CPU SMP system.  One cpu could hold a lock and 11 cpu’s could be waiting on the same lock.  And extend this problem to a 24 CPU SMP system.  One cpu could hold a lock and 23 cpu’s could be waiting on the same lock.

In this case, when a cpu releases a lock, possibly 23 other cpus could all try to get the same lock.  This mass attempt to obtain a single lock results in a non linear increase in bus and cache traffic.  The Thundering Herd is the 23 other cpus.  And the problem is, only one cpu is going to obtain the lock.  The other 23 cpus go back into a wait state and the process continues.  Since fairness is not built into this algorithm, there is no gaurantee that a particular cpu will obtain the lock in a reletively fair fashon.

Are there locks that suffer from this kind of contention?  Does the Thundering Herd problem occur in a real engineering envirnoment?  Or, is all of this just an interesting theory?

At Pyramid we actually ran into this problem.  In the Unix kernel, there a number of locks that experience this kind of contention.  And we experienced the Thundering Herd problem when we expanded our cachsis from 12 cpus to 24 cpus.  We actually ran into Lock Timeouts (30 seconds) where no single cpu was holding the lock for an excessive amount of time.  The problem was that the mutex algorithm was not fair, a waiting cpu was starved for attention.

In my next post I will present solutions to this problem.

SMP: Debugging the System Part 5

Posted on: May 2nd, 2012 by stephenbroeker No Comments

My last four blogs posts have presented:

  1. Symmetric Multi Processor (SMP) scaling problem.
  2. Solution as presented by Pyramid Technology.
  3. User space SMP problem.
  4. Various lock primitives.

It’s logical to now explore how to debug an SMP system. But first I need to present the classic locking problem as defined by Dijkstra.

Classic Locking Problem

Consider two processes: P1 and P2; and two locks: L1 and L2.  Now consider the following locking scenerio: P1 locks L1 (P1.L1) and then L2 (P2.L1); P2 locks L2 (P2.L2) and then L1 (P2.L1).

If the locks are aquired in the order: (1: P1.L1, 2: P2.L2, 3: P1.L2, 4: P2,L1), then steps 3 and 4 will never complete.  This scenerio is called a deadlock in that both processes are stuck waiting for locks that are held by each other.

So how do we deal with a SMP deadlock?  There are essentually two ways to deal with deadlocks.  Method 1: try to detect when a deadlock has occured.  Method 2: prevent deadlocks from occuring.

The first method is somewhat problomatic in that you have to keep track of all of the locks in the system and periodically monitor them for a deadlock.

The second method is much easier to implement.  Pyramid dealt with this problem by adding code to detect when a deadlock was immenant.  To wit, a positive, non-zero level was assigned to each lock and the lock protocol was modified to ensure that a lock could only be aquired if the lock level was increased.

Now let us apply lock levels to the Dijkstra lock problem.  L1 will now be defined as L1.5 (lock 1 with level 5) and L2 will now be defined as L2.10 (lock 1 with level 10).  If the locks are aquired in the order: (1: P1.L1, 2: P2.L2, 3: P1.L2, 4: P2,L1), then step 4 will generate a Lock Level Violation since P2 is holding L2 (which has a level of 10) and P2 is trying to lock L1 (which has a level of 5) – you can only increase the lock levels.

Handling SMP Interrupts

There is another problem that we have to deal with: interrupts.  If a process is holding a lock and interrupts occur, then it will look like the process is holding the lock for a long time.

This problem is dealt with by adding an interrupt level to the definition of a lock.  In addition, the lock protocol is expanded by requiring that the cpu interrupt level be set to the lock interrupt level.  Setting the cpu interrupt level blocks interrupts at a lower level.  System interrupts are defined by levels, for example: 0 = none, 1 = tty, 2 = network, 3 = disk, etc.  The higher the interrupt level, the higher the priority.  Higher level interrupts block out lower level interrupts.  Thus, setting the interrupt level of a cpu to priority X, blocks all interrupts lower than and including level X.

Defining the Locking Protocol

So a lock is now defined as: (Name, Level, Interrupt).  And the locking protocol is now defined as:

1)  If the new lock level is less than the current lock level, then generate a Lock Level Protocol Violation.

2)  If the new lock interrupt priority is less than the current cpu interrupt priority, then generate a Lock Interrupt Protocol Violation.

Next up? SMP’s Thundering Herd Problem.

SMP: Solving The Lock Problem Part 4

Posted on: April 25th, 2012 by stephenbroeker No Comments

My last three blog posts have presented the SMP (Symmetric Multi Processor scaling problem, Performance Solution as presented by Pyramid Technology, and the user space SMP problem. Now, I’d like to delve into the locks themselves.  What actually is a lock?

There are essentually two types of SMP locks: mutexes and semaphores.

SMP Mutex Lock

A mutex (short for mutual exclusion), is a lock that waits for a field to be released before proceeding.  In its simplist incantation, a mutex constantly checks the field in question.  This type of lock thus uses the following alorithm:

mutex (void *addr)  {
while (test_and_set(addr) == 1) ;
}

For this algorithm to work, there must be an atomic operation: test_and_set(void *ptr).  This operation writes a 1 into the given word address.  If the value was previously 0, then test_and_set() returns 0.  Otherwise this operation returns 1.

It is important to note that a mutex does not voluntarily give up the cpu.  In the kernel, this means never.  In user space, the schedular can run another process if it has higher priority.

SMP Semaphore Lock

A semaphore is a lock in which the cpu is released if the field is not free.  In its simplist incantation, a semaphore uses a mutex to try to set the lock.  If the lock is not set, then the semaphore gives up the cpu and tries later.  This type of lock thus uses the following alorithm:

semaphore (void *addr)
{
    while (test_and_set(addr) == 1)
        sleep(0);
}

The call to sleep with a seconds value of 0, simply results in a call to sched().  And sched() will preempt the semaphore if there is another process ready to run.

So why is mutex used at all?  Why not always use a semaphore?  The reason is that the kernel has to use a mutex, it cannot use a semaphore, since there is only a single copy of the kernel that can run on a given cpu.  But in user mode, semaphores allow other processes to run, thus improving system efficiency.

Hello world!

Posted on: April 20th, 2012 by admin 1 Comment

Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!

2012 Beehives: B-Haven Apiary Started

Posted on: April 20th, 2012 by stephenbroeker No Comments

On April 14, we picked up five new queen bees packages from Honey Bee Genetics.  Our intent was to kick of 2012 for B-Haven Apiary.

But first I had to get the apiary ready for the new beehives. If you will recall from B-Haven’s 2011 Annual Report, all four of last years hives died; they did not make it through the Winter and Spring. As a result I decided to employ a Hive Nucleus in the apiary.

New Beehives

I had two supers (20 frames) of honey and pollen left over from last year. I set up four new beehives plus the nucleus. I placed four honey/pollen frames in each empty hive – to give the new hives a good start.

This year, our youngest son-in-law (Mathew Togami) was nice enough to assist me in installing the new beehives. When we went to install the first new hive, we had quite the surprise. A swarm had moved in and taken over – awesome!

Installing Hive Nucleus

So we left the swarm alone and installed the four new beehives and nucleus. I had to rebalance the old honey/pollen frames. So instead of four old frames per hive, we used three old frames in two of the hives, four old frames in three of the hives, and 2 old frames in the nucleus.

All of the new hives look good. The new hives consist of four pounds of bees and a caged queen. I will check in three days to make sure that the new hives have released the queens from their cages. The swarm hive is on a single super. The four new hives are all on a single super. The nucleus is on two supers since each super is only five frames.

In my next bee blog, I will report on how queen acceptance is going. Did the new hives accept their queens? And how is the swarm hive doing?

SMP: Symmetric Multi-Processor User Processes (Part 3)

Posted on: April 18th, 2012 by stephenbroeker No Comments

Continuing on the information presented in Part 1:  The Symmetric Multi Processor (SMP) scaling problem and Part 2: The Performance Solution as presented by Pyramid Technology, my comments were addressed at the kernel.  But what about user processes?  Do they also need to use locking?

The answer is yes, if multiple processes share the same physical memory.  This can occur with Shared Memory. User space proccesses use Shared Memory so that proccesses can easily and quickly share data.  Shared Memory is the fastest Inter Process Communication (IPC) mechanism that can be employed.  But it does have a cost (alas nothing is free): lack of synchronization.

If a single process is writing to shared memory, and a single process is reading the same shared memory, then locking is not needed.

If a single process is writing to shared memory, and multiple processes are reading the shared memory, then locking may or may not be required.  If the shared memory is used as a resource allocator, then locks will be required.  On the other hand, if the shared memory is used to post events, then locks might not be required.

Multiple Processes, Locking Required

If SMP multiple processes are writing to shared memory, then locking will be required.

User space Shared Memory is thus like the kernel.  Each user process can be thought of as a seperate kernel running on different cpu.  Multiple instances of a user process can run on a single cpu.  If they use Shared Memory, they will then all use the same physical memory, just like the kernel.  It turns out that using user processes in this manner is a great way to simulate a kernel SMP environment.

A true SMP system requires hardware memory support, called cache coherency.  A cache coheriancy problem occurs when a cpu writes to physical memory and does not update or invalidate the caches on the other cpus in the system.

Next up I’ll delve more into locks.

SMP: Solving The Performance Problem (Part 2)

Posted on: April 11th, 2012 by stephenbroeker No Comments

In my last blog post, I described the SMP problem.  How adding cpus to an SMP system does not necessarily reliably increase performance.

Pyramid Technology solved this problem by decreasing the granularity of the lock.  This means that the amount of real estate that was controlled by the lock, was decreased.  Thus instead of a single lock for the entire kernel, multiple locks were added for the kernel.  That is, locks were added for each major data structure: Process Table, Schedular, File Systems, etc.

Pyramid found that this increase in kernel locks greatly increased system performance.  But it was not enough, processing were still stalling for long periods of time (greater than 30 seconds), waiting for kernel locks to be released.  The problem was that some of the locks were too course.

Process Table Example

For example: consider the Process Table.  Initially, a single lock was used to control access to this data structure: the Process Table Lock.  This lock was used when entries were added or deleted from the table.  This lock was also used were entries were modified.  Thus, when a process was created or killed, no other cpu could modify an existing process: change the wait state, post interrupts, etc.  The solution was to add a Process Entry Lock.  Thus the Process Table Lock controlled access when entries were added and deleted, and each Process Entry Lock was used to synchronize access to a single Process Table Entry.

This process thus decreased the granularity of the Process Table Lock. More locks were added to the system and the Process Table Lock was used to synchronize less code.

This process had to be repeated until cpus were no longer stalling. The definition of stalling is left up to the designer.  A lower stall time means that the system is more efficient.  It also means that more locks will have to be added.

By the way, the brains behind Pyramid Technology was Rich Hammonds. He designed both the software and hardware at Pyramid.  He is the finest engineer that I have ever had the pleasure of working with.

SMP: Symmetric Multi-Processor Explained (Part 1)

Posted on: April 4th, 2012 by stephenbroeker No Comments

As a software engineer at Pyramid Technology, Symmetric Multi-Processors (SMPs) served as a technology backbone. An SMP system essentually consists of multiple cpus that all use the same physical memory.

Each cpu ran some flavor of the Unix operating system.  In Unix, user spaces processes have seperate physical address spaces, so user processes are not a problem in SMP.  But the kernel maps to hard-wired physical addresses; thus multiple cpus, running the same kernel in SMP, is a problem.

SMP Locks to Avoid Corruption

A problem occurs when data structures are manipulated or changed.  If the same code or related code (on different cpus) is changing the same data structure, at the same time, then data corruption can occur. To prevent this data corruption, we protect the data structures with locks.  That is, a process must obtain a read or write lock before accessing a data structure.  Such operations thus synchronize data access.

So how many locks do we add?

AT&T was one of the first companies to create an SMP system: the 3B5000.  This system consisted of multiple cpus running Unix (AT&T invented Unix) and a single kernel lock.  This single lock was used to make sure that one and only one process was in the kernel at the same time.

If a process was in user mode, then performance was wonderful.  On the other hand, if a process needed to be in kernel mode, then performance could be terrible.  Adding additional cpus did not improve the problem since only a single process could be in the kernel. Thus this lock did not scale.  That is, adding X number of cpus did not result in a linear increase in system performance, since all processes need the kernel at some point.

In this blog entry, I have described the SMP problem.  My next blog post will describe the solution.

Openstack Swift TempAuth Module

Posted on: March 28th, 2012 by stephenbroeker No Comments

Last time I presented a Swift REST API example. Now I’ll explain the Openstack Swift Test Authentication and Authorization System (tempauth). This is an excellent authentication module for Swift All In One (SAIO) and for development work.

Add Tempauth to Openstack Swift Proxy Server

The first thing that you will need to do is to add tempauth to the Proxy Server configuration. So make sure that the following is in proxy-server.conf:

[pipeline:main]
pipeline = catch_errors cache tempauth proxy-server

[app:proxy-server]
account_autocreate = true

[filter:tempauth]
use = egg:swift#tempauth
user_admin_admin = admin .admin .reseller_admin
user_test_tester = testing .admin
user_test2_tester2 = testing2 .admin
user_test_tester3 = testing3

The configuration lines under the [filter:tempauth] section that begin with “user_”, define a users login, password, and privileges. These lines are of the form: user_<login1>_<login2> = <password> <privileges>

Four Users, Four Permissions

This configuration enables tempauth and creates the following four users, each with different permissions.

1) login = admin:admin
password = admin
privileges = .admin, .reseller_admin

2) login = test:tester
password = testing
privileges = .admin

3) login = test2:tester2
password = testing2
privileges = .admin

4) login = test:tester3
password = testing3
privileges = None

Openstack Swift Privileges

So what do the privileges mean?

1) admin
Admin users can do anything within their account.

2) reseller_admin
Reseller Admin users can do anything to any account.

3) None
Non-Admin users can only perform operations per container based on the container’s X-Container-Read and

X-Container-Write ACLs.

To allow a “user” to read the objects in container, then set the container header “X-Container-Read: .r:user”.

To allow a “user” to list the contents of a container, then set the container header “X-Container-Read: .rlistings”.

To allow a “user” to read and list the objects in container, then set the container header

“X-Container-Read: .r:user, .rlistings”.

To allow anyone to write to a container, then set the container header “X-Container-Write: .r:*”.

For complete ACL details, check out Openstack Swift dev documentation.

When a Non-Admin user is created, then the only way to create X-Container-Read and X-Container-Write headers is via a Reseller Admin user.

Or, do you have another solution?

Swift REST API Example

Posted on: March 14th, 2012 by stephenbroeker No Comments

Swift REST API Example

Last post I explained REST interfaces, and promised to use a Swift REST API as an example.   This application programming interface (API) supports the following operations.

1)  Swift REST API Authorization

GET Authorization

————————

This operation is used to obtain an authorization token and URL for a  given user login and password.   This token and URL are then used in any subsequent operations.

2) Swift REST API Account

DELETE Account
————————

Mark an account as deleted.   Swift REST API will clean up the account as time permits.

GET Account
————————

Get the list of containers in an account.

HEAD Account
————————

Get account statistics.

POST Account
————————

PUT Account
————————

Create an account.

3) Swift REST API Container

DELETE Container
————————

Mark a container as deleted. Swift will clean up the container as time permits.

GET Container
————————

Get the list of objects in a container.

HEAD Container
————————

Get container statistics.

POST Container
————————

PUT Container
————————

Create a container.

4) Swift REST API Object

COPY Object
————————

Copy an object from one container to another.

DELETE Object
————————

Delete an object from a container.

GET Object
————————

Get an object from a container.

POST Object
————————

Post meta-data to an object.

PUT Object
————————

Create an object in a container.

What would you add to this Swift REST API example?

REST Interfaces Explained

Posted on: March 7th, 2012 by stephenbroeker No Comments

REST (REpresentational State Transfer) is an interface type that was created in 2000 by Roy Fielding in a doctoral dissertation. It is based on HTTP, so a REST interface is basically composed of GET, PUT, POST, HEAD, and DELETE operations.  I’ll explain REST Interfaces in this post, and follow-up with a Swift REST API example.

REST Interface HTTP Components

First, let’s look at each operation as defined by the HTTP components:

  1. Method
    The operations typically are: COPY, DELETE, GET, HEAD, PUT, and POST.
    The server has complete freedom in defining these operations.
    Typically though, these operations are defined as follows:
    COPY – create a copy of an item.
    • DELETE – delete an item.
    • GET – read an item.
    • HEAD – read item meta-data.
    • PUT – create an item.
    • POST – write item meta-data.
  2. URL
    The URL is commonly of the form:
    <http>://<ip><port>/<version>/<auth>/<parameters>
    Where:

    • <http> = “http” or “https”.
    • <ip> = ip address.
    • <port> = port number.
    • <version> = version string.
    • <auth> = authorization token.
    • <data> = operation specific data.
    • <parameters> = options.
      Parameters begins with a “?”.   Multiple parameters are seperated by a “&”.
  3. Request Headers
    Operation input headers.
  4. Response Headers
    Headers returned by the interface.
  5. Response Data.
    Data returned by the interface in ASCII format.
  6. Status Code
    HTTP status code. This code is in the range 1 – 600 and can be broken down as follows:

    • 1xx Informational.
    • 2xx Success.
    • 3xx Redirection.
    • 4xx Client Error.
    • 5xx Server Error.

Next time I’ll present a SWIFT REST Interface API example. In the meantime, what would you add to this explanation?

Managing a Hive Nucleus

Posted on: March 2nd, 2012 by stephenbroeker No Comments

I just finished reading the February 2012 edition of the American Bee Journal.   In this edition, I discovered a most interesting article:  “The Half-Hive: Setting Up and Managing a Nucleus Hive“.

The author, Larry Connor, does a wonderful job of explaining the proper use of a Hive Nucleus in an apiary. I have never really understood the purpose before. Mr. Connor does such a fine job,  that I am now totally motivated to build and use a nucleus in my apiary this  year.

To start with, a nucleus is essentually a half hive. That is, instead of using a super with 10 frames,  a nucleus uses a super with only 5 frames. The reason being that the purpose of a nucleus is to be an apiary nursery,  not to produce honey,  like a normal hive.

Apiary Nursery

So what exactly is an apiary nursery?   An apiary nursery is a nucleus that is used to produce frames of brood. Once the brood is capped, the frames are moved from the nucleus hive to a  normal hive. Worker bees and the queen must be removed from the brood frames  (shake em baby, shake em)  before they are transferred. Otherwise, you would be moving worker bees from one hive to another  and a fight might break out.
Besides, you want the nucleus work bees to stay at home and produce more brood  frames.

Adding brood frames to a hive, strengthens the hive by quickly adding worker bees. I have tried added swarms to an existing hive, in the past, with minimal  results. The worker bees started fighting and the result was lots of dead bees. Adding brood frames is like adding a swarm without a fight.

You can create as many nukes as you want. But I have decided to use a single nucleus this year in my apiary. I want to start simple. I will ensure that the nucleus always has at least one frame of brood. Extra frames will be moved to normal hives, when the brood is capped. Thus the nucleus will never get larger than a single super.

At the end of the season, I will add a new queen to the nucleus, to help it make it through the winter.  Hopefully, the nucleus will result in stronger honey hives that survive the  winter.

B-Haven 2011 Annual Report

Posted on: February 28th, 2012 by stephenbroeker No Comments

I am very excited about the up and coming Honey Flow.  That is, when the local flowers are blooming and there is sufficient necter production for bees to grow the hives.  The Honey Flow usually starts some time in April.

2011 Honey Production

In the mean time, I would like to recap last years (2011) honey production.  We started the season with a single hive, that had made it successfully through the winter.  I decided to expand the apiary to four total hives.  I thus needed to create three new hives.

In the past, I had always created new hives by purchasing bee packages.  A Bee Package is a box full of bees (4 pounds worth) with a queen bee (in a cage) and a food source (a reservoir of sugar syrup).  The queen is in a cage so that the worker bees have a chance to bond to her scent and thus adopt her as their queen.  If the worker bees do not accept the queen, then they will kill her.

This year I decided to try a new approach to creating a new hive, split an existing hive into two hives.  I had never done this before and had no idea if my attempt would be successfull, even though it was standard procedure in the world of beekeeping.  Splitting a hive envolves the following:

1)  Count the number of frames with brood.

2)  Count the number of frames with only honey.

3) Count the number of frames with only pollen.

4)  Move half of the brood frames into the new hive.

5)  Move half of the honey frames into the new hive.

6) Move half of the pollen frames into the new hive.

7)  Add a caged queen into both hives.

This procedure resulted in two identical hives.  After three days, I opened the hives and made sure that the queens were out of their cages.  The old hive now had two queens.  They will fight it out, let the best female survive!  The ultimate chick fight! So I started with a single hive, and now I had two hives.  To get to my total of four, I created two more new hives from bee packages.  Once I had the apiary up to full strength,  all I had to do was let the bees do their thing.

Spring & Summer Tending

Over the course of the Spring and Summer I checked on the hives once a week.  I made sure that each hive had enough free space.  If a hive runs out of space, then the bees will swarm.  This is essentually hive reproduction, in which the bees create a new queen.  When the new queen emerges from a cell, she mates and leaves to start a new hive.

The new queen cannot survive on her own, so roughly half of the worker population leaves with her, after they have gourged themselves with honey.  The result is that the old hive has approximately half the number of worker bees and the hive thus has more free space.  This process is of no benefit to my apiary and does in fact weaken a hive.  So I wanted to avoid a swarm as much as possible.  Hence I frequently checked each hive and added empty supers if there was not enough free space.  I always erred on the side of caution, better to have to much free space than not enough.

In 2011, the Spring was somewhat delayed, but we made up for it with a long, mild Summer.  This resulted in a tremendous harvest!  We averaged 16 gallons per hive!

The hives had 4 to  5 honey supers.  I had to use a step stool to examine these hives!

I should take this opportunity to explain what a honey super is.  The first two supers are for the hive to raise bees and store honey and pollen for the winter.  Any additional supers are strictly for honey.  I ensure that by placing a screen (called a queen excluder) on top of the second super.  The wholes in this screen is large enough for the workers to travel through, but too small for the queen to squeeze through (she is after all somewhat larger than a worker bee due to her egg laying capabilities).

Once the Honey Flow died down in August, I decided to harvest the honey in September.  And as previously stated, the harvest was awsome.  I collected honey, honey comb, and bee bread.  I also melted down the wax trimmings from the pure honey, so that I could easily save it and use it in the future for candles and such.

Colony Collapse Disorder

But once the summer ended and the temperatures cooled down, I started loosing hives.  Three of the hives experienced Colony Collapse Disorder (CCD).  Which means that the worker bees absconded and the hive was reduced to a very small number of worker bees (less than 50) to keep the hive health and strong through the upcoming Winter.  So why did the worker bees leave?  No one really knows for sure.  The possible reasons include:

1)  The worker bees were killed during foraging by pesticides.

2)  The hive become infected with some kind of disease.

3)  The hive become infested with mites, hive beattles or some other pest.

4)  Parasitic wasps or flies that kill the worker bees.

5)  The hive swarmed and the reduced old hive was too weak.

The bottom line was that once CCD hit, the hive was doomed.  It was not strong enough to make it through the winter.  Two of the hives died in November and a third died in December.  When a hive died, I harvested the remaining honey and bee bread and discarded frames with dead brood.

So I am now left with a single hive.  The hope is that this hive will make it to the Honey Bloom this year, and regain honey production levels reached last year.  My next blog will be about my trip to Honey Bee Genetics to pick up the queens and packages for this years apiary.  The pick up date is April 14.

Swift Storage Devices

Posted on: February 28th, 2012 by stephenbroeker No Comments

In a previous blog, I presented the topology of Openstack Swift Object Storage.   That is, how Swift uses XFS  to store Accounts, Containers, and Objects.

A separate Account Database is created for each Account and an Account Database is a SQLite  instantiation in the file system. An Account Database consists of a Account Stat Table and a Container Table. The Account Stat Table contains meta-data for that specific account.   The Container Table contains an entry for each container in that account.

Similarly,  a separate Container Database is created for each Container  and an Container Database is an SQLite instantiation in the file system. A Container Database consists of a Container Stat Table and an Object Table. The Container Stat Table contains meta-data for that specific container. The Object Table contains an entry for each object in that container.

And finally, object data is stored in XFS files.

So what happens when a Swift Storage Device becomes full?   The storage device is managed by XFS,  so the file system will run out of space when this happens.   Thus XFS will NOT be able to perform subsequent writes,  but reads and deletes will still work.

What happens with the Swift REST API? Reads (GETs) continue to work, this is good.
Writes (PUTs and POSTs) obviously do not work  and this is as expected. But what about deletes (DELETEs)? If a Swift Account is “full”,  then users should be able to free up space in the account.

The sad, fact of the matter is that deletes do NOT work at this point. A DELETE Object operation fails with a 503 Internal Error. This error is the result of a Proxy Server exception. The exception occures because a database update fails. To understand this exception, we must understand how objects are deleted in  Swift.

As previously stated, a Container Database has an Object Table,  which contains an entry for each object in it. The Object Data itself is stored in an XFS file. So to delete an object, the appropriate Object Table entry has to be updated  and marked as “deleted”. The Object Data is left for the Object Reaper to clean up at a later time.

The Problem

The problem is with the Object Table update. This is a write, and since the file system is full, this update will fail.

So how do we get around this problem?   The Swift Account is full and users CANNOT free up account space. Currently, OpenStack recommends monitoring the Storage Devices so that an alarm  is generated when a device becomes 80% full. The Swift Administrator then,  has to add new Storage Devices to the Swift Rings. This involves modifying the rings and restarting the Proxy Servers. The Account will no longer be full,  since it now has additional space,  but users will NEVER be able to delete objects and containers from the full  storage device.

The Solution?

This solution is less than adequate and is clearly lacking in Enterprise quality. I would now like to propose more robust solutions, two in fact.

Solution 1: create a utility that removes objects from the full Storage Device. This utility would remove the Object Data file and then update the Object Table. The customer would obviously have to identify which objects should be removed. Once there is enough free space,  the customer could then use the normal DELETE operations to free up further
space. This solution is good in that it allows an account to be quickly repaired,  but it is bad in that a cluster administrator has to perform the action.

Soultion 2: create a configurable Swift Storage Device High Water Mark (HWM). The HWM would be expressed as a storage percentage,  that is how full can any Storage Device get before PUT operations are turned off?  So when the HWM is hit for any Storage Device, all PUT operations would fail for all accounts. But users could still perform GETs, POSTs, and DELETEs. They could thus still manage their accounts. The cluster admin would then have time to add storage devices,  without users clammering about “full” accounts.

Database Performance with Openstack Swift Improvements – Part 3

Posted on: February 22nd, 2012 by stephenbroeker No Comments

In Part 2 of this blog post series, I proposed replacing SQLite with MySQL as a database engine and changing the database schema to use chunking. These changes ensure that database performance is consistent for Account and Container Databases.

But what about Objects? As previously stated, Object data is stored in files. Are there problems with this?

Yes. Storing Object data in files is nice in that it is simple and Object data is easy to find – just look in the file system. But what about performance?

A file system adds a lot of overhead to the equation, functionality is not free. There are a number of problems with this approach.

Database Performance Problem 1: Buffer Cache

File systems use an in memory buffer cache to speed up repetitive, sequential, small file IO. But Swift Objects are written once with no partial updates or reads. Object are always completely rewritten. They are also always completely read. So the buffer cache sucks up a lot of memory and does not improve performance. In fact it degrades performance because of overhead.

Database Performance Problem 2: Multi Use of the File System

You will recall that the file system is also used to store the Account and Container databases. In addition, XFS supports Extended Attributes. These allow name/value pairs to be attached to a file. Swift uses XFS Extended Attributes to store user defined headers (meta data).

The end result is that the file system is multi use: database, meta data, and object data. This results in poor disk performance. All three of these components use the file system in different ways and thus have different effects on the file system and causes constant disk head seeks. For the best performance, we want to minimize seeks and keep the disk heads moving in one direction.

Database Performance Problem 3: Using File System Meta Data

Querying file system meta data performs poorly. File systems are great at moving data. They are terrible at querying data. Traversing inodes and data blocks is expensive. Consider the code to do an “ls”. This is basically a straight forward problem. Get the inode for a directory, perform multiple reads on the inode data, and filter and sort the results. This is a substantial ammount of code. Now consider if the meta data was in a database. The “ls” code becomes a single database query.

Database Performance Problem 4: All Components in a Single Repository

The database, meta data, and object data are all in the same file system. They thus are all in the same repository. Thus if the repo is down, then nothing is available. Now consider moving the meta data into the database and moving the database to a different repo. The result is that the database is still available if the file system is down. And the file system is still available if the database is down.

Database Performance Solution

The solution to these four problems is to move the file system meta data into the database and then move the database to another repo. In addition, scrap the file system and change the database to point directly into raw disk partitions. This means that Object Data would no longer be stored as files but the Object Table (in the Container Database) would point to the: partition name, partition offset, and object size. An object would thus be defined by the three tuple: (partition, offset, size). We thus remove the file system in all of its complexity and overhead and use raw disk partitions. To summerize:

  1. Replace SQLite with MySQL.
  2. Separate the database repo from the file system.
  3. Remove the file system.
  4. Move all meta data into the database.
  5. Change Object Table to use raw disk partitions.

That’s how I’ve fixed the problem. How have you solved these database performance issues?

Openstack Swift Database Performance Improvements Part 2

Posted on: February 15th, 2012 by stephenbroeker No Comments

In my previous Openstack post I established the groundwork for my proposed solution to improve the performance of Openstack Swift Database that consists of two parts: MySQL and Database Chunking.

Database Performance with Openstack Improvement: MySQL

For the first part of the solution, I propose replacing SQLite with MySQL as a database engine. As the name implies, SQLite is fine for small databases, but has performance problems with larger Openstack databases. MySQL is perfect for this problem, since a database is represented as a file system directory, and database tables are represented as files.

Openstack Performance Improvement: Database Chunking

For the second part of the solution, I propose using Database Chunking. That is, breaking up the Container and Object Tables into chunks. The result would be database queries on reasonably sized tables. The table structure would be tiered and tables would either be of type “index” or “data”. Index Tables would point to the next level of table, which would be “index” or “data”. Data Tables contain the actual table data and are thus leaf nodes in the table schema.

The optimal size of each table chunk would have to be determined through experimentation, but for purposes of argument, let us assume a table chunk size of 100,000 rows. So for an empty Account, the Container Table schema would be:

Container Index 1 -> empty

After the first container is created, the Container Table schema would be:

Container Index 1 -> Container Data 1
Container Data 1 -> 1 row

When container number 100,001 gets created, the Container Table schema would be:

Container Index 1 -> Container Data 1
Container Data 2
Container Data 1 -> 100,000 rows
Container Data 2 -> 1 row

So Container Index 1 can map (100,000 X 100,000 = 10 Billion) containers. When container number 10,000,000,001 gets created, the Container Table schema would be:

Container Index 1 -> Container Index 2
Container Index 3
Container Index 2 -> Container Data 1

Container Data 100,000
Container Index 3 -> Container Data 100,001
Container Data 1 -> 100,000 rows

Container Data 100,000 -> 100,000 rows
Container Data 100,001 -> 1 row

This Database Table Schema will essentually scale forever. Thus these database changes will ensure that database performance is consistent for Accounts and Containers. But what about Objects? In my next blog, I will first identify problems with Swift Object data storage and then present a solution.

In the meantime, how do you deal with Objects?

Openstack Swift Database Performance Part 1

Posted on: February 8th, 2012 by stephenbroeker No Comments

In my last two posts (Python’s Strengths & Weaknesses) I have been describing the operation of Openstack Swift Storage. Swift storage basically consists of four components: Ring, Database, Zones, and File system. I’m proposing some performance improvements to this design. But first we need to understand the Swift database schema. An Openstack Account Database consists of two tables: Account Stat and Container. And a Container Database consists of two tables: Container Stat and Object.

Openstack Account Stat Table

The following is a detailed view of the Openstack Database Account Stat Table:
account
created_at
put_timestamp
delete_timestamp
container_count
object_count
bytes_used
hash
id
status
status_changed_at
metadata

Openstack Container Table

And the following is a detailed view of the Container Table:
ROWID
name
put_timestamp
delete_timestamp
object_count
bytes_used
deleted

Notice that both the Account Stat Table and the Container Table have deleted attributes. These attributes are required since rows in these tables are never deleted, they are just marked as deleted. The reason that rows are not deleted is that this would require some time of synchronization (locking), in case another thread was accessing the same database. And we all know that locking in the Cloud is a very bad thing, it would destroy scaling. So these tables are append or update, deletes are not allowed.

This is all well and good for performance, but happens when these tables grow? The Account Stat Table will never grow, it will always have one and only one row. But the Container Table will grow with time, as containers are created for the account. So what happens to SQLite performance when the Container Table gets large? First, since an SQLite database is a file, file performance will degrade as the file grows. Second, database query performance will also degrade as the file grows.

Container Database

The following is a detailed view of the Container Stat Table:
account
container
created_at
put_timestamp
delete_timestamp
object_count
reported_put_timestamp
reported_delete_timestamp
reported_object_count
reported_bytes_used
hash
id
status
status_changed_at
metadata

Object Table

And the following is a detailed view of the Object Table:
ROWID
name
created_at
size
content_type
etag
deleted

Notice that both the Container Stat Table and the Object Table have deleted attributes, just like the Account Database, and for the same reasons. So both the Container and Objects Tables will have performance problems as containers and objects are created over time.

Next time I’ll propose a solution that consists of two parts: MySQL and Database Chunking.

Python: It’s Programming Weaknesses

Posted on: February 1st, 2012 by stephenbroeker No Comments

In my last post, I presented Python and it’s strengths. To be fair, I’d like to continue and share my thoughts about some its weaknesses.

Python’s Weaknesses

1. Documentation Revisited

Python programs are poorly documented. On of the goals of Python is that the code should be self-documenting; programs are thus written in such a way that you don’t need comments.  This is a wonderful little theory that doesn’t work so well in the  real world. For anyone who has had to scroll through code, good luck trying to understand it. A little judicious documentation goes a long way in explaining algorithms and finding bugs.

2. Where’s The White Space?

Another Python standard that is white space is discouraged. This results in code that is extremely difficult to read. I understand the goal is to limit the size of each function to one page but extremes are dangerous and I would like to see a compromise of limiting the size and function and including proper comments. In fact the whole point of Sphinx is to convert code comments to code documentation. Sphinx can’t work unless there’s code comments.

3. Debugging

I have noticed that a lot of Python code is defensive. By that I mean the code assumes correctness and that any bugs will be easy to find at run time. Again this is another wonderful little theory that doesn’t work out so well in the real world. A lot of bugs are a result of data structure corruption; the quicker code can detect this corruption the easier it is to find and fix bugs. Exceptions are standard Python feature and are great. Asserts are another feature that is unfortunately not commonly used. Asserts should be used for assumptions that should “never” happen. They are in essence self documenting and help greatly in error detection and debugging.

4. Performance

Because it is an interpreter Python will obviously be slower than compiled code. But Python is implemented in C and it creates “compiled” .pyc files. The performance will not match programming languages like C.

For those Python experts out there, what weaknesses have you experienced? I invite you to share your thoughts about my analysis.

 

Python: It’s Programming Strengths

Posted on: January 21st, 2012 by stephenbroeker 1 Comment

Python is a relatively new programming language. It can be thought of an extension of Perl. Essentially Python is an interpreter. In my current position as a Cloud Architect at Internap, I have been exclusively using it on the OpenStack project.

Though I am not an expert in Python (been using it for less than a year), I am an accomplished programmer in other languages like C. With this experience I bring a different viewpoint to the Python discussion. As a Python newbie, I understand that it has a different perspective than C. I am not opposed to change. I am not stuck in the past. I embrace the rapid progress in the computer engineering community.

Python seems to be the new hot kid on the block, and here’s my opinion about it’s strengths:

Python’s Strengths

1. Interpreter

It’s an interpreter; you don’t have to pre-compile your program. It is thus interactive.   So in theory, a Python program that works on a given system should work on any other system.

2. Open Source & Python

A lot of newer Open Source code is written in Python. By that I mean it is in wide use. Thus the engineering community is focused on its use. This should result in Python becoming refined over time.

3. Do You Have Standards?

There is a serious effort put into standardization. Excellent examples are Pep8 and Pylint.

4. Documentation

There’s a wonderful document generator called Sphinx that is similar to Doxygen. I am a big fan of using Doxygen in my C programs. The basic idea of Sphinx is that the program comments should be good enough to generate proper design documents.  This means Python functions need to have comments (surrounded by “””).  Such comments are in fact required by pylint.

5. Data Types

It supports high level data types like lists, dictionaries, and tuples. This makes it much easier to design data structures.

For those Python experts out there, what strengths have you experienced? I invite you to share your thoughts about my analysis.

Thanks for checking out my blog

Posted on: November 30th, 2011 by stephenbroeker No Comments

If you are interested in Cloud Computing, or OLAP for streaming data, then drop me a line. Especially if you are seeking research assistance in these – or related areas.

If you are interested in bees, making honey, specialty beers, extraordinary cuisine, or roses, then drop me a line on any of these topics.

I also like football and archery, but not as a combination. You might be able to provoke a discussion with me on these topics if your strong opinions are supported by extensive knowledge, critical thinking skills, and civility brought about by self-directed maturity.