torsdag 13. august 2009

Debugging pecl/memcache

Some background stuff; we have 14 servers running both Apache and Memcached on the same machines. Each is installed with 32GB RAM where memcached allocates 25GB.

The last days we've starting experiencing some problems with our webservers where they start using all the CPU and load average goes to 10-14(rarely above 2 in normal conditions). Debugging suggests is caused by pecl/memcache 3.0.4 or memcached itself.

Using strace on the httpd process it reveals an seemingly infinte loop using the syscall select(), waiting for activity on a socket. The socket is an connection to one of the memcached(1.2.6) servers.

The source for pecl/memcache 3.0.4 goes like this:

void mmc_pool_select(mmc_pool_t *pool TSRMLS_DC) /*
  runs one select() round on all scheduled requests {{{ */
{
  ...
    result = select(nfds + 1, &(pool->rfds), &(pool->wfds), NULL, &tv);
  ...
  [lots of code for sending and recieving data]
}


Nothing seems odd about the code in the mmc_pool_select() and there's no loops inside that can cause an infinite looping of select(). The most likely source of this loop lies in mmc_pool_run():

void mmc_pool_run(mmc_pool_t *pool TSRMLS_DC) /*
  runs all scheduled requests to completion {{{ */
{
  ...
  while (pool->reading->len || pool->sending->len) {
    mmc_pool_select(pool TSRMLS_CC);
  }
}


This simply seems like something triggers the socket so select() continues to run the mmc_pool_select() routine, but it never recieves or sends anything. In turn this never sets pool->reading->len or pool->sending->len to 0 and then you got an infinite loop.

What is the most likely source of the problem here? pecl/memcache or memcached itself? If it's the first we probably could solve it with switching to pecl/memcached.

Ingen kommentarer:

Legg inn en kommentar