Echoes of my mind‎ > ‎Scribbles‎ > ‎

DB polling done right

DB polling done wrong

Most database polling is naive polling - simply get the rows that need to be processed, process them and mark them done - beautiful, simple, and very naive.

Basic naive polling pseudocode:
foreach row in (
  select id,data,status
  from table
  where status = 'ready'
) do {
  result = process(row.data);
  update table
  set status = result
  where id = row.id;
}

Of course, depending on the situation, there are better or worse ways you can do this naively and each "row" may be a batch of rows but this has all the basic features you want:
  • each row is processed independently
  • and doesn't affect the other rows
  • and results (success? error?) are recorded.

Hard vs soft delete

Hard delete means that when you've finished processing a row, you remove it from the table. Soft delete means you update a (status) column to indicate that the row has been processed.
I'm also assuming an ideal polling table here. There is an ID column. There is a status column. Sometimes these are not so obvious. The ID column really stands for a unique identifier - something that uniquely identifies each row - it may in fact be a composite key across several columns, possibly including the status column. The ID or status columns may be a monotonic sequence, perhaps a processed or created date; or may be missing completely - using a hard delete strategy instead of soft delete. There are strategies for dealing with all of these complications, but ultimately that all they are - complications.

So, "What's the problem?" you ask. The problem is that as soon as you introduce any sort of concurrency - which can happen even when you only have one poller - you risk rows being processed multiple times and/or data corruption. The cause can be as simple as taking longer to process a row that there is between polls.

A better way

The solution is to get exclusive control over the rows you're processing. For an ideal polling table, we do it like this:

Basic ideal table polling pseudocode:
if 0 < (
  select count(1)
  from table
  where status = 'ready'
) then {
  var batchkey = newhash();
  update table
  set status = batchkey
  where status = 'ready';

  foreach row in (
    select id,data,status
    from table
    where status = batchkey
  ) do {
    result = process(row.data);
    update table
    set status = result
    where id = row.id;
  }
}

By setting the status to a random batchkey each time we poll the table, any concurrent polling will skip those rows because their status is no longer 'ready'. For all modern databases this will also lock the table for the duration of the batch process. If your row processing takes a long time (which may be why you have concurrency problems at all) then you want to release the table as soon as possible (aside from generally being a good practise in itself). With this algorithm you can safely commit the changes after every table interaction - check your SQLs flavour, there may be a way to auto-commit after each operation. I'm not going to get in working with transactions here as they really deal with the details of the process, not the polling.

Processing errors

Before I look at the various complications, I should address the process result/status. Both of the algorithms above assume that the process result is a) obtainable and b) recordable. If you're going to abstract the database polling, then it needs to be separated from the processing - and the result. For this we can really only know about two results 1) nothing bad happened or 2) something bad happened. While it's generally a Good Thing™ to handle your own errors and not pass them up the chain, we can't assumes that the process will do that. We do this by wrapping the processing like so:

Catch exceptions caused by processing:
...
    do {
      try {
        process(row.data);
      } catch (exception e){maybetellsomeone('row.id had a boo boo')}
      update table
      set status = 'done'
      where id = row.id;
    }

...

..or better..
...
    do {
      try {
        process(row.data);
        update table
        set status = 'done'
        where id = row.id;
      } catch (exception e){
        maybetellsomeone('row.id had a boo boo');
        update table

        set status = 'error'
        where id = row.id;
      }

    }
...

..and if you're working with a hard delete strategy then..
...
    do {
      try {
        process(row.data);
      } catch (exception e){definitelytellsomeone('row.id had a boo boo')}
      delete from table
      where id = row.id;
    }

...

There are a bunch of possible variations here - e.g. maybe you want to retry erroring rows so set them back to 'ready' (or don't delete them). Going forward I'm going to work with a good - error handling - process, so I can keep the code here a bit simpler.

Complications

Polling table has IDs BUT no status

Ah!

Polling table has a sequence instead of IDs

Aah!

Polling table has a sequence instead of IDs AND no status

Aaah!!

Polling table has IDs and uses a hard delete strategy

Aaaah!!1

Polling table has no IDs at all and uses a hard delete strategy

Aaaah!!1


Comments