Echoes of my mind‎ > ‎Scribbles‎ > ‎

DB polling done right

Most database polling is naive polling - simply get the rows that need to be processed, process them and mark them done - beautiful, simple, and very naive.

Basic naive pseudocode:

foreach row in (
    select id,data,status 
    from table
    where status = 'ready'
) do {
    result = process(row.data);
    update table
    set status = result
    where id = row.id;
}

Of course, depending on the situation, there are better or worse ways you can do this naively and each "row" may be a batch of rows but this has all the basic features you want - each row is processed independently and doesn't affect the other rows and results (success? error?) are recorded.

Hard vs soft delete

Hard delete means that when you've finished processing a row, you remove it from the table. Soft delete means you update a (status) column to indicate that the row has been processed.

I'm also assuming an ideal table here. There is an ID column. There is a status column. Sometimes these are not so obvious. The ID column really stands for a unique identifier - something that uniquely identifies each row - it may in fact be a composite key across several columns, possibly including the status column. The status column may be an monotonic sequence, perhaps a processed or created date; or may be missing completely - using a hard delete strategy instead of soft delete. There are strategies for dealing with all of these complications, but ultimately that all they are - complications.

DB polling done wrong

So, "What's the problem?" you ask. The problem is that as soon as you introduce any sort of concurrency - which can happen even when you only have one poller - you risk rows being processed multiple times and/or data corruption.

Comments