Hashing tables to ensure consistency in Postgres, Redshift and MySQL

Blog
Tech Talk

Maintaining cache consistency Maintaining a data cache optimized to serve different queries can help speed them up, and benefits from economies of scale. Maintaining this cache leads us to a critical question: How do I know if the cache is…

Written By Sisense Team May 12, 2023

Table of Contents

Maintaining cache consistency
The algorithm
In Postgres
In Redshift
In MySQL
In practice

Maintaining cache consistency

Maintaining a data cache optimized to serve different queries can help speed them up, and benefits from economies of scale.

Maintaining this cache leads us to a critical question: How do I know if the cache is still valid? Put another way: How do I know if the data in a table in database A matches the data in a table in database B?

Enter hashing, a general technique to detect if two datasets are the same. We can use a hash to validate that our cache is fresh without needing to understand any application-specific logic.

What we need now is a query that returns the exact same hash given the same table structure and data on all databases.

The algorithm

For each row:

Take the MD5 of each column. Use a space for NULL values.
Concatenate those results, and MD5 this result.
Split into 4 8-character hex strings.
Convert into 32-bit integers and sum.

We choose MD5 as our hash function because it’s fast and supported on all databases.

We break the column hashes into integer-sized chunks and sum them to get a single value (4 bigints) in order to save memory and transfer time. Ideally we’d hash all the individual column hashes, but this isn’t possible on all databases.

Finally, note that we must convert the columns into the same format before encoding in step (1) to ensure cross-database consistency.

In Postgres

Taking the MD5 of a column looks like this:

md5("column name"::text)

Some extra massaging may be required for more complex types. Examples of integers, text columns, and datetime columns are below.

Now we’ll layer on spaces for NULL values:

coalesce(md5("column name"::text), ' ')

Concatenating and hashing those results is a simple matter:

select md5(
  coalesce(md5("column name"::text), ' ') || 
  coalesce(md5("second column name"::text), ' ')
) as "hash"
from "my_table"."my_schema"

We then wrap this all in a subquery so we can split the result into 4 8-character hex strings, which are each converted into 32-bit integers and summed.

As we add that in, we get the final query:

select
  sum(('x' || substring(hash, 1, 8))::bit(32)::bigint),
  sum(('x' || substring(hash, 9, 8))::bit(32)::bigint),
  sum(('x' || substring(hash, 17, 8))::bit(32)::bigint),
  sum(('x' || substring(hash, 25, 8))::bit(32)::bigint)
from (
  select md5 (
    coalesce(md5("integer column"::text), ' ') ||
    coalesce(md5(floor(
      extract(epoch from "datetime column")
    )::text), ' ') ||
    coalesce(md5("string column"::text), ' ') ||
    coalesce(md5("boolean column"::integer::text), ' ')
  ) as "hash"
  from "my_schema"."my_table";
) as t;

Note the ‘x’ prepended to the hash strings, which tells Postgres to interpret them as hex strings when casting to a number.

In Redshift

Redshift supports the handy strtol function, making our hash-string-to-integer conversion a bit easier. Otherwise, the full query is the same:

select
  sum(trunc(strtol(substring(hash, 1, 8), 16))),
  sum(trunc(strtol(substring(hash, 9, 8), 16))),
  sum(trunc(strtol(substring(hash, 17, 8), 16))),
  sum(trunc(strtol(substring(hash, 25, 8), 16)))
from (
  select md5(
    coalesce(md5("integer column"::text), ' ') ||
    coalesce(md5(floor(
      extract(epoch from "datetime column")
    )::text), ' ') ||
    coalesce(md5("string column"::text), ' ') ||
    coalesce(md5("boolean column"::integer::text), ' ')
  ) as "hash"
  from "my_schema"."my_table"
) as t;

In MySQL

MySQL sports a few changes from the Postgres and Redshift variants:

First, the syntax for casting many of the columns to helpful strings is different, e.g. for datetimes:

floor(unixtimestamp("datetime column"))

Second, an explicit concat call is required to concatenate the column hashes, since we’re missing Postgres’s || syntax.

Finally, we use conv to get our 32-bit numbers, and cast to cast them to integers:

cast(conv(substring(hash, 1, 8), 16, 10) as unsigned)

Putting it all together, we get this final query:

select 
  sum(cast(conv(substring(hash, 1, 8), 16, 10) as unsigned)), 
  sum(cast(conv(substring(hash, 9, 8), 16, 10) as unsigned)), 
  sum(cast(conv(substring(hash, 17, 8), 16, 10) as unsigned)), 
  sum(cast(conv(substring(hash, 25, 8), 16, 10) as unsigned)) 
from (
  select md5(
    concat(
      coalesce(md5("integer column"), ' '),
      coalesce(md5(floor(
        unix_timestamp("datetime column")
      )), ' '), 
      coalesce(md5("string column"), ' '), 
      coalesce(md5(cast("boolean column" as integer)), ' ')
    )
  ) as "hash"
  from "my_table"
) as t;

In practice

The end result will be four bigints representing the state of the table. Changing any row will change the results.

All this being said, we can’t recommend that you write this all by hand! We’ve open-sourced a simple Go script to build the hash query given a database type and a list of columns names and types.

Blog

Data Visualizations in Python and R

Sisense hackathon recap feature 1200X628 1

Blog

Sisense hackathon 2020 recap: Dreaming big, building bigger with AWS

Blog

Hashing tables to ensure consistency in Postgres, Redshift and MySQL

Maintaining cache consistency

The algorithm

In Postgres

In Redshift

In MySQL

In practice

Data Visualizations in Python and R

Sisense hackathon 2020 recap: Dreaming big, building bigger with AWS

SQL cheat sheet & query syntax

Want the latest in analytics?

Sisense

Support

Resources