Hashing tables to ensure consistency in Postgres, Redshift and MySQL
- Blog
- Tech Talk
Maintaining cache consistency Maintaining a data cache optimized to serve different queries can help speed them up, and benefits from economies of scale. Maintaining this cache leads us to a critical question: How do I know if the cache is…
Maintaining cache consistency
Maintaining a data cache optimized to serve different queries can help speed them up, and benefits from economies of scale.
Maintaining this cache leads us to a critical question: How do I know if the cache is still valid? Put another way: How do I know if the data in a table in database A matches the data in a table in database B?
Enter hashing, a general technique to detect if two datasets are the same. We can use a hash to validate that our cache is fresh without needing to understand any application-specific logic.
What we need now is a query that returns the exact same hash given the same table structure and data on all databases.
The algorithm
For each row:
- Take the MD5 of each column. Use a space for NULL values.
- Concatenate those results, and MD5 this result.
- Split into 4 8-character hex strings.
- Convert into 32-bit integers and sum.
We choose MD5 as our hash function because it’s fast and supported on all databases.
We break the column hashes into integer-sized chunks and sum them to get a single value (4 bigints) in order to save memory and transfer time. Ideally we’d hash all the individual column hashes, but this isn’t possible on all databases.
Finally, note that we must convert the columns into the same format before encoding in step (1) to ensure cross-database consistency.
In Postgres
Taking the MD5 of a column looks like this:
md5("column name"::text)
Some extra massaging may be required for more complex types. Examples of integers, text columns, and datetime columns are below.
Now we’ll layer on spaces for NULL values:
coalesce(md5("column name"::text), ' ')
Concatenating and hashing those results is a simple matter:
select md5(
coalesce(md5("column name"::text), ' ') ||
coalesce(md5("second column name"::text), ' ')
) as "hash"
from "my_table"."my_schema"
We then wrap this all in a subquery so we can split the result into 4 8-character hex strings, which are each converted into 32-bit integers and summed.
As we add that in, we get the final query:
select
sum(('x' || substring(hash, 1, 8))::bit(32)::bigint),
sum(('x' || substring(hash, 9, 8))::bit(32)::bigint),
sum(('x' || substring(hash, 17, 8))::bit(32)::bigint),
sum(('x' || substring(hash, 25, 8))::bit(32)::bigint)
from (
select md5 (
coalesce(md5("integer column"::text), ' ') ||
coalesce(md5(floor(
extract(epoch from "datetime column")
)::text), ' ') ||
coalesce(md5("string column"::text), ' ') ||
coalesce(md5("boolean column"::integer::text), ' ')
) as "hash"
from "my_schema"."my_table";
) as t;
Note the ‘x’ prepended to the hash strings, which tells Postgres to interpret them as hex strings when casting to a number.
In Redshift
Redshift supports the handy strtol function, making our hash-string-to-integer conversion a bit easier. Otherwise, the full query is the same:
select
sum(trunc(strtol(substring(hash, 1, 8), 16))),
sum(trunc(strtol(substring(hash, 9, 8), 16))),
sum(trunc(strtol(substring(hash, 17, 8), 16))),
sum(trunc(strtol(substring(hash, 25, 8), 16)))
from (
select md5(
coalesce(md5("integer column"::text), ' ') ||
coalesce(md5(floor(
extract(epoch from "datetime column")
)::text), ' ') ||
coalesce(md5("string column"::text), ' ') ||
coalesce(md5("boolean column"::integer::text), ' ')
) as "hash"
from "my_schema"."my_table"
) as t;
In MySQL
MySQL sports a few changes from the Postgres and Redshift variants:
First, the syntax for casting many of the columns to helpful strings is different, e.g. for datetimes:
floor(unixtimestamp("datetime column"))
Second, an explicit concat call is required to concatenate the column hashes, since we’re missing Postgres’s || syntax.
Finally, we use conv to get our 32-bit numbers, and cast to cast them to integers:
cast(conv(substring(hash, 1, 8), 16, 10) as unsigned)
Putting it all together, we get this final query:
select
sum(cast(conv(substring(hash, 1, 8), 16, 10) as unsigned)),
sum(cast(conv(substring(hash, 9, 8), 16, 10) as unsigned)),
sum(cast(conv(substring(hash, 17, 8), 16, 10) as unsigned)),
sum(cast(conv(substring(hash, 25, 8), 16, 10) as unsigned))
from (
select md5(
concat(
coalesce(md5("integer column"), ' '),
coalesce(md5(floor(
unix_timestamp("datetime column")
)), ' '),
coalesce(md5("string column"), ' '),
coalesce(md5(cast("boolean column" as integer)), ' ')
)
) as "hash"
from "my_table"
) as t;
In practice
The end result will be four bigints representing the state of the table. Changing any row will change the results.
All this being said, we can’t recommend that you write this all by hand! We’ve open-sourced a simple Go script to build the hash query given a database type and a list of columns names and types.