Skip to content

Better know an algorithm: Burrows-Wheeler transformation

Introduction

I’ve always thought this algorithm was particularly awesome, and you’ll see why. Here’s the description by wikipedia:

The Burrows–Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding.

So a string such as “BANANA” is rendered as “BNN|AAA” (the | denotes EOF, we’ll cover this later). The algorithm can take a piece of text with repeating _sequences_ and transform it into a string of repeating letters. Remarkably, this process is reversible. The strings that benefit the most from this kind of transformation are ones with repeating sets of characters. Imagine a DNA sequence such as follows: “CGATCGATCGATCGATCGAT”. This would not be typically compressible by a scheme such as Run-Length Encryption. But after a BWT application this becomes “GGGGG|TTTTCCCCCAAAAAT”, which can be efficiently compressed. And on decompression the string can be transformed back into its original string form.

The Algorithm

def bw(string)
  string += "|"
  columns = []
  0.upto(string.length - 1) do |n|
    columns.push(string.chars.rotate(n*-1))
  end
  columns.sort!.map { |c| c[-1] }
end

    Steps

  • Add an EOF (“|”) character to mark the end of the string, this is needed for reversing the process
  • Create an array of cyclic rotated versions of the string
  • Sort the array lexographically, so similar sequences are next to each other
  • Take the last column

In the canonical example of “BANANA” we go through the following steps:

    BANANA

    BANANA|

    BANANA|
    |BANANA
    A|BANAN
    NA|BANA
    ANA|BAN
    NANA|BA
    ANANA|B

    ANANA|B
    ANA|BAN
    A|BANAN
    BANANA|
    NANA|BA
    NA|BANA
    |BANANA
  

Why does this work?

Initially, I implemented the algorithm and then just stared at it, wondering how the hell it could possibly work. The key insight is that it works primarily with repeating sequences of characters. The DNA sequence example shows this more clearly:

 ATCGATCGAT|CG
 ATCGAT|CGATCG
 AT|CGATCGATCG
 CGATCGATCGAT|
 CGATCGAT|CGAT
 CGAT|CGATCGAT
 GATCGATCGAT|C
 GATCGAT|CGATC
 GAT|CGATCGATC
 TCGATCGAT|CGA
 TCGAT|CGATCGA
 T|CGATCGATCGA
 |CGATCGATCGAT

By grouping together the similar substrings, letters that precede the said substring end up aligned next to each other. Pretty cool. While this works, there’s a caveat. This isn’t particularly efficient as it requires holding N rotations of the string in memory. In the next article I’ll be covering how to improve the space requirements, and more importantly how to reverse the process.

Published inAlgorithms

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *