This article needs attention from an expert in WikiProject. The specific problem is: excessive use of start/end symbols without sorting out the history of such use.January 2020) ( 
The Burrows–Wheeler transform (BWT, also called blocksorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as movetofront transform and runlength encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation.
The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.^{[1]}
When a character string is transformed by the BWT, the transformation permutes the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row.
For example:
Input  SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES 

Output  TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT^{[2]} 
The output is easier to compress because it has many repeated characters. In this example the transformed string contains six runs of identical characters: XX, SS, PP, .., II, and III, which together make 13 out of the 44 characters.
The transform is done by sorting all the circular shifts of a text in lexicographic order and by extracting the last column and the index of the original string in the set of sorted permutations of S.
Given an input string S = ^BANANA (step 1 in the table below), rotate it N times (step 2), where N = 8 is the length of the S string considering also the symbol ^ representing the start of the string and the red  character representing the 'EOF' pointer; these rotations, or circular shifts, are then sorted lexicographically (step 3). The output of the encoding phase is the last column L = BNN^AAA after step 3, and the index (0based) I of the row containing the original string S, in this case I = 6.
Transformation  

1. Input  2. All rotations 
3. Sort into lexical order 
4. Take the last column 
5. Output 
^BANANA

^BANANA ^BANANA A^BANAN NA^BANA ANA^BAN NANA^BA ANANA^B BANANA^ 
ANANA^B ANA^BAN A^BANAN BANANA^ NANA^BA NA^BANA ^BANANA ^BANANA 
ANANA^B ANA^BAN A^BANAN BANANA^ NANA^BA NA^BANA ^BANANA ^BANANA 
BNN^AAA

The following pseudocode gives a simple (though inefficient) way to calculate the BWT and its inverse. It assumes that the input string s
contains a special character 'EOF' which is the last character and occurs nowhere else in the text.
function BWT (string s) create a table, rows are all possible rotations of s sort rows alphabetically return (last column of the table)
function inverseBWT (string s) create empty table repeat length(s) times // first insert creates first column insert s as a column of table before first column of the table sort rows of the table alphabetically return (row that ends with the 'EOF' character)
To understand why this creates moreeasilycompressible data, consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps lesscommon exceptions (such as if it contains "Brahe ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).
The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it is reversible, allowing the original document to be regenerated from the last column data.
The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters alphabetically to get the first column. Then, the first and last columns (of each row) together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:
Inverse transformation  

Input  
BNN^AAA
 
Add 1  Sort 1  Add 2  Sort 2 
B
N
N
^
A
A

A

A
A
A
B
N
N
^


BA NA NA ^B AN AN ^ A 
AN AN A BA NA NA ^B ^ 
Add 3  Sort 3  Add 4  Sort 4 
BAN NAN NA ^BA ANA ANA ^B A^ 
ANA ANA A^ BAN NAN NA ^BA ^B 
BANA NANA NA^ ^BAN ANAN ANA ^BA A^B 
ANAN ANA A^B BANA NANA NA^ ^BAN ^BA 
Add 5  Sort 5  Add 6  Sort 6 
BANAN NANA NA^B ^BANA ANANA ANA^ ^BAN A^BA 
ANANA ANA^ A^BA BANAN NANA NA^B ^BANA ^BAN 
BANANA NANA^ NA^BA ^BANAN ANANA ANA^B ^BANA A^BAN 
ANANA ANA^B A^BAN BANANA NANA^ NA^BA ^BANAN ^BANA 
Add 7  Sort 7  Add 8  Sort 8 
BANANA NANA^B NA^BAN ^BANANA ANANA^ ANA^BA ^BANAN A^BANA 
ANANA^ ANA^BA A^BANA BANANA NANA^B NA^BAN ^BANANA ^BANAN 
BANANA^ NANA^BA NA^BANA ^BANANA ANANA^B ANA^BAN ^BANANA A^BANAN 
ANANA^B ANA^BAN A^BANAN BANANA^ NANA^BA NA^BANA ^BANANA ^BANANA 
Output  
^BANANA

A number of optimizations can make these algorithms run more efficiently without changing the output. There is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. Some care must be taken to ensure that the sort does not exhibit bad worstcase behavior: Standard library sort functions are unlikely to be appropriate. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the alphabet size and string length, the decoded string may be generated one character at a time from right to left. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.
One may also make the observation that mathematically, the encoded string can be computed as a simple modification of the suffix array, and suffix arrays can be computed with linear time and memory. The BWT can be defined with regards to the suffix array SA of text T as (1based indexing):
\({\displaystyle BWT[i]={\begin{cases}T[SA[i]1],&{\text{if }}SA[i]>1\\\$,&{\text{otherwise}}\end{cases}}}\)^{[3]}
There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. That means the BWT does expand its input slightly. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.
A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.^{[1]} The algorithms vary somewhat by whether EOF is used, and in which direction the sorting was done. In fact, the original formulation did not use an EOF marker.^{[4]}
Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an EOF marker to the end of the input or doing something equivalent, making it possible to distinguish the input string from all its rotations. Increasing the size of the alphabet (by appending the EOF character) makes later compression steps awkward.
There is a bijective version of the transform, by which the transformed string uniquely identifies the original, and the two have the same length and contain exactly the same characters, just in a different order.^{[5]}^{[6]}
The bijective transform is computed by factoring the input into a nonincreasing sequence of Lyndon words; such a factorization exists and is unique by the Chen–Fox–Lyndon theorem,^{[7]} and may be found in linear time.^{[8]} The algorithm sorts the rotations of all the words; as in the Burrows–Wheeler transform, this produces a sorted sequence of n strings. The transformed string is then obtained by picking the final character of each string in this sorted list. The one important caveat here is that strings of different lengths are not ordered in the usual way; the two strings are repeated forever, and the infinite repeats are sorted. For example, "ORO" precedes "OR" because "OROORO..." precedes "OROROR...".
For example, the text "^BANANA" is transformed into "ANNBAA^" through these steps (the red  character indicates the EOF pointer) in the original string. The EOF character is unneeded in the bijective transform, so it is dropped during the transform and readded to its proper place in the file.
The string is broken into Lyndon words so the words in the sequence are decreasing using the comparison method above. (Note that we're sorting '^' as succeeding other characters.) "^BANANA" becomes (^) (B) (AN) (AN) (A).
Bijective transformation  

Input  All rotations 
Sorted alphabetically  Last column of rotated Lyndon word 
Output 
^BANANA

^^^^^^^^... (^) BBBBBBBB... (B) ANANANAN... (AN) NANANANA... (NA) ANANANAN... (AN) NANANANA... (NA) AAAAAAAA... (A) 
AAAAAAAA... (A) ANANANAN... (AN) ANANANAN... (AN) BBBBBBBB... (B) NANANANA... (NA) NANANANA... (NA) ^^^^^^^^... (^) 
AAAAAAAA... (A) ANANANAN... (AN) ANANANAN... (AN) BBBBBBBB... (B) NANANANA... (NA) NANANANA... (NA) ^^^^^^^^... (^) 
ANNBAA^

Inverse bijective transform  

Input  
ANNBAA^  
Add 1  Sort 1  Add 2  Sort 2 
A N N B A A ^ 
A A A B N N ^ 
AA NA NA BB AN AN ^^ 
AA AN AN BB NA NA ^^ 
Add 3  Sort 3  Add 4  Sort 4 
AAA NAN NAN BBB ANA ANA ^^^ 
AAA ANA ANA BBB NAN NAN ^^^ 
AAAA NANA NANA BBBB ANAN ANAN ^^^^ 
AAAA ANAN ANAN BBBB NANA NANA ^^^^ 
Output  
^BANANA 
Up until the last step, the process is identical to the inverse BurrowsWheeler process, but here it will not necessarily give rotations of a single sequence; it instead gives rotations of Lyndon words (which will start to repeat as the process is continued). Here, we can see (repetitions of) four distinct Lyndon words: (A), (AN) (twice), (B), and (^). (NANA... doesn't represent a distinct word, as it is a cycle of ANAN....) At this point, these words are sorted into reverse order: (^), (B), (AN), (AN), (A). These are then concatenated to get
The BurrowsWheeler transform can indeed be viewed as a special case of this bijective transform; instead of the traditional introduction of a new letter from outside our alphabet to denote the end of the string, we can introduce a new letter that compares as preceding all existing letters that is put at the beginning of the string. The whole string is now a Lyndon word, and running it through the bijective process will therefore result in a transformed result that, when inverted, gives back the Lyndon word, with no need for reassembling at the end.
Relatedly, the transformed text will only differ from the result of BWT by one character per Lyndon word; for example, if the input is decomposed into six Lyndon words, the output will only differ in six characters. For example, applying the bijective transform gives:
Input  SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES 

Lyndon words  SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES 
Output  STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT 
The bijective transform includes eight runs of identical characters. These runs are, in order: XX, II, XX, PP, .., EE, .., and IIII.
In total, 18 characters are used in these runs.
When a text is edited, its Burrows–Wheeler transform will change. Salson et al.^{[9]} propose an algorithm that deduces the Burrows–Wheeler transform of an edited text from that of the original text, doing a limited number of local reorderings in the original Burrows–Wheeler transform, which can be faster than constructing the Burrows–Wheeler transform of the edited text directly.
This Python implementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation. It essentially does what the pseudocode section does.
Using the STX/ETX control codes to mark the start and end of the text, and using s[i:] + s[:i]
to construct the i
th rotation of s
, the forward transform takes the last character of each of the sorted rows:
def bwt(s: str) > str:
"""Apply BurrowsWheeler transform to input string."""
assert "\002" not in s and "\003" not in s, "Input string cannot contain STX and ETX characters"
s = "\002" + s + "\003" # Add start and end of text marker
table = sorted(s[i:] + s[:i] for i in range(len(s))) # Table of rotations of string
last_column = [row[1:] for row in table] # Last characters of each row
return "".join(last_column) # Convert list of characters into string
The inverse transform repeatedly inserts r
as the left column of the table and sorts the table. After the whole table is built, it returns the row that ends with ETX, minus the STX and ETX.
def ibwt(r: str) > str:
"""Apply inverse BurrowsWheeler transform."""
table = [""] * len(r) # Make empty table
for i in range(len(r)):
table = sorted(r[i] + table[i] for i in range(len(r))) # Add a column of r
s = [row for row in table if row.endswith("\003")][0] # Find the correct row (ending in ETX)
return s.rstrip("\003").strip("\002") # Get rid of start and end markers
Following implementation notes from Manzini, it is equivalent to use a simple null character suffix instead. The sorting should be done in colexicographic order (string read righttoleft), i.e. sorted(..., key=lambda s: s[::1])
in Python.^{[4]} (The above control codes actually fail to satisfy EOF being the last character; the two codes are actually the first. The rotation holds nevertheless.)
The advent of nextgeneration sequencing (NGS) techniques at the end of the 2000s decade has led to another application of the Burrows–Wheeler transformation. In NGS, DNA is fragmented into small pieces, of which the first few bases are sequenced, yielding several millions of "reads", each 30 to 500 base pairs ("DNA characters") long. In many experiments, e.g., in ChIPSeq, the task is now to align these reads to a reference genome, i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied on hashing (e.g., Eland, SOAP,^{[10]} or Maq^{[11]}). In an effort to reduce the memory requirement for sequence alignment, several alignment programs were developed (Bowtie,^{[12]} BWA,^{[13]} and SOAP2^{[14]}) that use the Burrows–Wheeler transform.
Categories: Lossless compression algorithms  Transforms