Version: 0.4.24
6. Prefix and Suffix Trees#
- Lecture:
Lecture 5.6
(slides)
- Objectives:
Understand what is a prefix and what is a suffix tree
- Concepts:
Insertion, deletion and search into a Prefix-tree
As a last example of tree data structures, we shall look at prefix and suffix trees. These are data structures used to speed up text search such as auto completion in a text editors, or in text box on modern apps and websites. Fig. 6.13 shows an example of “default completion”, where the editor make suggestions based on words already in the document.
- Prefixes vs. Suffixes
What are prefixes and suffixes? Intuitively, we can say the are just “beginning” and “ending” of sequences. More formally, given a sequence of characters \(s = (c_1, c_2, \dots, c_n)\) any shorted sequence \(s' = (c_1, \dots, c_k)\) where \(0 \leq k \leq n\) is a prefix of \(S\). For instance, the word “algorithm” has the following prefixes (ordered by length):
\[P = \{ '',\, 'a',\, 'al',\, 'alg',\, 'algo',\, 'algor',\, 'algori',\, 'algorit',\, 'algorith',\, 'algorithm' \}\]Note that the set of prefixes includes the empty sequence and the complete word itself.
Similarly, suffixes are possible endings. Formally, and sequence \(s' = (c_k, \dots, c_n)\) is a suffix of \(s\). The word algorithm therefore admits the following suffixes:
\[S = \{ '',\, 'm',\, 'hm',\, 'thm',\, 'ithm',\, 'rithm',\, 'orithm',\, 'gorithm',\, 'lgorithm', \, 'algorithm' \}\]
6.1. Prefix Trees#
How can we implement such auto-completion mechanisms efficiently? What we need is kind of Map ADT that could give us all the word that starts by the “prefix2 the user has typed. For instance, given a set of words already typed,one could index them by prefix, using a hash table.
6.1.1. Structure of Prefix Trees#
A prefix tree is built over a set of words where paths from the root to the leaves represent different prefixes. At the end of each path, the tree holds all the words that start with the corresponding prefix.
A prefix tree is a \(n\)-ary tree where every leaf node holds a complete word. Every internal node identifies each of its children with a single character, so that the path from the root to any leaf (i.e., word) represents the shortest prefix that distinguishes that word from all the others (also known as the shortest common prefix, or LCP). Fig. 6.14 below gives an example of prefix tree built over the set of words:
Consider the word “apart” for instance. It is the only word among these eight words that has the prefix “apa”. Consequently, starting from the root, the path that corresponds to the prefix “apa” leads to “apart”.
Fig. 6.14 A prefix tree built on the set of eight words \(S = \{'apart', \, 'ape', \, 'apear',\, 'apple',\, 'apple',\, 'apply',\, 'apricot',\, 'april' \}\)#
In this example, if we need all the words that start with the prefix “apr”, we can follow the corresponding path from the root and we land a the root of a subtree where every leaves starts with that prefix. The path “apr” ends on Node 5, and this subtree contains the words “apricot” and “april”, the only two words that start with “apr”.
- What about word-prefixes?
We use a sentinel character
- Compression
How can we compress the tree