Package it.unimi.dsi.sux4j.mph


package it.unimi.dsi.sux4j.mph
Implementations of ([compressed] static | minimal perfect hash) functions.

This package provides a number of state-of-the-art implementations of static functions, that is, immutable structures that map a set of objects to a set of values. In particular, we provide minimal perfect hash functions (hence the package name), which map n objects to the first n natural numbers bijectively, and monotone minimal perfect hash functions, which map n objects to their lexicographical rank (i.e., the bijective mapping respects lexicographical order). All structures can compute their result using a space storage that is way below the information-theoretical lower bound for a dictionary because they have an unpredictable result when queried with keys outside of the original set. You can sometimes control the behaviour outside of the original set, however, using signatures (read below). In particular, some of the maps can actually be used to store an approximate dictionary in a space very close to the optimum.

Most of the functions have constant lookup time, except for the computation of a hash function on the key.

Usage

Most functions in this package provide a builder, rather than a constructor, and feature an easy-to-use main method to build serialized functions from the command line. Moreover, they implement the Object2LongFunction interface, but the underlying machinery manipulates bit vectors only. To bring you own data into the bit-vector world, each function builder requires to specify a transformation strategy that maps your objects into bit vectors. Many ready-made strategies can be found in TransformationStrategies: for example, rawUtf16(), utf16() and prefixFreeUtf16() can be used with character sequences, whereas rawIso(), iso() and prefixFreeIso() can be used with character sequences that are known to be within the range of the ISO-8859-1 charset.

Note that if you plan to use monotone hashing, you must provide objects in an order such that the corresponding bit vectors are lexicographically ordered. For instance, utf16() obtain this results by concatenating the reversed 16-bit representation of each character. Sometimes, you need also the resulting bit vectors to be prefix-free. If order is not an issue, raw transformation strategies do not reverse bit order, so they are usually faster.

For maps whose size depends on the length of the bit vectors being hashed, the HuTuckerTransformationStrategy provides lexicographical optimal compression, at the price of a slower lookup.

Signing functions

All functions in this package will return a value in their range for most of the keys that are not in their domain. In the few cases in which it is possible to detect a key out of the original set, you will get the default return value.

If you are interested in getting a more precise behaviour, you can sign a function, that is, you can record a signature for each key and use it to filter keys out of the original set. For that to happen, the function must be a bijection of the original set with the first natural numbers: this happens with (monotone) minimal perfect hash functions and with suitable static functions. Several classes provide built-in signing (look at the methods of their builder).

An interesting application of the signing technique makes it possible to store a set of objects with a small probability of error (false positives): one simply stores a function mapping each object to its signature. At lookup time, the signature of the object is compared with the output of the function. For instance, using a GOV4Function one can store a dictionary with probability of a false positive 2−w using just 1.03nw bits, that is, just 3% more than the theoretical lower bound. Look for the dictionary() method in builders.

There are also external signing classes for character sequences provided by the DSI utilities which actually implement the interface StringMap; in particular, LiterallySignedStringMap will provide a full StringMap implementation.

Available data structures

The classes can be gathered in three broad groups:
  • General functions. They can be used to associate arbitrary values to a set of objects, so, in particular, they can be used to implement order-preserving minimal perfect hashing (elements are mapped to their order in which they were provided, independently of their lexicographical order). They are also essential building blocks for all other classes. Out of this class we suggest as a workhorse GOV3Function, GOV4Function (slightly slower, smaller space) and TwoStepsGOV3Function (slower, but uses less space if the distribution of the output values is skewed, i.e., some values are very frequent).
  • Compressed functions. As above, but static functions of this type are targeted at functions whose output is not evenly distributed (i.e., some values are much more frequent than others). The space usage per key is proportional to the empirical 0-th order entropy of the output values, rather than to the number of bits that are necessary to express the larger value. Out of this class we suggest as a workhorse GV3CompressedFunction.
  • Minimal perfect hash function; they map a set of n object bijectively to the set n = { 0, 1,… n − 1 }. Out of this class we suggest as a workhorse GOVMinimalPerfectHashFunction, which uses ≈2.24 bits per key and has very fast lookups.
  • Monotone minimal perfect hash functions; these functions requires keys to be prefix-free and provided in lexicographical order. They will map back each key to its position using a very small number of bits per element, providing different space/time tradeoffs (in what follows, ℓ is the maximum string length):

LcpMonotoneMinimalPerfectHashFunction and ZFastTrieDistributorMonotoneMinimalPerfectHashFunction were introduced by Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna in “Monotone Minimal Perfect Hashing: Searching a Sorted Table with O(1) Accesses”, Proc. of the 20th Annual ACM–SIAM Symposium On Discrete Mathematics (SODA), ACM Press, 2009. TwoStepsLcpMonotoneMinimalPerfectHashFunction was introduced by the same authors in “Theory and practice of monotone minimal perfect hashing”, ACM Journal of Experimental Algorithmics, 16(3):132−144, 2011. All the GOV/GV-prefixed functions have been described by Marco Genuzio, Giuseppe Ottaviano, and Sebastiano Vigna in “Fast scalable construction of ([compressed] static | minimal perfect hash) functions”, Information and Computation, 273:104517, 2020.

Additional data structures

PaCoTrieDistributorMonotoneMinimalPerfectHashFunction, HollowTrieMonotoneMinimalPerfectHashFunction, and HollowTrieDistributorMonotoneMinimalPerfectHashFunction were introduced by Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna in “Theory and practice of monotone minimal perfect hashing”, ACM Journal of Experimental Algorithmics, 16(3):132−144, 2011.