Class MinimalPerfectHashFunction<T>
- All Implemented Interfaces:
Function<T,
,Long> Object2LongFunction<T>
,Size64
,Serializable
,Function<T,
,Long> ToLongFunction<T>
Given a list of keys without duplicates, the builder of this class finds a
minimal perfect hash function for the list. Subsequent calls to the getLong(Object)
method will return a distinct number for each key in the list. For keys out of the list, the
resulting number is not specified. In some (rare) cases it might be possible to establish that a
key was not in the original list, and in that case -1 will be returned; by signing the
function (see below), you can guarantee with a prescribed probability that -1 will be returned on
keys not in the original list. The class can then be saved by serialisation and reused later.
This class uses a chunked hash store to
provide highly scalable construction. Note that at construction time you can
pass a
it.unimi.dsi.sux4j.io.ChunkedHashStore containing the keys (associated with any value); however,
if the store is rebuilt because of a
ChunkedHashStore.DuplicateException
it will be rebuilt associating
with each key its ordinal position.
The theoretical memory requirements for the algorithm we use are 2γ
=2.46 + o(n) bits per key, plus the bits for the random hashes (which are
usually negligible). The o(n) part is due to an embedded ranking scheme that increases
space occupancy by 6.25%, bringing the actual occupied space to around 2.68 bits per key.
For convenience, this class provides a main method that reads from standard input a (possibly
gzip
'd) sequence of newline-separated strings, and writes a serialised minimal
perfect hash function for the given list.
Signing
Optionally, it is possible to sign the minimal perfect
hash function. A w-bit signature will be associated with each key, so that
getLong(Object)
will return -1 on strings that are not in the original key set. As
usual, false positives are possible with probability 2-w.
How it Works
The technique used is very similar (but developed independently) to that described by Fabiano C. Botelho, Rasmus Pagh and Nivio Ziviani in “Simple and Efficient Minimal Perfect Hashing Functions”, Algorithms and data structures: 10th international workshop, WADS 2007, number 4619 of Lecture Notes in Computer Science, pages 139−150, 2007. In turn, the mapping technique described therein was actually first proposed by Bernard Chazelle, Joe Kilian, Ronitt Rubinfeld and Ayellet Tal in “The Bloomier Filter: an Efficient Data Structure for Static Support Lookup Tables”, Proc. SODA 2004, pages 30−39, 2004, as one of the steps to implement a mutable table.
The basic ingredient is the Majewski-Wormald-Havas-Czech 3-hypergraph technique. After generating a random 3-hypergraph, we sort its 3-hyperedges to that a distinguished vertex in each 3-hyperedge, the hinge, never appeared before. We then assign to each vertex a two-bit number in such a way that for each 3-hyperedge the sum of the values associated to its vertices modulo 3 gives the index of the hash function generating the hinge. As as a result we obtain a perfect hash of the original set (one just has to compute the three hash functions, collect the three two-bit values, add them modulo 3 and take the corresponding hash function value).
To obtain a minimal perfect hash, we simply notice that we whenever we have to assign a value to
a vertex, we can take care of using the number 3 instead of 0 if the vertex is actually the
output value for some key. The final value of the minimal perfect hash function is the number of
nonzero pairs of bits that precede the perfect hash value for the key. To compute this number, we
use a simple table-free ranking scheme, recording the number of nonzero pairs each
BITS_PER_BLOCK
bits and using Long.bitCount(long)
to
count the number of nonzero pairs of bits in a word.
- Since:
- 0.1
- Author:
- Sebastiano Vigna
- See Also:
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
Deprecated.A builder class forMinimalPerfectHashFunction
. -
Field Summary
Modifier and TypeFieldDescriptionprotected long[]
Deprecated.The bit array supportingvalues
.static final int
Deprecated.The number of bits per block in the rank structure.protected final LongArrayBitVector
Deprecated.The bit vector underlyingvalues
.protected final long[]
Deprecated.The array of counts for blocks and superblocks.protected final long
Deprecated.The seed used to generate the initial hash triple.static final int
Deprecated.The logarithm of the desired chunk size.protected final long
Deprecated.The number of keys.protected final long[]
Deprecated.The start offset of each chunk.protected final long[]
Deprecated.The seed of the underlying 3-hypergraphs.static final long
Deprecated.protected final long
Deprecated.The mask to compare signatures, or zero for no signatures.protected final LongBigList
Deprecated.The signatures.protected final TransformationStrategy<? super T>
Deprecated.The transformation strategy.protected final LongBigList
Deprecated.The final magick—the list of modulo-3 values that define the output of the minimal perfect hash function.Fields inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defRetValue
-
Constructor Summary
ModifierConstructorDescriptionprotected
MinimalPerfectHashFunction
(Iterable<? extends T> keys, TransformationStrategy<? super T> transform, int signatureWidth, File tempDir, ChunkedHashStore<T> chunkedHashStore) Deprecated.Creates a new minimal perfect hash function for the given keys. -
Method Summary
Modifier and TypeMethodDescriptionstatic int
countNonzeroPairs
(long x) Deprecated.Counts the number of nonzero pairs of bits in a long.long
Deprecated.long
getLongBySignature
(long[] triple) Deprecated.Low-level access to the output of this minimal perfect hash function.static void
Deprecated.long
numBits()
Deprecated.Returns the number of bits used by this structure.long
size64()
Deprecated.Methods inherited from class it.unimi.dsi.sux4j.mph.AbstractHashFunction
containsKey, size
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defaultReturnValue, defaultReturnValue
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction
andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
-
Field Details
-
serialVersionUID
public static final long serialVersionUIDDeprecated.- See Also:
-
BITS_PER_BLOCK
public static final int BITS_PER_BLOCKDeprecated.The number of bits per block in the rank structure.- See Also:
-
LOG2_CHUNK_SIZE
public static final int LOG2_CHUNK_SIZEDeprecated.The logarithm of the desired chunk size.- See Also:
-
n
protected final long nDeprecated.The number of keys. -
globalSeed
protected final long globalSeedDeprecated.The seed used to generate the initial hash triple. -
seed
protected final long[] seedDeprecated.The seed of the underlying 3-hypergraphs. -
offset
protected final long[] offsetDeprecated.The start offset of each chunk. -
values
Deprecated.The final magick—the list of modulo-3 values that define the output of the minimal perfect hash function. -
bitVector
Deprecated.The bit vector underlyingvalues
. -
array
protected transient long[] arrayDeprecated.The bit array supportingvalues
. -
transform
Deprecated.The transformation strategy. -
signatureMask
protected final long signatureMaskDeprecated.The mask to compare signatures, or zero for no signatures. -
signatures
Deprecated.The signatures. -
count
protected final long[] countDeprecated.The array of counts for blocks and superblocks.
-
-
Constructor Details
-
MinimalPerfectHashFunction
protected MinimalPerfectHashFunction(Iterable<? extends T> keys, TransformationStrategy<? super T> transform, int signatureWidth, File tempDir, ChunkedHashStore<T> chunkedHashStore) throws IOException Deprecated.Creates a new minimal perfect hash function for the given keys.- Parameters:
keys
- the keys to hash, ornull
.transform
- a transformation strategy for the keys.signatureWidth
- a signature width, or 0 for no signature.tempDir
- a temporary directory for the store files, ornull
for the standard temporary directory.chunkedHashStore
- a chunked hash store containing the keys, ornull
; the store can be unchecked, but in this casekeys
andtransform
must be non-null
.- Throws:
IOException
-
-
Method Details
-
countNonzeroPairs
public static int countNonzeroPairs(long x) Deprecated.Counts the number of nonzero pairs of bits in a long.- Parameters:
x
- a long.- Returns:
- the number of nonzero bit pairs in
x
.
-
numBits
public long numBits()Deprecated.Returns the number of bits used by this structure.- Returns:
- the number of bits used by this structure.
-
getLong
Deprecated.- Specified by:
getLong
in interfaceObject2LongFunction<T>
-
getLongBySignature
public long getLongBySignature(long[] triple) Deprecated.Low-level access to the output of this minimal perfect hash function.This method makes it possible to build several kind of functions on the same
ChunkedHashStore
and then retrieve the resulting values by generating a single triple of hashes. The methodTwoStepsMWHCFunction.getLong(Object)
is a good example of this technique.- Parameters:
triple
- a triple generated as documented inChunkedHashStore
.- Returns:
- the output of the function.
-
size64
public long size64()Deprecated.- Specified by:
size64
in interfaceSize64
- Overrides:
size64
in classAbstractHashFunction<T>
-
main
Deprecated.
-
GOVMinimalPerfectHashFunction
.