Class GOVMinimalPerfectHashFunction<T>
- All Implemented Interfaces:
Function<T,
,Long> Object2LongFunction<T>
,Size64
,Serializable
,Function<T,
,Long> ToLongFunction<T>
Given a list of keys without duplicates, the builder of this class finds a
minimal perfect hash function for the list. Subsequent calls to the getLong(Object)
method will return a distinct number for each key in the list. For keys out of the list, the
resulting number is not specified. In some (rare) cases it might be possible to establish that a
key was not in the original list, and in that case -1 will be returned; by signing the
function (see below), you can guarantee with a prescribed probability that -1 will be returned on
keys not in the original list. The class can then be saved by serialisation and reused later.
This class uses a bucketed hash store to provide highly scalable
construction. Note that at construction time you can pass a BucketedHashStore containing the keys (associated with any value); however, if the store
is rebuilt because of a BucketedHashStore.DuplicateException
it
will be rebuilt associating with each key its ordinal position.
For convenience, this class provides a main method that reads from standard input a (possibly
gzip
'd) sequence of newline-separated strings, and writes a serialised minimal
perfect hash function for the given list.
Signing
Optionally, it is possible to sign the minimal perfect
hash function. A w-bit signature will be associated with each key, so that
getLong(Object)
will return -1 on strings that are not in the original key set. As
usual, false positives are possible with probability 2-w.
Multithreading
This implementation is multithreaded: each bucket returned by the BucketedHashStore
is
processed independently. By default, this class uses Runtime.availableProcessors()
parallel threads, but by default no more than 4. If you wish to set a specific number of threads,
you can do so through the system property "it.unimi.dsi.sux4j.mph.threads".
How it Works
The detail of the data structure can be found in “Fast Scalable Construction of (Minimal Perfect Hash) Functions”, by Marco Genuzio, Giuseppe Ottaviano and Sebastiano Vigna, 15th International Symposium on Experimental Algorithms — SEA 2016, Lecture Notes in Computer Science, Springer, 2016. We generate a random 3-regular hypergraph and give it an orientation. From the orientation, we generate a random linear system on F3, where the variables in the k-th equation are the vertices of the k-th hyperedge, and the known term of the k-th equation is the vertex giving orientation to the k-th hyperedge. Then, we solve the system and store the solution, which provides a perfect hash function.
To obtain a minimal perfect hash function, we simply notice that we whenever we have to assign a
value to a vertex, we can take care of using the number 3 instead of 0 if the vertex is actually
the output value for some key. The final value of the minimal perfect hash function is the number
of nonzero pairs of bits that precede the perfect hash value for the key. To compute this number,
we use use in each bucket broadword programming.
Since the system must have ≈10% more variables than equations to be solvable, a
GOVMinimalPerfectHashFunction
on n keys requires 2.2n bits.
- Since:
- 4.0.0
- Author:
- Sebastiano Vigna
- See Also:
-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionprotected long[]
The bit array supportingbitVector
.protected final LongArrayBitVector
The bit vector underlyingvalues
.static final int
The expected bucket size.protected final long[]
A long containing the cumulating function of the bucket edges (i.e., keys) in the lower 56 bits, and the local seed of each bucket in the upper 8 bits.protected final long
The seed used to generate the initial signature.protected final long
The number of keys.static final String
The system property used to set the number of parallel threads.static final long
protected final long
The mask to compare signatures, or zero for no signatures.protected final LongBigList
The signatures.protected final TransformationStrategy<? super T>
The transformation strategy.protected final LongBigList
The final magick—the list of modulo-3 values that define the output of the minimal perfect hash function.Fields inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defRetValue
-
Constructor Summary
ModifierConstructorDescriptionprotected
GOVMinimalPerfectHashFunction
(Iterable<? extends T> keys, TransformationStrategy<? super T> transform, int signatureWidth, File tempDir, BucketedHashStore<T> bucketedHashStore) Creates a new minimal perfect hash function for the given keys. -
Method Summary
Modifier and TypeMethodDescriptionstatic final int
countNonzeroPairs
(long x) Counts the number of nonzero pairs of bits in a long.void
long
long
getLongBySignature
(long[] signature) Low-level access to the output of this minimal perfect hash function.static void
long
numBits()
Returns the number of bits used by this structure.long
size64()
protected static long
vertexOffset
(long edgeOffsetSeed) Methods inherited from class it.unimi.dsi.sux4j.mph.AbstractHashFunction
containsKey, size
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defaultReturnValue, defaultReturnValue
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction
andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
-
Field Details
-
serialVersionUID
public static final long serialVersionUID- See Also:
-
NUMBER_OF_THREADS_PROPERTY
The system property used to set the number of parallel threads.- See Also:
-
BUCKET_SIZE
public static final int BUCKET_SIZEThe expected bucket size.- See Also:
-
n
protected final long nThe number of keys. -
globalSeed
protected final long globalSeedThe seed used to generate the initial signature. -
edgeOffsetAndSeed
protected final long[] edgeOffsetAndSeedA long containing the cumulating function of the bucket edges (i.e., keys) in the lower 56 bits, and the local seed of each bucket in the upper 8 bits. The methodvertexOffset(long)
returns the bucket (i.e., vertex) cumulative value starting from the edge cumulative value. -
values
The final magick—the list of modulo-3 values that define the output of the minimal perfect hash function. -
bitVector
The bit vector underlyingvalues
. -
array
protected transient long[] arrayThe bit array supportingbitVector
. -
transform
The transformation strategy. -
signatureMask
protected final long signatureMaskThe mask to compare signatures, or zero for no signatures. -
signatures
The signatures.
-
-
Constructor Details
-
GOVMinimalPerfectHashFunction
protected GOVMinimalPerfectHashFunction(Iterable<? extends T> keys, TransformationStrategy<? super T> transform, int signatureWidth, File tempDir, BucketedHashStore<T> bucketedHashStore) throws IOException Creates a new minimal perfect hash function for the given keys.- Parameters:
keys
- the keys to hash, ornull
.transform
- a transformation strategy for the keys.signatureWidth
- a signature width, or 0 for no signature.tempDir
- a temporary directory for the store files, ornull
for the standard temporary directory.bucketedHashStore
- a bucketed hash store containing the keys, ornull
; the store can be unchecked, but in this casekeys
andtransform
must be non-null
.- Throws:
IOException
-
-
Method Details
-
countNonzeroPairs
public static final int countNonzeroPairs(long x) Counts the number of nonzero pairs of bits in a long.- Parameters:
x
- a long.- Returns:
- the number of nonzero bit pairs in
x
.
-
vertexOffset
protected static long vertexOffset(long edgeOffsetSeed) -
numBits
public long numBits()Returns the number of bits used by this structure.- Returns:
- the number of bits used by this structure.
-
getLong
- Specified by:
getLong
in interfaceObject2LongFunction<T>
-
getLongBySignature
public long getLongBySignature(long[] signature) Low-level access to the output of this minimal perfect hash function.This method makes it possible to build several kind of functions on the same
BucketedHashStore
and then retrieve the resulting values by generating a single signature. The methodTwoStepsGOV3Function.getLong(Object)
is a good example of this technique.- Parameters:
signature
- a signature generated as documented inBucketedHashStore
.- Returns:
- the output of the function.
-
size64
public long size64()- Specified by:
size64
in interfaceSize64
- Overrides:
size64
in classAbstractHashFunction<T>
-
dump
- Throws:
IOException
-
main
-