public class GOV3Function<T> extends AbstractObject2LongFunction<T> implements java.io.Serializable, Size64
Instances of this class store a function from keys to values. Keys are provided by an iterable object (whose iterators
must return elements in a consistent order), whereas values are provided by a LongIterable
. If you do not specify
values, each key will be assigned its rank (e.g., its position in iteration order starting from zero).
For convenience, this class provides a main method that reads from
standard input a (possibly gzip
'd) sequence of newlineseparated strings, and
writes a serialised function mapping each element of the list to its position, or to a given list of values.
Optionally, it is possible to sign a GOV3Function
.
Signing is possible if no list of values has been specified (otherwise, there
is no way to associate a key with its signature). A wbit signature will
be associated with each key, so that getLong(Object)
will return a default return value (by default, 1) on strings that are not
in the original key set. As usual, false positives are possible with probability 2^{w}.
If you're not interested in the rank of a key, but just to know whether the key was in the original set, you can turn the function into a dictionary. In this case, the value associated by the function with a key is exactly its signature, which means that the only space used by the function is that occupied by signatures: this is one of the fastest and most compact way of storing a static dictionary. In this case, the only returned value is one, and the default return value is set to zero.
This class provides a great amount of flexibility when creating a new function; such flexibility is exposed through the builder. To exploit the various possibilities, you must understand some details of the construction.
In a first phase, we build a ChunkedHashStore
containing hashes of the keys. By default,
the store will associate each hash with the rank of the key. If you specify values,
the store will associate with each hash the corresponding value.
However, if you further require an indirect
construction the store will associate again each hash with the rank of the corresponding key, and access randomly the values
(which must be either a LongList
or a LongBigList
). Indirect construction is useful only in complex, multilayer
hashes (such as an LcpMonotoneMinimalPerfectHashFunction
) in which we want to reuse a checked ChunkedHashStore
.
Storing values in the ChunkedHashStore
is extremely scalable because the values must just be a LongIterable
that
will be scanned sequentially during the store construction. On the other hand, if you have already a store that
associates ordinal positions, and you want to build a new function for which a LongList
or LongBigList
of values needs little space (e.g.,
because it is described implicitly), you can opt for an indirect construction using the already built store.
Note that if you specify a store it will be used before building a new one (possibly because of a DuplicateException
),
with obvious benefits in terms of performance. If the store is not checked, and a DuplicateException
is
thrown, the constructor will try to rebuild the store, but this requires, of course, that the keys, and possibly the values, are available.
Note that it is your responsibility to pass a correct store.
This implementation is multithreaded: each chunk returned by the ChunkedHashStore
is processed independently. By
default, this class uses Runtime.availableProcessors()
parallel threads, but never more than 16. If you wish to
set a specific number of threads, you can do so through the system property "it.unimi.dsi.sux4j.mph.threads".
The detail of the data structure
can be found in “Fast Scalable Construction of (Minimal Perfect Hash) Functions”, by
Marco Genuzio, Giuseppe Ottaviano and Sebastiano Vigna,
15th International Symposium on Experimental Algorithms — SEA 2016,
Lecture Notes in Computer Science, Springer, 2016. We generate a random 3regular linear system on F_{2}, where
the known term of the kth equation is the output value for the kth key.
Then, we solve it and store the solution. Since the system must have ≈10% more variables than equations to be solvable,
an rbit GOV3Function
on n keys requires 1.1rn
bits.
Optionally, you may require a compacted function, that is, a
function storing nonzero values
in a separate array and using a ranked marker array to record the positions of nonzero values.
In this case, the function requires just (1.1 + r)n bits (plus the bits that are necessary for the
ranking structure; the current implementation uses Rank16
), but has slightly slower lookups.
GOV4Function
,
Serialized FormModifier and Type  Class and Description 

static class 
GOV3Function.Builder<T>
A builder class for
GOV3Function . 
Modifier and Type  Field and Description 

static double 
C
The ratio between variables and equations.

protected LongBigList 
data
The final magick—the list of modulo3 values that define the output of the minimal hash function.

protected long 
globalSeed
The seed used to generate the initial hash triple.

static int 
LOG2_CHUNK_SIZE
The logarithm of the desired chunk size.

protected long 
m
The number of variables.

protected LongArrayBitVector 
marker

protected long 
n
The number of keys.

static java.lang.String 
NUMBER_OF_THREADS_PROPERTY
The system property used to set the number of parallel threads.

protected long[] 
offsetAndSeed
A long containing the start offset of each chunk in the lower 56 bits, and the local seed of each chunk in the upper 8 bits.

protected Rank16 
rank
The ranking structure on
marker . 
protected long 
signatureMask
The mask to compare signatures, or zero for no signatures.

protected LongBigList 
signatures
The signatures.

protected TransformationStrategy<? super T> 
transform
The transformation strategy to turn objects of type
T into bit vectors. 
protected int 
width
The data width.

defRetValue
Modifier  Constructor and Description 

protected 
GOV3Function(java.lang.Iterable<? extends T> keys,
TransformationStrategy<? super T> transform,
int signatureWidth,
LongIterable values,
int dataWidth,
boolean indirect,
boolean compacted,
java.io.File tempDir,
ChunkedHashStore<T> chunkedHashStore)
Creates a new function for the given keys and values.

Modifier and Type  Method and Description 

boolean 
containsKey(java.lang.Object o) 
long 
getLong(java.lang.Object o) 
long 
getLongByTriple(long[] triple)
Lowlevel access to the output of this function.

static void 
main(java.lang.String[] arg) 
long 
numBits()
Returns the number of bits used by this structure.

int 
size()
Deprecated.

long 
size64()
Returns the number of keys in the function domain.

defaultReturnValue, defaultReturnValue
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
applyAsLong, get, put, put, remove, removeLong
public static double C
public static final java.lang.String NUMBER_OF_THREADS_PROPERTY
public static final int LOG2_CHUNK_SIZE
protected final long n
protected final long m
protected final int width
protected final long globalSeed
protected final long[] offsetAndSeed
protected final LongBigList data
protected final LongArrayBitVector marker
protected final TransformationStrategy<? super T> transform
T
into bit vectors.protected final long signatureMask
protected final LongBigList signatures
protected GOV3Function(java.lang.Iterable<? extends T> keys, TransformationStrategy<? super T> transform, int signatureWidth, LongIterable values, int dataWidth, boolean indirect, boolean compacted, java.io.File tempDir, ChunkedHashStore<T> chunkedHashStore) throws java.io.IOException
keys
 the keys in the domain of the function, or null
.transform
 a transformation strategy for the keys.signatureWidth
 a positive number for a signature width, 0 for no signature, a negative value for a selfsigned function; if nonzero, values
must be null
and width
must be 1.values
 values to be assigned to each element, in the same order of the iterator returned by keys
; if null
, the
assigned value will the the ordinal number of each element.dataWidth
 the bit width of the values
, or 1 if values
is null
.indirect
 if true, chunkedHashStore
contains ordinal positions, and values
is a LongIterable
that
must be accessed to retrieve the actual values.compacted
 if true, the coefficients will be compacted.tempDir
 a temporary directory for the store files, or null
for the standard temporary directory.chunkedHashStore
 a chunked hash store containing the keys associated with their ranks (if there are no values, or indirect
is true)
or values, or null
; the store
can be unchecked, but in this case keys
and transform
must be nonnull
.java.io.IOException
public long getLong(java.lang.Object o)
getLong
in interface Object2LongFunction<T>
public long getLongByTriple(long[] triple)
This method makes it possible to build several kind of functions on the same ChunkedHashStore
and
then retrieve the resulting values by generating a single triple of hashes. The method
TwoStepsGOV3Function.getLong(Object)
is a good example of this technique.
triple
 a triple generated as documented in ChunkedHashStore
.public long size64()
public long numBits()
public boolean containsKey(java.lang.Object o)
public static void main(java.lang.String[] arg) throws java.lang.NoSuchMethodException, java.io.IOException, JSAPException
java.lang.NoSuchMethodException
java.io.IOException
JSAPException