Class GV4CompressedFunction<T>
- All Implemented Interfaces:
Function<T,
,Long> Object2LongFunction<T>
,Size64
,Serializable
,Function<T,
,Long> ToLongFunction<T>
Instances of this class store a function from keys to values. Keys are provided by an
iterable object (whose iterators must return elements in a consistent
order), whereas values are provided by a LongIterable
. If you do not specify values, each
key will be assigned its rank (e.g., its position in iteration order starting from zero).
The values must have a skewed distribution: some values must be much more frequent than others.
In that case, this data structure uses much less space than, say, a GOV4Function
because
it is able to use, for each key, a number of bits close to the empirical entropy of the value
list (with an additional ≈3%). It is slower than a GV3CompressedFunction
, and it
takes more time to build, but it uses less space.
For convenience, this class provides a main method that reads from standard input a (possibly
gzip
'd) sequence of newline-separated strings, and writes a serialised function
mapping each element of the list to its position, or to a given list of values.
Building a function
This class provides a great amount of flexibility when creating a new function; such flexibility is exposed through the builder. To exploit the various possibilities, you must understand some details of the construction.
In a first phase, we build a BucketedHashStore
containing hashes of the keys. By default,
the store will associate each hash with the rank of the key. If you
specify values, the store will associate with each hash
the corresponding value.
However, if you further require an indirect construction the
store will associate again each hash with the rank of the corresponding key, and access randomly
the values (which must be either a LongList
or a LongBigList
). Indirect
construction is useful only in complex, multi-layer hashes (such as an
LcpMonotoneMinimalPerfectHashFunction
) in which we want to reuse a checked
BucketedHashStore
. Storing values in the BucketedHashStore
is extremely scalable
because the values must just be a LongIterable
that will be scanned sequentially during
the store construction. On the other hand, if you have already a store that associates ordinal
positions, and you want to build a new function for which a LongList
or
LongBigList
of values needs little space (e.g., because it is described implicitly), you
can opt for an indirect construction using the already built
store.
Note that if you specify a store it will be used before building a new one (possibly because of a
DuplicateException
), with
obvious benefits in terms of performance. If the store is not checked, and a
DuplicateException
is thrown,
the constructor will try to rebuild the store, but this requires, of course, that the keys, and
possibly the values, are available. Note that it is your responsibility to pass a correct store.
Multithreading
This implementation is multithreaded: each bucket returned by the BucketedHashStore
is
processed independently. By default, this class uses Runtime.availableProcessors()
parallel threads, but by default no more than 4. If you wish to set a specific number of threads,
you can do so through the system property "it.unimi.dsi.sux4j.mph.threads".
Implementation Details
The detail of the data structure can be found in “Engineering Compressed Functions”, by Marco Genuzio and Sebastiano Vigna, 2017. The theoretical basis for the construction is described by Jóhannes B. Hreinsson, Morten Krøyer, and Rasmus Pagh in “Storing a compressed function with constant time access”, Algorithms - ESA 2009, 17th Annual European Symposium, 2009, pages 730−741. Each output value is represented by a codeword from a prefix-free code (by default, a length-limited Huffman code). We generate a random 4-regular linear system on F2, where the known term of the k-th equation is a bit in a bit array. When we read the bit array at four suitable positions depending on an input key, we can recover the codeword representing the output value.
- Since:
- 4.2.0
- Author:
- Sebastiano Vigna, Marco Genuzio
- See Also:
-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
The expected bucket size.protected final BitVector
A bit vector storing the main data array.protected final Codec.Decoder
The decoder that will be used to yield output values.static final double
static final int
protected final int
Codec.Decoder.escapedSymbolLength()
fromdecoder
, cached.protected final int
Codec.Decoder.escapeLength()
fromdecoder
, cached.protected final int
Length of longest codewordprotected long
The seed used to generate the initial signature.protected final long
The number of keys.static final String
The system property used to set the number of parallel threads.protected static final int
protected final long[]
protected static final int
protected final TransformationStrategy<? super T>
The transformation strategy to turn objects of typeT
into bit vectors.Fields inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defRetValue
-
Constructor Summary
ModifierConstructorDescriptionprotected
GV4CompressedFunction
(Iterable<? extends T> keys, TransformationStrategy<? super T> transform, LongIterable values, boolean indirect, File tempDir, BucketedHashStore<T> bucketedHashStore, Codec codec) Creates a new function for the given keys and values. -
Method Summary
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defaultReturnValue, defaultReturnValue
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction
andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
-
Field Details
-
SEED_BITS
protected static final int SEED_BITS- See Also:
-
OFFSET_BITS
protected static final int OFFSET_BITS- See Also:
-
NUMBER_OF_THREADS_PROPERTY
The system property used to set the number of parallel threads.- See Also:
-
DELTA
public static final double DELTA- See Also:
-
DELTA_TIMES_256
public static final int DELTA_TIMES_256 -
BUCKET_SIZE
public static final int BUCKET_SIZEThe expected bucket size.- See Also:
-
n
protected final long nThe number of keys. -
globalMaxCodewordLength
protected final int globalMaxCodewordLengthLength of longest codeword -
globalSeed
protected long globalSeedThe seed used to generate the initial signature. -
offsetAndSeed
protected final long[] offsetAndSeed -
data
A bit vector storing the main data array. -
transform
The transformation strategy to turn objects of typeT
into bit vectors. -
decoder
The decoder that will be used to yield output values. -
escapeLength
protected final int escapeLengthCodec.Decoder.escapeLength()
fromdecoder
, cached. -
escapedSymbolLength
protected final int escapedSymbolLengthCodec.Decoder.escapedSymbolLength()
fromdecoder
, cached.
-
-
Constructor Details
-
GV4CompressedFunction
protected GV4CompressedFunction(Iterable<? extends T> keys, TransformationStrategy<? super T> transform, LongIterable values, boolean indirect, File tempDir, BucketedHashStore<T> bucketedHashStore, Codec codec) throws IOException Creates a new function for the given keys and values.- Parameters:
keys
- the keys in the domain of the function, ornull
.transform
- a transformation strategy for the keys.values
- values to be assigned to each element, in the same order of the iterator returned bykeys
; ifnull
, the assigned value will the ordinal number of each element.indirect
- if true,bucketedHashStore
contains ordinal positions, andvalues
is aLongIterable
that must be accessed to retrieve the actual values.tempDir
- a temporary directory for the store files, ornull
for the standard temporary directory.bucketedHashStore
- a bucketed hash store containing the keys associated with their values and counting value frequencies, ornull
; the store can be unchecked, but in this casekeys
andtransform
must be non-null
.codec
- theCodec
used to encode values.- Throws:
IOException
-
-
Method Details
-
getLong
- Specified by:
getLong
in interfaceObject2LongFunction<T>
-
size64
public long size64()Returns the number of keys in the function domain. -
size
Deprecated. -
numBits
public long numBits()Returns the number of bits used by this structure.- Returns:
- the number of bits used by this structure.
-
containsKey
- Specified by:
containsKey
in interfaceFunction<T,
Long>
-
dump
- Throws:
IOException
-
main
-