Class CompactSet
- All Implemented Interfaces:
HLLRepresentation
HyperLogLog
magic.
The compact representation can track up to 264 elements with an accurate cardinality estimation and occupy
only 12Kb of memory. At instantiation, is reserved a long array with length STORE_SIZE
. We use
P=HLL_BUCKET_COUNT
bits to identify the bucket and Q=HLL_MAX_CONSECUTIVE_ZEROES
bits to count the
consecutive zeroes. The Q number is representable with 6 bits. Therefore, a register has a width of 6 bits
(REGISTER_WIDTH
).
In C, we could allocate the memory space and manipulate it as we want, but here in Java, we have to do some bit twiddling magic to make the registers fit. In other words, each position in the array uses long with 64 bits, whereas one register uses only 6 bits. Multiple registers are within a single array position or spread across two positions. Some manipulation is necessary to retrieve the correct register value from the array.
The basic idea is that with a 64-bit hash, the first P bits identify the bucket, and the remaining Q bits we use to count the trailing zeroes, and in case all bits are 0, the count is Q + 1. The next step is to update the bucket value if it is smaller than the current count. From the bucket number and some bit-shifting, we identify the register.
Credits
This implementation takes inspiration from the Redis implementation [1], which is based [2]. The bit-shifting manipulation has some basis in [3], which implements the original algorithm without optimizations.Our cardinality estimation uses the optimizations proposed in [2]. And this is the reason we keep an additional int array. We use this approach as it is the same as Redis uses. This algorithm does not require a bias correction as the Google implementation or the [3]. That is, it does not need empirical data to correct estimations.
- Since:
- 15.0
- Author:
- José Bolina
- See Also:
-
Constructor Details
-
CompactSet
public CompactSet()
-
-
Method Details
-
set
public boolean set(byte[] data) Description copied from interface:HLLRepresentation
Add the given to the representation set.- Specified by:
set
in interfaceHLLRepresentation
- Parameters:
data
- : The data to include.- Returns:
- true if the data was added, and false otherwise.
-
cardinality
public long cardinality()The cardinality estimation is based on [2] (see class doc).This is an implementation of Algorithm 6 [2]. It uses the
multiplicity
vector and the methods oftau(double)
andsigma(double)
. Note that this does not need to iterate over the real registers to make an estimation,- Specified by:
cardinality
in interfaceHLLRepresentation
- Returns:
- The estimate cardinality of the set.
-
equals
-
hashCode
public int hashCode()
-