isitthere 0.2.9

A tool that generates perfect hash sets


To use this package, put the following dependency into your project's dependencies section:

dub.json
dub.sdl

Is it there ?

It's a D program that generates perfect hash sets for constant dictionaries made of strings.

It's based on the fact that for any small set of unique words there's a probability that exists a hash function able to produce different results for each element. IsItThere finds, using brute force and a PRNG, the coefficients that produce the same results as this hypothetical function. These coefficients are then used to generate the source code of the hash set.

The hash set produced is faster than any equivalent build at runtime:

  • the coefficients are stored on the stack (vs on the heap for a standard hash set).
  • the hash function simply adds the coefficients, mapped by each input byte of an entry (vs often several bit operations for a standard hash set).
  • "perfect" means no collision, so there's a single string comparison for each bucket that's not empty (vs possibly more than one for a standard hash set).

Build

DUB is required:

dub build --build=release

Usage

  • --cc<=true/false>: Optional, defines if the set is letter-case sensitive, which is the default.
  • --fk=<arg>: The formater kind. "0" or "agnostic" prints the raw data, "1" or "dlang" writes a ready to be used struct, "2" or "pascal" writes a pascal record.
  • --fn=<arg>: The name of the aggregate produced by the formatter (i.e a struct or a class name).
  • --if=<arg>: The input file. The dictionary, must contain a list of words, one by line.
  • --is=<arg>: The initial seed value as an integer, "0" means unpredictable.
  • --of=<arg>: The name of the file that will contain the tables, after formatting and if a perfect hash map exists for the word list. When not specified stdout is used.
  • --ml=<arg>: Optional, the map length as an int. This number should be at least equal to the count of words in the input list, but a bigger number, if possible a power of two, is recomended.
  • --nt=<arg>: Optional, the number of working threads.

The dictionary length should not be over (0xFFFF / 2) words. This program is designed for small dictionaries such as the keyword list of a programing language.

Notes:

  • A compact dictionary is not always faster since it can lead to more comparisons.
  • It's often interesting to choose a map length that's a power of two in order to avoid an integer division after the hashing. The formaters write a standard modulo but since the set size is constant it's very likely to be optimized by any compiler.
  • After generating d code, pass it to dfmt ! Constant arrays are written on very long lines.

Output

The programs formats the results either in a file or in stdout. The formatting relies on the fk parameters.

D

The input encoding is expected to be UTF8. The result of the D formater is ready to be used.

Pascal

The input encoding is expected to be UTF8. The result of the Pascal formater needs edition:

  • the right content must be put in the interface and the implementation sections.
  • the switch {$MODESWITCH ADVANCEDRECORDS} that's written before the record must be cut and paste between the unit declaration and the interface beginning

Examples

buildable
  • cd to the repository root.
  • bin/isitthere --if=test/d2kw.txt --of="test/d2kw.d" --ml=512 --fk=d --fn=DlangKeywords --is=2.
  • open the new test/d2kw.d file.
  • adds this code:
unittest
{
    assert( "qsdfsf" !in DlangKeywords);
    assert( "static" in DlangKeywords);
}
  • save and test the file: dmd d2kw.d -unittest -main.

in D

The Toy programming language Yatol uses a hash map generated by the program in order to filter the keywords when identifiers are created.

in Pascal

The D IDE Coedit uses two hash sets produced by this program.

Other information

  • licensed under the Boost Software license, 1.0
  • status: working, however the command line interface is not finished and more formatters could be added.
Authors:
  • Basile Burg
Dependencies:
none
Versions:
0.2.9 2018-Apr-08
0.2.8 2017-Jan-10
0.2.7 2017-Jan-10
0.2.6 2017-Jan-09
0.2.5 2017-Jan-09
Show all 10 versions
Download Stats:
  • 0 downloads today

  • 0 downloads this week

  • 0 downloads this month

  • 4 downloads total

Score:
0.7
Short URL:
isitthere.dub.pm