Skip to main content

Module vocab

Module vocab

Expand description

Vocabulary construction with frequency counting, subsampling, and the unigram noise distribution table for negative sampling.

§Subsampling

Frequent words are stochastically discarded using Mikolov’s formula:

P(discard) = 1 - sqrt(t / f)

where t is Config::subsample_threshold and f is the word’s relative corpus frequency.

§Negative Sampling Table

A flat array of TABLE_SIZE word indices drawn from freq^0.75, which downweights very frequent words as negatives.

Structs§

Vocabulary: Maps tokens ↔ integer indices and stores frequency statistics.