Expand description
Vocabulary construction with frequency counting, subsampling, and the unigram noise distribution table for negative sampling.
§Subsampling
Frequent words are stochastically discarded using Mikolov’s formula:
P(discard) = 1 - sqrt(t / f)
where t is Config::subsample_threshold and f is the word’s
relative corpus frequency.
§Negative Sampling Table
A flat array of TABLE_SIZE word indices drawn from freq^0.75,
which downweights very frequent words as negatives.
Structs§
- Vocabulary
- Maps tokens ↔ integer indices and stores frequency statistics.