pub struct Vocabulary {
pub word2idx: HashMap<String, usize>,
pub idx2word: Vec<String>,
pub counts: Vec<u64>,
pub noise_table: Vec<u32>,
pub total_tokens: u64,
}Expand description
Maps tokens ↔ integer indices and stores frequency statistics.
Fields§
§word2idx: HashMap<String, usize>word → index
idx2word: Vec<String>index → word
counts: Vec<u64>Raw corpus frequency per index
noise_table: Vec<u32>Flat noise table for O(1) negative sampling
total_tokens: u64Total token count (after min_count filter)
Implementations§
Source§impl Vocabulary
impl Vocabulary
Sourcepub fn build(sentences: &[Vec<String>], config: &Config) -> Result<Self>
pub fn build(sentences: &[Vec<String>], config: &Config) -> Result<Self>
Build vocabulary from a tokenised corpus.
Steps:
- Count every token
- Drop tokens below
config.min_count - Sort by descending frequency (stable index order)
- Build unigram noise table
use word2vec::{Config, vocab::Vocabulary};
let corpus = vec!["the cat sat on the mat".to_string()];
let tokens: Vec<Vec<String>> = corpus.iter()
.map(|s| s.split_whitespace().map(str::to_string).collect())
.collect();
let vocab = Vocabulary::build(&tokens, &Config::default()).unwrap();
assert!(vocab.word2idx.contains_key("the"));
assert_eq!(vocab.count("the"), 2);Sourcepub fn count(&self, word: &str) -> u64
pub fn count(&self, word: &str) -> u64
Frequency of a word (0 if not in vocab).
use word2vec::{Config, vocab::Vocabulary};
let corpus = vec!["a a b".to_string()];
let tokens: Vec<Vec<String>> = corpus.iter()
.map(|s| s.split_whitespace().map(str::to_string).collect())
.collect();
let vocab = Vocabulary::build(&tokens, &Config::default()).unwrap();
assert_eq!(vocab.count("a"), 2);
assert_eq!(vocab.count("z"), 0);Sourcepub fn should_subsample(&self, idx: usize, threshold: f64, dice: f64) -> bool
pub fn should_subsample(&self, idx: usize, threshold: f64, dice: f64) -> bool
Returns true if this word should be subsampled (discarded) given
a uniformly random dice in [0, 1).
Uses Mikolov’s formula: P(keep) = min(1, sqrt(t/f) + t/f).
Sourcepub fn negative_sample(&self, rng: &mut SmallRng) -> usize
pub fn negative_sample(&self, rng: &mut SmallRng) -> usize
Draw a negative sample index from the noise distribution.
Uses the precomputed unigram table for O(1) lookup.
Trait Implementations§
Source§impl Clone for Vocabulary
impl Clone for Vocabulary
Source§fn clone(&self) -> Vocabulary
fn clone(&self) -> Vocabulary
Returns a duplicate of the value. Read more
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from
source. Read moreSource§impl Debug for Vocabulary
impl Debug for Vocabulary
Source§impl<'de> Deserialize<'de> for Vocabulary
impl<'de> Deserialize<'de> for Vocabulary
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Deserialize this value from the given Serde deserializer. Read more
Auto Trait Implementations§
impl Freeze for Vocabulary
impl RefUnwindSafe for Vocabulary
impl Send for Vocabulary
impl Sync for Vocabulary
impl Unpin for Vocabulary
impl UnsafeUnpin for Vocabulary
impl UnwindSafe for Vocabulary
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more