Skip to main content

Vocabulary

word2vec::vocab

Struct Vocabulary

pub struct Vocabulary {
    pub word2idx: HashMap<String, usize>,
    pub idx2word: Vec<String>,
    pub counts: Vec<u64>,
    pub noise_table: Vec<u32>,
    pub total_tokens: u64,
}

Expand description

Maps tokens ↔ integer indices and stores frequency statistics.

Fields§

§word2idx: HashMap<String, usize>

word → index

§idx2word: Vec<String>

index → word

§counts: Vec<u64>

Raw corpus frequency per index

§noise_table: Vec<u32>

Flat noise table for O(1) negative sampling

§total_tokens: u64

Total token count (after min_count filter)

Implementations§

impl Vocabulary

pub fn build(sentences: &[Vec<String>], config: &Config) -> Result<Self>

Build vocabulary from a tokenised corpus.

Steps:

Count every token
Drop tokens below config.min_count
Sort by descending frequency (stable index order)
Build unigram noise table

use word2vec::{Config, vocab::Vocabulary};

let corpus = vec!["the cat sat on the mat".to_string()];
let tokens: Vec<Vec<String>> = corpus.iter()
    .map(|s| s.split_whitespace().map(str::to_string).collect())
    .collect();

let vocab = Vocabulary::build(&tokens, &Config::default()).unwrap();
assert!(vocab.word2idx.contains_key("the"));
assert_eq!(vocab.count("the"), 2);

pub fn len(&self) -> usize

Number of unique tokens in vocabulary.

pub fn is_empty(&self) -> bool

Returns true if the vocabulary contains no words.

pub fn count(&self, word: &str) -> u64

Frequency of a word (0 if not in vocab).

use word2vec::{Config, vocab::Vocabulary};
let corpus = vec!["a a b".to_string()];
let tokens: Vec<Vec<String>> = corpus.iter()
    .map(|s| s.split_whitespace().map(str::to_string).collect())
    .collect();
let vocab = Vocabulary::build(&tokens, &Config::default()).unwrap();
assert_eq!(vocab.count("a"), 2);
assert_eq!(vocab.count("z"), 0);

pub fn should_subsample(&self, idx: usize, threshold: f64, dice: f64) -> bool

Returns true if this word should be subsampled (discarded) given a uniformly random dice in [0, 1).

Uses Mikolov’s formula: P(keep) = min(1, sqrt(t/f) + t/f).

pub fn negative_sample(&self, rng: &mut SmallRng) -> usize

Draw a negative sample index from the noise distribution.

Uses the precomputed unigram table for O(1) lookup.

pub fn tokenise_and_subsample( &self, sentence: &[String], threshold: f64, rng: &mut SmallRng, ) -> Vec<usize>

Tokenise and subsample a sentence, returning word indices.

Trait Implementations§

impl Clone for Vocabulary

fn clone(&self) -> Vocabulary

Returns a duplicate of the value. Read more

1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

impl Debug for Vocabulary

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

impl<'de> Deserialize<'de> for Vocabulary

fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where __D: Deserializer<'de>,

Deserialize this value from the given Serde deserializer. Read more

impl Serialize for Vocabulary

fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where S: Serializer,

Serialize this value into the given Serde serializer. Read more

Auto Trait Implementations§

impl Freeze for Vocabulary

impl RefUnwindSafe for Vocabulary

impl Send for Vocabulary

impl Sync for Vocabulary

impl Unpin for Vocabulary

impl UnsafeUnpin for Vocabulary

impl UnwindSafe for Vocabulary

Blanket Implementations§

impl<T> Any for T
where T: 'static + ?Sized,

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

impl<T> Borrow<T> for T
where T: ?Sized,

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

impl<T> BorrowMut<T> for T
where T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

impl<T> CloneToUninit for T
where T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)

Performs copy-assignment from self to dest. Read more

impl<T> From<T> for T

fn from(t: T) -> T

Returns the argument unchanged.

impl<T, U> Into<U> for T
where U: From<T>,

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

impl<T> ToOwned for T
where T: Clone,

type Owned = T

The resulting type after obtaining ownership.

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more

impl<T, U> TryFrom<U> for T
where U: Into<T>,

type Error = Infallible

The type returned in the event of a conversion error.

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn vzip(self) -> V

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,