Ensure the gensim generate the same Word2Vec model for different runs on the same data

Question

In LDA model generates different topics everytime i train on the same corpus , by setting the np.random.seed(0), the LDA model will always be initialized and trained in exactly the same way.

Is it the same for the Word2Vec models from gensim? By setting the random seed to a constant, would the different run on the same dataset produce the same model?

But strangely, it's already giving me the same vector at different instances.

>>> from nltk.corpus import brown
>>> from gensim.models import Word2Vec
>>> sentences = brown.sents()[:100]
>>> model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,
        0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32)
>>> model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,
        0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32)
>>> model = Word2Vec(sentences, size=20, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,
        0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,
        0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,
        0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32)
>>> model = Word2Vec(sentences, size=20, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,
        0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,
        0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,
        0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32)
>>> exit()
alvas@ubi:~$ python
Python 2.7.11 (default, Dec 15 2015, 16:46:19) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.corpus import brown
>>> from gensim.models import Word2Vec
>>> sentences = brown.sents()[:100]
>>> model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4)
>>> word0 = sentences[0][0]
>>> model[word0]
array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,
        0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32)
>>> model = Word2Vec(sentences, size=20, window=5, min_count=5, workers=4)
>>> model[word0]
array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,
        0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,
        0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,
        0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32)

Is it true then that the default random seed is fixed? If so, what is the default random seed number? Or is it because I'm testing on a small dataset?

If it's true that the the random seed is fixed and different runs on the same data returns the same vectors, a link to a canonical code or documentation would be much appreciated.

score 14 · Answer 1 · answered Feb 16 '17 at 05:23

As per the docs of Gensim, for executing a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling.

A simple parameter edit to your code should do the trick.

model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=1)

score 10 · Accepted Answer · edited Aug 07 '20 at 12:22

Yes, default random seed is fixed to 1, as described by the author in https://radimrehurek.com/gensim/models/word2vec.html. Vectors for each word are initialised using a hash of the concatenation of word + str(seed).

Hashing function used, however, is Python’s rudimentary built in hash function and can produce different results if two machines differ in

32 vs 64 bit, reference
python versions, reference
different Operating Systems/ Interpreters, reference1, reference2

Above list is not exhaustive. Does it cover your question though?

EDIT

If you want to ensure consistency, you can provide your own hashing function as an argument in word2vec

A very simple (and bad) example would be:

def hash(astring):
   return ord(astring[0])

model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4, hashfxn=hash)

print model[sentences[0][0]]

I tried this. Still gives different results to me – Vahid the Great Jul 06 '21 at 11:15 — Vahid the Great, Jul 06 '21 at 11:15

score 10 · Answer 3 · answered Nov 28 '17 at 00:59

Just a remark on the randomness.

If one is working with gensim's W2V model and is using Python version >= 3.3, keep in mind that hash randomisation is turned on by default. If you're seeking consistency between two executions, make sure to set the PYTHONHASHSEED environment variable. E.g. when running your code like so PYTHONHASHSEED=123 python3 mycode.py, next time you generate a model (using the same hash seed) it would be the same as previously generated model (provided, that all other randomness control steps are followed, as mentioned above - random state and single worker). See gensim's W2V source and Python docs for details.

score 1 · Answer 4 · answered Jul 06 '21 at 11:36

For a fully deterministically reproducible run, next to defining a seed, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires the use of the PYTHONHASHSEED environment variable to control hash randomization).

def hash(astring):
  return ord(astring[0])

model = gensim.models.Word2Vec (texts, workers=1, seed=1,hashfxn=hash)

score 0 · Answer 5 · answered Apr 25 '22 at 02:15

0

"PYTHONHASHSEED=0 python yourcode.py" should solve your problem.

answered Apr 25 '22 at 02:15

Rise of Kingdom

33
6

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 25 '22 at 05:54

Peter O. · Answer 6 · 2021-07-06T23:33:27.270

Your problem is indeed a small dataset: only 100 sentences.

Note what the Gensim FAQ says:

[Because randomness is part of Word2Vec and similar models], it is to be expected that models vary from run to run, even trained on the same data. There's no single "right place" for any word-vector or doc-vector to wind up: just positions that are at progressively more-useful distances & directions from other vectors co-trained inside the same model. [...]

Suitable training parameters should yield models that are roughly as useful, from run-to-run, as each other. Testing and evaluation processes should be tolerant of any shifts in vector positions, and of small "jitter" in the overall utility of models, that arises from the inherent algorithm randomness. (If the observed quality from run-to-run varies a lot, there may be other problems: too little data, poorly-tuned parameters, or errors/weaknesses in the evaluation method.)

You can try to force determinism[.] But [...] you'd be obscuring the inherent randomness/approximateness of the underlying algorithms[.] It's better to tolerate a little jitter, and use excessive jitter as an indicator of problems elsewhere in the data or model setup – rather than impose a superficial determinism.

Ensure the gensim generate the same Word2Vec model for different runs on the same data

6 Answers6

Linked