I have a function to extract the pre trained embeddings from GloVe.txt and load them as Kears Embedding Layer weights but how can I do for the same for the given two files?
This accepted stackoverflow answer gave me a a feel that .vec can be seen as .txt and we might use the same technique to extract the fasttext.vec which we use for glove.txt. Is my understanding correct?
I went through a lot of blogs and stack answers to find what to do with the binary file? And I found in this stack answer that binary or .bin file is the MODEL itself not the embeddings and you can convert the bin file to text file using Gensim. I think it does something to save the embeddings and we can load the pre trained embeddings just like we load Glove. Is my understanding correct?
Here is the code to do that. I want to know if I'm on the right path because I could not find a satisfactory answer to my question anywhere.
tokenizer.fit_on_texts(data) # tokenizer is Keras Tokenizer()
vocab_size = len(tokenizer.word_index) + 1 # extra 1 for unknown words
encoded_docs = tokenizer.texts_to_sequences(data) # data is lists of lists of sentences
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post') # max_length is say 30
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) # this will load the binary Word2Vec model
model.save_word2vec_format('GoogleNews-vectors-negative300.txt', binary=False) # this will save the VECTORS in a text file. Can load it using the below function?
def load_embeddings(vocab_size,fitted_tokenizer,emb_file_path,emb_dim=300):
'''
It can load GloVe.txt for sure. But is it the right way to load paragram.txt, fasttext.vec and word2vec.bin if converted to .txt?
'''
embeddings_index = dict()
f = open(emb_file_path)
for line in f:
values = line.split()
word = values[0]
coefs = asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
embedding_matrix = zeros((vocab_size, emb_dim))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
return embedding_matrix
My question is that Can we load the .vec file directly AND can we load the .bin file as I have described above with the given load_embeddings() function?