在15.4 節(jié)中,我們在一個(gè)小數(shù)據(jù)集上訓(xùn)練了一個(gè) word2vec 模型,并將其應(yīng)用于為輸入詞尋找語義相似的詞。在實(shí)踐中,在大型語料庫上預(yù)訓(xùn)練的詞向量可以應(yīng)用于下游的自然語言處理任務(wù),這將在第 16 節(jié)后面介紹。為了以直接的方式展示來自大型語料庫的預(yù)訓(xùn)練詞向量的語義,讓我們將它們應(yīng)用到詞相似度和類比任務(wù)中。
15.7.1。加載預(yù)訓(xùn)練詞向量
下面列出了維度為 50、100 和 300 的預(yù)訓(xùn)練 GloVe 嵌入,可以從GloVe 網(wǎng)站下載。預(yù)訓(xùn)練的 fastText 嵌入有多種語言版本。這里我們考慮一個(gè)可以從fastText 網(wǎng)站下載的英文版本(300 維“wiki.en”) 。
#@save
d2l.DATA_HUB['glove.6b.50d'] = (d2l.DATA_URL + 'glove.6B.50d.zip',
'0b8703943ccdb6eb788e6f091b8946e82231bc4d')
#@save
d2l.DATA_HUB['glove.6b.100d'] = (d2l.DATA_URL + 'glove.6B.100d.zip',
'cd43bfb07e44e6f27cbcc7bc9ae3d80284fdaf5a')
#@save
d2l.DATA_HUB['glove.42b.300d'] = (d2l.DATA_URL + 'glove.42B.300d.zip',
'b5116e234e9eb9076672cfeabf5469f3eec904fa')
#@save
d2l.DATA_HUB['wiki.en'] = (d2l.DATA_URL + 'wiki.en.zip',
'c1816da3821ae9f43899be655002f6c723e91b88')
#@save
d2l.DATA_HUB['glove.6b.50d'] = (d2l.DATA_URL + 'glove.6B.50d.zip',
'0b8703943ccdb6eb788e6f091b8946e82231bc4d')
#@save
d2l.DATA_HUB['glove.6b.100d'] = (d2l.DATA_URL + 'glove.6B.100d.zip',
'cd43bfb07e44e6f27cbcc7bc9ae3d80284fdaf5a')
#@save
d2l.DATA_HUB['glove.42b.300d'] = (d2l.DATA_URL + 'glove.42B.300d.zip',
'b5116e234e9eb9076672cfeabf5469f3eec904fa')
#@save
d2l.DATA_HUB['wiki.en'] = (d2l.DATA_URL + 'wiki.en.zip',
'c1816da3821ae9f43899be655002f6c723e91b88')
為了加載這些預(yù)訓(xùn)練的 GloVe 和 fastText 嵌入,我們定義了以下TokenEmbedding
類。
#@save
class TokenEmbedding:
"""Token Embedding."""
def __init__(self, embedding_name):
self.idx_to_token, self.idx_to_vec = self._load_embedding(
embedding_name)
self.unknown_idx = 0
self.token_to_idx = {token: idx for idx, token in
enumerate(self.idx_to_token)}
def _load_embedding(self, embedding_name):
idx_to_token, idx_to_vec = [''], []
data_dir = d2l.download_extract(embedding_name)
# GloVe website: https://nlp.stanford.edu/projects/glove/
# fastText website: https://fasttext.cc/
with open(os.path.join(data_dir, 'vec.txt'), 'r') as f:
for line in f:
elems = line.rstrip().split(' ')
token, elems = elems[0], [float(elem) for elem in elems[1:]]
# Skip header information, such as the top row in fastText
if len(elems) > 1:
idx_to_token.append(token)
idx_to_vec.append(elems)
idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec
return idx_to_token, torch.tensor(idx_to_vec)
def __getitem__(self, tokens):
indices = [self.token_to_idx.get(token, self.unknown_idx)
for token in tokens]
vecs = self.idx_to_vec[torch.tensor(indices)]
return vecs
def __len__(self):
return len(self.idx_to_token)
#@save
class TokenEmbedding:
"""Token Embedding."""
def __init__(self, embedding_name):
self.idx_to_token, self.idx_to_vec = self._load_embedding(
embedding_name)
self.unknown_idx = 0
self.token_to_idx = {token: idx for idx, token in
enumerate(self.idx_to_token)}
def _load_embedding(self, embedding_name):
idx_to_token, idx_to_vec = [''], []
data_dir = d2l.download_extract(embedding_name)
# GloVe website: https://nlp.stanford.edu/projects/glove/
# fastText website: https://fasttext.cc/
with open(os.path.join(data_dir, 'vec.txt'), 'r') as f:
for line in f:
elems = line.rstrip().split(' ')
token, elems = elems[0], [float(elem) for elem in elems[1:]]
# Skip header information, such as the top row in fastText
if len(elems) > 1:
idx_to_token.append(token)
idx_to_vec.append(elems)
idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec
return idx_to_token, np.array(idx_to_vec)
def __getitem__(self, tokens):
indices = [self.token_to_idx.get(token, self.unknown_idx)
for token in tokens]
vecs = self.idx_to_vec[np.array(indices)]
return vecs
def __len__(self):
return len(self.idx_to_token)
下面我們加載 50 維 GloVe 嵌入(在維基百科子集上預(yù)訓(xùn)練)。創(chuàng)建TokenEmbedding
實(shí)例時(shí),如果尚未下載指定的嵌入文件,則必須下載。
Downloading ../data/glove.6B.50d.zip from http://d2l-data.s3-accelerate.amazonaws.com/glove.6B.50d.zip...
輸出詞匯量。詞匯表包含 400000 個(gè)單詞(標(biāo)記)和一個(gè)特殊的未知標(biāo)記。
我們可以獲得一個(gè)詞在詞匯表中的索引,反之亦然。
glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367]
(3367, 'beautiful')
15.7.2。應(yīng)用預(yù)訓(xùn)練詞向量
使用加載的 GloVe 向量,我們將通過將它們應(yīng)用于以下單詞相似性和類比任務(wù)來演示它們的語義。
15.7.2.1。詞相似度
與第 15.4.3 節(jié)類似,為了根據(jù)詞向量之間的余弦相似度為輸入詞找到語義相似的詞,我們實(shí)現(xiàn)以下knn
(k-最近的鄰居)功能。
def knn(W, x, k):
# Add 1e-9 for numerical stability
cos = np.dot(W, x.reshape(-1,)) / (
np.sqrt(np.sum(W * W, axis=1) + 1e-9) * np
評(píng)論
查看更多