隨著在線社交媒體和評論平臺的激增,大量的意見數據被記錄下來,具有支持決策過程的巨大潛力。情感分析研究人們在其生成的文本中的情感,例如產品評論、博客評論和論壇討論。它在政治(例如,公眾對政策的情緒分析)、金融(例如,市場情緒分析)和市場營銷(例如,產品研究和品牌管理)等領域有著廣泛的應用。
由于情緒可以被分類為離散的極性或尺度(例如,積極和消極),我們可以將情緒分析視為文本分類任務,它將可變長度的文本序列轉換為固定長度的文本類別。在本章中,我們將使用斯坦福的大型電影評論數據集進行情感分析。它由一個訓練集和一個測試集組成,其中包含從 IMDb 下載的 25000 條電影評論。在這兩個數據集中,“正面”和“負面”標簽的數量相等,表明不同的情緒極性。
import os import torch from torch import nn from d2l import torch as d2l
import os from mxnet import np, npx from d2l import mxnet as d2l npx.set_np()
16.1.1。讀取數據集
首先,在路徑中下載并解壓這個 IMDb 評論數據集 ../data/aclImdb。
#@save d2l.DATA_HUB['aclImdb'] = (d2l.DATA_URL + 'aclImdb_v1.tar.gz', '01ada507287d82875905620988597833ad4e0903') data_dir = d2l.download_extract('aclImdb', 'aclImdb')
Downloading ../data/aclImdb_v1.tar.gz from http://d2l-data.s3-accelerate.amazonaws.com/aclImdb_v1.tar.gz...
#@save d2l.DATA_HUB['aclImdb'] = (d2l.DATA_URL + 'aclImdb_v1.tar.gz', '01ada507287d82875905620988597833ad4e0903') data_dir = d2l.download_extract('aclImdb', 'aclImdb')
Downloading ../data/aclImdb_v1.tar.gz from http://d2l-data.s3-accelerate.amazonaws.com/aclImdb_v1.tar.gz...
接下來,閱讀訓練和測試數據集。每個示例都是評論及其標簽:1 表示“正面”,0 表示“負面”。
#@save def read_imdb(data_dir, is_train): """Read the IMDb review dataset text sequences and labels.""" data, labels = [], [] for label in ('pos', 'neg'): folder_name = os.path.join(data_dir, 'train' if is_train else 'test', label) for file in os.listdir(folder_name): with open(os.path.join(folder_name, file), 'rb') as f: review = f.read().decode('utf-8').replace('n', '') data.append(review) labels.append(1 if label == 'pos' else 0) return data, labels train_data = read_imdb(data_dir, is_train=True) print('# trainings:', len(train_data[0])) for x, y in zip(train_data[0][:3], train_data[1][:3]): print('label:', y, 'review:', x[:60])
# trainings: 25000 label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his label: 1 review: An unassuming, subtle and lean film, "The Man in the White S label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta
#@save def read_imdb(data_dir, is_train): """Read the IMDb review dataset text sequences and labels.""" data, labels = [], [] for label in ('pos', 'neg'): folder_name = os.path.join(data_dir, 'train' if is_train else 'test', label) for file in os.listdir(folder_name): with open(os.path.join(folder_name, file), 'rb') as f: review = f.read().decode('utf-8').replace('n', '') data.append(review) labels.append(1 if label == 'pos' else 0) return data, labels train_data = read_imdb(data_dir, is_train=True) print('# trainings:', len(train_data[0])) for x, y in zip(train_data[0][:3], train_data[1][:3]): print('label:', y, 'review:', x[:60])
# trainings: 25000 label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his label: 1 review: An unassuming, subtle and lean film, "The Man in the White S label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta
16.1.2。預處理數據集
將每個單詞視為一個標記并過濾掉出現次數少于 5 次的單詞,我們從訓練數據集中創建了一個詞匯表。
train_tokens = d2l.tokenize(train_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=[''])
train_tokens = d2l.tokenize(train_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=[''])
標記化后,讓我們繪制以標記為單位的評論長度直方圖。
d2l.set_figsize() d2l.plt.xlabel('# tokens per review') d2l.plt.ylabel('count') d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
d2l.set_figsize() d2l.plt.xlabel('# tokens per review') d2l.plt.ylabel('count') d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
正如我們所料,評論的長度各不相同。為了每次處理一小批此類評論,我們將每個評論的長度設置為 500,并進行截斷和填充,這類似于第 10.5 節中機器翻譯數據集的預處理 步驟。
num_steps = 500 # sequence length train_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) print(train_features.shape)
torch.Size([25000, 500])
num_steps = 500 # sequence length train_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) print(train_features.shape)
(25000, 500)
16.1.3。創建數據迭代器
現在我們可以創建數據迭代器。在每次迭代中,返回一小批示例。
train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), 64) for X, y in train_iter: print('X:', X.shape, ', y:', y.shape) break print('# batches:', len(train_iter))
X: torch.Size([64, 500]) , y: torch.Size([64]) # batches: 391
train_iter = d2l.load_array((train_features, train_data[1]), 64) for X, y in train_iter: print('X:', X.shape, ', y:', y.shape) break print('# batches:', len(train_iter))
X: (64, 500) , y: (64,) # batches: 391
16.1.4。把它們放在一起
最后,我們將上述步驟包裝到函數中load_data_imdb。它返回訓練和測試數據迭代器以及 IMDb 評論數據集的詞匯表。
#@save def load_data_imdb(batch_size, num_steps=500): """Return data iterators and the vocabulary of the IMDb review dataset.""" data_dir = d2l.download_extract('aclImdb', 'aclImdb') train_data = read_imdb(data_dir, True) test_data = read_imdb(data_dir, False) train_tokens = d2l.tokenize(train_data[0], token='word') test_tokens = d2l.tokenize(test_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5) train_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) test_features = torch.tensor([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in test_tokens]) train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), batch_size) test_iter = d2l.load_array((test_features, torch.tensor(test_data[1])), batch_size, is_train=False) return train_iter, test_iter, vocab
#@save def load_data_imdb(batch_size, num_steps=500): """Return data iterators and the vocabulary of the IMDb review dataset.""" data_dir = d2l.download_extract('aclImdb', 'aclImdb') train_data = read_imdb(data_dir, True) test_data = read_imdb(data_dir, False) train_tokens = d2l.tokenize(train_data[0], token='word') test_tokens = d2l.tokenize(test_data[0], token='word') vocab = d2l.Vocab(train_tokens, min_freq=5) train_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in train_tokens]) test_features = np.array([d2l.truncate_pad( vocab[line], num_steps, vocab['']) for line in test_tokens]) train_iter = d2l.load_array((train_features, train_data[1]), batch_size) test_iter = d2l.load_array((test_features, test_data[1]), batch_size, is_train=False) return train_iter, test_iter, vocab
16.1.5。概括
情感分析研究人們在其生成的文本中的情感,這被認為是將變長文本序列轉換為固定長度文本類別的文本分類問題。
預處理后,我們可以將斯坦福的大型電影評論數據集(IMDb 評論數據集)加載到帶有詞匯表的數據迭代器中。
16.1.6。練習
我們可以修改本節中的哪些超參數來加速訓練情緒分析模型?
你能實現一個函數來將亞馬遜評論的數據集加載到數據迭代器和標簽中以進行情感分析嗎?
-
數據集
+關注
關注
4文章
1208瀏覽量
24740 -
pytorch
+關注
關注
2文章
808瀏覽量
13252
發布評論請先 登錄
相關推薦
評論