Sentiment classification using LSTM
在这个笔记本中,我们将使用LSTM架构在电影评论数据集上训练一个模型来预测评论的情绪。首先,让我们看看什么是LSTM?

LSTM,即长短时记忆,是一种序列神经网络架构,它利用其结构保留了对前一序列的记忆。第一个被引入的序列模型是RNN。但是,很快研究人员发现,RNN并没有保留很多以前序列的记忆。这导致在长文本序列中失去上下文。
为了维护这一背景,LSTM被引入。在LSTM单元中,有一些特殊的结构被称为门和单元状态,它们被改变和维护以保持LSTM中的记忆。要了解这些结构如何工作,请阅读 this blog.
从代码上看,我们正在使用tensorflow和keras来建立模型和训练它。为了进一步了解本项目的代码/概念,我们使用了以下参考资料。
References:
(1) Medium article on keras lstm
(2) Keras embedding layer documentation
(3) Keras example of text classification from scratch
(4) Bi-directional lstm model example
(5) kaggle notebook for text preprocessing
Notebook:
# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/sentiment-analysis-on-movie-reviews/sampleSubmission.csv
/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip
/kaggle/input/sentiment-analysis-on-movie-reviews/test.tsv.zip
train_data = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip',sep = 't') test_data = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip',sep = 't') train_data.head()
| PhraseId | SentenceId | Phrase | Sentiment | |
|---|---|---|---|---|
| 0 | 1 | 1 | A series of escapades demonstrating the adage ... | 1 |
| 1 | 2 | 1 | A series of escapades demonstrating the adage ... | 2 |
| 2 | 3 | 1 | A series | 2 |
| 3 | 4 | 1 | A | 2 |
| 4 | 5 | 1 | series | 2 |
train_data = train_data.drop(['PhraseId','SentenceId'],axis = 1) test_data = test_data.drop(['PhraseId','SentenceId'],axis = 1)
import keras from keras.models import Sequential from keras.layers import Dense #层lyer from keras.layers import LSTM from keras.layers import Activation from keras.layers import Embedding from keras.layers import Bidirectional
max_features = 20000 # 只考虑前20千字 maxlen = 200
train_data.head()
| Phrase | Sentiment | |
|---|---|---|
| 0 | A series of escapades demonstrating the adage ... | 1 |
| 1 | A series of escapades demonstrating the adage ... | 2 |
| 2 | A series | 2 |
| 3 | A | 2 |
| 4 | series | 2 |
from nltk.corpus import stopwords import re # 定义文本清理函数 def text_cleaning(text): forbidden_words = set(stopwords.words('english'))#停用词,对于理解文章没有太大意义的词,比如"the"、“an”、“his”、“their” if text: text = ' '.join(text.split('.')) text = re.sub('/',' ',text) text = re.sub(r'\',' ',text) text = re.sub(r'((http)S+)','',text) text = re.sub(r's+', ' ', re.sub('[^A-Za-z]', ' ', text.strip().lower())).strip() text = re.sub(r'W+', ' ', text.strip().lower()).strip() text = [word for word in text.split() if word not in forbidden_words] return text return []
# 将句子转化为词语列表 train_data['flag'] = 'TRAIN' test_data['flag'] = 'TEST' total_docs = pd.concat([train_data,test_data],axis = 0,ignore_index = True) total_docs['Phrase'] = total_docs['Phrase'].apply(lambda x: ' '.join(text_cleaning(x))) phrases = total_docs['Phrase'].tolist() from keras.preprocessing.text import one_hot vocab_size = 50000 encoded_phrases = [one_hot(d, vocab_size) for d in phrases] total_docs['Phrase'] = encoded_phrases train_data = total_docs[total_docs['flag'] == 'TRAIN'] test_data = total_docs[total_docs['flag'] == 'TEST'] x_train = train_data['Phrase'] y_train = train_data['Sentiment'] x_val = test_data['Phrase'] y_val = test_data['Sentiment']
x_train.head()
y_train.unique()
array([1, 2, 3, 4, 0])
tf.keras.preprocessing.sequence.pad_sequences()的用法:https://blog.csdn.net/qq_45465526/article/details/109400926)
# 将序列转化为经过填充以后得到的一个长度相同新的序列 x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen) x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)
model = Sequential() inputs = keras.Input(shape=(None,), dtype="int32") # 将每个整数嵌入一个128维的向量中 model.add(inputs) model.add(Embedding(50000, 128)) # 增加2个双向的LSTM model.add(Bidirectional(LSTM(64, return_sequences=True))) model.add(Bidirectional(LSTM(64))) # 添加一个分类器 model.add(Dense(5, activation="sigmoid")) #model = keras.Model(inputs, outputs) model.summary()
result:
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, None, 128) 6400000 _________________________________________________________________ bidirectional (Bidirectional (None, None, 128) 98816 _________________________________________________________________ bidirectional_1 (Bidirection (None, 128) 98816 _________________________________________________________________ dense (Dense) (None, 5) 645 ================================================================= Total params: 6,598,277 Trainable params: 6,598,277 Non-trainable params: 0 _________________________________________________________________
model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"]) model.fit(x_train, y_train, batch_size=32, epochs=30, validation_data=(x_val, y_val))
result:
Epoch 1/30 4877/4877 [==============================] - 562s 115ms/step - loss: 0.9593 - accuracy: 0.6107 - val_loss: 0.7819 - val_accuracy: 0.6798 Epoch 2/30 4877/4877 [==============================] - 520s 107ms/step - loss: 0.7942 - accuracy: 0.6729 - val_loss: 0.7094 - val_accuracy: 0.7114 ..................................................................... Epoch 29/30 4877/4877 [==============================] - 539s 111ms/step - loss: 0.3510 - accuracy: 0.8117 - val_loss: 0.3220 - val_accuracy: 0.8242 Epoch 30/30 4877/4877 [==============================] - 553s 113ms/step - loss: 0.3485 - accuracy: 0.8124 - val_loss: 0.3187 - val_accuracy: 0.8238
<tensorflow.python.keras.callbacks.History at 0x7fa9b82520d0>
model.fit(x_train, y_train, batch_size=32, epochs=5, validation_data=(x_val, y_val))
result:
Epoch 1/5 4877/4877 [==============================] - 535s 110ms/step - loss: 0.3477 - accuracy: 0.8128 - val_loss: 0.3193 - val_accuracy: 0.8240 Epoch 2/5 4877/4877 [==============================] - 543s 111ms/step - loss: 0.3457 - accuracy: 0.8134 - val_loss: 0.3173 - val_accuracy: 0.8250 Epoch 3/5 4877/4877 [==============================] - 542s 111ms/step - loss: 0.3428 - accuracy: 0.8140 - val_loss: 0.3158 - val_accuracy: 0.8254 Epoch 4/5 4877/4877 [==============================] - 541s 111ms/step - loss: 0.3429 - accuracy: 0.8144 - val_loss: 0.3165 - val_accuracy: 0.8257 Epoch 5/5 4877/4877 [==============================] - 557s 114ms/step - loss: 0.3395 - accuracy: 0.8150 - val_loss: 0.3136 - val_accuracy: 0.8259
<tensorflow.python.keras.callbacks.History at 0x7fa8e0763150>
总之,我们创建了一个双向的LSTM模型,并对其进行了检测情感的训练。我们达到了80%的训练和82%的验证准确率。
Notebook code:https://www.kaggle.com/code/ranxi169/sentiment-classification-using-lstm/notebook
原创作者:孤飞-博客园
个人博客:https://blog.onefly.top