Sentiment Analysis: A Complete Pipeline Using Word Clouds, LSTM, and Naive Bayes
Through out this article, I show you a complete sentiment analysis project covering many NLP techniques Sentiment Analysis From data preprocessing, Creating word clouds, Training deep learning models (LSTM) to simpler classifiers like Naive Bayes this articles give you a complete guide to build sentiment analysis model from scratch.
Introduction
Sentiment AnalysisTo determine if a piece of writing is in positive, negative or neutral? Commonly used within things like marketing, customer service inquiries, and product ratings to try and grasp user sentiment and better those experiences. So in this project I worked upon a dataset of user comments classified them into positive or negative sentiment. In this post, We will get into the complete workflow.
Data Preprocessing
Before we build any models, it’s crucial to clean and preprocess the text data. This helps the machine learning algorithms focus on the meaningful parts of the text.
import re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
# Define stopwords and stemmer
stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')
# Clean text
text_cleaning_re = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
def preprocess(text, stem=False):
text = re.sub(text_cleaning_re, ' ', str(text).lower()).strip()
tokens = [stemmer.stem(token) if stem else token for token in text.split() if token not in stop_words]
return " ".join(tokens)We used regex to clean the text and a Snowball Stemmer to reduce words to their root forms. Stop words like “the,” “is,” and “in” are removed to improve the relevance of the features used in the model.
Visualizing the Data with Word Clouds
To get a sense of the distribution of words in positive and negative sentiments, I generated word clouds.
from wordcloud import WordCloud
# Word cloud for positive sentiment
wc = WordCloud(max_words=2000, width=1600, height=800).generate(" ".join(data[data.sentiment == 'Positive'].text))
plt.imshow(wc, interpolation='bilinear')
# Word cloud for negative sentiment
wc = WordCloud(max_words=2000, width=1600, height=800).generate(" ".join(data[data.sentiment == 'Negative'].text))
plt.imshow(wc, interpolation='bilinear')Word Clouds Explained:
Word clouds are a visual representation of word frequency. The larger the word in the cloud, the more frequently it appears in the text. This allows us to quickly spot the dominant words for positive and negative sentiments.
Building a Deep Learning Model Using LSTM
For this project, I opted to build a deep learning model using Long Short-Term Memory (LSTM), which is a special kind of recurrent neural network (RNN). LSTMs are particularly effective for sequential data like text.
Tokenization & Padding
We used TensorFlow’s Tokenizer and pad_sequences to convert text into numerical values that our LSTM model can process.
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_data.text)
vocab_size = len(tokenizer.word_index) + 1
x_train = pad_sequences(tokenizer.texts_to_sequences(train_data.text), maxlen=30)
x_test = pad_sequences(tokenizer.texts_to_sequences(test_data.text), maxlen=30)Word Embeddings with GloVe
Pre-trained word embeddings like GloVe (Global Vectors for Word Representation) can help the model by encoding words into high-dimensional vectors based on their context in large datasets.
embeddings_index = {}
with open('glove.6B.300d.txt', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
embeddings_index[word] = np.asarray(values[1:], dtype='float32')LSTM Model Architecture
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional, Conv1D, Input, SpatialDropout1D
from tensorflow.keras.models import Model
embedding_layer = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=30, trainable=False)
sequence_input = Input(shape=(30,), dtype='int32')
x = SpatialDropout1D(0.2)(embedding_layer(sequence_input))
x = Conv1D(64, 5, activation='relu')(x)
x = Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2))(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.5)(x)
outputs = Dense(1, activation='sigmoid')(x)
model = Model(sequence_input, outputs)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=1024, epochs=10, validation_data=(x_test, y_test))I added Conv1D layers and Bidirectional LSTMs to capture both past and future context in the text data. This architecture ensures that the model understands complex relationships between words.
Naive Bayes Classifier: A Simpler Approach
For comparison, I also built a Naive Bayes classifier using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization.
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=100000, max_df=0.8, min_df=5, stop_words='english')),
('nb', MultinomialNB())
])
pipeline.fit(train_data.text, train_data.sentiment)
y_pred = pipeline.predict(test_data.text)This method is simpler but often highly effective for text classification tasks. Naive Bayes assumes word independence, which isn’t always realistic but often works surprisingly well.
Lessons learned from Sentiment Analysis Model LSTM vs Naive Bayes
It is the most popularly known application of natural language processing (NLP) in sentiment analysis. Machine learning has improved by leaps and bounds, and sentiment analysis with text data can be built using many model types. In this project, I worked on two different sentiment classification methods: an LSTM (Long Short-Term Memory) network and a Naive Bayes classifier. It was about measuring how do both of methods perform in sentiment analysis and which is time efficient method with a good result.
LSTM (Long Short-Term Memory)
LSTMs are a specific type of recurrent neural network (RNN) that was created to handle sequential data. While Transformers shine when dealing with extended sequences of information, making them perfect tools for tasks involving long-range interactions like sentiment analysis where the relationship between words in space and time are necessary.
LSTM models, however, are computationally expensive and take longer to train due to their complexity.
Naive Bayes (with TF-IDF features)
Naive bayes is a probabilistic model that assumes the independence of features (i, e — In case of text naive bayes assumes words occurrences are conditionally independent) This assumption usually does not stand in real world, but makes Naive bayes computationally efficient to train.
I used the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text into numerical features. TF-IDF captures the importance of a word in a document relative to a collection of documents, which helps improve model performance.
LSTM Model: Training Process and Results
Here’s how the LSTM performed over the 10 epochs during training:
Epoch 1/10: Accuracy = 0.7138, Loss = 0.5494, Validation Accuracy = 0.7646
Epoch 2/10: Accuracy = 0.7571, Loss = 0.4949, Validation Accuracy = 0.7699
...
Epoch 10/10: Accuracy = 0.7799, Loss = 0.4596, Validation Accuracy = 0.7816Training Progress: The LSTM began with an initial accuracy of 0.7138 and gradually improved to 0.7799 over the course of 10 epochs. The validation accuracy similarly increased to 0.7816 by the end of training.
Training Progress: The LSTM began with an initial accuracy of 0.7138 and gradually improved to 0.7799 over the course of 10 epochs. The validation accuracy similarly increased to 0.7816 by the end of training.
The training time for each epoch was quite long, especially the first epoch, which took around 7852 seconds (~2 hours). This shows the computational cost of deep learning models.
Naive Bayes Model: Classification Results
For the Naive Bayes model, I used the TF-IDF vectorizer to preprocess the text data. After training, the model’s performance was as follows:
Accuracy: 0.76
Classification Report:
precision recall f1-score support
Negative 0.76 0.75 0.76 160542
Positive 0.75 0.76 0.76 159458
Accuracy: 0.76Accuracy: The Naive Bayes model achieved an accuracy of 0.76, which is comparable to the LSTM model’s performance.
Naive Bayes is fast and efficient. Training took only a few seconds compared to LSTM’s much longer epochs, making it ideal for time-sensitive tasks.
Despite its simplicity and the assumption of feature independence, Naive Bayes performed surprisingly well, offering nearly the same accuracy as the more complex LSTM model.
Comparison and Analysis
Here’s a comparison of both models based on key factors:
Lessons Learned
Power to LSTM: One of the reasons is that LSTMs are better at handling sequential data since they can learn relationships between words which are far from each other. These are useful for sentiment analysis tasks where word context is important (for example, to capture sarcasm or irony) or complex sentence structures.
Efficient: Naive Bayes is a simple and fast model and mostly uniquely suited when we have small data set. It is very handy when you require high performance and with simple text patterns.
There are trade-offs between models: The question of whether to use LSTM or Naive Bayes depends on the problem Attention: When you have enough computational power and need to consider the text deeply (reviews, long articles), then you can go with LSTM. Nonetheless, for simple text tasks where you want fast results, Naive Bayes walk a line of decent accuracy and speed.
Conclusion
It allowed me to get some interesting perspectives about how different models try to solve one problem. There use-case of Naiive bayes is on simpler and faster tasks are useful for getting results, and over LSTM which is good in understanding sequential dependencies and required in more nuanced text processing. Like always, the model choice will depend on the unique problems to solve, available computational resources and above all complexity of data collected.
For more details and the full code implementation of this project, check out my GitHub repository here.
