Sentiment Analysis for Customer Reviews in R
Last Updated : 24 Jun, 2025
Sentiment analysis, also known as opinion mining, computationally identifies and categorizes opinions expressed in text data. It involves analyzing the polarity (positive, negative or neutral) of textual content to gauge the sentiment or attitude of the author. In the context of customer reviews, sentiment analysis helps businesses understand how customers perceive their products or services. In this article, we delve into the world of sentiment analysis for customer reviews using the R Programming Language.
Understanding the Dataset
The dataset used in this project contains TripAdvisor Hotel Reviews where each row represents a customer review. The dataset includes the following key columns:
- S.No.: Serial number of the review.
- Review: The actual customer review text.
- Rating: Customer rating (usually on a scale of 1 to 5, representing their experience).
We will focus on the Review column which contains the textual data we need to analyze for sentiment. The Rating column can be used as an additional reference to compare how well our sentiment analysis matches the numerical ratings
You can download the dataset from here: TripAdvisor
1. Installing and Loading Required Packages
We need to install and load the required R packages.
- tm: Provides text mining functions
- SnowballC: Implements stemming for text data
- syuzhet: Provides sentiment analysis functions
- tidyverse: A collection of packages for data manipulation
- wordcloud: Used for visualizing word frequencies
- ggplot2: Used for data visualization
R install.packages(c("tm", "SnowballC", "syuzhet", "tidyverse", "wordcloud", "ggplot2")) library(tm) library(SnowballC) library(syuzhet) library(tidyverse) library(wordcloud) library(ggplot2)
2. Loading the Dataset
Next, we will load the CSV file containing the reviews. The str() function will display the structure of the dataframe, showing the data types of each column and a preview of the data.
R data <- read.csv("/content/tripadvisor.csv", header = TRUE) str(data)
Output:
Loading the Dataset3. Creating and Inspecting the Corpus
We convert the review text to a character vector and create a corpus for text processing.
R corpus <- iconv(data$Review, to = "UTF-8", sub = "byte") corpus <- Corpus(VectorSource(corpus)) inspect(corpus[1:5])
Output:
Corpus4. Cleaning the Corpus
We will clean the text by converting it to lowercase, removing punctuation, numbers, stopwords, extra whitespaces and applying stemming.
R cleaned_corpus <- tm_map(corpus, content_transformer(tolower)) cleaned_corpus <- tm_map(cleaned_corpus, removePunctuation) cleaned_corpus <- tm_map(cleaned_corpus, removeNumbers) cleaned_corpus <- tm_map(cleaned_corpus, removeWords, stopwords('english')) cleaned_corpus <- tm_map(cleaned_corpus, stripWhitespace) cleaned_corpus <- tm_map(cleaned_corpus, stemDocument) inspect(cleaned_corpus[1:5])
Output:
Cleaning the Corpus5. Sampling the Data
We will sample a subset of reviews to make the analysis more manageable.
R set.seed(123) sampled_reviews <- sample(data$Review, 200) sampled_corpus <- Corpus(VectorSource(iconv(sampled_reviews, to = "UTF-8", sub = "byte")))
6. Cleaning the Sampled Corpus
We will now clean our sampled corpus similarly we did full corpus.
R cleaned_sampled_corpus <- tm_map(sampled_corpus, content_transformer(tolower)) cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removePunctuation) cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removeNumbers) cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removeWords, stopwords('english')) cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, stripWhitespace) cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, stemDocument)
7. Creating Sparse Term Document Matrix
We will create a sparse Term Document Matrix (TDM) for efficient processing and memory usage.
R tdm_sparse <- TermDocumentMatrix(cleaned_sampled_corpus, control = list(weighting = weightTfIdf)) tdm_m_sparse <- as.matrix(tdm_sparse)
8. Analyzing Term Frequencies
We analyze the frequency of terms in the corpus and display the most frequent ones.
R term_freq <- rowSums(tdm_m_sparse) term_freq_sorted <- sort(term_freq, decreasing = TRUE) tdm_d_sparse <- data.frame(word = names(term_freq_sorted), freq = term_freq_sorted) head(tdm_d_sparse, 5)
Output:
Term FrequenciesWe use three different methods (syuzhet, bing, afinn) to perform sentiment analysis on the text data.
R text <- iconv(data$Review) syuzhet_vector <- get_sentiment(text, method = "syuzhet") cat("Syuzhet method",head(syuzhet_vector),"\n") bing_vector <- get_sentiment(text, method = "bing") cat("Bing method:",head(bing_vector),"\n") afinn_vector <- get_sentiment(text, method = "afinn") cat("Afinn method:",head(afinn_vector),"\n")
Output:
Sentiment Analysis10. Comparing Sentiment Methods
We compare the sentiment scores using the three methods.
R rbind( sign(head(syuzhet_vector)), sign(head(bing_vector)), sign(head(afinn_vector)) )
Output:
Comparing Sentiment MethodsVisualization of Sentiment Analysis for Customer Reviews in R
We will now visualize the sentiment analysis results using different methods, including a Word Cloud, Sentiment Histogram, Emotion Bar Plot and Pie Chart of Sentiment Distribution.
1. Word Cloud
We create a word cloud to visualize the most frequent terms in the reviews. A word cloud provides a quick and intuitive way to visualize the most common words in a text corpus, making it easier to identify patterns and trends.
R wordcloud(words = tdm_d_sparse$word, freq = tdm_d_sparse$freq, min.freq = 5, max.words = 100, colors = brewer.pal(8, "Dark2"))
Output:
Word CloudWords with higher frequencies will appear larger and more prominent in the word cloud. The colors of the words are determined by the specified color palette, with each color representing a different word in the cloud.
2. Sentiment Histogram
We create a histogram to visualize the distribution of sentiment scores using the Syuzhet method. A histogram allows for a quick assessment of the overall sentiment distribution within the sampled text data.
R text_sampled <- iconv(sampled_reviews) syuzhet_vector_sampled <- get_sentiment(text_sampled, method = "syuzhet") ggplot(data.frame(syuzhet_vector_sampled), aes(x = syuzhet_vector_sampled)) + geom_histogram(binwidth = 0.1, fill = "blue", color = "black") + labs(title = "Sentiment Distribution using Syuzhet Method (Sampled Data)", x = "Sentiment Score", y = "Frequency") + theme_minimal()
Output:
Sentiment HistogramEach bar in the histogram represents a range of sentiment scores and the height of the bar indicates the frequency of occurrence of sentiment scores within that range.
3. Bar Plot of emotions
We will use ggplot2
package to create a bar plot of emotions along with the sentiment scores categorized into different emotions.
R nrc_sampled <- get_nrc_sentiment(text_sampled) nrct_sampled <- data.frame(t(nrc_sampled)) nrcs_sampled <- data.frame(rowSums(nrct_sampled)) nrcs_sampled <- cbind("sentiment" = rownames(nrcs_sampled), nrcs_sampled) rownames(nrcs_sampled) <- NULL names(nrcs_sampled)[1] <- "sentiment" names(nrcs_sampled)[2] <- "frequency" nrcs_sampled <- nrcs_sampled %>% mutate(percent = frequency/sum(frequency)) nrcs2_sampled <- nrcs_sampled[1:8, ] colnames(nrcs2_sampled)[1] <- "emotion" ggplot(nrcs2_sampled, aes(x = reorder(emotion, -frequency), y = frequency, fill = emotion)) + geom_bar(stat = "identity") + labs(title = "Emotion Distribution (Sampled Data)", x = "Emotion", y = "Frequency") + theme_minimal() + scale_fill_brewer(palette = "Set3")
Output:
Emotion Bar PlotThe bar plot shows the distribution of emotions based on sentiment analysis using the NRC lexicon on the sampled dataset. Each bar represents a different emotion and the height of the bar indicates the frequency of that emotion within the text data. The colors of the bars are determined by the specified color palette, allowing for easy visualization of different emotions.
4. Bar Plot of Most Popular Words
Creating a bar plot of the most popular words in a text dataset involves visualizing the frequency distribution of words within the corpus. This visualization helps in identifying the most common words in the text data.
R tdm_d_sparse <- tdm_d_sparse[1:10, ] tdm_d_sparse$word <- reorder(tdm_d_sparse$word, tdm_d_sparse$freq) ggplot(tdm_d_sparse, aes(x = word, y = freq, fill = word)) + geom_bar(stat = "identity") + coord_flip() + labs(title = "Most Popular Words", x = "Word", y = "Frequency") + theme_minimal()
Output:
Bar plot of most popular wordThe horizontal bar plot shows the frequency of the top 10 most popular words in the text data. Each bar represents a word and the length of the bar indicates the frequency of that word in the dataset. The colors of the bars are determined by the words themselves, providing visual differentiation between them.
5. Pie Chart of Sentiment Distribution
Creating a pie chart of sentiment distribution involves visualizing the proportion of different sentiment categories within a dataset.
R library(ggplot2) library(RColorBrewer) sentiment_df <- data.frame( sentiment = c("Positive", "Negative", "Neutral"), count = c(sum(syuzhet_vector_sampled > 0), sum(syuzhet_vector_sampled < 0), sum(syuzhet_vector_sampled == 0)) ) ggplot(sentiment_df, aes(x = "", y = count, fill = sentiment)) + geom_bar(stat = "identity", width = 1) + coord_polar("y", start = 0) + labs(title = "Sentiment Distribution", x = "", y = "") + theme_minimal() + scale_fill_brewer(palette = "Set3")
Output:
Pie Chart of Sentiment DistributionThe pie chart shows the distribution of sentiment categories within the dataset. Each segment of the pie chart represents a sentiment category ("Positive", "Negative", "Neutral") and the size of each segment corresponds to the count of that sentiment category in the dataset. The colors of the segments are determined by the specified color palette, allowing for easy differentiation between sentiment categories.
Conclusion
From our analysis, we can see that the majority of customers had a positive experience using TripAdvisor, expressing emotions of trust, joy and anticipation most often.
Similar Reads
Dataset for Sentiment Analysis Sentiment analysis, which helps understand how people feel and what they think, is very important in studying public opinions, customer thoughts, and social media buzz. But to make sentiment analysis work well, we need good datasets to train and test our systems. In this article, we will look at som
8 min read
Flipkart Reviews Sentiment Analysis using Python Sentiment analysis is a NLP task used to determine the sentiment behind textual data. In context of product reviews it helps in understanding whether the feedback given by customers is positive, negative or neutral. It helps businesses gain valuable insights about customer experiences, product quali
3 min read
Telecom Customer Churn Analysis in R Customer churn is an important concern for the telecom industry, as retaining customers is just as important as acquiring new ones. In this article we will be analyzing a dataset related to customer churn to derive insights into why customers leave and what can be done to retain them.Project Overvie
6 min read
Analyzing Google Play Store Reviews in R Analyzing Google Play Store reviews can provide valuable insights into user sentiments, app performance, and areas for improvement. In this project, we'll explore how to analyze Google Play Store reviews using R Programming Language covering theoretical concepts, dataset creation, and multiple visua
7 min read
Sentiment Analysis of Restaurant Reviews - Machine Learning Project Online reviews are essential in our decision-making process for choosing where to eat, what to buy, and which services to use. But how do businesses interpret the significance of these reviews for their service quality? That's where sentiment analysis comes in. This article provides an overview of s
5 min read