Explore the power of Natural Language Processing (NLP) and Data Science in uncovering valuable insights from Flipkart product reviews. This presentation delves into the methodology, tools, and techniques used to analyze customer sentiments, identify trends, and extract actionable intelligence from a vast sea of textual data. From understanding customer preferences to improving product offerings, discover how NLP Data Science is revolutionizing the way businesses leverage consumer feedback on Flipkart. Visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Report
Share
Report
Share
1 of 39
More Related Content
Similar to NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnshradhasharma2101
The document discusses natural language processing (NLP) and its applications. NLP is a subfield of AI focused on enabling computers to understand human language. It is used to analyze text to allow machines to understand how humans speak. Common NLP tasks include automatic summarization, sentiment analysis, topic extraction, and question answering. The document then provides examples of how NLP is used for automatic summarization, sentiment analysis, text classification, and virtual assistants. It also discusses using NLP for cognitive behavioral therapy.
LLMs are artificial intelligence models that can generate human-like text based on patterns in training data. They are commonly used for language translation, chatbots, content creation, and summarization. LLMs consist of encoders, decoders and attention mechanisms. Popular LLMs include GPT-3, BERT, and XLNet. LLMs are trained using unsupervised learning on vast amounts of text data and then fine-tuned for specific tasks. They are evaluated based on metrics like accuracy, F1-score, and perplexity. ChatGPT is an example of an LLM that can answer questions, generate text, summarize text, and translate between languages.
This document discusses sentiment analysis techniques in machine learning. It defines sentiment analysis as using natural language processing to identify subjective information and extract sentiment from text. Several machine learning algorithms can be used for sentiment analysis, including naïve Bayes classification, Word2Vec, and neural recursive networks. The document also provides examples of industries that use sentiment analysis, such as retail, entertainment, and healthcare.
Natural Language Processing in Artificial Intelligence.
What is the basic concept of Text Normalization? How does it work while processing human languages? Difference between Stemming and Lemmatization. Term Frequency and Inverse Document Frequency contribute to TFIDF.
This version of NLP(PPT) contains the updated contents. In the earlier one, Stemming and Lemmatization processes were not taken into consideration while working with Bag of Words Algorithm. This PPT has come with all those corrections.
Natural Language Processing in Artificial Intelligence.
What is the basic concept of Text Normalization? How does it work while processing human languages? Difference between Stemming and Lemmatization. Term Frequency and Inverse Document Frequency contribute to TFIDF.
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
We propose an automatic classification system of movie genres based on different features from their textual
synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis,
and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
This document presents a project report on sarcasm analysis using machine learning techniques. It discusses how sarcasm detection is a challenging task in natural language processing due to the gap between the literal and intended meaning of sarcastic texts. The report outlines a methodology to detect sarcasm in tweets by extracting features like intensifiers and interjections and training machine learning classifiers. Naive Bayes, maximum entropy, and decision tree classifiers are tested, with decision trees achieving the highest accuracy of 63%. The conclusion discusses how accuracy could be improved by incorporating better features, and future work includes adding context and detecting sarcasm in other languages.
This document describes a movie recommendation system that uses machine learning techniques like cosine similarity and TF-IDF. It discusses collecting movie data, preprocessing it using techniques like TF-IDF to generate feature vectors, and then calculating cosine similarity between movies to find similar movies and make recommendations. The system was developed in Python using libraries like NumPy, Pandas, and Matplotlib. It demonstrates generating recommendations based on both movie genres and titles and achieves good results. Pseudocode is also provided to explain the technical approach.
This document proposes a system for sentiment classification in Hindi language texts. It involves building a training dataset from Hindi corpora by identifying sentiment scores. A classification model is then built and applied to new test data to predict sentiment. Key steps include tokenization, removing stop words, stemming using a Hindi stemmer, identifying sentiment using Hindi WordNet, and aggregating word-level sentiment scores to determine overall sentiment. Challenges noted include limited coverage of Hindi WordNet and accuracy issues. Future work could focus on expanding Hindi WordNet. The proposed system aims to efficiently classify sentiment in Hindi texts.
It gives an overview of Sentiment Analysis, Natural Language Processing, Phases of Sentiment Analysis using NLP, brief idea of Machine Learning, Textblob API and related topics.
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
We propose an automatic classification system of movie genres based on different features from their textual synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis, and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
The document discusses various natural language processing (NLP) techniques including implementing search, document level analysis, sentence level analysis, and concept extraction. It provides details on tokenization, word normalization, stop word removal, stemming, evaluating search results, parsing and part-of-speech tagging, entity extraction, word sense disambiguation, concept extraction, dependency analysis, coreference, question parsing systems, and sentiment analysis. Implementation details and useful tools are mentioned for various techniques.
The document provides an overview of natural language processing (NLP) including definitions, applications, modeling techniques, and tools used. It defines NLP as making computers understand human language and discusses applications like email filters, assistants, translation, and data analysis. Techniques covered include data preprocessing, tokenization, stop words removal, stemming, lemmatization, bag of words, TF-IDF, word embeddings, and sentiment analysis. Python is highlighted as a commonly used programming language and libraries like NLTK are mentioned. Demos are provided of tokenization, stemming, lemmatization, and sentiment analysis.
Aspect-Level Sentiment Analysis On Hotel ReviewsKimberly Pulley
The document discusses aspect-level sentiment analysis on hotel reviews. It describes extracting sentiments on specific aspects or entities mentioned in documents, like reviews. It uses Python tools like scrapy and NLTK to preprocess reviews, identify aspects in sentences, and determine sentiment scores for each aspect using a sentiment analysis algorithm. The goal is to analyze different aspects of reviews and summarize sentiment values to understand customer feedback.
IRJET - Twitter Sentiment Analysis using Machine LearningIRJET Journal
This document summarizes a research paper on Twitter sentiment analysis using machine learning. It describes extracting tweets on a topic, cleaning the data, extracting features, building a logistic regression model to classify tweets as positive, negative or neutral sentiment, and validating the model. The goal is to analyze public sentiment from Twitter data, which has applications in marketing, product feedback, and other areas.
This document reviews dictionary-based approaches to sentiment analysis. It discusses how sentiment analysis is used to determine sentiment polarity in text data using sentiment dictionaries like SentiWordNet. Dictionary-based methods involve matching words from a text to an opinion dictionary to determine if they express positive, negative, or neutral sentiment. The document also discusses some challenges with dictionary-based sentiment analysis, like handling negation and word sense disambiguation. Overall, the document provides an overview of dictionary-based sentiment analysis techniques and how they involve using sentiment dictionaries to classify the polarity of words and texts.
Here is a research paper on women's shoes:
Women's Shoes Research Paper
Introduction
Shoes are an important fashion accessory that can reflect a woman's personal style and taste. The shoe industry offers a wide variety of women's shoes in different styles, materials, colors and heel heights to suit different occasions and outfits. This research paper will explore some of the most popular styles of women's shoes and analyze trends in the shoe market.
Popular Styles
Some of the most commonly worn styles of women's shoes include heels, flats, boots, sandals and sneakers. Heels are dressier shoes that lift and elongate the leg with a raised heel. Popular heel heights range from 1-4 inches
Natural Language Processing: A comprehensive overviewBenjaminlapid1
Natural language processing enhances human-computer interaction by bridging the language gap. Uncover its applications and techniques in this comprehensive overview. Dive in now!
Similar to NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx (20)
Predict Your Way to Marketing Success: A Data Science Approach to Optimizing Ad Campaign Performance for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Dive into our project presentation by Pavan Kumar Data Science Takes the Wheel: Predicting F1 Race Outcomes for Engaging Media Content for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This presentation dives into the methodologies and tools used for predicting power consumption. Tailored for students, it covers the importance of power consumption forecasting, various prediction techniques, data requirements, and practical applications in energy management and sustainability.
for more visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Explore common web application vulnerabilities like CSRF and XSS, and learn how ethical hackers use these techniques to identify and fix security weaknesses responsibly. This presentation will also cover best practices for securing web applications and preventing attacks. for more info visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
This presentation provides an in-depth exploration of various tools available in Kali Linux for conducting website scans through IP addresses. Designed for students, the slides cover the functionality, usage, and practical applications of these tools in cybersecurity and ethical hacking. for more visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Discover the cutting-edge integration of artificial intelligence in facial and biometric authentication systems. This presentation examines the technological advancements, implementation strategies, and security benefits of AI-powered authentication methods. Learn how AI enhances accuracy, speed, and reliability in verifying identities, and explore real-world applications and future trends in biometric security.
for more information visit : https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
This presentation provides an in-depth analysis of HTML injection vulnerabilities in web applications. It explores the mechanisms through which these vulnerabilities are introduced, their potential impacts, and effective mitigation strategies. Through case studies and real-world examples, the report highlights the importance of secure coding practices and regular vulnerability assessments to safeguard web applications from malicious exploits.
for more details visit : https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Explore the comprehensive landscape of government-established cybersecurity standards designed to protect digital environments globally. This presentation delves into key international and national frameworks, sector-specific regulations, and best practices for compliance. Ideal for cybersecurity professionals and policymakers, it offers insights into the strategies and requirements essential for maintaining robust cyber defenses.
In the digital age, cybersecurity has become a critical concern for governments worldwide. This presentation explores various government-established standards and regulations designed to foster a secure cyber environment. It covers international, national, and sector-specific standards that aim to protect sensitive information, ensure data privacy, and combat cyber threats.
for more information visit : https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
In this project presentation, we explore the application of machine learning techniques to detect and predict crime hotspots. By analyzing historical crime data, we aim to identify patterns and trends that can help law enforcement agencies allocate resources more efficiently and proactively address crime-prone areas. Key components of the project include data preprocessing, feature engineering, model selection, and evaluation. The presentation will also cover the implementation of visualization tools to highlight crime hotspots on a map, making the findings easily interpretable for stakeholders. This project demonstrates the potential of data science to enhance public safety and support informed decision-making in crime prevention efforts. for more information visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
This presentation delves into the science behind rain forecasting, exploring various techniques used to predict precipitation patterns. Learn how meteorologists use data, models, and technology to forecast rain and make informed decisions. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Our data science approach will rely on several data sources. The primary source will be NYPD shooting incident reports, which include details about the shooting, such as the location, time, and victim demographics. We will also incorporate demographics data, weather data, and socioeconomic data to gain a more comprehensive understanding of the factors that may contribute to shooting incident fatality. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This presentation explores product cluster analysis, a data science technique used to group similar products based on customer behavior. It delves into a project undertaken at the Boston Institute, where we analyzed real-world data to identify customer segments with distinct product preferences. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Unveiling website security risks! This presentation delves into the findings of a Boston Institute project focused on website analysis. for more details visit : https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
In today's digital age, website security is paramount. This presentation dives into SQL injection in depth for more visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Learn about session hijacking, a serious cybersecurity threat where attackers steal or manipulate a user's session token to gain unauthorized access to web applications. This comprehensive guide covers the methods used by attackers, the risks involved, and practical steps you can take to secure your online sessions. Whether you're a cybersecurity professional or a regular internet user, this post provides essential insights to help you stay safe online.https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Dive into the world of web security with this comprehensive presentation on solving labs for common web vulnerabilities. This hands-on guide is designed to help you understand and mitigate vulnerabilities such as SQL Injection, Cross-Site Scripting (XSS), Cross-Site Request Forgery (CSRF), and more. Perfect for cybersecurity students, professionals, and enthusiasts, this presentation provides practical exercises, detailed explanations, and real-world examples to enhance your web security skills. Equip yourself with the knowledge to protect your web applications from the most prevalent threats. for more details visit https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Explore the critical aspects of session hijacking in this informative presentation. Learn how attackers exploit session vulnerabilities to gain unauthorized access to user accounts and the effective strategies to prevent such breaches. This presentation covers the mechanisms of session hijacking, its impact on security, and best practices for safeguarding your web applications. Ideal for cybersecurity professionals, developers, and IT enthusiasts, this guide will enhance your understanding of online session security. for more visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This presentation explores how K-means clustering can be used to analyze solar production data and identify patterns that can help optimize energy generation. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
How We Added Replication to QuestDB - JonTheBeachjavier ramirez
Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance.
A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen.
Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second.
In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
### Data Description and Analysis Summary for Presentation
#### 1. **Importing Libraries**
Libraries used:
- `pandas`, `numpy`: Data manipulation
- `matplotlib`, `seaborn`: Data visualization
- `scikit-learn`: Machine learning utilities
- `statsmodels`, `pmdarima`: Statistical modeling
- `keras`: Deep learning models
#### 2. **Loading and Exploring the Dataset**
**Dataset Overview:**
- **Source:** CSV file (`mumbai-monthly-rains.csv`)
- **Columns:**
- `Year`: The year of the recorded data.
- `Jan` to `Dec`: Monthly rainfall data.
- `Total`: Total annual rainfall.
**Initial Data Checks:**
- Displayed first few rows.
- Summary statistics (mean, standard deviation, min, max).
- Checked for missing values.
- Verified data types.
**Visualizations:**
- **Annual Rainfall Time Series:** Trends in annual rainfall over the years.
- **Monthly Rainfall Over Years:** Patterns and variations in monthly rainfall.
- **Yearly Total Rainfall Distribution:** Distribution and frequency of annual rainfall.
- **Box Plots for Monthly Data:** Spread and outliers in monthly rainfall.
- **Correlation Matrix of Monthly Rainfall:** Relationships between different months' rainfall.
#### 3. **Data Transformation**
**Steps:**
- Ensured 'Year' column is of integer type.
- Created a datetime index.
- Converted monthly data to a time series format.
- Created lag features to capture past values.
- Generated rolling statistics (mean, standard deviation) for different window sizes.
- Added seasonal indicators (dummy variables for months).
- Dropped rows with NaN values.
**Result:**
- Transformed dataset with additional features ready for time series analysis.
#### 4. **Data Splitting**
**Procedure:**
- Split the data into features (`X`) and target (`y`).
- Further split into training (80%) and testing (20%) sets without shuffling to preserve time series order.
**Result:**
- Training set: `(X_train, y_train)`
- Testing set: `(X_test, y_test)`
#### 5. **Automated Hyperparameter Tuning**
**Tool Used:** `pmdarima`
- Automatically selected the best parameters for the SARIMA model.
- Evaluated using metrics such as AIC and BIC.
**Output:**
- Best SARIMA model parameters and statistical summary.
#### 6. **SARIMA Model**
**Steps:**
- Fit the SARIMA model using the training data.
- Evaluated on both training and testing sets using MAE and RMSE.
**Output:**
- **Train MAE:** Indicates accuracy on training data.
- **Test MAE:** Indicates accuracy on unseen data.
- **Train RMSE:** Measures average error magnitude on training data.
- **Test RMSE:** Measures average error magnitude on testing data.
#### 7. **LSTM Model**
**Preparation:**
- Reshaped data for LSTM input.
- Converted data to `float32`.
**Model Building and Training:**
- Built an LSTM model with one LSTM layer and one Dense layer.
- Trained the model on the training data.
**Evaluation:**
- Evaluated on both training and testing sets using MAE and RMSE.
**Output:**
- **Train MAE:** Accuracy on training data.
- **T
2. Introduction to Natural Language Processing (NLP)
• According to industry estimates, only 21% of the
available data is present in structured form. Data
is being generated as we speak, tweet, and send
messages on WhatsApp, and in various other
activities.
• Despite having high-dimensional data, its
information is not directly accessible unless it is
processed (read and understood) manually or
analyzed by an automated system.
• To produce significant and actionable insights
from text data, it is important to get acquainted
with the techniques and principles of Natural
Language Processing (NLP).
3. What is Sentiment Analysis?
• Sentiment Analysis, as the name suggests, means to identify the view or emotion behind a
situation.
• We, humans, communicate with each other in a variety of languages, and any language is just a
mediator or a way in which we try to express ourselves. And, whatever we say has a sentiment
associated with it. It might be positive or negative or it might be neutral as well.
• Sentiment Analysis is a sub-field of NLP and with the help of machine learning techniques, it tries
to identify and extract the insights.
• Let’s look at an example below to get a clear view of Sentiment Analysis:
4. Challenges faced by NLP in real world
1) Ambiguity and Context: NLP struggles with understanding the multiple meanings of words
and phrases in different contexts.
2) Data Quality and Quantity: NLP models need large amounts of high-quality data, but
obtaining and labeling it can be challenging.
3) Domain Adaptation: Models trained in one domain often fail to generalize well to others,
requiring adaptation for real-world use.
4) Ethical and Bias Concerns: Biases in data can lead to unfair outcomes, necessitating
measures to address ethical concerns and mitigate biases.
5) Interpretability and Trust: Complex NLP models are difficult to interpret, making it hard to
trust their decisions without explanation.
5. Real-life applications of NLP
1) Virtual Assistants: Siri, Alexa, and Google Assistant, aiding in tasks such as setting reminders,
answering questions, and controlling smart devices.
2) Email Filtering and Categorization: Sorting emails into folders or labeling them as spam based
on their content.
3) Language Translation Apps: Such as Google Translate, helps users understand and
communicate in different languages.
4) Customer Support Chatbots: Providing instant responses to customer queries on websites or
messaging platforms.
5) Social Media Monitoring: Analyzing trends, sentiments, and customer feedback on platforms
like Twitter and Facebook for brand reputation management.
6. Basic Libraries of Python
1) NumPy: For numerical computing with large
arrays and mathematical operations.
2) Pandas: For data manipulation and
analysis, especially with structured data.
3) Matplotlib: For creating various types of
plots and visualizations.
4) scikit-learn: For machine learning tasks like
classification, regression, and clustering.
7. Important Libraries for NLP
1) NLTK: Offers sentiment analysis via Vader
Sentiment Analyzer.
2) TextBlob: Provides simple functions for
sentiment polarity.
3) scikit-learn: Offers machine learning
algorithms for sentiment classification.
4) spaCy: Supports sentiment analysis via
rule-based or integrated approaches.
5) VADER: Specifically tuned for sentiment
analysis in social media text.
6) Gensim: Python library for topic modeling
and document similarity analysis,
including LSA and LDA.
8. Dataset
• This dataset contains information about Product name, Product price, Rate, Reviews, Summary,
and Sentiment in CSV format. There are 104 different types of products on flipkart.com such as
electronics items, clothing for men, women, and kids, Home decor items, Automated systems, and
so on. It has 205053 rows and 6 columns.
• This dataset has multiclass labels as sentiment such as positive, neutral, and negative. The
sentiment given was based on a column called Summary using NLP and the Vader model. Also,
after that, we manually checked the label and put it into the appropriate categories if the summary
has text like okay, just ok, or one positive and negative we labeled it as neutral for better
understanding while using this dataset for human languages.
• Data was collected through web scraping using the library called Beautiful Soup from flipkart.com.
9. First 5 rows of data
Shape of data
There are 205052 rows and 6 features. From the above table, we can see that the Sentiment
column is our target variable since we have to classify whether the Reviews are positive,
negative, or neutral.
10. All the columns in the data are of Object type.
Checking the type of columns
11. Checking the null values in the data
Review and Summary have null values present.
After dropping the null values, there are 841 unique products available in Flipkart
data.
12. Top 10 products in the data
In the product name column there were many punctuation marks and Cyrillic text was present so
it was creating noise in the data. After removing punctuation marks and converting Cyrillic text
into human-readable format here are 10 products that are frequently purchased online.
13. Distribution of Price
From the KDE plot, we can see that the maximum number of products is between 0 to 1000 price
range. The minimum product price is 59 and the maximum is 86990.
15. Top 10 Frequently Used Words in Review
These are the top 10 words used frequently in Reviews of products. And all these reviews reflect
positive sentiments about the products. Also, we saw that a maximum of people have given a 5-
star rating.
17. Relationship between Sentiment and Rate
This is a count plot of Sentiment and Rate, as we can see for the positive sentiment the highest
rating is 5 and 4, for the negative sentiment the highest is 1, and for neutral all ratings are
distributed evenly. The same can be seen through the line plot.
18. Relationship between Product price and Rate
The correlation between product price and rate is 0.062 and it is visible that for product prices
of low range, the rating is more as compared to higher product price ranges.
19. Plotting the Word Cloud for Sentiment columns
1) Positive Sentiment 2) Negative Sentiment
20. Data Preprocessing
Now, we will pre-process the data before converting it into vectors and passing it to the machine
learning model.
We will create a function for the pre-processing of data.
1) First, we will iterate through each record, and Split the text into individual words or tokens.
2) Then, we will convert the string to lowercase as the word “Good” is different from the word
“good”.
3) Then we will check for stopwords in the data and get rid of them. Stopwords are commonly
used words in a sentence such as “the”, “an”, “to” etc. which do not add much value.
4) Then, we will perform lemmatization on each word,i.e. change the different forms of a word
into a single item called a lemma.
5) A lemma is a base form of a word. For example, “run”, “running” and “runs” are all forms of the
same lexeme, where the “run” is the lemma. Hence, we are converting all occurrences of the
same lexeme to their respective lemma.
23. Topic Modelling using Latent Dirichlet Allocation (LDA)
• Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given
corpus. In other words, latent means hidden or concealed.
• LDA generates probabilities for the words using which the topics are formed and eventually the topics
are classified into documents.
• Any corpus, which is the collection of documents, can be represented as a document-word (or
document term matrix) also known as DTM.
24. Vectorization
To convert the text data into numerical data, we need some smart ways which are known as
vectorization, or in the NLP world, it is known as Word embeddings.
Count Vectorizer
• It creates a document term matrix, which is a set of dummy variables that indicates if a
particular word appears in the document.
• Count vectorizer will fit and learn the word vocabulary and try to create a document term matrix
in which the individual cells denote the frequency of that word in a particular document, which is
also known as term frequency, and the columns are dedicated to each word in the corpus.
25. TF-IDF Vectorization
Term frequency-inverse document frequency ( TF-IDF) gives a measure that considers the
importance of a word depending on how frequently it occurs in a document and a corpus
Term Frequency
Term frequency denotes the frequency of a word in a document.
26. Inverse Document Frequency
It measures the importance of the word in the corpus. It measures how common a particular
word is across all the documents in the corpus.
For Example, In any corpus, a few words like ‘is’ or ‘and’ are very common, and most likely,
they will be present in almost every document.
Let’s say the word ‘is’ is present in all the documents in a corpus of 1000 documents. The idf for
that would be:
The idf(‘is’) is equal to log (1000/1000) = log 1 = 0
28. Machine Learning Model
• This is a machine learning problem and classification where the goal is to predict the
sentiment based on reviews. To do this I fitted the Multinomial Naive Bayes, Random forest
classifier, and XGBoost classifier.
• Our task is a classification problem so we can use performance metrics like precision,
recall, Accuracy, and F1-score.
• We will evaluate our model using various metrics such as Accuracy Score, Precision Score,
Recall Score, and Confusion Matrix and create a roc curve to visualize how our model
performed.
37. Conclusion
1. The majority of the reviews (59%) were rated 5 out of 5, indicating a high level of customer
satisfaction.
2. Positive sentiment was the most common sentiment in the reviews, followed by neutral and
negative sentiment.
3. There was a positive correlation between product price and rate, suggesting that customers
were more likely to give higher ratings to more expensive products.
4. The most frequently used words in positive reviews included "good", "great", "love", and
"amazing", while the most frequently used words in negative reviews included "bad",
"terrible", "waste", and "disappointed".
5. The topic modeling analysis identifies several key topics in the reviews, including product
quality, customer service, value for money, and shipping.
6. The Multinomial Naive Bayes classifier achieves an accuracy of around 70% on both count
vectorizer and TF-IDF vectorizer, suggesting that it is a suitable model for sentiment analysis
on this dataset.
7. The Random Forest classifier achieves an accuracy of around 75% on both count vectorizer
and TF-IDF vectorizer, outperforming the Multinomial Naive Bayes classifier.
38. 9. The XGBoost classifier achieves an accuracy of around 80% on the TF-IDF vectorizer,
outperforming both the Multinomial Naive Bayes and Random Forest classifiers.
10. Hyperparameter tuning further improves the performance of the XGBoost classifier, achieving
an accuracy of around 85% on the TF-IDF vectorizer.
11. The analysis suggests that customers tend to be more satisfied with products that are of good
quality, offer good value for money, and have a good customer service experience.
12. The insights gained from this project can be used by Flipkart to make data-driven decisions to
improve its business and provide a better customer experience.