In this, we cover tools for pre-processing text data, several supervised/unsupervised models for sentiment prediction and model causation.

We will use the following packages. Make sure to have them installed correctly on your machine.

# Usual data representation and manipulation libraries
import pandas as pd
import numpy as np
from collections import Counter

# NLTK is very useful for natural language applications
import nltk
nltk.download('stopwords')
nltk.download('sentiwordnet')
nltk.download('wordnet')
nltk.download('vader_lexicon')

# This will be used to tokenize sentences
from nltk.tokenize.toktok import ToktokTokenizer

# We use spacy for extracting useful information from English words
import spacy
nlp = spacy.load('en_core_web_sm', disable= ["parser", "tag", "entity"])

# This dictionary will be used to expand contractions (e.g. we'll -> we will)
from contractions import contractions_dict
import re

# Unicodedata will be used to remove accented characters
import unicodedata

# BeautifulSoup will be used to remove html tags
from bs4 import BeautifulSoup

# Lexicon models
from afinn import Afinn
from nltk.corpus import sentiwordnet as swn
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Evaluation libraries
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

The IMDb dataset

We will predict the sentiment for movie reviews obtained from the Internet Movie Database (IMDb). The dataset contains 50,000 movie reviews that have been labeled with “positive” and “negative” labels based on the review content.

Untitled

The dataset can be formally obtained from http://ai.stanford.edu/~amaas/data/sentiment/, courtesy of Stanford University and Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts [Source].

To simplify, I have created a CSV file below which you can use to import the reviews.

movie_reviews.csv

Task 1: Importing and pre-processing the input


Task 2: Using Unsupervised Lexicon-based models


Task 3: Using SVM/LR with TF-IDF and BOW features


BONUS

For the tasks below, create a new Conda environment and make sure that you have the following packages.