Dataset
I am using "Large Movie Review Dataset" for this sentiment analysis tutorial. This dataset contains 50,000 movie reviews which is obtained from the Internet Movie Database (IMDB) for this specific task. Dataset contains two polarity level: negative and positive, each set contains 25,000 movie reviews.
After download, extract the winrar file into your directory
Convert Dataset into .CSV file
import os
import pandas as pd
import numpy as np
labels = {'pos': 'positive', 'neg': 'negative'}
dataset = pd.DataFrame()
for directory in ('test', 'train'):
for sentiment in ('pos', 'neg'):
# Note: change the path name with your directory
path =r'C:\Users\nlpgeek\sentiment\aclImdb/{}/{}'.format(directory, sentiment)
for review_file in os.listdir(path):
with open(os.path.join(path, review_file), 'r', encoding='utf8') as input_file:
review = input_file.read()
dataset = dataset.append([[review, labels[sentiment]]],
ignore_index=True)
dataset.columns = ['review', 'sentiment']
indices = dataset.index.tolist()
np.random.shuffle(indices)
indices = np.array(indices)
dataset = dataset.reindex(index=indices)
dataset.to_csv('movie_reviews.csv', index=False)
It will take some time for converting 50,000 text documents into .CSV file.