R语言代写|R语言代做|R语言代考
当前位置:以往案例 > >Movie Review Sentiment Analysis
2023-01-04

You are provided with a data set consisting of 50,000 IMDB movie reviews, where each review is labelled as positive or negative. The goal is to build a binary classification model to predict the sentiment of a movie review.


## Movie Review Data

### Dataset
The data set, [[alldata.tsv](https://liangfgithub.github.io//Data/alldata.tsv)], has 50,000 rows (i.e., reviews) and 4 columns:

* Col 1: "id", the identification number; 
* Col 2: "sentiment", 0 = negative and 1 = positive; 
* Col 3: "score", the 10-point score assigned by the reviewer. Scores 1-4 correspond to negative sentiment; Scores 7-10 correspond to positive sentiment. This data set contains no reviews with score 5 or 6. 
* Col 4: "review". 

### Test IDs


This file, [[project3_splits.csv](https://liangfgithub.github.io//Data/project3_splits.csv)], contains 25,000 rows and 5 columns, each column containing the 25,000 row-numbers of a test data.

You can use the following code to generate the 5 sets of training/test splits with each set (3 files: `train.tsv`, `test.tsv`, `test_y.tsv`) stored in a subfolder. Note that the training data do not contain the "score" column; this is to avoid students mistakenly using "score" as an input feature. 

```{r, eval = FALSE}
data <- read.table("alldata.tsv", stringsAsFactors = FALSE,
                  header = TRUE)
testIDs <- read.csv("project3_splits.csv", header = TRUE)
for(j in 1:5){
  dir.create(paste("split_", j, sep=""))
  train <- data[-testIDs[,j], c("id", "sentiment", "review") ]
  test <- data[testIDs[,j], c("id", "review")]
  test.y <- data[testIDs[,j], c("id", "sentiment", "score")]
  
  tmp_file_name <- paste("split_", j, "/", "train.tsv", sep="")
  write.table(train, file=tmp_file_name, 
              quote=TRUE, 
              row.names = FALSE,
              sep='\t')
  tmp_file_name <- paste("split_", j, "/", "test.tsv", sep="")
  write.table(test, file=tmp_file_name, 
              quote=TRUE, 
              row.names = FALSE,
              sep='\t')
  tmp_file_name <- paste("split_", j, "/", "test_y.tsv", sep="")
  write.table(test.y, file=tmp_file_name, 
            quote=TRUE, 
            row.names = FALSE,
            sep='\t')
}

```



### Source
A subset of this data set was used in a Kaggle competition: [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial). You can find useful discussion and sample code there. 

## Performance Target

The goal is to build a binary classification model to predict the sentiment of a review with a **vocabulary size** less than or equal to **1000**. A phrase, for example, "worst_movie", is counted as one word. 


You should use the **same** vocabulary for all five training/test data sets. 


Evaluation metric is **AUC** on the test data. You should try to produce AUC equal to or bigger than **0.96** over all five test data.

## What to Submit

Submit the following **four** items on Coursera:

* A text file named **myvocab.txt** that contains your word list: one word/phrase in a row and the number of rows should be less than 2000. Partial credits will be given to vocabs of size bigger than 1000 but less than 2000.

* An R/Python Markdown/Notebook file in HTML explaining how you construct your vocabulary. Include **all necessary code** in this  file so we can reproduce your results.

* A report (4-page maximum, PDF or HTML) that provides the technical details of your model, your implementation and any interesting findings. In addition, report the performance and running time for the five test data sets, and the computer system you use (e.g., Macbook Pro, 2.53 GHz, 4GB memory, or AWS t2.large). **More requirements on your report will be posted on Campuswire.**

  You could include some details on how you construct the vocabulary in the report, but it is not required since we can find those details in the Markdown file.

 

* R/Python code in a single file named **mymain.R** (or **mymain.py**) that takes your myvocab.txt, a training data and a test data as input and outputs one submission file (the format of the submission file is given below). 

  Your mymain.R should look like the following. 

```{r eval=FALSE}
#####################################
# Load libraries
# Load your vocabulary and training data
#####################################
myvocab <- scan(file = "myvocab.txt", what = character())
train <- read.table("train.tsv", stringsAsFactors = FALSE,
                   header = TRUE)

#####################################
# Train a binary classification model
#####################################


#####################################
# Load test data, and 
# Compute prediction
#####################################
test <- read.table("test.tsv", stringsAsFactors = FALSE,
                    header = TRUE)


#####################################
# Store your prediction for test data in a data frame
# "output": col 1 is test$id
#           col 2 is the predicted probs
#####################################
write.table(output, file = "mysubmission.txt", 
            row.names = FALSE, sep='\t')
```



## Code Evaluation

We shall run the command "source(mymain.R)" in a directory which contains only **three** files: 

* `myvocab.txt`
* `train.tsv`
* `test.tsv`

After running your code, we should see **one** txt file in the same directory named `mysubmission.txt`. Then we shall move `test_y.tsv` to this directory, and load `test_y.tsv` and `mysubmission.txt` to compute AUC.

**Submission File Format**:  `mysubmission.txt` should look like the following:


"id" "prob"
47604 0.940001011154441
36450 0.584891891011812
30088 0.499236341444505
18416 0.00687786009135037


Our evaluation R code looks like the following:
```{r eval=FALSE}
library(pROC)
source(mymain.R)
# move "test_y.tsv" to this directory
test.y <- read.table("test_y.tsv", header = TRUE)
pred <- read.table("mysubmission.txt", header = TRUE)
pred <- merge(pred, test.y, by="id")
roc_obj <- roc(pred$sentiment, pred$prob)
pROC::auc(roc_obj)
```

The evaluation for **Python code** is similar: after typing “python mymain.py” in the directory containing three files `myvocab.txt`, `train.tsv` and `test.tsv`, we should see **one** txt file in the same directory named `mysubmission.txt`. Then we can just use the R code above to compute the correspoding AUC, which should be the same as the output from `sklearn.metrics.roc_auc_score`.

在线提交订单