## Source
You can find relevant information and useful discussion and sample code on Kaggle https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting.
**Note**: (1) ONLY the training data is used in this project and our evaluation procedure is **different** from the one on Kaggle. (2) In the Kaggle competition, there is an additional cvs file containing a number of features such as temperature, fuel price, cpi, etc. We **do not** use this feature set in our project.
## Datasets
Download the zip file **train.csv.zip** [[Here](https://liangfgithub.github.io/Data/train.csv.zip)]. Then use the following code to generate datasets you need for this project.
```{r eval = FALSE}
library(lubridate)
library(tidyverse)
# read raw data and extract date column
train_raw <- readr::read_csv(unz('train.csv.zip', 'train.csv'))
train_dates <- train_raw$Date
# training data from 2010-02 to 2011-02
start_date <- ymd("2010-02-01")
end_date <- start_date %m+% months(13)
# split dataset into training / testing
train_ids <- which(train_dates >= start_date & train_dates < end_date)
train = train_raw[train_ids, ]
test = train_raw[-train_ids, ]
# create the initial training data
readr::write_csv(train, 'train_ini.csv')
# create test.csv
# removes weekly sales
test %>%
select(-Weekly_Sales) %>%
readr::write_csv('test.csv')
# create 10 time-series
num_folds <- 10
test_dates <- train_dates[-train_ids]
# month 1 --> 2011-03, and month 20 --> 2012-10.
# Fold 1 : month 1 & month 2, Fold 2 : month 3 & month 4 ...
for (i in 1:num_folds) {
# filter fold for dates
start_date <- ymd("2011-03-01") %m+% months(2 * (i - 1))
end_date <- ymd("2011-05-01") %m+% months(2 * (i - 1))
test_fold <- test %>%
filter(Date >= start_date & Date < end_date)
# write fold to a file
readr::write_csv(test_fold, paste0('fold_', i, '.csv'))
}
```
The code above will generate the following files:
* **train_ini.csv**: 5 columns ("Store", "Dept", "Date", "Weekly_Sales", "IsHoliday"), same as the train.csv file on Kaggle but ranging from 2010-02 to 2011-02.
* **test.csv**: 4 columns ("Store", "Dept", "Date", "IsHoliday"), in the same format as the train.csv file on Kaggle ranging from 2011-03 to 2012-10 with the "Weekly_Sales" column being removed.
* **fold_1.csv**, ..., **fold_10.csv**: 5 columns ("Store", "Dept", "Date", "Weekly_Sales", "IsHoliday"), same as the train.csv file on Kaggle, and one for every two months starting from 2011-03 to 2012-10.
We also provide a zip file **F22_proj2_data.zip** that contains the above 12 files on Campuswire.
## Goal
The file, **train_ini.csv**, provides the weekly sales data for various stores and departments from 2010-02 (February 2010) to 2011-02 (February 2011).
Given **train_ini.csv**, the data till 2011-02, you need to predict the weekly sales for 2011-03 and 2011-04. Then you'll be provided with the weekly sales data for 2011-03 and 2011-04 (**fold_1.csv**), and you need to predict the weekly sales for 2011-05 and 2011-06, and so on:
* `t = 1`, predict 2011-03 to 2011-04 based on data from 2010-02 to 2011-02 (train_ini.csv);
* `t = 2`, predict 2011-05 to 2011-06 based on data from 2010-02 to 2011-04 (train_ini.csv, fold_1.csv);
* `t = 3`, predict 2011-07 to 2011-08 based on data from 2010-02 to 2011-06 (train_ini.csv, fold_1.csv, fold_2.csv);
* ......
* `t = 10`, predict 2012-09 to 2012-10 based on data from 2010-02 to 2012-08 (train_ini.csv, fold_1.csv, fold_2.csv, ..., fold_9.csv)
## Code Evaluation
Name your submission as **mymain.R**. Our evaluation code looks like the following:
```{r eval = FALSE}
source("mymain.R")
# read in train / test dataframes
train <- readr::read_csv('train_ini.csv')
test <- readr::read_csv('test.csv')
# wae: record weighted mean absolute error WMAE
num_folds <- 10
wae <- rep(0, num_folds)
for (t in 1:num_folds) {
# *** THIS IS YOUR PREDICTION FUNCTION ***
test_pred <- mypredict()
# read new data from fold_t
fold_file <- paste0('fold_', t, '.csv')
new_train <- readr::read_csv(fold_file,
col_types = cols())
# extract predictions matching up to the new data
scoring_tbl <- new_train %>%
left_join(test_pred, by = c('Date', 'Store', 'Dept'))
# compute WMAE
actuals <- scoring_tbl$Weekly_Sales
preds <- scoring_tbl$Weekly_Pred
preds[is.na(preds)] <- 0
weights <- if_else(scoring_tbl$IsHoliday, 5, 1)
wae[t] <- sum(weights * abs(actuals - preds)) / sum(weights)
# update train data and get ready to predict at (t+1)
train <- train %>% add_row(new_train)
}
print(wae)
mean(wae)
```
We use the same evaluation metric as the [one described on Kaggle](https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/overview/evaluation), which uses higher weights on the following **four** holiday weeks:
* Super Bowl
* Labor Day
* Thanksgiving
* Christmas
## What to Submit
Submit the following **two** items on Coursera/Canvas:
* R/Python code in a single file named **mymain.R** (or **mymain.py**). No zip files; no Markdown/Notebook files.
Your file, mymain.R, should load all necessary libraries you need and contain the function `mypredict` that will be called in our evaluation code.
- `mypredict` should return predictions for the next two months stored in a column named "Weekly_Pred".
- Variables like `train`, `test`, and `t` are global parameters, so `mypredict` can access them.
* A report (3 pages maximum, PDF)
1. Provides technical details ( e.g., pre-processing, implementation details if not trivial) for the model you use.
2. Reports the accuracy on test data (accuracy on each of the 10 folds and the average of the 10 accuracies), running time of your code and the computer system you use (e.g., Macbook Pro, 2.53 GHz, 4GB memory, or AWS t2.large).
3. Do not copy-and-paste your code to the report.