Unit 2 – Part 2 – Worksheet
For part 2, you will be using this guided worksheet to analyze the data set provided. You will be required to submit both the worksheet (saved as a word document) with all your responses AND your completed R script. You must place your answers in the provided empty boxes; please provided ONLY the answers in the box that way we know where to find the answers.
Within each step, you will be using R to work towards our end goal of developing a model to predict the current week’s volume of egg sales to know how much to order. Make sure you answer all the questions within this worksheet. You will be required to include some of the visualizations within the document, so you may copy and paste those at the end (Appendix) the worksheet (make sure the visualizations have an appropriate title).
——————————————————————————————————————–
Step 1:
First, we read in the data into R and look at the first 6 rows to get a sense of the structure of the information.
Question 1: How many qualitative variables are there? How many quantitative/numeric variables are there?
Answer:
There are 7 quantitative variables and 3 qualitative variables
Question 2: What are the levels used for the “Easter” variable?
Answer:
Step 2 (A):
Having briefly ined the structure of the data, now we take a deeper look at the relationship between the variables. Knowing today's egg price per dozen, we first ine Egg Price and Cases. Take a deeper look at Egg.Pr vs Cases: Fit a SLR and create a data visualization with a fitted simple linear regression line. Are there some noticeable outliers? If so also add a fitted SLR to the data without including them to see if the line changes.
Question 3: Fill in the blank cells in the below table.
Estimate
Std Error
T value
Pr( > |t| )
Intercept
153414.5
15992.9
<< 0.0001
Egg.Pr
168.2
Question 4: Write out the regression model and interpret the slope coefficient in context.
Model:
Slope interpretation:
Question 5: Does it appear that prices of eggs is useful in predicting the current week’s volume of egg sales? Explain your reasoning.
Answer:
Question 6: What is the R2 value? Interpret this statistic.
Answer:
Question 7: Create a data visualization with a fitted simple linear regression line. (1) Are there some noticeable outliers? If so, also add a fitted SLR to the data without including them to see if the line changes. (2) Finally, interpret your final data visualization. Copy and paste your final visualization below.
Answer:
(1)
(2)
Step 2 (B):
Additionally, based on our own knowledge of egg usage we know that eggs are a common theme in Easter (paintings eggs and Easter egg hunts). Is this the case based on what information we have from the last two years? Common sense says if there are outliers that occur around Easter we can make a prediction that starts with a price effect on sales which do not occur at Easter, and then, if our Monday eggs order is around Easter, make an adjustment to that prediction. Create a helpful data visualization that explores Week versus Cases to see if there is an effect around Easter.
Question 8: (1) Create a helpful data visualization that explores Week versus Cases to see if there is an effect around Easter (paste plot in appendix). (2) Based on the visualization you created, does there appear to be an effect around Easter?
Answer for (2):
Step 2 (C):
Before we do some modeling, we should take a look at the remaining variables by creating a Pairwise Comparison Chart or Correlation Color Matrix (quantitative variables only).
Question 9: (1) Create one of these graphics as a data visualization (include in appendix). (2) Interpret your findings (i.e. should we be concern about collinearity or multicollinearity, etc?).
Answer for (2):
Step 2 (D):
At this point we checked the relationship between the other (quantitative) variables included in the data set and case sales of eggs, but what about the qualitative variables. To explore this, we create a box plot of Cases for different levels of the variable First.Week (a variable that indicates whether an observation corresponds to the first week in a month). Make this graphic into a data visualization.
Question 10: (1) Create a box plot of Cases for different levels of the variable First.Week. (include in appendix). (2) What are the median and quartile values (approximately) shown in the boxplots? (3) Does First.Week appear to have an effect? What is that effect? (4) Can you think of a plausible reason why this effect might exist?
Answers:
(2)
(3)
(4)
Having explored the data set, we now can make a Multiple Linear Regression Model BUT first we want to make sure the qualitative variables have an appropriate baseline category before we have R create the corresponding dummy variables. For the moment, we are not going to use the Month variable and just look at the remaining other variables making sure "Non Easter" for Easter and "No" for First.Week are the defaults/baseline level. Reorder these variables to do this.
——————————————————————————————————————–
Step 3:
We decided to create a full model to predict Cases of Eggs. Even though from earlier we know that there isn't strong correlation between Cases Sold and any of the other food prices. We decide to include each of these main effects in our model to see if a linear combination of them has an effect. Additionally, we aren't sure if the First Week of the month affects pricing of Eggs, Beef, Pork, Chicken, or Cereal, so we decide to include the interaction of First.Week with these variables to see if there is an effect. Since we know that Egg prices are affected by Easter we know we want to include that interaction into our model, but we don't believe there is an Easter effect on any of the food prices so we leave those out. We aren't interested in week number so will leave that out but are also interested in relationship between Month and Egg Price so will include that interaction and main effect in the model. Fit a Multiple Linear Regression model to this full model and interpret the results.
Question 11: Write the fitted regression model calculated above.
Answer:
Question 12: Provide the hypotheses, test statistic, and the p-value for the F Test. Based on this test, draw some conclusions about the overall significance of the model. Assume we are using a significance level of 5%.
Answer:
Question 13: Go through the output for the T test in R and provide a list of the variables that are significant. Assume we are using a significance level of 5%.
Answer:
Question 14: Based on the analysis from step 3, draw some final conclusions on the variables and interactions (if any) that seem important to the model.
Answer:
Step 4:
Having explored this full model we know that some model changes need to be made. We decide to run a Backward Selection Regression to simplify the model.
Question 15: Provide your line of code below that will run the Backward Selection Regression.
Answer:
Question 16: Write the final regression model from the Backward Selection method and include the appropriate R2 value for the final model.
Answer:
Step 5:
Having estimated the linear model specification, we will next estimate the multiplicative power function specification. This specification assumes that, instead of Cases being a function of the sum of the predictor variables multiplied by a coefficient (that is estimated), Cases is the product of the predictor variables, each raised to the power of an estimated coefficient. This is easy to do, because if we take the natural logarithm of each side of the multiplicative power function model, it becomes a linear model of logs of the variables, and we can use standard linear regression to estimate the coefficients. To estimate the power function specification we must, therefore, apply a natural logarithm transformation to the dependent variable (Cases) and all the continuous (ratio or interval scaled) predictor variables. However, we do not apply the natural logarithm transformation to factor variables.
ple –
A power function model of egg sales as a function of egg prices, cereal prices, and chicken prices looks like Sale = beta0 x (EggPr)^beta1 x (CerealPr)^beta2 x (ChickenPr)^beta3 x First.Week, where beta0, beta1, beta2, and beta3 are the parameters we wish to estimate (x are multiplication). We transform this model into a form we can apply linear regression to by taking the natural logarithm of both sides, resulting in the equation log(Sales) = log(beta0) + beta1 x [log(eggPr)] + beta2 x [log(CerealPr)] + beta3 x [log(ChickenPr)] + beta4 x First.Week. This is now a linear model with target variable being log(Sales) rather than Sales, and the predictor variables being the terms inside the square brackets. When we run a multiple linear regression with these logged variables, the resulting estimated intercept is log(beta0). To find beta0, we simply take the inverse, exp(intercept) = exp(log(beta0)) = beta0. Similarly, the exponents in the power function formulation, beta1, beta2, and beta3, are the estimated coefficients.
Fit a OLS MLR to all the explanatory variables in the data set minus Week (no interaction terms). Using these same variables apply a multiplicative power function to the same variables. Compare the results of the two estimated models.
Question 17: Compare the results of the two estimated models. Question to consider when comparing the two models: Which fits the data better?, How do the estimated coefficients differ across these two models, particularly their signs and significance?, Although the magnitudes are very different between the two models, are the larger magnitudes in one model the same as the larger magnitudes in the other?, Which variables appear to have effects that are “robust” to the selection of a particular functional form (that is, remain fairly similar in sign and relative magnitude and significance in both models)?
Answer:
APPENDIX
All data visualizations that we ask for need to be included here with labeling for each graph (aka what step is it for?).