**Problem Set #5**

- (40 points) The dataset “nlsy97.dta” contains data on the 2017 hourly wages (in U.S. dollars), job tenure (time spent at job) in years, and union status for the primary job of a sample of working Americans who were teenagers in 1997.
- Suppose that you want to understand how length of job tenure impacts wages. You are not sure which functional form to use. Start with a linear functional form. Report your estimated regression equation with all standard statistics. (7 points)
- Generate a scatterplot showing the actual and predicted values of the dependent variable on the y-axis and job tenure on the x-axis. Include your scatterplot in the problem set. (5 points)
- Interpret the job tenure coefficient. (5 points)
- Another researcher suggests that you try a semi-log functional form in which the dependent variable is in natural logs but not the independent variables. Estimate this regression and include the results in your problem set as a formatted regression table. (6 points)
- Interpret the job tenure coefficient. Based on theory and the real workings of labor markets, explain which functional form makes more sense—the semi-log or the linear form. (6 points)
- Use the marginal effect of job tenure that you described in part e. to calculate the following…

(*Hint:*Remember that the marginal effect is .) (6 points)- The dollar impact on wages of a one-year increase in job tenure for someone who earns $15.00 per hour.
- The dollar impact on wages of a one-year increase in job tenure for someone who earns $30.00 per hour.

- Finally, visualize the semi-log regression equation using a scatterplot. You will need to a) predict the natural log of the hourly wage from your last regression, b) generate an exponentiated version of that predicted value, and c) graph the predicted and actual values as you did in part b. above. For comparison, display a linear best fit line (
**lfit**) on the scatterplot as well. Include your graph in your problem set. (5 points)

- (35 points) For this problem, we will return to the sleep data, but with an updated “SLEEP75_V2.dta” dataset. This dataset changes the sleep variable to
__hours__of work instead of minutes of work.

In the last problem set, we delved deeper into the negative relationship between total work time per week and total sleep time per week. The relationship might lead you to expect that greater work time leads to worse health, as sleep is known to be related to health outcomes. At the same time, poor health can make it difficult to work (backwards causality) and greater time spent working tends to increase earnings, which in turn increases access to health care.- You would like to estimate the following regression equation to determine whether greater work time is associated with better or worse health:

Use a linear probability model to estimate the model. Interpret your finding for . You do not need to present the rest of your results. (4 points) - Generate a scatterplot with your predicted and fitted values of the good health variable on the y-axis and total work time on the x-axis. Include the graph in your problem set and explain how it illustrates a problem with the linear probability model. (7 points)
- Estimate the same regression equation as part a. using a binomial logistic regression (
**logit**). (3 points) - We know from class that the predicted left-hand-side of the regression equation cannot be interpreted as a probability. However, the
**predict**command will automatically generate the predicted probabilities for logit and probit models. Use this to generate a scatterplot like the one in part b but for your logit results. Explain how it shows that logit deals with the problem you discussed in part b. (7 points)

- You would like to estimate the following regression equation to determine whether greater work time is associated with better or worse health:

- Similarly, we discussed in class that the in a logit regression cannot be interpreted as the correct marginal effect. Use
**margins**along with the**atmeans**and**post**options to generate the marginal effects at the means. Careful with your choice of other options: Do you need**eyex**,**dydx**,**eydx**, etc? Try**help margins**. (3 points) - It is common to include both parameter estimates and marginal effects when you report logit and probit results. Use
**outreg2**to output the results of i) your logit model parameter estimates and ii) your marginal effects. Title the first column “Parameter Estimates” and the second column “Marginal Effects.” Include this output in your problem set. (6 points) - Interpret the total work hours marginal effect. (5 points)

- (70 points) It has been widely noted that the prevalence of “eating out” (outside of the home) has changed over time. Part of this may be due to changing food options and new social norms around eating meals. At the same time, other factors such as household size have also been changing over the years, impacting the time and incentives that household members have to spend cooking at home.

Use the dataset “diets.dta” from the 1990 and 2000 UK National Food Surveys to answer the following questions. This dataset contains information on the number of meals eaten “outside of the home” per person in the household, per week. You would like to know whether British households ate more or fewer meals outside of the home per person, per week in 2000 compared to 1990, controlling for changes in household size.

- Present relevant descriptive statistics about the dependent and control variables in this study
__by year__. (10 points) - Produce a relevant data visualization and use it to tell us more about important or concerning characteristics of one or more of the variables of interest. Indicate where the relevant visualization code appears in your do-file using comments. (10 points)
- Run the multivariate regression that you would need to answer the research question discussed in the paragraph above. (
*Hint:*You can think of the year like a variable.) Present the results in the standard equation format used in Question 2, rounding all coefficients and standard errors to the nearest thousandth. (15 points) - Interpret all of the slope coefficient estimates and comment on statistical significance. (10 points)
- Choose another variable in the dataset that you believe belongs in the regression. Add this variable to your regression equation and estimate the new regression. Is this new equation a better fit to the data? Explain how you know. (15 points)
- Most research suggests that households are eating out at restaurants
*more*in recent years than they used to. Provide one possible explanation for the difference between your results and other research. (10 points)

__ __