question archive RSH901 - Techniques and Interpretation for Advanced Statistical Research   WEEK 4: Discussion Question:   Explain why the Poisson model is or is not suited to each of the following applications? (a) The number of customers who enter a store each hour

RSH901 - Techniques and Interpretation for Advanced Statistical Research   WEEK 4: Discussion Question:   Explain why the Poisson model is or is not suited to each of the following applications? (a) The number of customers who enter a store each hour

Subject:BusinessPrice: Bought3

RSH901 - Techniques and Interpretation for Advanced Statistical Research

 

WEEK 4:

Discussion Question:

 

Explain why the Poisson model is or is not suited to each of the following applications?

(a) The number of customers who enter a store each hour.

(b) The number of auto accidents reported to a claims office each day.

(c) The number of broken items in a shipment of glass vials.

 

Assignment:

1.  Read and answer the following questions in the case study Modeling Sampling Variation in the textbook.

a.  M&M’s weigh 0.86 grams on average with SD = 0.04 grams, so the coefficient of variation is 0.04/0.86   0.047. Suppose that we decide to label packages by count rather than weight. The system adds candy to a package until the weight of the package exceeds a threshold. How large would we have to set the threshold weight to be 99.5% sure that a package has 60 pieces?

b.  Suppose the same system is used (packaging by count), but this time we only want to have 10 pieces in a package. Where is the target weight? (Again, we want to be 99.5% sure that the package has at least 10 pieces.) What assumption is particularly relevant with these small counts?

c.  In which situation (10 pieces or 60 pieces) would the packaging system be more likely to put more than the labeled count of candies into a package?

 

2. Select one of the following:

a. Review your chosen thesis and explain if the concepts in the module are applicable. If yes, then review and assess the model.

 Or

b. 4M Market Survey - Problem 45 in Chapter 11.

A marketing research firm interviewed visitors at a car show. The show featured new cars with various types of environmentally oriented enhancements, such as hybrid engines, alternative types of fuels (such as biofuels), and aerodynamic styling. The interviews of 25 visitors who indicated they were interested in buying a hybrid included the question: What appeals most to you about a hybrid car? The question offered three possible answers:

• Savings on fuel expenses,

• Concern about global warming, and

• Desire for cutting-edge technology.

Motivation

(a) Why would a manufacturer be interested in the opinions of those who have already stated they want to buy a hybrid? How would such information be useful?

Method

(b) The question offered three choices, and customers could select more than one. If a manufacturer of hybrids is interested in the desire for cutting-edge technology, can we obtain Bernoulli trials from these responses?

(c) What random variable describes the number of visitors who indicated that cutting-edge technology appeals to them in a hybrid car? Be sure to think about the necessary assumptions for this random variable.

Mechanics

(d) Past shows have found that 30% of those interested in a hybrid car are drawn to the technology of the car. If that were still the case, how many of the 25 interviewed visitors at this show would you expect to express interest in the technology?

(e) If five visitors at this show were drawn to the cutting edge technology, would this lead you to think that the appeal of technology had changed from 30% found at prior shows?

Message

(f) Summarize the implications of the interviews for management, assuming that 5 of the 25 visitors expressed a desire for cutting-edge technology.

 

3. 4M Normality and Transformation - Problem 51 in Chapter 12.

We usually do not think of the distribution of income as being normally distributed. Most histograms show the data to be skewed, with a long right tail reaching out toward Bill Gates. This sort of skewness is so common, but normality so useful, that the lognormal model has become popular. The lognormal model says that the logarithm of the data follows a normal distribution. (It does not matter which log you use because one log is just a constant multiple of another.)

For this exercise, we’ll use a sample of household incomes from the 2010 U.S. Community Survey, which has replaced the decennial census as a source of information about U.S. households. This sample includes 392 households from coastal South Carolina.

Motivation

(a) What advantage would there be in using a normal model for the logs rather than a model that described the skewness directly?

(b) If poverty in this area is defined as having a household income less than $20,000, how can you use a lognormal model to find the percentage of households in poverty?

Method

(c) These data are reported to be a sample of households in coastal South Carolina. If the households are equally divided between only two communities in this region, would that cause problems with using these data?

(d) How do you plan to check whether the lognormal model is appropriate for these incomes?

Mechanics

(e) Does a normal model offer a good description of the household incomes? Explain.

(f) Does a normal model offer a good description of the logarithm of household incomes? Explain.

(g) Using the lognormal model with parameters set to match this sample, find the probability of finding a household with income less than $20,000.

(h) Is the lognormal model suitable for determining this probability?

Message

(i) Describe these data using a lognormal model, pointing out strengths and any important weaknesses or limitations.

 

4.  Similar Homes - Problem 37 in Chapter 12.

A contractor built 30 similar homes in a suburban development. The homes have comparable size and amenities, but each has been sold with features that customize the appearance, landscape, and interior.

The contractor expects the homes to sell for about $450,000. He expects that one-third of the homes will sell either for less than $400,000 or more than $500,000.

(a) Would a normal model be appropriate to describe the distribution of sale prices?

(b) What data would help you decide if a normal model is appropriate? (You cannot use the prices of these 30 homes; the model is to describe the prices of as-yet-unsold homes.)

(c) What normal model has properties that are consistent with the intuition of the contractor?

 

5. 4M Stock Returns - Problem 50 in Chapter 12.

Percentage changes (or returns) on the stock market follow a normal distribution—if we don’t reach too far into the tails of the distribution. Are returns on stocks in individual companies also roughly normally distributed? This example uses monthly returns on McDonald’s stock from 1990 through the end of 2011 (264 months).

Motivation

(a) If the returns on stock in McDonald’s during this period are normally distributed, then how can the performance of this investment be summarized?

(b) Should the analysis be performed using the stock price, the returns on the stock, or the percentage changes? Does it matter?

Method

(c) Before summarizing the data with a histogram, what assumption needs to be checked?

(d) What plot should be used to check for normality if the data can be summarized with a histogram?

Mechanics

(e) If the data are summarized with a histogram, what features are lost, and are these too important to conceal?

(f) Does a normal model offer a good description of the percentage changes in the price of this stock?

(g) Using a normal model, estimate the chance that the price of the stock will increase by 10% or more in at least one month during the next 12 months.

(h) Count the historical data rather than using the normal model to estimate the probability of the price of the stock going up by 10% or more in a coming month. Does the estimate differ from that given by a normal model?

Message

(i) Describe these data using a normal model, pointing out strengths and any important weaknesses or limitations.

 

WEEK 5:

 

Discussion Question:

A bank with branches in a large metropolitan area is considering opening its offices on Saturday, but it is uncertain whether customers will prefer (1) having walk-in hours on Saturday or (2) having extended branch hours during the week. Listed below are some of the ideas proposed for gathering data. For each, indicate what kind of sampling strategy is involved and what (if any) biases might result.

(a) Put a big ad in the newspaper asking people to log their opinions on the bank’s website.

(b) Randomly select one of the branches and contact every customer at that bank by phone.

(c) Send a survey to every customer’s home, and ask the customers to fill it out and return it.

(d) Randomly select 20 customers from each branch. Send each a survey, and follow up with a phone call if he or she does not return the survey within a week.

 Do you think the following data would represent processes that were under control or out of control? Explain your thinking.

(a) Monthly shipments of snow skis to retail stores.

(b) Number of daily transactions at the service counter in a local bank.

(c) Attendance at NFL games during the 16-week season.

(d) Number of hourly hits on a corporate website.

Assignment:

1. Case: Rare Events Part 1 in the textbook.

Study the case carefully and then answer one of the first four Questions for Thought at the end of the case study.

 

2. Case: Rare Events Part 2 in the textbook.

Using the same case answer either question 5 or 6 from the Questions for Thought at the end of the case study. 

 

3. Select one of the following:

a. Review your chosen thesis and check if the sample size, validation of the sample, and the inferences are correct. Support your observation by citing examples.

Or

b. Shopping Downtown - Problem 38 in Chapter 15.

Consider each situation described below. Identify the population and the sample, explain what the parameter p or or m represents, and tell whether the methods of this chapter can be used to create a confidence interval. If so, find the interval.

(a) The service department at a new car dealer checks for small dents in cars brought in for scheduled maintenance. It finds that 22 of 87 cars have a dent that can be removed easily. The service department wants to estimate the percentage of all cars with these easily repaired dents.

(b) A survey of customers at a supermarket asks whether they found shopping at this market more pleasing than at a nearby store. Of the 2,500 forms distributed to customers, 325 were filled in and 250 of these said that the experience was more pleasing.

(c) A poll asks visitors to a website for the number of hours spent Web surfing daily. The poll gets 223 responses one day. The average response is three hours per day with s  = 1.5.

(d) A sample of 1,000 customers given loans during the past two years contains 2 who have defaulted.

 

4. Insulator - Problem 35 in Chapter 14. 

One stage in the manufacture of semiconductor chips applies an insulator on the chips. This process must coat the chip evenly to the desired thickness of 250 microns or the chip will not be able to run at the desired speed. If the coating is too thin, the chip will leak voltage and not function reliably. If the coating is too thick, the chip will overheat. The process has been designed so that 95% of the chips have a coating thickness between 247 and 253 microns. Twelve chips were measured daily for 40 days, a total of 480 chips.

(a) Do the data meet the sample size condition if we look at samples taken each day?

(b) Group the data by days and generate X-bar and S-charts with control limits at {3 SE. Is the process under control?

(c) Describe the nature of the problem found in the control charts. Is the problem an isolated incident (which might be just a chance event), or does there appear to be a more systematic failure?

 

5. 4M Monitoring an Email System - Problem 37 in Chapter 14. 

A firm monitors the use of its email system. A sudden change in activity might indicate a virus spreading in the system, and a lull in activity might indicate problems on the network. When the system and office are operating normally, about 16.5 messages move through the system on average every minute, with a standard deviation near 8.

The data for this exercise count the number of messages sent every minute, with 60 values for each hour and eight hours of data for four days (1,920 rows). The data cover the period from 9 A.M. to 5 P.M. The number of users on the system is reasonably consistent during this time period.

Motivation

(a) Explain why the firm needs to allow for variation in the underlying volume. Why not simply send engineers in search of the problem whenever email use exceeds a rate of, say, 1,000 messages?

(b) Explain why it is important to monitor both the mean and the variance of volume of email on this system.

Method

(c) Because the computer support team is well staffed, there is minimal cost (aggravation aside) in having someone check for a problem. On the other hand, failing to identify a problem could be serious because it would allow the problem to grow in magnitude. What value do you recommend for a, the chance of a Type I error?

(d) To form a control chart, accumulate the counts into blocks of 15 minutes rather than use the raw counts. What are the advantages and disadvantages of computing averages and SDs over a 15-minute period compared to using the data for 1-minute intervals?

Mechanics

(e) Build the X-bar and S-charts for these data with a  = 0.0027 (i.e., using control limits at {3 SE). Do these charts indicate that the process is out of control?

(f) What is the probability that the control charts in part (e) signal a problem even if the system remains under control over these four days?

(g) Repeat part (e), but with the control limits set according to your choice of a. (Hint: If you used a  = 0.0027, think harder about part (c).) Do you reach a different conclusion?

Message

(h) Interpret the result from your control charts (using your choice of a) using nontechnical language. Does a value outside the control limits guarantee that there’s a problem?

WEEK 6:

 

Discussion Question:

Identify a decision process within your work or application area that could be resolved based on data.

a) Describe the null and research hypotheses.

b) Compute or (use an educated guess for) an appropriate estimate and its standard error.

c) Find a confidence interval.

d) Test the hypotheses.

e) Interpret and explain the results.

Discuss any surprises and/or insights.

Assignment:

1. Case: Data Mining using Chi Squared Part 1 in the textbook.  Study the case carefully and then answer one of the first four questions from the Questions for Thought section.

 

2. Case: Data Mining using Chi Squared Part 2 in the textbook.  Using the same case answer one question from questions 5-8 from the Questions for Thought section. 

 

3. Select one of the following:

a. Review your chosen thesis and check if the topics from module have been used in the thesis. How has it been used and discuss any improvements.

Or

b. Damaged Machines - Problem 39 in Chapter 16.

An appliance manufacturer stockpiles washers and dryers in a large warehouse for shipment to retail stores. Some appliances get damaged in handling. The long-term goal has been to keep the level of damaged machines below 2%. In a recent test, an inspector randomly checked 60 washers and discovered that 5 of them had scratches or dents. Test the null hypothesis: p  0.02 in which p represents the probability of a damaged washer.

(a) Do these data supply enough evidence to reject? Use a binomial model from Chapter 11 to obtain the p-value.

(b) What assumption is necessary in order to use the binomial model for the count of the number of damaged washers?

(c) Test by using a normal model for the sampling distribution of. Does this test reject?

(d) Which test procedure should be used to test? Explain your choice.

 

4. New Contact Lens - Problem 24 in Chapter 17.

Doctors tested a new type of contact lens. Volunteers who normally wear contact lenses were given a standard type of lens for one eye and a lens made of the new material for the other. After a month of wear, the volunteers rated the level of perceived comfort for each eye.

(a) Should the new lens be used for the left or right eye for every patient?

(b) How should the data on comfort be analyzed?

 

5. Stock Movement - Problem 27 in Chapter 17.

A stock market analyst recorded the number of stocks that went up or went down each day for 5 consecutive days, producing a contingency table with 2 rows (up or down) and 5 columns (Monday through Friday). Are these data suitable for applying the chi-squared test of independence?

 

WEEK 7:

 

Discussion Question:

What are the most important models and methods for regression analysis? Discuss some application of regression models relevant to your application area. Explain the relevance of the chosen regression model.

Assignment:

1. Case: Analyzing Experiments in the textbook. 

Study the case carefully and then answer question 1 and one other from question 2 through question 5 of the Questions for Thought section. 

 

2. Select one of the following:

a. Study your chosen thesis and find out if any of the topics covered in this module are applicable. Has they been applied correctly? What improvements would you suggest and why. Support your thinking.

Or

b. OECD Part 1 - Problem 45 in Chapter 19.

The Organization for Economic Cooperation and Development (OECD) tracks various summary statistics of the member economies. The countries lie in Europe, parts of Asia, and North America. Two variables of interest are GDP (gross domestic product per capita, a measure of the overall production in an economy per citizen) and trade balances (measured as a percentage of GDP). Exporting countries tend to have large positive trade balances. Importers have negative balances. These data are from the 2005 report of the OECD.

(a) Describe the association in the scatterplot of GDP on Trade Balance. Does the association in this plot move in the right direction? Does the association appear linear?

(b) Estimate the least squares linear equation for GDP on Trade Balance. Interpret the fitted intercept and slope. Be sure to include their units. Note if either estimate represents a large extrapolation and is consequently not reliable.

(c) Interpret and associated with the fitted equation. Attach units to these summary statistics as appropriate.

(d) Plot the residuals from this regression. After considering this plot, does it provide an adequate summary of the residual variation?

(e) Which country has the largest values of both variables? Is it the country that you expected?

(f) Locate the United States in the scatterplot and find the residual for the United States. Interpret the value of the residual for the United States.

 

3. OECD Part 2 - Problem 45 in Chapter 19.

The Organization for Economic Cooperation and Development (OECD) tracks various summary statistics of its member economies. The countries lie in Europe, parts of Asia, and North America. Two variables of interest are GDP (gross domestic product per capita, a measure of the overall production in an economy per citizen) and trade balances (measured as a percentage of GDP). Exporting countries tend to have large positive trade balances. Importers have negative balances. These data are from the 2005 report of the OECD. Formulate the SRM with GDP as the response and Trade Balance as the explanatory variable.

 (a) On average, what is the per capita GDP for countries with balanced imports and exports (i.e., with trade balance zero)? Give your answer as a range, suitable for presentation.

 (b) The foreign minister of Krakozia has claimed that by increasing the trade surplus of her country by 2%, she expects to raise GDP per capita by $4,000. Is this claim plausible given this model?

 (c) Suppose that OECD uses this model to predict the GDP for a country with balanced trade. Give the 95% prediction interval.

 (d) Do your answers for parts (a) and (c) differ from each other? Should they? 

 

4. OECD Part 3 - Problem 45 in Chapter 19.

The Organization for Economic Cooperation and Development (OECD) tracks summary statistics of the member economies. The countries are located in Europe, parts of Asia, and North America. Two variables of interest are GDP (gross domestic product per capita, a measure of the overall production in an economy per citizen) and trade balance (measured as a percentage of GDP). Exporting countries have positive trade balances; importers have negative trade balances. These data are from the 2005 report of the OECD. Formulate the SRM with GDP as the response and Trade Balance as the explanatory variable.

(a) On average, what is the per capita GDP for countries with balanced imports and exports (i.e., with trade balance zero)? Give your answer as a range, suitable for presentation.

(b) The foreign minister of Krakozia has claimed that by increasing the trade surplus of her country by 2%, she expects to raise GDP per capita by $4,000. Is this claim plausible given this model?

(c) Suppose that OECD uses this model to predict the GDP for a country with balanced trade. Give the 95% prediction interval.

(d) Do your answers for parts (a) and (c) differ from each other? Should they?

 

5. OECD Part 4 - Problem 45 in Chapter 19.

An analyst at the United Nations is developing a model that describes GDP (gross domestic product per capita, a measure of the overall production in an economy per citizen) among developed countries. She is using national data for 29 countries from the 2005 report of the Organization for Economic Cooperation and Development (OECD). She started with the equation (estimated by least squares):

Estimated per capita GDP = $26,714 +$1.441 Trade Balance

The trade balance is measured as a percentage of GDP. Exporting countries tend to have large positive trade balances. Importers have negative balances. This equation explains only 37% of the variation in per capita GDP, so she added a second explanatory variable, the number of kilograms of municipal waste per person.

(a) Examine scatterplots of the response versus the two explanatory variables as well as the scatterplot between the explanatory variables. Do you notice any unusual features in the data? Do the relevant plots appear straight enough for multiple regression?

(b) Do you think, before fitting the multiple regression, that the partial slope for trade balance will be the same as in the equation shown? Explain.

(c) Fit the multiple regression that expands the one-predictor equation by adding the second explanatory variable to the model. Summarize the estimates obtained for the fitted model.

(d) Does the estimated model appear to meet the conditions for the use of the MRM?

(e) Draw the path diagram for this estimated model. Use it to explain why the estimated slope for the trade balance has become smaller than in the simple regression shown.

(f) Give a confidence interval, to presentation precision, for the slope of the municipal waste variable. Does this interval imply that countries can increase their GDP by encouraging residents to produce more municipal waste? 

 

WEEK 8:

 

Discussion Question:

Smoothing reduces the random variation in data, producing a sequence that reveals the systematic trend in the data. Shouldn’t we build models from the smoothed data, which have less random noise, rather than from the original data? Explain why this is, or is not, such a good approach to building a model for forecasting.

The exponentially weighted moving average is a one-sided moving average of the time series. The smoothed value is an average of and prior values. The regular moving average is two-sided, averaging values on both sides of Yt. For example, a three-term moving average of is:

part a

and a five-term moving average is:

part b

(a) Which series will be smoother: a three-term moving average or a five-term moving average? Explain your thinking.

(b) What problem does a two-sided moving average have when the smoothing reaches the last value of the time series?

(c) The most common moving averages have an odd number of terms, such as the three-term and five-term averages in this exercise or the 13-term average used to smooth computer shipments in this module. What problem happens if you try to use a moving average with an even number of terms? Suggest a simple remedy for the problem.

Benchmark Assignment

This is a benchmark assignment for DCS students. Store your submission with any grading feedback in your Professional's Portfolio and use the following tag: DCS-PG3

Assignment:

1. Case: Automated Modeling in the textbook.

Study the case carefully and then answer question 1 and one other from questions 2-8 from the Questions for Thought section.

 

2. Select one of the following:

a. Review your chosen thesis and assess if the topics covered in this module have been used. Discuss the application and how they can be improved.

Or

b. R & D Expenses - Problem 43 in Chapter 19.

This data file contains a variety of accounting and financial values that describe 493 companies operating in several technology industries in 2004: software, systems design, and semiconductor manufacturing. One column gives the expenses on research and development (R&D), and another gives the total assets of the companies. Both columns are reported in millions of dollars.

(a) Scatterplot R&D Expense on Assets. Does a line seem to you to be a good summary of the relationship between these two variables? Describe the outlying companies.

(b) Estimate the least squares linear equation for R&D Expense on Assets. Interpret the fitted intercept and slope. Be sure to include their units. Note if either estimate represents a large extrapolation and is consequently not reliable.

(c) Interpret the summary values r2 and se associated with the fitted equation. Attach units to these summary statistics as appropriate. Does the value of r2 seem fair to you as a characterization of how well the equation summarizes the association?

(d) Inspect the histograms of the x- and y-variables in this regression. Do the shapes of these histograms anticipate some aspects of the scatterplot and the linear relationship between these variables?

(e) Plot the residuals from this regression. Does this plot reveal patterns in the residuals? Does se provide an adequate summary of the residual variation?

 

3. R & D Expenses - Problem 43 in Chapter 19.

This data file contains a variety of accounting and financial values that describe 324 companies operating in the information sector in 2010. The largest of these provide telephone services. One column gives the expenses on research and development (R&D), and another gives the total assets of the companies. Both columns are reported in millions of dollars. These data need to be expressed on a log scale; otherwise, outlying companies dominate the analysis. Use the natural logs of both variables rather than the original variables in the data table. (Note that the variables are recorded in millions, so 1,000  = 1 billion.)

(a) What difference in R&D spending (as a percentage) is associated with a 1% increase in the assets of a firm? Give your answer as a range, rounded to meaningful precision.

(b) Revise your model to use base 10 logs of assets and R&D expenses. Does using a different base for both log transformations affect your answer to part (a)?

(c) Find a 95% prediction interval for the R&D expenses of a firm with $1 billion in assets. Be sure to express your range on a dollar scale. Do you expect this interval to have 95% coverage? 

 

4. R & D Expenses - Problem 43 in Chapter 22.

This table contains accounting and financial data that describe 324 companies operating in the information sector in 2010. The largest of these provide telephone services. One column gives the expenses on research and development (R&D), and another gives the total assets of the companies. Both columns are reported in millions of dollars. Use the logs of both variables rather than the originals. (That is, set Y to the natural log of R&D expenses, and set X to the natural log of assets. Note that the variables are recorded in millions, so 1,000  = 1 billion.)

(a) What problem with the use of the SRM is evident in the scatterplot of y on x as well as in the plot of the residuals from the fitted equation on x?

(b) If the residuals are nearly normal, of the values that lie outside the 95% prediction intervals, what proportion should be above the fitted equation?

(c) Based on the property of residuals identified in part (b), can you anticipate that these residuals are not nearly normal—without needing the normal quantile plot?

 

5. R & D Expenses - Problem 43 in Chapter 23.

This data table contains accounting and financial data that describe 324 companies operating in the information sector. The variables include the expenses on research and development (R&D), total assets of the company, and the cost of goods sold (CGS). All columns are reported in millions of dollars; the variables are recorded in millions, so 1,000  = 1 billion. Use natural logs of all variables rather than the originals.

(a) Examine scatterplots of the log of spending on R&D versus the log of total assets and the log of the cost of goods sold. Then consider the scatterplot of the log of total assets versus the log of the cost of goods sold. Do you notice any unusual features in the data? Do the relevant plots appear straight enough for multiple regression?

(b) Fit the indicated multiple regression and show a summary of the estimated features of the model.

(c) Does the estimated model appear to meet the conditions for the use of the MRM?

(d) Does the fit of this model explain statistically significantly more variation in the log of spending on R&D than a model that uses the log of assets alone?

The multiple regression in part (b) has all variables on a natural log scale. To interpret the equation, note that the sum of natural logs is the log of the product,

 a

and that 

b

Hence, the equation

c

is equivalent to

d

The slopes in the log-log regression are exponents in an equation that describes y as the product of the explanatory variables raised to different powers. These powers are the partial elasticities of the response with respect to the predictors. (See Chapter 20 for a discussion of elasticities.)

(e) Interpret the slope for the log of the cost of goods sold in the equation estimated by the fitted model in part (b). Include the confidence interval in your calculation.

(f) The marginal elasticity of R&D spending with respect to CGS is about 0.60. Why is the partial elasticity in the multiple regression for CGS so different? Is it really that different?

pur-new-sol

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions