Data Analysis Practice Exam
The following questions are designed to indicate the scope and style of questions that may be on your midterm. Warning: Do not stop your preparations prematurely. The real exam may seem more difficult!

1. You would use a scatter plot to display the:
A. percent of students in a college course who come from each class (freshman to senior).
B. relationship between the sex of the students and their scores on a course examination.
C. distribution of the GPA among all students enrolled in the course.
D. relationship between the GPA of the students and their scores on a course exam.

2. Which of the following statements have something wrong with them (the answer could be A, B, neither, or both):
A. "The correlation between a voter's religion and his or her political party is r = .45."
B. "Since the R Squared between grades and hours studied per week is .06, we can conclude that a one hour increase in study hours will result in a .06 increase in GPA."

3. The regression line for the relationship between X and Y is Y = 10 + 5X. A review of one actual data point for X = 9 was Y = 65. The mean of Y is 25. Compute the conceptual equivalent for the unadjusted R square for this single point.

4. The null hypothesis regarding the R square for any regression is ______________

5. An important advantage of experiments over observational studies is:
A. experiments are usually less expensive
B. experiments are more realistic
C. experiments are easier to design and carry out
D. none of these

6. You read that Scholastic Aptitude Test scores in high school explain only 9% of the variation in students' later grades in college. Therefore, the R Squared = __________, the "Multiple R" = _________, and the standard error of estimate = ______________.

The next two questions are based on the following situation. A news report says that a national opinion poll of 1500 randomly selected adults in the United States found that 43 percent thought they would be worse off during the next year. The news report went on to say that the margin of error in the poll result is * 3 percentage points with 95 percent confidence.

7. If the poll had interviewed 1000 persons rather than 1500 (and still found 43% believing they would be worse off), the margin of error for 95% confidence would be:
A. less than * 3 percentage points
B. equal to * 3 percentage points
C. greater than * 3 percentage points
D. any of the above -- the margin of error is random

8. If the poll had obtained the outcome 43 percent by a similar random sampling method from all adults in New York State (population 18 million) instead of from all adults in the U.S. (population 249 million), the margin of error for 95 percent confidence would be:
A. less than * 3 percentage points
B. equal to * 3 percentage points
C. greater than * 3 percentage points
D. any of the above -- the margin of error is random

9. According to the Central Limit Theorem, what distribution must the population have before we can assume that the means of large samples selected from that population would follow a normal distribution? What is a 'large' sample?

10. In hypotheses testing, the worst error is usually referred to as the:
A. Type I error
B. Type II error
C. Type III error
D. null hypothesis
E. none of these

11. Whenever possible, the null hypothesis should be set up so as to:
A. minimize sample error
B. avoid making a Type I error
C. avoid making a Type II error
D. minimize sample size
E. none of these

 

12. The mean of the sample is used to estimate the mean of the population and the standard deviation of the sample is used to estimate the standard deviation of the population. The standard error of the sample is used to:

A. estimate the range of the population

B. estimate the variance of the population

C. estimate the "non-response bias" in the sample

D. estimate the skewness of the population

E. none of these

13. The most commonly used form of nonlinear multiple equations is the:

A. quadratic

B. power

C. cubic

D. exponential

14. Review the tables below, then answer the questions.

A recent newspaper headline reported that death rates of patients undergoing operations were higher in public hospitals than in private hospitals, namely 3 percent versus 2 percent, respectively. A statistician was hired to search for lurking variables. The first thing she did was to separate out the death rates for patients who were in good condition healthwise when they were admitted from those who were in poor condition. Do these figures suggest a "Simpson's Paradox" situation? Why or why not? Explain fully.

 

 Good Condition

 Poor Condition

 

 Public

 Private

 Public

 Private

 Died

 6

 8

 57

 8

 Survived

 594

 492

 1443

 192

15. Bill is a politician who wants to be reelected very much. He ratings are going down, however, and he is worried. His support is eroding quickly and his campaign director believes that unless he has over 60 percent of the vote right now, he will lose the election by the time the election takes place in three weeks. A TV campaign could help but is very expensive and his campaign funds are low. He plans to have a pollster estimate his current support to see how close he is to 60 percent.

a. What would be the two mistakes Bill could make regarding the TV campaign?

b. Which would be the worse mistake? Why?

c. Which of the following should be Bill's null hypothesis be regarding the 60 percent?

A. Ho: P <= 60%

B. Ho: P >=60%

C. Ho: P = 60%

D. Ho: P not = 60%

E. None of these

Consider the regression printout from Excel shown below (ficticious data). Answer the following questions based on your analysis of the printout. The variables were:

Y = quantity of units produced during month t by a manufacturing firm

X1 = Inventory at end of Period t - 1

X2 = Anticipated sales in period t + 1

SUMMARY OUTPUT

 

 

 

 

 

 

 

 

 

 

 

Regression Statistics

 

 

 

 

 

Multiple R

0.970192167

 

 

 

 

R Square

0.94127284

 

 

 

 

Adjusted R Square

0.924493652

 

 

 

 

Standard Error

42.68771303

 

 

 

 

Observations

10

 

 

 

 

 

 

 

 

 

 

ANOVA

 

 

 

 

 

 

df

SS

MS

F

Significance F

Regression

2

204446.8141

102223.407

56.09763792

4.90836E-05

Residual

7

12755.68591

1822.240844

 

 

Total

9

217202.5

 

 

 

 

 

 

 

 

 

 

Coefficients

Standard Error

t Stat

P-value

 

Intercept

82.67643636

612.7538014

0.134926028

0.89646812

 

X Variable 1

482.9224421

46.47003698

10.39212519

1.65934E-05

 

X Variable 2

-680.6792932

102.0377531

-6.670857331

0.000284983

 

16. Does the regression equation make sense in terms of the coefficients? Why or why not?

17. Is the quantified impact of the variables significant?

18. Is the R squared value significantly different from zero?

19. Is the equation a good fit?

20. What non-linear equational form would make sense in this case?

21. If a constant is multiplied times the X variable values in a simple linear regression, what will be altered in the regression results?

Back to the top

Answers:
1.D - it is the only one with numerical data for both variables
2. Both - A is wrong because correlations can be calculated only between variables that are numerical in nature, whereas political party affiliation is an attribute. B is wrong because the speaker is confusing R Square with the coefficient in the regression analysis
3. Since the predicted value = 10 + 5(9) = 55 or 30 above average, and the actual value = 65 or 40 above the average. Thus, 30/40 = 75% of Y's variation is explained by the regression.
4. R Squared = 0
5. D - experiments can give good evidence for causation
6. R Squared = .09, Multiple R = .30, Standard Error of Estimate cannot be calculated from the information given.
7. C - The basic equation is Sxbar = S/Ön . If n drops from 1500 to 1000, Sxbar increases.
8. B - The size of the population being sampled has no bearing on the formula above (provided it is at least ten times larger than the sample).
9. any distribution, regardless of its shape. A "large" sample is generally considered to be any randomly selected sample that has 30 or more observations (although some scientists prefer at least 125 - the bottom line is the larger the better).
10. A
11. B - It is a matter of logic to set up the null hypothesis so as to avoid the worst error because you will take action based on believing that the null is true unless the data indicate otherwise. The classic example is law, where we think it is worse to convict an innocent person than free a guilty one, so we believe everyone is innocent and will free them unless the evidence shows 'beyond the shadow of a doubt' that the person is guilty.
12. E

13. B (all variables must be transformed to their logarythmic values first)

14. The news report says 3% died in public hospitals vs. 2% in private. Compare these to the percentages that died in each when the lurking variable of initial patient condition is separated out:

 

 Good Condition

 Poor Condition

 

 Public

 Private

 Public

 Private

 Died

 6

 8

 57

 8

 Survived

 594

 492

 1443

 192

 Total:

 600

 500

 1500

 200

 As Percent:

 1%

 1.6%

 3.8%

 4%

Note that when separated out, a smaller percentage of public patients die. Since this is a reversal of the total, Simpson's paradox is present.

15.a.

1. To spend the money when he should not because he will win anyway.

2. Not to spend the money when he should.

b. Worse mistake is not spend it when he should because he wants to be elected rather than save money.

c. A. Null hypothesis: P <= 60% (he will therefore spend the money)

16. No - the coefficient should be negative for X1 (produce less when inventories are high) and positive for X2 (higher future sales should cause us to produce more).

17. Yes - very puzzling. Why should a variable that has the wrong sign theoretically be statistically significant?

18. Yes - ditto

19. Yes - Why should the fit be so good given that the variables are opposite what theory says they ought to be? Has someone punched in the wrong data?

20. Nothing obvious - there is no reason to choose one form over another, so use the simplest, namely the linear form.

21. The b coefficients will be altered, but not the Adusted R square, Significant F, or P Values.

Back to the top