Study your flashcards anywhere!

Download the official Cram app for free >

  • Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off

How to study your flashcards.

Right/Left arrow keys: Navigate between flashcards.right arrow keyleft arrow key

Up/Down arrow keys: Flip the card between the front and back.down keyup key

H key: Show hint (3rd side).h key

A key: Read text to speech.a key


Play button


Play button




Click to flip

40 Cards in this Set

  • Front
  • Back
The correction for attenuation formula is used to
measure the impact on the validity of a test of an increase in its reliability. Because the validity of a test can be affected by its reliability, a test user or developer might want to estimate what the validity of a test would be if its reliability were perfect. This is the function of the correction for attenuation formula.
To prevent criterion contamination from affecting the validity of a test, it is essential that
the persons who assign criterion ratings have no knowledge of the subjects' test scores. Criterion contamination occurs when a rater knows how a ratee did on a predictor test, and this knowledge affects the rating. For example, if an employee obtained a very high score on an intelligence test before being hired and the employee's supervisor knows this, the supervisor's ratings of the employee's on-the-job performance might be biased upward. Therefore, to prevent criterion contamination, the rater should have no knowledge of the ratees' predictor scores.
The best example of a criterion-referenced scoring system is
pass/fail. Criterion-referenced interpretation pegs performance to some criterion level. The person either does it, has it, or not. A simple example is a driving test. Does the person know how to parallel park? Yes or no. If yes, then that criterion is met. Of the choices listed, a pass/fail system is the best and clearest example of criterion-referenced scoring, even though ratings or even grades could conceivably be criterion-referenced scores under some circumstances.
Which of the following is a unique characteristic of a criterion-referenced test?
each person's score is compared to an external standard. A criterion-referenced test, by definition, is interpreted in terms of an external standard. An example would be a content mastery test, which is interpreted by comparing examinees' scores to a "passing" score indicating a pre-determined level of mastery. By comparison, a norm- referenced test is interpreted by comparing examinees' scores to the scores of others who have taken the test.
Average difficulty on a test should be set at what level for maximum discrimination among test takers?
.50. If an item has a difficulty level of .50, this means that half the examinees answer it correctly and half answer it incorrectly. If the average item difficulty on a test is .50, you will have the maximum discrimination between high and low scorers. The average difficulty means that some items can be more difficult and some less difficult.
If scores on several ability tests are positively correlated, one might draw the conclusion that
there is at least one factor common to all the tests. Tests which are correlated are likely to be measuring the same underlying construct. The Stanford-Binet and the WISC-III are highly correlated; this is probably because they both measure the same underlying construct of basic intellectual ability. Indeed, factor analysis, a technique which is designed to identify common factors measured by many tests, is based on identifying tests which have high correlations with each other.
At a cocktail party attended only by psychometricians, one of the guests is heard loudly arguing that only oblique rotations should be used when a factor analysis is conducted. The basis of his argument is probably
most factors are correlated. The purpose of a factor analysis is to determine the degree to which a few underlying factors or traits can account for scores on a set of variables (e.g., tests). Factors can be orthogonal (uncorrelated) or oblique (correlated), depending on how they are rotated — rotation is a statistical procedure that facilitates interpretation of the factors. A researcher can choose the rotation on the basis of his or her purpose or theory; however, many would argue that, in most cases, oblique rotations are more justifiable, since most traits or factors are correlated.
In factor analysis, an eigenvalue represents
the amount of variance in the tests accounted for by an unrotated factor. An eigenvalue is a statistic that indicates the variance accounted for by a factor extracted from a factor analysis. Eigenvalues are used to determine whether or not a factor should be "retained"; i.e., whether it is accounting for a significant amount of variance in the tests included in the analysis.
In personnel selection, if you want to hire as many individuals as possible who will be successful on the job, you would want to reduce the number of
false negatives. If your concern is to hire as many successful people as possible, this means that you want to reduce the number of people who are rejected for the position even though they would have been good on the job. Such individuals are referred to as false negatives, since a selection procedure has falsely identified them as "negative" (i.e., as unqualified for the job).
A personnel director at a company is distressed that, due to a screening test, qualified candidates are not being hired, even though they would have done very well on the job. In other words, the director is distressed about the high number of
false negatives. A false negative is an examinee who comes up negative on a selection test, but actually should have been selected. An analogy is a false negative on a drug screening test. The person comes up negative on the test, but actually has drugs in his or her system. Similarly, a negative on a job selection test is identified by the test as not being a good worker, when in actuality he or she is a good worker.
Item analysis is a procedure used to evaluate the
difficulty and discriminability of test items. Item analysis involves a quantitative and/or qualitative study of potential test items to determine which items will be retained for the final version of the test. There are a number of different dimensions of item analysis, but two of the most commonly employed are item difficulty (the percentage of examinees who answer an item correctly) and item discrimination (the ability of individual items to differentiate between two subgroups).
Which of the following is considered to be a drawback of norm- referenced interpretation?
It does not provide an absolute standard of performance. Norm-referenced interpretation involves comparing an individual examinee's performance to that of other individuals in the norm group. A problem with this type of interpretation is that it does not provide an absolute standard or performance. For instance, if you do well on a norm-referenced math test, it could mean that you are a math whiz or that others in the normative group are math dunces. From the norm-referenced score alone, we would not know exactly how much of the test you have mastered.
A set of predictor scores is normally distributed. Which range of scores would yield the most unreliable prediction if used to assess the predictor's relationship with a criterion?
those of the middle 33% of scorers. This question is assessing your understanding of two concepts: 1) any correlation coefficient will be lowered if a restricted range of scores is used on either the X or Y variable, and 2) in a normal distribution, most scores fall in the middle of the distribution; therefore, as compared to the lowest or highest third, the range of scores for the middle 33% of scorers will be more restricted. Thus, the lowest validity will result if only scores of the middle 33% of the distribution are used.
One good way to reduce observer drift would be to
have raters review videotapes of the units being observed. Observer drift refers to the tendency of observers to change the manner in which they apply definitions of behavior over time. It could be a problem in any situation in which human observers record behaviors by classifying them according to categories or rating criteria, such as in observational research or behavioral assessment. One way to reduce observer drift is to show videotapes of the people under observation in periodic retraining sessions for the observers where the criteria for categorizing behaviors are discussed. If you chose A, you should know that periodically bringing in new observers is another way of reducing observer drift. Comparison of the ratings of current observers to those of newly trained observers, who presumably will be more likely to stay with the original definitions, can reveal whether the behavioral categories are being applied differentially over time.
Which of the following two types of tests are most opposite in nature?
power; speed. A power test is designed to assess the level of difficulty a person can master; by contrast, the goal of a speed test is to assess examinees' response rate. On a pure power test, items are arranged in order of difficulty; on a pure speed test, items are uniformly easy. Sufficient time is given for examinees to attempt all items on a power test; whereas, on a speed test, the goal is to assess how far an examinee can get within the confines of a strict time limit. So as you can see, power tests and speed tests are polar opposites; most tests, however, assess elements of both power and speed.
When choosing a test to use in a vocational counseling situation, one would choose a test with
validity at least .45 and reliability at least .80. In psychological tests, reliability must be high; otherwise, the test is not measuring anything at all. A reliability coefficient of .80 is usually considered to be the level that is minimally acceptable. The validity coefficient is never as high. A validity coefficient of .40 or above is considered high in psychology.
The Kuder-Richardson Formula 20 (KR-20) would not be useful for assessing the reliability of
a speeded test. The Kuder-Richardson Formula 20 (along with Cronbach's alpha and the split-half method) is a measure of the internal consistency of a test. On a speeded test, measures of internal consistency are not appropriate indices of the test's reliability. This is because a speeded test is one in which the examinee is expected to answer every question correctly — the response rate, rather than the mastery of items, is being assessed. Such tests will always be highly internally consistent; that is, all attempted items will be highly correlated with other attempted items, since they will all be answered correctly. Therefore, measures of internal consistency yield spuriously high estimates of the reliability of speeded tests. The test-retest method and the alternative forms method are more accurate gauges of the reliability of speeded tests.
Cohen's Kappa Coefficient is used to determine:
inter-rater reliability. The Kappa coefficient is a measure of the agreement between two judges who each rate a set of objects using the nominal scales. It is one of the most common ways to estimate interscorer reliability.
A personality test is administered in August. The exact same test is administered again in December to the exact same group of examinees. Most likely, this was done in order to
A coefficient of stability, or a test-retest reliability coefficient, is the correlation between scores on the same measure administered at two separate times to the same group of examinees.
If raters are checked for inter-rater agreement and trained to a high agreement criterion, what happens after the checks are done?
Agreement declines if the raters work alone. Over time, the inter-rater reliability of a subjectively scored instrument tends to decrease, especially if the raters do not regularly get together to discuss the rating criteria. This decrease is due to observer drift, or the tendency of raters, over time, to substitute their own interpretations for the original definitions of rated behaviors.
To ensure the highest possible inter-rater reliability, you would want to make sure
the scoring categories are mutually exclusive and exhaustive. If rating categories are mutually exclusive and exhaustive, it means that there is no overlap between the categories. Thus, it will be easier to determine if a particular behavior falls into the categories; as a result, the number of rating errors should be reduced and agreement among raters should be increased. By contrast, if the behavior occurs rarely, raters will be more prone to errors in identifying its occurrence. If raters know they are being checked, they are more likely to produce accurate ratings. And if categories are open-ended, there is more room for classifying a behavior into the wrong one.
When feasible, the best way to estimate reliability is with
a coefficient of equivalence and stability. You may have already come across questions in which the alternate forms reliability coefficient was identified as the best reliability coefficient to use when practical. The alternate forms coefficient is usually a coefficient of equivalence and stability -- equivalence because it is the correlation between two different tests that are considered to be equivalent (i.e., different forms of the same test) and stability because some time elapses between the administration of the first and second forms of the test. The alternate forms reliability coefficient may be referred to as a parallel forms coefficient, equivalent forms coefficient, and (as you saw here) a coefficient of equivalence and stability. So this shows you that you need to be somewhat flexible in your thinking and not expect that the concepts on the exam will always be presented using the exact vocabulary that you memorized.
A test has reliability of .45. This indicates that
it shouldn’t be used. A reliability coefficient of .45 is not acceptable for any test. Such a test will yield vastly different scores each time it is administered to the same individual. The minimally acceptable reliability coefficient is between .70 and .80, depending on the type of test involved.
Which of the following statements would be considered most correct?
A test that has high validity always has high reliability. The reliability of a test refers to the degree to which it yields consistent, repeatable results. A test is valid to the degree that it measures what it purports to measure. A test cannot have high validity if it does not have high reliability. However, the converse is not true: A test can have high reliability without being valid.
A test which does not correlate highly with itself most likely
lacks adequate reliability. If test-retest correlations are low, you'd have to conclude that the test lacked reliability. As such, it would also never be valid for anything since a valid test needs first to be reliable. You should see that the questions says "most" -- and first and foremost, a test such as this will lack reliability. It isn't necessarily too short. A short test doesn't have to be unreliable just because its short. And even if it's long, that doesn't guarantee reliability. And how it was normed wouldn't really affect it's reliability if it's a poor test to begin with.
If the purpose of a test is to select applicants for a job for which there are 5 applicants for every position, the psychologist should make sure to use items which, on the average, are
failed 80% of the time. The test will be most effective as the pass rate of the items matches the selection ratio. In this case 20% is the selection ratio, so you want items that are passed on the average by only 20% of the test-takers. Looked at another way, it means that items that are failed 80% of the time are most useful. The selection ratio is .20 (1 opening for every 5 applicants) and thus the item difficulty level should be .20, meaning only 20% of the people pass them, on the average.
A test with a mean of 50, a standard deviation of 10, and a reliability coefficient of .75 would have a standard error of measurement of
5. The formula for the standard error of measurement is the standard deviation of test scores multiplied by the square root of [1 - the reliability coefficient]. The reliability coefficient is .75, so [1 - the reliability coefficient] is .25; the square root of .25 is .50. The standard deviation is 10; 10 times .50 is 5.0.
Which of the following represents the range of possible values of the standard error of measurement?
0 to the test’s SD. This question can be answered with reference to the formula for the standard error of measurement, which is the standard deviation of the test scores multiplied by the square root of [1 - the reliability coefficient]. The reliability coefficient ranges in value from 0.0 to 1.0. If reliability is 1.0, the formula evaluates to 0; if reliability is 0.0, the formula evaluates to the standard deviation of the test scores.
You are studying with a friend, and you both take the practice test. Supposing that you obtain a score of 120 and your friend obtains a 125, and using the standard error of measurement (calculated to be 5) in comparing your scores, you can conclude that
you cannot determine for sure whose true score is higher or lower. The standard error of measurement means that you can conclude (with 68% accuracy) that your true score lies within plus or minus 5 points of the score you actually obtained. If you scored 120, all you know about your true score is that it is likely to fall somewhere between 115 and 125. The possibility cannot be ruled out therefore that your true score is 125, and that your friend's is 120. Given these scores, it is therefore impossible to determine whose true score is higher or lower.
A multitrait-multimethod matrix provides information regarding
convergent and discriminent validity. Use of a multitrait-multimethod matrix is one method of assessing a test's construct validity. The matrix contains correlations among different tests that measure both the same and different traits using similar and different methodologies. Convergent validity is indicated when different methods measuring the same trait yield similar results. For instance, the Stanford-Binet and the WAIS-III should yield similar results when measuring a person's intelligence. Tests have divergent validity when there is a low correlation between two tests that are supposed to measure different constucts. For instance, you would expect a test for depression to have a low correlation with a test of self-esteem. The presence of convergent and divergent validity provide evidence of construct validity.
If a job selection test results in higher perfomance for Whites as compared to Hispanics, ethnicity would be considered a(n):
moderator variable. Variables that affect the relationship between two other variables are moderator variables. When a moderator variable is present a test is said to have differential validity. That is, in this example, there would be a different validity coefficient for Whites and Hispanics. Confounding or extraneous variables are similar, but are variables that are not of interest in a research study, but which exert a systematic effect on the DV.
An industrial-organizational psychologist conducts a study on the usefulness of a predictor. The study involves collecting data on a predictor and a criterion measure for different ethnic groups. Which of the following findings would be the strongest indication that the predictor has differential validity?
There are significant differences between the two groups in the predictor's standard error of estimate. The term "differential validity," as applied to a predictor measure, means that the predictor has a different validity coefficient for one subgroup than another. Although none of these choices says this directly, choice "C" is the best answer because the standard error of estimate is a direct function of the validity coefficient. Specifically, the higher the validity coefficient, the lower the standard error of estimate. So if a predictor test's standard error of estimate is different for two different ethnic groups, this would be a good sign that the predictor has differential validity. It is theoretically possible for tests with different standard errors of estimate to have the same validity coefficient, but "C" is still the best choice because, of the choices listed, it is the only one that is a direct function of the validity coefficient.
A study was undertaken to determine the validity of the diagnosis of anxiety disorder and the subclassifications of panic and generalized anxiety. It was further questioned if either or both of these subclassifications differed from depression as distinct categories. The study involved endocrinological data, family studies, genetics, course, and treatment comparisons for the different groups. The type of validity the researchers were attempting to determine was
construct. If we are trying to determine if something exists, if it has substance beyond just giving a name to it, then we are determining its construct validity -- is it a real construct? For example, if we want to show that the personality dimensions of introversion/extroversion exist, we would need to establish their construct validity: do people who are introverts differ from people who are extraverts, and do they each differ from a third or fourth or fifth group of people who are something else? In the case here we are comparing different types of anxiety disorders -- panic and generalized -- and seeing if these people differ from each other and differ from something else, depression. The methods we'd be using could be convergent and discriminant validation techniques, which are ways of establishing construct validity.
Among the following, which is the most important consideration for a test such as the psychology licensing exam?
content validity. Licensing exams are designed to measure examinees' knowledge in a particular professional domain. Thus, content validity, or the degree to which the test measures knowledge of the content domain it is supposed to measure knowledge of, is an important concern.
You are a personnel counselor working for an international telecommunications firm. The latest group of management trainees has just been selected and it is your job to make decisions as to which training program each is to be assigned. To maximize differential validity, you would recommend that
an aptitude battery be used, with the tests having different correlations with the different criterion variables. To make placements into different training programs, you want to ensure that the aptitudes of the selectees are congruent with the requirements of the various programs. Therefore, your battery must assess the different aptitudes adequately. That's what differential validity is all about: the aptitudes tested are really different and predict different behaviors.
Among the following criteria, which one is least appropriate for evaluating the content validity of a standardized achievement test in English for high school students? The
correlation of test scores with college English marks. Choice A would be used to evaluate the test's predictive validity, or the degree to which scores on the test are predictive of the examinees' performance on another measure administered at a later time. The other choices all have to do with evaluating the test's content validity, or how well the test measures knowledge of a particular content domain (in this case, high school English).
Which of the following is the lowest validity coefficient?
.10. To answer this question correctly, you have to understand that a correlation coefficient's magnitude has nothing to do with whether it is positive or negative. That is, a negative correlation coefficient indicates that there is an inverse relationship between two variables, or as one variable increases in value, the other decreases. To determine the magnitude of a correlation coefficient, one has to look at its absolute value. Of the choices listed [.90, .50, -.15) the coefficient with the lowest absolute value is .10.
Validity coefficients between a predictor and a criterion behavior will be reduced most as a function of
criterion unreliability. If the measures used are unreliable, you will get a spuriously low validity coefficient. You need to get this clear as there will be questions regarding this concept. Keep in mind that an unreliable measure can never be valid.
A job selection test which predicts differently for different subgroups in the population would have
differential validity. A test has differential validity when it has a different validity coefficient for two subgroups of a population. For example, if a test has a high validity coefficient for women but a low validity coefficient for men, that test would have differential validity.
In the process of constructing a new test of mechanical aptitude, you obtain the correlation between your new test and a series of previously published tests of mechanical aptitude. You do this in order to establish
convergent validity. In this situation, you are obtaining the correlation between a new measure and a set of established measures of the same construct. A high correlation between a new test and an established test of the same construct provides evidence of the new test's convergent validity. Convergent validity is a type of construct validity.