Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
55 Cards in this Set
- Front
- Back
relevance |
the extent that a test item contributes to the test's goal
based on qualitative judgement to include: -content appropriateness -taxonomic level (is it reflective of appropriate cognitive level) -extraneous abilities (does it require additional skills than domain of interest) |
|
item difficulty
|
p = total # passing item/total number of examinees
.50 is ideal for multple choice .75 ideal for true/false tests |
|
item discrimination
|
extent to which a test item discriminates between high versus low scorers on the test or on the entire criterion
D = (% upper group answered correctly) - (% lower group answered correctly) range: -1 to +1 D = .35 is acceptable D = .50 is maximum discrimination |
|
Item Response Theory
|
-item characteristics are considered to be the same across samples
-possible to equate scores to different tests because performance is reported in terms of LATENT trait being measured (not just test score) -easier to develop computer-adaptive tests - ITEM CHARACTERISTIC CURVE is developed to determine relatinoship between examinee's ability and probability he/she will respond correctly. Slope indicates ability to discriminate between high and low achievers |
|
Classical Test Theory
|
obtained test score is sum of truth and error
difficult to equate scores on different tests because scales are different |
|
measurement error
|
random error / unsystematic
|
|
reliability
|
estimate of the proportion of variance in obtained scores accounted for by differences on attribute measured on test
range: 0 to +1 do NOT square reliability coefficient - interpret directly |
|
test-retest reliability
|
coefficient of stability
administer same test over time and compare scores. measure of stability source of error: factors that occured to time between tests designed to measure stable traits and not affected by repeated measures |
|
Alternate forms reliability
|
coefficient of equivalence
2 equal forms of a test are given to same group and scores are compared source of error: content sampling, error of interaction between differen examinees knowledge for stable traits and not affected by repeated measures most rigorous form of reliability, but difficult to develop |
|
internal consistency
|
split-half reliability: compare odds/evens or first/second half
underestimate of reliability because length is shorter Use Spearman-Brown prophecy formula to estimate reliability of full-length test Cronbach's COEFFICIENT ALPHA is formula to determine inter-item consistency (i.e. Kuder Richardson can be used if item is scored right/wrong) source of error: content sampling, heterogeneous content domain decreases coefficient alpha |
|
relevance
|
the extent that a test item contributes to the test's goal
based on qualitative judgement to include: -content appropriateness -taxonomic level (is it reflective of appropriate cognitive level) -extraneous abilities (does it require additional skills than domain of interest) |
|
item difficulty
|
p = total # passing item/total number of examinees
.50 is ideal for multple choice .75 ideal for true/false tests |
|
item discrimination
|
extent to which a test item discriminates between high versus low scorers on the test or on the entire criterion
D = (% upper group answered correctly) - (% lower group answered correctly) range: -1 to +1 D = .35 is acceptable D = .50 is maximum discrimination |
|
Item Response Theory
|
-item characteristics are considered to be the same across samples
-possible to equate scores to different tests because performance is reported in terms of LATENT trait being measured (not just test score) -easier to develop computer-adaptive tests - ITEM CHARACTERISTIC CURVE is developed to determine relatinoship between examinee's ability and probability he/she will respond correctly. Slope indicates ability to discriminate between high and low achievers |
|
Classical Test Theory
|
obtained test score is sum of truth and error
difficult to equate scores on different tests because scales are different |
|
measurement error
|
random error / unsystematic
|
|
reliability
|
estimate of the proportion of variance in obtained scores accounted for by differences on attribute measured on test
range: 0 to +1 do NOT square reliability coefficient - interpret directly |
|
test-retest reliability
|
coefficient of stability
administer same test over time and compare scores. measure of stability source of error: factors that occured to time between tests designed to measure stable traits and not affected by repeated measures |
|
Alternate forms reliability
|
coefficient of equivalence
2 equal forms of a test are given to same group and scores are compared source of error: content sampling, error of interaction between differen examinees knowledge for stable traits and not affected by repeated measures most rigorous form of reliability, but difficult to develop |
|
internal consistency
|
SPLIT-HALF reliability: compare odds/evens or first/second half
underestimate of reliability because length is shorter Use SPEARMAN BROWN prophecy formula to estimate reliability of full-length test Cronbach's COEFFICIENT ALPHA is formula to determine inter-item consistency (i.e. Kuder Richardson can be used if item is scored right/wrong) source of error: content sampling, heterogeneous content domain decreases coefficient alpha |
|
Inter-rater reliability
|
kappa statistic or percent agreement
kappa statistic accounts for percent of agreement that occurs by chance source of error: rater bias, rater lack of motivation, non-exhaustive categories or not mutually exclusive categories, consensual observer drift -- provide training & emphasize difference between observation and interpretation |
|
Factors that affect reliability
|
- test length: the longer the test the higher the reliability (Spearman-Brown can be used to estimate reliability for a given number of items, but tends to OVER estimate
-Range of test scores: best when unrestricted, heterogeneous examinees and item difficulty around .5 (or.75) -Guessing: as probability of guessing correct answer increases, reliability decreases |
|
Standard error of measurement
|
SEmes = SD (square root of 1-Rxx)
provides information to make confidence interval (95% = add & substract 2 SEmeas from score, 99% = add & subtract 3 SEmeas from score) the lower the SD and higher the reliability the smaller teh SEmeas |
|
Content validity
|
when test is used to obtain information about examinees familiarity with content/behavior domain
-associated with achievement tests -relies on the judgment of subject matter experts to determine if valid |
|
Construct Validity
|
When test is used to measure a construct (hypothetical trait)
-intelligence, mechanical aptitude, self-esteem, neuroticism -no single way to test, use multiple: 1) assess internal consistency to ensure only 1 construct is being measured 2) study group differences. Does the test differentiate between people who are known to differ on construct 3) test to see if scores change following manipulation of construct (i.e. treatment, education) 4) Assess convergent and discriminant validity 5) Assess factorial validity * most theory laden of validation tests * |
|
Multitrait - Multimethod Matrix
|
monotrait-monomethod: correlation between the measure and itself, useful to konw if the matrix is useful (should be high)
Monotrait-heteromethod: if high, shows convergent validity because high correlation between same trait on different measures heterotrait-monomethod: if low, shows discriminant/divergent validity because trait should not correlate with a different trait being measured heterotrait-heteromethod: should be low to show discriminant/divergent validity |
|
Factor Analysis
|
Conducted to determine minimum number of factors that accoutn for intercorrelations among variables
assess construct validity (assess convergent or divergent validity) 1) administer target test along with other tests (of different constructs) to a group 2) correlate scores on each test with scores on every other test (R) - the pattern of correlations indicates how factors are extracted 3) Covert correlation matrix to factor matrix to develop "factor loadings" 4) simplify by "rotating" for ease of interpretation 5) interpret and name factors |
|
Factor Loadings
|
correlation coefficients that indicate the degree of association between each test and each factor
SQUARE THIS CORRELATION COEFFICIENT TO GET VARIABILITY OF THAT FACTOR ACCOUNTED FOR BY THAT VARIABLE |
|
Communality
|
the common variance (shared variability) of the tests scores that is accounted for by the factors.
can only be calculated WHEN ORTHOGONAL DO NOT SQUARE - INTERPRET DIRECTLY |
|
orthogonal
|
refers to rotation of a factor analysis
UNCORELLATED FACTORS allows you to calculate communality |
|
oblique
|
refers to rotation of a factor analysis
CORRELATED FACTORS |
|
criterion-related validity
|
when test scores are used to draw conclusions or predict standing/performance on another measure
the test is the predictor and the other measure (preformance) is criterion |
|
concurrent vs. predictive validity
|
both are forms of criterion-related validity
CONCURRENT: criterion data collected around the same time as predictor data (good if goal is to predict current status) PREDICTIVE: when criterion is measured some time after predictor (preferred if goal is to predict future performance) |
|
criterion-related validity coefficient
|
rarely exceeds .60
as low as .2 to .3 can be acceptable square to determine shared variability |
|
standard error of estimate
|
similar to SEmeas, except helps to determine confidence interval for PREDICTED CRITERION score (not obtained score, like SEmeas does)
SEest = SD of criterion (square root of 1 - validity coefficity rxy) same as SEmeas, +/- 2 SEest is 95% confidence interval and +/- 3 SEest is 99% confidence interval the smaller the standard deviation and larger the validity coefficient the smaller the SEest |
|
incremental validity
|
increase in correct decisions that can be expected if the predictor is used as a tool
1) construct a scatterplot 2) set cutoff scores for predictor and criterion |
|
True Positive
|
predicted to be successful and meet cutoff for criterion (are successful in reality)
|
|
False Positive
|
predicted to be successful but do not meet criterion cutoff (i.e. not successful in reality)
|
|
True Negative
|
predicted to be unsuccessful and do not meet criterion cutoff (are not successfull in reality)
|
|
False Negative
|
predicted to be unsuccessful but meet criterion cutoff (predicted unsuccessful but are succesful in reality)
|
|
invremental validity formula
|
positive hit rate - base rate
positive hit rate= true positives divided by all positives base rate = (true positives +false negatives) divided by total number of people - base rate is all people who were selected without predictor and are successful what is considered valid or invalid is up to judgement (there is no standard) |
|
sensitivity
|
correct identification of TRUE POSITIVES
|
|
specificity
|
the correct identification of TRUE NEGATIVES
|
|
positive predictive value
|
percent of validation sample that were accurately identified by predictor as having disorder
|
|
negative predictive value
|
percent of cases in validation sample who were accurately identified as not having disorder
|
|
relationship between reliability and validity
|
a test's validity is always limited by its reliability
a predictor's validity is less than the square root of the reliability...to increase validity coefficient, you must increase predictor and criterion reliability |
|
Correction for attenuation
|
used to estimate the validity if the reliability was 1.0
you need: 1) preditor's current reliability 2) criterion's curren reliability 3) criterion-related validity coefficient tends to overestimate actual validity that can be achieved |
|
Criterion contamination
|
tends to inflate relationship between predictor and criterion
avoid by: raters are not familiar with predictor scores |
|
cross-validation
|
"try out" your predictor with a different group.
You will get "shrinkage" of the validity coefficient due to the different sample |
|
Norm-Referenced Interpretation
|
compare test score to normative sample
raw score converted to another score ex: percentile ranks, standard scores (-): relies on the extent to which an examinee matches the normed sample |
|
Percentile Ranks
|
express raw score in terms of percent of normative sample who achieved lower scores
(+): easy to interpret distribution is always flat - nonlinear transformation (-): Ordinal scale of measurement, do not provide absolute differences between examinees, transforming scores to percentiles maximizes differences in the middle range and minimizes difference in the extreme ranges |
|
Standard Scores
|
norm-referenced interpretation
(+) permit comparisons of scores obtained on different tests |
|
z-score
|
z = (Score - Mean)/SD
Mean: 0 Standard deviation = 1 all scores below the mean are (-) all above the mean are (+) unless "normalized" will retain shape of original scores (i.e. nonlinear transformation) |
|
percentage score
|
Criterion-referenced interpretation
indicates the precentage of the test that the examinee answered correctly usually a cutoff score is set often used for mastery evaluation when all examinees must meet certain performance also can interpret score on predictor with "likely scores on external criterion" - create a regression equation |
|
Correction for guessing
|
involve calculating number correct, number incorrect and number of alternatives for each question
if the correction involves subtracting points from examinees scores, the distribution will have a lower mean and larger SD than original distribution |