Def: reliability

Repeatable and consistent
Free from error Reflects 'true score' 

Def: validity

Measures what it says it does


Def: power test

Assesses the attainable level of difficulty
No time limit Graduated difficulty Qs that everyone can do Qs that no one can do Eg: WAIS information subtest 

Def: ipsative measures

Scores reported in terms of relative strength within the individual
Preference is expressed for one item over another 

Def: mastery test

Cutoff for predetermined level of performance


Def: normative measures

Absolute strength measured
All items answered Comparison among people possible 

Range and interpretation of a reliability coefficient

0 (unreliable)
to 1 (perfectly reliable) .9 means 90% of the variance accounted for You do NOT square a reliability coefficient 

Factors affecting reliability coefficient

Anything reducing the range of obtained scores (eg a homogeneous population)
Anything increasing measurement error Short (vs long) tests Presence of floor or ceiling effects High probability of guessing a correct answer 

Factors affecting testretest reliability

Maturation
Difference in conditions Practice effects 

Measures of internal consistency

Splithalf: divide test in 2 and correlate scores on the subtests; sensitive to selection strategy
Coefficient alpha: used with multiple choice questions KuderRichardson Formula 20 (KR20) used for questions with dichotomous answers Reliability increases with item homogeniety 

Utility of internal consistency measures

Measurement of unstable traits
Not good for speed tests Sensitive to item content / sampling 

Appropriate measure of speed test reliability

Testretest
Alternate forms 

Measure of interrater reliability

Kappa coefficient


Factors improving interrater reliability

Well trained raters
Explicit observation of the raters Mutually exclusive and exhaustive scoring categories 

Def: interval recording

All behavior within a specified period of time


Def: standard error of measurement

How much error is expected from an individual test score


Formula: standard error of measurement *

SE = SD * square root of (1r)
where r = the reliability coefficient which ranges from 0 to 1 

Use: standard error of measurement

Construction of a confidence interval


Probability of scores falling within a specified confidence interval

68% +/ 1 SE
95% +/ 1.96 SE 99% +/ 2.58 SE 

Use: eta *

Correlation of continuous nonlinear variables


Def: types of criterion related validity

Concurrent
Scores collected at the same time Useful for diagnostic tests Predictive validity Scores tested before and later Useful for eg job selection tests 

Factors affecting criterion related validity

Restricted range of scores
Unreliability of predictor or criterion Regression Criterion contamination 

Def: criterion contamination

Occurs when person assessing criterion knows predictor for an individual


Def: convergent/divergent analysis

Convergent validity is high correlation between different measures of same construct
Divergent validty is low correlation between measures measuring different constructs 

Relationship between reliability and validity

The criterionrelated validity coefficient cannot exceed the square root of the predictor's reliability coefficient
Reliability coefficient sets a ceiling on the validity coefficient 

Def: face validity

Appearance of validity to test takers, administrators and other untrained people


Def: criterion related validity coefficient

Pearson r correlation between predictor and criterion
acceptable range is +/ .3 to .6 

Differences between
standard error of measurement and standard error of estimate 
Standard error of measurement
related to reliability coefficient used to estimate true score on a given test Standard error of estimate Determines where a criterion will fall given a predictor 

Def: shrinkage

Reduction in validity coefficient on crossvalidation (revalidation with a second sample)
A result of noise in original sample 

Factors affecting shrinkage

Small original validation sample
Large original item pool Relative number of items retained is small Items not sensibly chosen 

Def: construct validity

Extent to which a test successfully measures an unobservable, abstract concept such as IQ


Techniques for assessing construct validity

Convergent validity techniques
High correlation on a trait even with different methods Divergent / discriminant validity techniques Low correlation on different traits even with the same method Factor analysis 

Def: factor loading

Correlation between a given test and a factor derived from a factor analysis
Can be squared to give % of variance that the test accounts for in the factor 

Def: communality (factor analysis)

The proportion of variance of a test accounted for by the factors
Sum of the squared factor loadings Interpreted directly, ie .4 = 40% Only valid when factors are orthogonal 

Def: unique variance (factor analysis)

Variance not accounted for by the factors
u2 = 1  h2, where h2 is the communality 

Def: eigenvalue

explained variance
= Sum of the squares of the loadings sum of the eigenvalues <= number of tests Applied to unrotated factors only 

Formula to convert eigenvalue to %

= eigenvalue * 100 / number of tests


Types of rotation (factor analysis) *

Orthogonal  uncorrelated
Oblique  correlated Choice depends on what you believe the relationship is among the factors 

Differences between principle components analysis and factor analysis

In principle components analysis:
Factors are always uncorrelated Variance = explained + error In factor analysis: variance = common + specific + error 

Use: cluster analysis

Categorize or taxonimize a set of objects


Differences between cluster analysis and factor analysis

Cluster analysis
all types of data clusters interpreted as categories Factor analysis interval or ratio data only factors interpreted as underlying constructs 

Def: correction for attenuation

Estimate of how much more valid a predictor would be if it and the criterion were perfectly reliable


Def: content validity

Adquate sampling of relevant content domain


To reduce the number of false positives...

Raise the predictor cutoff
and / or Lower the criterion cutoff 

Def: false negative

Predicted not to meet a criterion but in reality does


Def: item difficulty or difficulty index *

% of examinees answering correctly
an ordinal value, because an item with an index of .2 is not necessarily half the difficulty of an item with an index of .4 

Def: item discriminability

Degree to which an item differentiates between low and high scorers
D = difference between high and low % correctly answered range from 100 to 100 moderate difficulty optimal 

Target values for item difficulty by objective

.5 for most tests
.25 for high cutoff (matching selection %) .8 or .9 for mastery half way between chance and 1, eg t/f exams would be .75 

Relationship between item difficulty and discriminability

Difficulty creates a ceiling for discriminability
Difficulty of .5 creates maximum discriminability The greater the mean discriminability the greater the reliability 

What can you determine from an item response (aka item characteristic) curve?

Difficulty
point where p(correct response) = .5 Discriminability slope of the curve; lower more discriminable Probability of a correct guess intersection with y axis 

Def: computer adaptive assessment

Computerized selection of test items based on periodic estimates of ability


What are the advantages of a test item of moderate difficulty (p = .5)

Increases variability which increases reliability and validity
Maximally differentiates between low and high scorers 

Techniques for assessing an item's discriminability

Correlation with
total score an external criterion 

What are the mean and std deviation for the following standard scores: z, t, stanine and deviation IQ?

mean SD
z 0 1 t 50 10 stanine 9 ~2 deviation IQ 100 15 

The difference between normreferenced and criterion referenced scores

Norm referenced is a comparison to others in a sample
Criterion referenced measure against an external criterion 

Characteristics of alternate forms reliability coefficient

Best, because to be high must be consistent across time and content
Likely to have a lower magnitude than other coefficients 

Def: moderator variable

Variables affecting validity of a test
A moderator variable confers differential validity on the test 

Def: 'testing the limits' in dynamic assessment

Following a standardized test, using hints to elicit correct performance. The more hints necessary, the more severe the learning disability


Contents of the Mental Measurements Yearbook

Author
Publisher Target population Administrative time Critical reviews 

Effect on the floor of adding easy questions to a test *

Will raise the floor


Def: dynamic assessment

Variety of procedures following on standardized testing to get further information, usually used with learning disablity or retardation
