• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/55

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

55 Cards in this Set

  • Front
  • Back

relevance

the extent that a test item contributes to the test's goal

based on qualitative judgement to include:
-content appropriateness
-taxonomic level (is it reflective of appropriate cognitive level)
-extraneous abilities (does it require additional skills than domain of interest)
item difficulty
p = total # passing item/total number of examinees

.50 is ideal for multple choice
.75 ideal for true/false tests
item discrimination
extent to which a test item discriminates between high versus low scorers on the test or on the entire criterion

D = (% upper group answered correctly) - (% lower group answered correctly)

range: -1 to +1

D = .35 is acceptable
D = .50 is maximum discrimination
Item Response Theory
-item characteristics are considered to be the same across samples
-possible to equate scores to different tests because performance is reported in terms of LATENT trait being measured (not just test score)
-easier to develop computer-adaptive tests

- ITEM CHARACTERISTIC CURVE is developed to determine relatinoship between examinee's ability and probability he/she will respond correctly. Slope indicates ability to discriminate between high and low achievers
Classical Test Theory
obtained test score is sum of truth and error

difficult to equate scores on different tests because scales are different
measurement error
random error / unsystematic
reliability
estimate of the proportion of variance in obtained scores accounted for by differences on attribute measured on test

range: 0 to +1

do NOT square reliability coefficient - interpret directly
test-retest reliability
coefficient of stability

administer same test over time and compare scores.

measure of stability

source of error: factors that occured to time between tests

designed to measure stable traits and not affected by repeated measures
Alternate forms reliability
coefficient of equivalence

2 equal forms of a test are given to same group and scores are compared

source of error: content sampling, error of interaction between differen examinees knowledge

for stable traits and not affected by repeated measures

most rigorous form of reliability, but difficult to develop
internal consistency
split-half reliability: compare odds/evens or first/second half

underestimate of reliability because length is shorter

Use Spearman-Brown prophecy formula to estimate reliability of full-length test

Cronbach's COEFFICIENT ALPHA is formula to determine inter-item consistency (i.e. Kuder Richardson can be used if item is scored right/wrong)

source of error: content sampling, heterogeneous content domain decreases coefficient alpha
relevance
the extent that a test item contributes to the test's goal

based on qualitative judgement to include:
-content appropriateness
-taxonomic level (is it reflective of appropriate cognitive level)
-extraneous abilities (does it require additional skills than domain of interest)
item difficulty
p = total # passing item/total number of examinees

.50 is ideal for multple choice
.75 ideal for true/false tests
item discrimination
extent to which a test item discriminates between high versus low scorers on the test or on the entire criterion

D = (% upper group answered correctly) - (% lower group answered correctly)

range: -1 to +1

D = .35 is acceptable
D = .50 is maximum discrimination
Item Response Theory
-item characteristics are considered to be the same across samples
-possible to equate scores to different tests because performance is reported in terms of LATENT trait being measured (not just test score)
-easier to develop computer-adaptive tests

- ITEM CHARACTERISTIC CURVE is developed to determine relatinoship between examinee's ability and probability he/she will respond correctly. Slope indicates ability to discriminate between high and low achievers
Classical Test Theory
obtained test score is sum of truth and error

difficult to equate scores on different tests because scales are different
measurement error
random error / unsystematic
reliability
estimate of the proportion of variance in obtained scores accounted for by differences on attribute measured on test

range: 0 to +1

do NOT square reliability coefficient - interpret directly
test-retest reliability
coefficient of stability

administer same test over time and compare scores.

measure of stability

source of error: factors that occured to time between tests

designed to measure stable traits and not affected by repeated measures
Alternate forms reliability
coefficient of equivalence

2 equal forms of a test are given to same group and scores are compared

source of error: content sampling, error of interaction between differen examinees knowledge

for stable traits and not affected by repeated measures

most rigorous form of reliability, but difficult to develop
internal consistency
SPLIT-HALF reliability: compare odds/evens or first/second half

underestimate of reliability because length is shorter

Use SPEARMAN BROWN prophecy formula to estimate reliability of full-length test

Cronbach's COEFFICIENT ALPHA is formula to determine inter-item consistency (i.e. Kuder Richardson can be used if item is scored right/wrong)

source of error: content sampling, heterogeneous content domain decreases coefficient alpha
Inter-rater reliability
kappa statistic or percent agreement

kappa statistic accounts for percent of agreement that occurs by chance

source of error: rater bias, rater lack of motivation, non-exhaustive categories or not mutually exclusive categories, consensual observer drift -- provide training & emphasize difference between observation and interpretation
Factors that affect reliability
- test length: the longer the test the higher the reliability (Spearman-Brown can be used to estimate reliability for a given number of items, but tends to OVER estimate

-Range of test scores: best when unrestricted, heterogeneous examinees and item difficulty around .5 (or.75)

-Guessing: as probability of guessing correct answer increases, reliability decreases
Standard error of measurement
SEmes = SD (square root of 1-Rxx)

provides information to make confidence interval (95% = add & substract 2 SEmeas from score, 99% = add & subtract 3 SEmeas from score)

the lower the SD and higher the reliability the smaller teh SEmeas
Content validity
when test is used to obtain information about examinees familiarity with content/behavior domain

-associated with achievement tests

-relies on the judgment of subject matter experts to determine if valid
Construct Validity
When test is used to measure a construct (hypothetical trait)

-intelligence, mechanical aptitude, self-esteem, neuroticism

-no single way to test, use multiple:
1) assess internal consistency to ensure only 1 construct is being measured
2) study group differences. Does the test differentiate between people who are known to differ on construct
3) test to see if scores change following manipulation of construct (i.e. treatment, education)
4) Assess convergent and discriminant validity
5) Assess factorial validity

* most theory laden of validation tests *
Multitrait - Multimethod Matrix
monotrait-monomethod: correlation between the measure and itself, useful to konw if the matrix is useful (should be high)

Monotrait-heteromethod: if high, shows convergent validity because high correlation between same trait on different measures

heterotrait-monomethod: if low, shows discriminant/divergent validity because trait should not correlate with a different trait being measured

heterotrait-heteromethod: should be low to show discriminant/divergent validity
Factor Analysis
Conducted to determine minimum number of factors that accoutn for intercorrelations among variables

assess construct validity (assess convergent or divergent validity)

1) administer target test along with other tests (of different constructs) to a group

2) correlate scores on each test with scores on every other test (R) - the pattern of correlations indicates how factors are extracted

3) Covert correlation matrix to factor matrix to develop "factor loadings"

4) simplify by "rotating" for ease of interpretation

5) interpret and name factors
Factor Loadings
correlation coefficients that indicate the degree of association between each test and each factor

SQUARE THIS CORRELATION COEFFICIENT TO GET VARIABILITY OF THAT FACTOR ACCOUNTED FOR BY THAT VARIABLE
Communality
the common variance (shared variability) of the tests scores that is accounted for by the factors.

can only be calculated WHEN ORTHOGONAL

DO NOT SQUARE - INTERPRET DIRECTLY
orthogonal
refers to rotation of a factor analysis

UNCORELLATED FACTORS

allows you to calculate communality
oblique
refers to rotation of a factor analysis

CORRELATED FACTORS
criterion-related validity
when test scores are used to draw conclusions or predict standing/performance on another measure

the test is the predictor and the other measure (preformance) is criterion
concurrent vs. predictive validity
both are forms of criterion-related validity

CONCURRENT: criterion data collected around the same time as predictor data (good if goal is to predict current status)

PREDICTIVE: when criterion is measured some time after predictor (preferred if goal is to predict future performance)
criterion-related validity coefficient
rarely exceeds .60
as low as .2 to .3 can be acceptable

square to determine shared variability
standard error of estimate
similar to SEmeas, except helps to determine confidence interval for PREDICTED CRITERION score (not obtained score, like SEmeas does)

SEest = SD of criterion (square root of 1 - validity coefficity rxy)

same as SEmeas, +/- 2 SEest is 95% confidence interval and +/- 3 SEest is 99% confidence interval

the smaller the standard deviation and larger the validity coefficient the smaller the SEest
incremental validity
increase in correct decisions that can be expected if the predictor is used as a tool

1) construct a scatterplot
2) set cutoff scores for predictor and criterion
True Positive
predicted to be successful and meet cutoff for criterion (are successful in reality)
False Positive
predicted to be successful but do not meet criterion cutoff (i.e. not successful in reality)
True Negative
predicted to be unsuccessful and do not meet criterion cutoff (are not successfull in reality)
False Negative
predicted to be unsuccessful but meet criterion cutoff (predicted unsuccessful but are succesful in reality)
invremental validity formula
positive hit rate - base rate

positive hit rate= true positives divided by all positives

base rate = (true positives +false negatives) divided by total number of people
- base rate is all people who were selected without predictor and are successful

what is considered valid or invalid is up to judgement (there is no standard)
sensitivity
correct identification of TRUE POSITIVES
specificity
the correct identification of TRUE NEGATIVES
positive predictive value
percent of validation sample that were accurately identified by predictor as having disorder
negative predictive value
percent of cases in validation sample who were accurately identified as not having disorder
relationship between reliability and validity
a test's validity is always limited by its reliability

a predictor's validity is less than the square root of the reliability...to increase validity coefficient, you must increase predictor and criterion reliability
Correction for attenuation
used to estimate the validity if the reliability was 1.0

you need:
1) preditor's current reliability
2) criterion's curren reliability
3) criterion-related validity coefficient

tends to overestimate actual validity that can be achieved
Criterion contamination
tends to inflate relationship between predictor and criterion

avoid by: raters are not familiar with predictor scores
cross-validation
"try out" your predictor with a different group.

You will get "shrinkage" of the validity coefficient due to the different sample
Norm-Referenced Interpretation
compare test score to normative sample

raw score converted to another score

ex: percentile ranks, standard scores

(-): relies on the extent to which an examinee matches the normed sample
Percentile Ranks
express raw score in terms of percent of normative sample who achieved lower scores

(+): easy to interpret
distribution is always flat - nonlinear transformation
(-): Ordinal scale of measurement, do not provide absolute differences between examinees, transforming scores to percentiles maximizes differences in the middle range and minimizes difference in the extreme ranges
Standard Scores
norm-referenced interpretation

(+) permit comparisons of scores obtained on different tests
z-score
z = (Score - Mean)/SD

Mean: 0
Standard deviation = 1
all scores below the mean are (-) all above the mean are (+)
unless "normalized" will retain shape of original scores (i.e. nonlinear transformation)
percentage score
Criterion-referenced interpretation

indicates the precentage of the test that the examinee answered correctly

usually a cutoff score is set

often used for mastery evaluation when all examinees must meet certain performance

also can interpret score on predictor with "likely scores on external criterion" - create a regression equation
Correction for guessing
involve calculating number correct, number incorrect and number of alternatives for each question

if the correction involves subtracting points from examinees scores, the distribution will have a lower mean and larger SD than original distribution