Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
52 Cards in this Set
- Front
- Back
What are we trying to measure in reliability |
When we get a test result from a difficult to measure psychological construct it may be an over or under estimate measure of the construct We need to know how much variability there is in the total test score (Need acceptable levels if consistency in order for them to be meaningful) We are trying to measure a person's true score on what the test is assessing What we actually measure is the observed score on the test |
|
What is the error of measurement |
The difference between the true score and the obsevered score This error is estimated in the standard error of measurement |
|
What does the error of measurement or standard error of measurement mean |
Doesn't mean mistake It is the variability of observed scores around the true score |
|
What is reliability |
The degree the test scores are free from measurement errors Test scores that are free of errors are consistent and stable test results Higher the reliability (0-1) the lower the measurement error and the more confidence one can have that the observed score mirrors the true score |
|
When is reliability applicable |
Whenever something is measured reliability is an issue Not limited to psychological tests Ex. Blood pressure tests have lower reliability than well constructed psychological test Economic indicators (GDP, poverty, SES) are unreliable |
|
Who created classical test theory |
Also called theory of true and error scores Spearman 1907 |
|
What does classical test theory assume |
Assumed that a person has a true score that could be measured if there were no errors of measurement
But there are errors
These errors are the differences between the observed score and the true score
Observed score X = Ture score + Error measurement
Error measurement = Observed score X - true score |
|
What is the underlying assumption of classical test theory |
Measurement errors are randomly distributed around the true score.
Random means that chance factors or nonsystematic error increases or decrease observed scores
If a person repeats the same test their results would produce a normal distribution of errors around each person's true score mean
In a reliable test it is assumed that these error distributions overlap and differ only due to true scores |
|
Pooled variance of errors |
Tells us the magnitude of the variability of the sample observed scores around the true score of the sample
Pooled standard deviations from all test takers becomes the basic measure of the error present in a test |
|
Pooled standard deviation in classic test theory |
Is called the standard error of measurement The mean of repeated testing is the true score estimate The standard deviation of all these measurements is the standard error of measurement |
|
What is the standard error of measurement used to calculate |
Calculates the range of scores around the observed score within which the true score is likely to fall Allows us to calcite the confidence interval around the observed mean True score us believed to fall within + or - 1 standard deviation. (95%) |
|
What is domain sampling theory |
Classical test theory contains elements of domain sampling theory
The concern is to estimate true score from a limited sample of items where sampling from the full domain is impossible
From a sample a true score is estimated |
|
What is the main problem that domain sampling theory explores |
The problem is how much error of measurement is there from one sample of items
This is an important issue when the sample of test items is small relative to the size of the domain of items
Reliability increases as sample size approaches the size of the domain |
|
In order to overcome these issues domain sampling theory uses repeated random sampling of items from the domain |
Each test is an unbiased estimate of the true score Due to measurement and sampling error these estimates will differ These differences will be random and normally distributed The mean of the correlations between thr various test scores is the test reliability One does not average the sample correlations. Each correlation is converted into a z score which are averaged and transformed back to a correlation |
|
What does domain sampling theory allows us to do |
Allows for a calculation of maximum, unbiased reliability estimate that a test could achieve |
|
What are 3 scores of measurement error/ reliability (high reliability the lower the error) |
Content sampling error Time sampling error Other sources of error |
|
What is content sampling error |
Error that results from differences between the sample of items (test) and the domain of items it comes from
Is the largest scores of error in test scores (should be concern)
Is the easiest and most accurately estimated source of measurement error |
|
How is content sampling error determined |
Is determined by how well the domain Is sampled
Do the items sample all components of the domain? Do the items test all relevant forms of knowledge?
Is estimated by analyzing the degree of similarity amoung the items making up the test -analyze the correlations between test items with the examinees standing on the construct being measured |
|
What are time sampling errors |
Random changes in the test taker or testing environment impact test performance Error reflect random fluctuations in performance from one situation or time to another |
|
What do time sampling errors do |
They limit our ability to generalized test results across different situations Major concern for psychological testing since tests are rarely given in exactly the same envirment Methods have been developed to estimate this error |
|
What are other possible sources of error |
Include errors in testing, administrative and scoring Clerical errors committed while adding up a scores Administrative errors on an individually administered test When scoring relies heavily on the subjective judgement of the tester subtle discriminations in scoring can happen -need to calculate inter rater and inter scorer agreement |
|
Ratio of reliability |
Reliability is usally expressed as a correlation coefficient but it is preferable to express it as a ratio of the variances of true score of a sample to the observer score of the sample In the ratio, reliability is the proportion of observed score variance that is accounted for by true score variance Closer to 1 the higher the reliability Can be rearranged so reliability of the observed score variance is equal to true score variance plus error variance |
|
How to get an estimate of error variance |
1- variance of true divided by variance of observed 1- coefficient correlation Ex Is test reliability coefficient is .8 then error is .2 Means 80% of test score reflects true score variability and 20% reflects random, nonsystematic error variability |
|
What is reliability index |
Reflects the correlation between true and observed scores Can't be calculated directly because true scores are unknown The index is equal to the square root of the reliability coedfienct Is .8 the reliability index is .9 This means that the correlation between observed score and true score is 0.9 |
|
There are many ways to estimate reliability |
Which way is chosen will depend on what the test is presumed to measure and what the test constructor wants to demonstrate |
|
Ways to estimate reliability |
1) test retest reliability (stability coefficient)
2) Parallel forms reliability (alternate or equivalent forms) 3) split half reliability |
|
What is test retest reliability coeffienct |
Assumes construct is stable over time -not used for constructs that change over time
|
|
Error concept in test retest reialbiliy due to time intervals |
If time interval is short we see random fluctuations (practice effects)
If time interval is long we see random fluctuations, unknown sources of error, changes in the construct over time
There is no single best time interval but the optimal interval is determined by the way the test results are to be used and the nature If tye construct |
|
What do postive correlations in teat retest reliability mean |
Generalize across time (scores are stable)
Low susceptibility to testing or test taker conditions
Geberalize over testing environments |
|
Limitations of test retest correlations |
Assumes construct is stable over time -not used for constructs that change over time
Depending on the interval, correlations may be susceptible to carry over and practice effects -presence of either overestimates true reliability when effects are random
Follows classical test theory -they assumes attribute stability -variability of test scores in assessment of the construct are seen as errors |
|
What us parallel forms reliability |
Two or more equivalent but different forms of a test are given over several time periods and results are corrected
Tests must be truly parallel in terms of content, difficult and over relevant characteristics |
|
Why is parallel forms the most informative form of reliability for psychological studies |
1) contains estimate of consistency over time
2) contains two or more samples of items from the domain
3) can estimate error attributable to selection from item sets
4) practice or carry objects effects are reduced |
|
Two kinds of parallel testing ????. |
Concurrent reliability -when the two tests are given close in time, Scorsese of error are die to random factors and content sampling of items |
|
The nature of items in parallel forms |
Same number Cover the same domain Expressed in the same way Equal difficulty |
|
Drawbacks of parallel from reliability |
Practice or carry over effects are reduced
Practice or carry over effects change the meaning of second or their testing
Creation of the many items needed for parallel forms is costly and time consuming |
|
What is split half reliability (internal reliability) |
Reflect errors related to content sampling.
These estimates are based on the relationship between items within the test
Test responses are split in half and two halves are correlated
Underestimates true reliability of the whole test -Spearman brown correction used to fix underestimate |
|
How can tests be split |
Frist half second half -if test is long and all items are equal difficulty
Odd even split -if test items are of increasing difficulty, practice effects, fatigue, or declining attention effect scores on later in the test |
|
Spearman brown correction |
Assumed equal variances in both halves of the test
The correction underlines the general point that reliability increases as the number of items drawn from the domain increases Rsb= 2r/ 1+ r Rsb is split half correlation r= half test score reliability |
|
KR20 (Kuder and Richardson) |
Is a measure of internal reliability Considers all possible splits simultaneously This reliability measure can only be used for items that can be scored in a dichotomous manner (only 2 options) Not used today often |
|
Coefficient alpha |
Is a measure of internal reliability Examines the consistency of responses to all test items regardless of how those items are scored Can be thought of as thr average of all possible split half coefficients corrected for the length of the whole test Sensitive to content sampling measurment error like split half reliability Also sensitive to heterogeneity of the test content |
|
What is heterogeneity of test content |
The degree to which the test items measure unrelated characteristics An item heterogeneity increases alpha coefficient decreases |
|
Relationship between KR20 and cronbach coeffiencent alpha |
KR20 is a simplified version of the alpha coefficient Alpha coefficient reduces to KR20 when all items are dichotomous |
|
What does cornbach coeffiecnt alpha tell us |
Coefficient provides the lowest bound estimate of reliability A high alpha suggests that the true reliability is higher A low alpha only means that the true reliability may be higher To overcome this issue 95% confidence intervals around alpha can be made |
|
Limitations of coefficient alpha |
It assumes Tau equivalence (T) or unidimensional factor structure All indicators (test items) of a factor all load or correlate in a similar manner on one dimension (item heterogeneity) When we don't have this Tau equivalence alpha coeffiecnt will underestimate the lists level of reliability |
|
McDonalds omega coefficient |
Does not assume Tau equivalence and can be used to assess internal reliability for non equivalent Tau items Calculation of omega coefficient is not straightforward as it relys on the outcome of a structural equation model SPSS macros are available to do calculation |
|
Sources of errors that take place when estimating reliability for behavioral observation |
Individuals scoring the test (judges)
Rating errors
Definitional issues
Ignores errors due to item sampling |
|
What must be done to estimate true scores (relability) for observational and subjective beahviours |
Interrater reliability (inter judge, inter scorer or inter observer ratings) must be calculated |
|
Interrater reliability |
Refer to estimating the consistency amoung judges or raters who are evaluating the beahviour
The percentage of agreement between raters is sometimes used as a measure of interrater reliability but percentages do not take into account chance level of agreement
Kappa coeffiecnt used to account for chance of agreement for Ordinal level data |
|
Kappa coeffiecnt |
agreement Range from +1 to -1 (leds agreement than expected by chance alone) Indicates actual level of agreement as a proportion of actual agreement is corrected for chance agreementRange from +1 to -1 (leds agreement than expected by chance alone) Greater than .75 is exceptional, .4 -.74 is satisfactory. Less than .4 is poorUsed when agreement is sought between two raters -more than two use fleiss interrater correlation or Krippendorffs alpha Greater than .75 is exceptional, .4 -.74 is satisfactory. Less than .4 is poor agreementRange from +1 to -1 (leds agreement than expected by chance alone) Greater than .75 is exceptional, .4 -.74 is satisfactory. Less than .4 is poorUsed when agreement is sought between two raters -more than two use fleiss interrater correlation or Krippendorffs alpha Used when agreement is sought between two raters -more than two use fleiss interrater correlation or Krippendorffs alpha |
|
When can Kappa coefficients be used (When the agreement in classification is of interest) |
1) when a test administered at two different points in time to classify people into diagnostic groups, or groups such as who to hire and who to reject -person is classified to a group using obtained test scores on each occasion and the degree of agreement across times is compared via Kappa
2) one could use two different tests on the same group of people at the same point in time and classify them separately using each set of test scores and then compute the cross test agreement in classification with Kappa |
|
Know |
KR20, split half reliability and coeffiecnt alpha are interrelated The maximum true value of a test retest correlation cannot exceed the square root of alpha |
|
What is the formula for Kappa |
K = chance of disagreement - chance of agreement / chance of disagreement K = frequency of observed agreement - expected frequency / overall total frequency - expected frequency |