• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/52

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

52 Cards in this Set

  • Front
  • Back

What are we trying to measure in reliability

When we get a test result from a difficult to measure psychological construct it may be an over or under estimate measure of the construct



We need to know how much variability there is in the total test score


(Need acceptable levels if consistency in order for them to be meaningful)



We are trying to measure a person's true score on what the test is assessing



What we actually measure is the observed score on the test

What is the error of measurement

The difference between the true score and the obsevered score



This error is estimated in the standard error of measurement

What does the error of measurement or standard error of measurement mean

Doesn't mean mistake



It is the variability of observed scores around the true score

What is reliability

The degree the test scores are free from measurement errors



Test scores that are free of errors are consistent and stable test results



Higher the reliability (0-1) the lower the measurement error and the more confidence one can have that the observed score mirrors the true score

When is reliability applicable

Whenever something is measured reliability is an issue



Not limited to psychological tests



Ex. Blood pressure tests have lower reliability than well constructed psychological test



Economic indicators (GDP, poverty, SES) are unreliable

Who created classical test theory

Also called theory of true and error scores



Spearman 1907

What does classical test theory assume

Assumed that a person has a true score that could be measured if there were no errors of measurement



But there are errors



These errors are the differences between the observed score and the true score



Observed score X = Ture score + Error measurement



Error measurement = Observed score X - true score

What is the underlying assumption of classical test theory

Measurement errors are randomly distributed around the true score.



Random means that chance factors or nonsystematic error increases or decrease observed scores



If a person repeats the same test their results would produce a normal distribution of errors around each person's true score mean



In a reliable test it is assumed that these error distributions overlap and differ only due to true scores

Pooled variance of errors

Tells us the magnitude of the variability of the sample observed scores around the true score of the sample



Pooled standard deviations from all test takers becomes the basic measure of the error present in a test

Pooled standard deviation in classic test theory

Is called the standard error of measurement



The mean of repeated testing is the true score estimate



The standard deviation of all these measurements is the standard error of measurement

What is the standard error of measurement used to calculate

Calculates the range of scores around the observed score within which the true score is likely to fall



Allows us to calcite the confidence interval around the observed mean



True score us believed to fall within + or - 1 standard deviation. (95%)

What is domain sampling theory

Classical test theory contains elements of domain sampling theory



The concern is to estimate true score from a limited sample of items where sampling from the full domain is impossible



From a sample a true score is estimated

What is the main problem that domain sampling theory explores

The problem is how much error of measurement is there from one sample of items



This is an important issue when the sample of test items is small relative to the size of the domain of items



Reliability increases as sample size approaches the size of the domain

In order to overcome these issues domain sampling theory uses repeated random sampling of items from the domain

Each test is an unbiased estimate of the true score



Due to measurement and sampling error these estimates will differ



These differences will be random and normally distributed



The mean of the correlations between thr various test scores is the test reliability



One does not average the sample correlations. Each correlation is converted into a z score which are averaged and transformed back to a correlation

What does domain sampling theory allows us to do

Allows for a calculation of maximum, unbiased reliability estimate that a test could achieve

What are 3 scores of measurement error/ reliability (high reliability the lower the error)

Content sampling error



Time sampling error



Other sources of error

What is content sampling error

Error that results from differences between the sample of items (test) and the domain of items it comes from



Is the largest scores of error in test scores (should be concern)



Is the easiest and most accurately estimated source of measurement error

How is content sampling error determined

Is determined by how well the domain Is sampled



Do the items sample all components of the domain? Do the items test all relevant forms of knowledge?



Is estimated by analyzing the degree of similarity amoung the items making up the test


-analyze the correlations between test items with the examinees standing on the construct being measured

What are time sampling errors

Random changes in the test taker or testing environment impact test performance



Error reflect random fluctuations in performance from one situation or time to another

What do time sampling errors do

They limit our ability to generalized test results across different situations



Major concern for psychological testing since tests are rarely given in exactly the same envirment



Methods have been developed to estimate this error

What are other possible sources of error

Include errors in testing, administrative and scoring



Clerical errors committed while adding up a scores



Administrative errors on an individually administered test



When scoring relies heavily on the subjective judgement of the tester subtle discriminations in scoring can happen


-need to calculate inter rater and inter scorer agreement

Ratio of reliability

Reliability is usally expressed as a correlation coefficient but it is preferable to express it as a ratio of the variances of true score of a sample to the observer score of the sample



In the ratio, reliability is the proportion of observed score variance that is accounted for by true score variance



Closer to 1 the higher the reliability



Can be rearranged so reliability of the observed score variance is equal to true score variance plus error variance

How to get an estimate of error variance

1- variance of true divided by variance of observed


1- coefficient correlation



Ex


Is test reliability coefficient is .8 then error is .2



Means 80% of test score reflects true score variability and 20% reflects random, nonsystematic error variability

What is reliability index

Reflects the correlation between true and observed scores



Can't be calculated directly because true scores are unknown



The index is equal to the square root of the reliability coedfienct



Is .8 the reliability index is .9



This means that the correlation between observed score and true score is 0.9

There are many ways to estimate reliability

Which way is chosen will depend on what the test is presumed to measure and what the test constructor wants to demonstrate

Ways to estimate reliability

1) test retest reliability (stability coefficient)



2) Parallel forms reliability (alternate or equivalent forms)



3) split half reliability

What is test retest reliability coeffienct

Assumes construct is stable over time


-not used for constructs that change over time


Error concept in test retest reialbiliy due to time intervals

If time interval is short we see random fluctuations (practice effects)



If time interval is long we see random fluctuations, unknown sources of error, changes in the construct over time



There is no single best time interval but the optimal interval is determined by the way the test results are to be used and the nature If tye construct

What do postive correlations in teat retest reliability mean

Generalize across time (scores are stable)



Low susceptibility to testing or test taker conditions



Geberalize over testing environments

Limitations of test retest correlations

Assumes construct is stable over time -not used for constructs that change over time



Depending on the interval, correlations may be susceptible to carry over and practice effects


-presence of either overestimates true reliability when effects are random



Follows classical test theory


-they assumes attribute stability


-variability of test scores in assessment of the construct are seen as errors

What us parallel forms reliability

Two or more equivalent but different forms of a test are given over several time periods and results are corrected



Tests must be truly parallel in terms of content, difficult and over relevant characteristics

Why is parallel forms the most informative form of reliability for psychological studies

1) contains estimate of consistency over time



2) contains two or more samples of items from the domain



3) can estimate error attributable to selection from item sets



4) practice or carry objects effects are reduced

Two kinds of parallel testing ????.

Concurrent reliability


-when the two tests are given close in time, Scorsese of error are die to random factors and content sampling of items

The nature of items in parallel forms

Same number


Cover the same domain


Expressed in the same way


Equal difficulty

Drawbacks of parallel from reliability

Practice or carry over effects are reduced



Practice or carry over effects change the meaning of second or their testing



Creation of the many items needed for parallel forms is costly and time consuming

What is split half reliability (internal reliability)

Reflect errors related to content sampling.



These estimates are based on the relationship between items within the test



Test responses are split in half and two halves are correlated



Underestimates true reliability of the whole test


-Spearman brown correction used to fix underestimate

How can tests be split

Frist half second half


-if test is long and all items are equal difficulty



Odd even split


-if test items are of increasing difficulty, practice effects, fatigue, or declining attention effect scores on later in the test

Spearman brown correction

Assumed equal variances in both halves of the test



The correction underlines the general point that reliability increases as the number of items drawn from the domain increases



Rsb= 2r/ 1+ r


Rsb is split half correlation


r= half test score reliability

KR20 (Kuder and Richardson)

Is a measure of internal reliability



Considers all possible splits simultaneously



This reliability measure can only be used for items that can be scored in a dichotomous manner (only 2 options)



Not used today often

Coefficient alpha

Is a measure of internal reliability



Examines the consistency of responses to all test items regardless of how those items are scored



Can be thought of as thr average of all possible split half coefficients corrected for the length of the whole test



Sensitive to content sampling measurment error like split half reliability



Also sensitive to heterogeneity of the test content

What is heterogeneity of test content

The degree to which the test items measure unrelated characteristics



An item heterogeneity increases alpha coefficient decreases

Relationship between KR20 and cronbach coeffiencent alpha

KR20 is a simplified version of the alpha coefficient



Alpha coefficient reduces to KR20 when all items are dichotomous

What does cornbach coeffiecnt alpha tell us

Coefficient provides the lowest bound estimate of reliability



A high alpha suggests that the true reliability is higher



A low alpha only means that the true reliability may be higher



To overcome this issue 95% confidence intervals around alpha can be made

Limitations of coefficient alpha

It assumes Tau equivalence (T) or unidimensional factor structure



All indicators (test items) of a factor all load or correlate in a similar manner on one dimension (item heterogeneity)



When we don't have this Tau equivalence alpha coeffiecnt will underestimate the lists level of reliability

McDonalds omega coefficient

Does not assume Tau equivalence and can be used to assess internal reliability for non equivalent Tau items



Calculation of omega coefficient is not straightforward as it relys on the outcome of a structural equation model



SPSS macros are available to do calculation

Sources of errors that take place when estimating reliability for behavioral observation

Individuals scoring the test (judges)



Rating errors



Definitional issues



Ignores errors due to item sampling

What must be done to estimate true scores (relability) for observational and subjective beahviours

Interrater reliability (inter judge, inter scorer or inter observer ratings) must be calculated


Interrater reliability

Refer to estimating the consistency amoung judges or raters who are evaluating the beahviour



The percentage of agreement between raters is sometimes used as a measure of interrater reliability but percentages do not take into account chance level of agreement



Kappa coeffiecnt used to account for chance of agreement for Ordinal level data

Kappa coeffiecnt

agreement



Range from +1 to -1 (leds agreement than expected by chance alone)



Indicates actual level of agreement as a proportion of actual agreement is corrected for chance agreementRange from +1 to -1 (leds agreement than expected by chance alone) Greater than .75 is exceptional, .4 -.74 is satisfactory. Less than .4 is poorUsed when agreement is sought between two raters -more than two use fleiss interrater correlation or Krippendorffs alpha


Greater than .75 is exceptional, .4 -.74 is satisfactory. Less than .4 is poor


agreementRange from +1 to -1 (leds agreement than expected by chance alone) Greater than .75 is exceptional, .4 -.74 is satisfactory. Less than .4 is poorUsed when agreement is sought between two raters -more than two use fleiss interrater correlation or Krippendorffs alpha



Used when agreement is sought between two raters


-more than two use fleiss interrater correlation or Krippendorffs alpha



When can Kappa coefficients be used


(When the agreement in classification is of interest)

1) when a test administered at two different points in time to classify people into diagnostic groups, or groups such as who to hire and who to reject


-person is classified to a group using obtained test scores on each occasion and the degree of agreement across times is compared via Kappa



2) one could use two different tests on the same group of people at the same point in time and classify them separately using each set of test scores and then compute the cross test agreement in classification with Kappa

Know

KR20, split half reliability and coeffiecnt alpha are interrelated



The maximum true value of a test retest correlation cannot exceed the square root of alpha

What is the formula for Kappa

K = chance of disagreement - chance of agreement / chance of disagreement



K = frequency of observed agreement - expected frequency / overall total frequency - expected frequency