• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/191

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

191 Cards in this Set

  • Front
  • Back

Issue with measuring psychological constructs

Psychological constructs unlike physical objects are hard to measure



When we get results the obtained score may over or under estimate measures of the construct



This issue means we need to know how much variability there is in the total test scores



Scores must demonstrate acceptable levels of consistency between observed scores and true scores in order for them to be meaningful

True scores vs observed scores

We are trying to assess a person's true score on wheat the test is assessing



We actually measure the observed score on the test not true score

Error of measurement

The difference between the true score and the observed score



Error doesn't mean mistake but rather the variability of observed scores around the true score



Error is estimated in thr standard error of measurement

Relability

Is the degree to which test scores are free from measurement error



High the reliability of a test (0-1) the lower the measurement error



The more confidence we can be that the observed score mirrors the true score

When test scores are free of meausmrent errors

Test scores that are free of errors display consistent and stable test results



Reliable assessments are relatively free from measurement error wheres less reliable results reflect measurement error

Reliability outside of psycholgocial tests

Whenever something is measured, reliability is an issue



Blood pressure tests have lower reliability than a well constructed psychological test



Economic indicators (GDP, poverty, SES) are particularly unreliable

Classical test theory (Spearman)


Also called test score theory

Assumes that a person has a true score that could be measured if there were no errors of measurement (high reliability)



However there are errors between observed scores and true scores



These measurement errors are then the difference between the observed score and true score

Classical test score theory formulas

X is observed score


T is true score


E is measurement error



X = T +E



Because we want to know reliability or measumrent errors



E = X - T

Major assumption of classical test theory.

Assumption is that measurement errors are randomly distributed around the true score



Meaning chance factors or nonsystematic error increases or decrease observed scores



If people were to repeat the same test results would produce a normal distribution of errors around each person's true score mean


- scores that greatly differ from true score will happen less



Can estimate true score by finding the mean of the observations from repeated applications

Standard error of measurement

Tells us on average how much a score varies from the true score



The standard deviation of observed score and the reliability of the test are used to estimate the standard error of measurement

Pooled standard deviations

In a reliable test it is assumed that error distributions overlap and differ only because of true test scores



Pooled variance of these errors tells us the magnitude of the variability of the sample observed scores around the true score of the sample



Pooled standard deviations from all test takers becomes the basic measure of error present in a test



Pooled standard deviation is called standard error of measurement

Standard error of measurement


The standard error of measurement is used to calculate the range of scores around the observed score within which the true score is likely to fall



Allows for confidence interval around the observed mean


-true score will fall within + or - standard error of measurement (95%)



The mean of repeated testing is the true score estimate -the SD is the standard error of measurement

Domain sampling model

Considers the problems created by using a limited number of items to represent a larger and more complicated construct



Concern is to estimate true score from a limited sample of items where sampling from the full domain is impossible

Domain sampling theory used in classical test theory

Classical theory uses elements of domain sampling



From a sample of items (repeated test scores) a true score is estimated

When is domain sampling important

Problem is how much error measurement is there from one sample of items



This is important when the sample of test items is small relative to the size of the domain of items



Reliability increases as sample size approaches the size of the domain

Repeated random sampling of items from the domain

Each test has an unbiased estimate of true score



Due to measurement and sampling error these estimates will differ



These differences will be random and normally distributed



The mean of the correlations between the various test scores is the test reliability



Do not average sample correlations, each correlation is converted into z scores which are averaged and transfered back to correlation

What does domain sampling allow us to calculate

Allows for the calculation of maximum, unbiased reliability estimate that a test achieve

Sources of measurement error

Content sampling error



Time sampling errors



Other scoured of error

Content sampling error

The error that results from differences between the sample of items and the domain of items



Largest scourge of error



Is easiest and most accurate source of error to estimate




Determined by how well the domain is sampled


-same difficulty, do sample all components of domain

How is content sampling measurement error estimated

Estimated by analyzing the degree of similarity amoung the items making up the test



To do so we Analyze the correlations between test items with the examinees standing on the construct being measured

Time sampling errors

Random changes in the test taker or testing enviroment can impact test performance



Reflect random fluctuations in performance from one situation to another



Limit our ability to generalize test results



Major concern since psychology test are rarely given in the exact same environment

Other sources of errors

Include errors in testing, administrative and scoring



Clerical errors committed while adding up scores or administrative errors on an individually administered test



When scoring relies heavily on subjective judgment of the tester, subtle discriminations in scoring can happen


-must calculate inter rater or inter scorer agreement

Expressing reliability

Reliability is often expressed as a correlation coefficient



It is more preferable to express it as a ratio of the variances of true score to the observed score



The reliability is the proportion of observed score variance us accounted for by true score variance

Equations of reliability ratio

Rxx = variance of true score / variance of observed score



(Standard deviation squared)



Reliability of the observed score = true score variance + error variance


Estimate of error variance

Estimate of error variance = 1- Rxx -if relaiblity (Rxx) is .8 error variance is .2



Means 80% of the test score reflects true score reliability and 20% reflects random nonsystematic error variability

Reliability index

Reflects the correlation between true and observed scores



Can't be calculated directly as true scores are unknown



Is equal to the square root of the reliability coeffiecnt



If Rxx= .81 then index is .9


-the correlation between observer and true score is 0.9

Know

There are a number of ways to estimate reliability



Each measures a different aspect of random error



Which reliability estimate is chosen will depend on what the test is presumed to measure and what the test constructor wants to demonstrate

Three ways to estimate test reliability

Test retest


Parallel forms


Internal consistency

Test retest method


Also called stability coefficients

Measures time sampling errors



Used to evaluate the error associated with administering a test at two different times



Applies only to stable traits



Susceptible to carryover effects



Follows classical test theory


-theroy assumes attribute stability


-test score variability is constructed as error variability

Time intervals in test retest

If time interval is short than random fluctuations take place


-practice / carryover effects



If time interval is long than random fluctuations, unknown scourses of error and changes in the construct can take place over time



There is no single best time interval



The optimal interval is determined by the way the test results are to be used and the nature of the construct

In test retest what does a postive correlation mean

The scores are stable as they are generalized across time



Low susceptibility to testing or test taker conditions



Generalize over testing environments

Carryover and practice effects

Happens when the first testing session influences scores from the second session



When there are these effects the test retest correlation usually overestimates true reliability



Only a problem when the changes over time are random


-not predictable, effects some but not all



Practice effects are a form of carryover effects


-have sharpend their skills after frist test

Test retest administrating

Administer the same test on two well specified occasions and then the correlation between scores from the two administrations

Poor test retest correlations

Poor correlations does not mean that a test is unreliable



Suggests that the characteristic under study has changed

Parallel forms reliability


Also called alternate forms or equivalent forms

Compares two or more different but equivalent forms of a test that measures the same attribute



Must be made of different items but the rules used select items are the same



Tests must be parallel in terms in content, difficultly ect.



Makes sure that the test scores do not represent any one particular set of items or a subset of items for the entire domain (content error)

Why is parallel forms the most informative form of reliability for psychological studies

1) contains estimate of consistency over time



2) contains two or more samples of items from the domain



3) can estimate error attribute selection from item sets



4) practice or carryover effects are reduced

Nature of items in parellel forms

Same number


Cover the same domain


Expressed in the same way


Equal difficulty

Drawbacks of parallel forms

Practice or carryover effects change the meaning of the second tests



Creation of the many items needed for parallel forms is costly and time consuming

Internal consistency reliability


-split half reliability

Reflects errors related to content sampling



These estimates are based on the relationship between items withinba test



Test is given once and responses are split into two halves and are correlated


-congeneric tests

Ways tests can be split

Frist half second half split


-if test is long and items are of equal difficulty



Odd even split


-if items increase in difficulty or practice effects, fatigue, or declining attention effects

Problem with split half internal reliability

This reliability is an underestimate because each subset is only half as long as the full test



An estimate of reliability would be deflated because each half would be less reliable than the whole test



Since only half of the items are used the reliability underestimates true reliability for the whole test



-test gains reliability as the number of items increase

Spearman brown correction correlation

Allows you to estimate what the correlation between the two halves would have been if each half had been the length of the whole test



Corrects underestimation



Underlies thr general point that reliability increases as the number if items increase



Assumes equal variances in both halves of the test

Spearman brown formulation

Rsb = 2r / 1+ r



Rbs = correlation between the two halves of the test if each had the total number of items (split half correlation)



r = correlation between the two halves of the test

Issue with Spearman brown correction

Assumes that there are equal variances in both halves of the test



When variances are unequal we can't use it



Instead have to use alpha

Kuder Richardson KR20

Is a measure of internal reliability so the test only has to be given once



Considers all possible splits simultaneously


-avoids the problems of split half



Only be used for items that can be scored in a dicotomous matter (0 or 1)

KR20 formula

Kr20 = N / N-1 (S squared - sum pq / S squared)



Kr20 = reliability estimate



N = number of items on the test



S squared = variance of total test score



P = the proportion of people getting each item correct



Q = the proportion of people getting each item incorrect (1 -p)



Sum of pq = sum of the products of p x q for each item on the test


-is the variance for an individual item


-is the sum of individual item variances

Know

To have 0 reliability the variance for thr total test score must be greater than the sum of the variances for the individual items



Only happens when there is covariance between items



Covariance happens when the items are correlated with each other



The greater the covariance the sample the sum of pq term will be

Cornbach coefficient alpha

Estimates the internal consistency of tests which the items are not scored as 0 or 1



Examined thr consistency of responses to all test items regardless of how those items are scored



Can be thought of as the average of all possible split half coeffiecnts corrected for the length of the whole test



The sum of pq is replaced with sum of individual item variances



Most general method for finding estimates of reliability through internal consistency

What is coefficient alpha sensitive to

Content sampling measurement errors



Heterogeneity of the test content


-the degree to which the test items measure unrelated characteristics


-as item heterogeneity increases alpha coefficient decreases

Coefficient alpha and kr20

Both estimates internal relability



Kr20 is a simplified version of the alpha coefficient



Alpha coefficient reduces to Kr20 when all items are dichotomous



If all test items are dichotomous then alpha and kr20 should give identical results within rounding error

What does coefficient alpha provide us with

Gives us the lowest bound estimate of reliability



A high alpha (>.80) suggests that the true reliability is higher



A low alpha only means that the true reliability may be higher



To overcome this issue 95% confidence intervals around alpha can be constructed

Reliability for a test that measures more than one trait

Factor analysis is a popular method for dealing with this situation



When factor analysis is used correctly, these subsets will be internally consistent (highly reliability) and independent from one another

Limitations of alpha coefficient

It assumes tau equivalence (T) or a unidimensional factor structure



When tau equivalence is not met alpha coefficient will underestimate the tests level of relability

Tau equivalence

All the indicators of a factor, The test items, all load or correlate in a similar manner on one dimension


-item heterogeneity

McDonald's omega coefficient

Does not assume tau equivalence and can be used to assess internal reliability for non equivalent tau items



Calculation is not straightforward, relying on the outcome of a structural equation model



Spss macros are available to do calculation

Sources of the error for when estimating reliability for behavioral observation data

Individuals scoring the test (judges)


Rating errors


Definitional issues


Item sampling errors

How to estimate true scores in behavior observation studies

Are unreliable because discrepancies between true scores and scores recorded by the observer



To address these problems we need to estimate the reliability of the observers



This is known as interrater reliability


(Inter judge, inter scorer or inter observer ratings)

Interrater reliability

Estimating the consistency amoung judges or raters who are evaluating the behavior or output



The percentage of agreement between raters is sometimes used as a measure of interrater reliability



This is incorrect as percentages as they do not take into account chance level of agreement

Most common for of interrater reliability

Is to record the percentage of times that two or more observers agree



Not the best because


1) percentage does not consider the level of agreement that would be expected by chance alone


2) percentages should not be mathematically manipulated

What is the best way to asses interrater reliability

Kappa coeffiecnt



Indicates the actual agreement as a proportion of the actual agreement following correction for chance agreement



Is a measure of agreement between two judges who each rate a set of objects using nominal scales for Ordinal level data



A weighted coeffiecnt is available for Ordinal level data and takes into consideration how desperate the ratings are



1 (perfect aggrement) -1 (less agreement than can be expected on the basis of chance alone)



Less than .4 is poor. Greater than .75 is excellent

When is Kappa coeffiecnts used


(when agreement in classification is of interest)

When a test is administered at two different points in time to classify people into diagnostic groups


-person would be classified or assigned to a group using the obtained test scores on each occasion and the degree of agreement across times is compared via Kappa



Could use two different tests on the same group of people at the same point of time and classify them seperately using each set of test scores and then computer the cross test agreement in classification with Kappa

Fleiss interrater correlation or Krippendorffs alpha

When there are more than two raters

Internal consistency

Evaluates the extent to which the different items on a test measure the same ability Measures of internal consistency will all give low estimates of reliability if the test is designed to measure several traits

Standard error of measurement and reliability

Reliability coefficients reflects the proportion of observed variance attributable to true score variance



Relability coefficients are a useful wah of comparing the consistency of tesr scores produced by different assessment procedures



The standard error of measurement is useful for interpretation



It is the standard deviation of the distibution of scores that would be obtained by one person if they were tested on an infinite number of parallel forms of a test comprised of items randomly sampled from the same domain

How is standard error of measurement calculated

= test standard deviation × the square root of 1 - test reliably



As relability decreases the standard error of measurement increases.


-this relationship occurs because the reliability coefficient reflects the proportion of observed score variance due to true score variance and the standard error of measurement is an estimate of the amount of error in test scores



Low test reliability means larger standard error of measurement and the less confidence we have in the precision of the test

Test batteries

The standard error of measurement is needed to interpret individual scores and scores from test batteries



In test batteries a number of attributes are assessed within a single test



Test battery results are displayed in percentiles using the standard error of measurement



When interpreting test results, the use of + or - 2 standard error of measurement is recommended


-prevents over interpration of small test score differences

Test battery sem consistently

In Test batteries the standard error of measurement is not constant across an entire set of scores



Standard error of measurement is lower for scores close to the score mean and higher for scores at both extremes



Scores at the extremes need to be checked for accuracy

Reliability for composite scores

Many tests have mutiple scores that can be combined to form composite scores



In all cases, scores on tests are combined to yield a score on a composite measure



The more scores that make up the composite, the higher correlation between those scores and the higher the individual test reliability and the higher the composite relability

The advantage of composite scores is that their relability is the result of

1) number of test scores in the composite



2) the reliability of the individual test scores



3) the correlation between those scores

Reliability of difference scores

There are a number of situations where researchers and clinicians want to consider the difference between two scores



Difference scores are used in pre post experimental designs



When using difference scores, the reliability of a difference score is often lower than the reliability of either the pre or post test



Increased when the original test measures have high relabilities and low correlations with each other

Problems with difference scores

Not only is the reliability lower than the pre and post measures buy those measures are often highly correlated

How large does a reliability coefficient need to be depends on

The nature of the construct



The amount of time available for testing



How test scores will be used



The method of estimating reliability

High reliability

Diagnostic tests that inform major decisions about individuals should be held to a higher reliability standard than tests used with group research or for screening large numbers of individuals



High stake decisions demand highly reliable information (.9 to .95)



.98 on standford binet intelligence scales for adolescents

Reliability strength outside of high stakes

Reliability estimates of .8 are considered acceptable in many testing situations and are commonly reported for group and individually administered achievement and personality tests



For teacher made classroom tests and tests used for screening, reliability estimates of at least .7 are expected



Classroom tests are combined to form linear composites to form a final grade


-the reliability of composite scores is greater than the relaiblities of the individual scores

What to do when reliability is too low

Increase the number of items



Factor and item analysis



Discriminate analysis



Correction for attenuation

Increase the number of items

Adding more items increase reliability by increasing thr number of item samples from the domain



Spearman brown prohpey formula allows us to estimate how many items need to be added



N = rd (1 - Ro) / Ro (1- Rd)



N= is number of tests of the same length as the original test


Ro = the observed level of reliability


Rd = the desired level of reliability



Once N is know times it by current number of items

What does increasing the number of items depend on

Item availability


Whether the time, effort and expense is worth the increase

Factor analysis and item analysis

Reliability can be increased by eliminating those items that do not correlate with other items



Those items that do not correlate are measuring a different construct than that assessed by other items and can be dropped from the scale



Increasing the cogeneratibity of the test



However less items decreases reliability

Discriminate analysis

Item total correlations



Correlate each item with total test scores (.2 to .3)



Those items with low item total correlations <.30 are propabably not measuring what other items are measuring


-these items are dropped from the final scale



This is a point bi serial correlation


-continous variable (total score) and one true dichotomous variable (item right or wrong)



Don't want higher than .35 cause don't want items carrying the whole test

Correction for attenuation

Used to correct for the unreliability of scores being correlated



Eliminating the error



Estimates maximum test reliability if their were no measurement errors



Can be user with any reliability estimate



Refered to as attenuated reliability coefficient

Formula for correction for attenuation

r'12 = r12 / (square root of r11) (square root of r22)



r' 12 = the maximum orrelation between the two tests



r 12 = the observed correlation between test



r 11 = relaiblity for test 1



r 22= relaiblity for test 2



The maximum relaiblity of test is equal to square root of Rxx (reliability index)

What is validity

Refers to evidence that supports interpretation of tesr results as reflecting the psychological construct that test was designed to measure



Refers into investigations into what tesrs measures



How well it measures what is says it measures



The degree to which evidence and theory support interpretations of test scores from the proposed uses of the test



Validity is the most fundamental consideration in developing and evaluating tests

Validity and reliability

Reliability is stability and accuracy of test scores, reflects amount of random error



Reliability is a necessarity but a insufficient condition for validity



An unreliable test cannot produce valid interpretations


-only true score variance can be reliable and related to the construct of the test is supposed to measure



However no matter how reliable a measurement is, it does not guarantee validity


-it is a characteristic of test performance not the test



Validity is not

Indicated by the little of the test



A brief description of the test given in the test manual



Represented by a single number



High nor low, good or bad



Indicated by the nature of the items (face valdidy)



Is not static but a constantly moving target

What is validity then

Is what the test measures



How well it measures what it says it measures



Useful or not for certian purposes (test utiltiy)



Is a process that involves ongoing dynamic effort to accumulate evidence for a sound scientific basis for proposed test score interpretations



Indicated by empirical associations between test scores and other measures



Reflected by a nomonolocial net

How valid are psycjolgocial tests

Meyer (2001)


Found psychological tests often provide results that are equal to or exceed the validity of medical tests



Pap smear is 0.36 while MMPI 2 0.37



Even when use both medical and psychological tests to detect the same disorder, psyvholgocial tests can provide superior results



MRI for dementia is 0.57 and neurophychological tests 0.68



Yes psyvholgocial tests can provide information that is as valid as common medical tests

Maximum validity correlation

Reliability places limits on the magnitude of validity coefficients when a test score is correlated with a criterion or outcome variable



Using iQ to reading achievement, the reliability coefficient for the IQ test imposes a theoretical limit in the true value of the correlation that is equal to the square root of the reliability coefficient



Maximum validity correlation = square root of reliability coefficient

Definition of validity from the standards for educational and psychological testing

The degree to which evidence and theory support the interpretations of test scores for proposed uses of the tests



Validity us a unitary concept with subtypes



These types represent different ways of collecting evidence to support the validity of interpretations of performance on a test, which we commonly express as a test score

4 subtypes of validity

Content related


Criterion related


Construct related


Structural related

Current view of validity

There is a consensus that the older concept of test validity be abandoned in favor of an emphasis on the appropriateness or accuracy of interpretation of test scores



We do not refer to validity of a test but rather validity of the interpretation of test scores



Responses to test items mean the interpretation of performance and it is the interpretation that possesses the construct of valdidty



When test scores are interpreted in mutiple ways each interpretation needs to be validated

Importance of validity evidence

Sources of validity evidence differ in importance according to factors like the construct being measured, the intended test usage, and the population being assessed

Content related validity

Examination of the test content to see if the test covers a representative sample of the domain being measured



Test content includes the themes, wording, and format of the items, tasks, test items, administration and scoring rules

What does content validity validate

Criterion or domain referenced tests (job based tests)



Tests used in education



Achievement tests



Aptitude tests



(Employee, student, training selection, evaluation or classification)

Inspection of test items is not sufficient to assess content validity as...

Inspection cont tell you of all relevant sub domains in an area have been covered



Whether all objectives of the instruction have been assessed not just matter



Whether there are items that assess the process by which people answer questions



Irrelevant sources of error in the test gave been removed

How is content validity assesses

A test blueprint or test specification plan is constructed (must be written down)



Items are written according to blueprint then area experts are consulted about whether important content, objectivities and processes are covered



Care at this stage established the foundation for correspondence between test content and the construct



Are experts are used to systematically review the test and reevaluate the correspondence between the test content and it's construct



Adininstee the test and empirical procedures are used to assess test and item characteristics (item difficulty, item discrimination, test item correlations)



Examine errors and check for irrelevant sources of error

What does test blueprint specifie

Areas to be covered



Objectives to be tested and a rational for inclusion



Processes to be assessed



Important topics and issues thar warrant study

Two types of errors examined in content validity test evaluation

Item or domain relevance



Item content or domain coverage

What do content valdidyy studys allow us to conclude

The test covers a representative sample of doamin relevant skills and knowledge



Obtained score is relatively free from irrelevant sources of error (reading level, item difficulty, domain irrelevancy)

When we have these errors

The presence of such errors indicates that there is construct irrelevant variance



Poorly designed tests are described as having construct underrepresnetation or construct overrepresentation



Criterion related validity (predictive valdidty)

Refers to the degree to which a test predicts future performance or a future outcome


-Future performance or outcome is the criterion



Requires an independent criterion to assess the outcome


-independent means that collection and knowledge of the predictor test scores should be independent or isolated from the collection and knowledge of criterion

What happens when predictor scores and criterion scores are not independent

Criterion contamination



This raises validity coeffiecnts

Two types of criterion related validity tests

Predictive study


- there is a passage of time between predictor (test) and the criterion (outcome)



Concurrent validity


-when both are given within a short period of time



Some tests may be excellent when used in concurrent applications but poor for predictive applications and visa versa

Concurrent validity

Becomes and issue when diagnoses are being made or when the goal of testing is to determine current status of the examinee as opposed to predicting future outcomes

Criterion related predictive validity

Becomes an issued when yhe test ID being used for selection and classification of individuals or personnel or when prediction us the ultimate goal of assessment

Know

While there is no limit or restriction on the nature of the predictors or criterion measures, both much be reliable and valid measures of the construct to be assessed



Criterion should be viewed as the gold standard, the best existing measure of construct



Only concern is whether this is an empirical link between predictor and criterion and whether the predictor is useful to those using rhe test

4 classes predictor and criterion measures fall into

Academic achivvment



Specialized training



Personality and interest inventories



Simpler/ shorter new test

Ratings made by judges, supervisors, interviewers

An often used criterion measure are ratings made by judges, supervisor, teacher, advisors, interviewers, social workers, coach's evaluations about some attribute of behavior (someone is asked to make an evaluation of someone else)



Very often ratings, evaluations or judgments made by others of someone's else's behavior are the core of many criterion measures



Ratings are often subject to bias but when obtained under controlled conditions, their reliability is confirmed (Kappa) and validity has been established, the ratings can be valuable scoured of criterion data

Cross validating sample

In industrial and other applied settings, tests are devised in one location and used to fulfill similar functions in another



Early studies on the generalization of criterion related validity showed poor generalization or wide variability



Later studies indicate that poor generalizabiliy relected statistical artifacts such as a small group size or sampled that were restricted in range



It is unnecessary to do local validation studies for previously validated tests.


-if there us little existing research on the test than local validation may be needed

When artifacts are removed

When artifacts are removed the predictive validity of tests that assess verbal, numerical and reasoning skills generalize across samples



Cognitive skills that tap a common core of abilities are broadly predictive of academic and occupational outcomes

Validity coefficient

Predictive validity assessed by correlation between the test and criterion



Coefficients should be large enough to indicate that information from the test will help predict how individuals will perform in criterion



Usally .3 to .4

Coefficient of determination

The Coefficient of determination (r squared) is used to see how much variability is explained by the correlation

Reasons why validity coeffiecnets are small and transfer poorly across situations

1) Differences in sample size and characteristics of the initial validating and cross validating samples



2) Validity, reliability and appropriateness of the criterion



3) The manner of measuring the criterion, nature of the job or curriculum, people who take the test all affect generalizabiliy



If generalizabiliy is low consider Differential prediction (does the test better for some groups than others)



Coeffiecnets may be small but still my be useful

Hunter 1984

Demonstrated the far reaching implications on worker productivity and subsequently the US gross domestic product if employers used employment tests to place workers in the best jobs, even if the employment tests had a validity coeffiecnts only in the .20 or .30



Even though the valdidity coeffiecnts are small for these employment tests and certainly well below what is acceptable for a reliability coeffiecnt, the very practical relationship they have with an overall increase in gross domestic product justifies their use



Is important to emphasize the importance of the context of assessment, measurement and prediction in deciding whether validity coefficient is large enough to be useful for application

Criterion and accuracy

The criterion must be the most accurate measure avaliable of what you are tying to predict if it is to serve as the standard for the beahviour



If that criterion exists then the only expense would justify using a lesser outcome measure



Using a lessor criterion for less than adequate reasons fails to do justice to the test, clientele, patient or student

The developed of what concepts mean and how to measure it with criterion related validity

Tests in personality, social or developmental psychology involve the simultaneous development of what the concept means and how to measure the concept



These two goals cannot be done by criterion related validity



Meaning and measurement of a construct requires processes of construct related validity

Constructs in construct related validity

Constructs are used in all science to explan why things happen



In Psychology constructs are used to explain why people do the things they do or to explain test responses



Constructs organize and explain observed beahviour

Frequently used constructs

Intelligence


Learning


Love


Anxiety

How are constructs built

Constructs are not physically real



They are built up through the accumulation of evidence and the intetgration of that evidence into some theoretically meaningful pattern (nomological net)

How is construct validity established

Construct validity evidence is established through activities by which the construct is defined and measured at the same time



This is relevant particularly when there is no consensus as to what measures adequately define the construct

Pap 1953 ideas of open concepts

Pap argued that psychological constructs like love, intelligence, extervrrsion, schizophrenia and anxiety are open



They are not clear and well defined



They are not necessarily problematic

Open concepts are characterized by

Having intrinsically fuzzy boundaries



A large extendable and variable list of indicators



An unclear inner nature

Concepts become less open concepts, better defined and understood as...

Evidence is gathered as to the meaning of the concept



How well the measures of the concept clarify the meaning of thr concept

What is construct validity

Is the degree to which a test measures the trait or attribute under consideration



-trait or attribute is theoretical



Construct validity is not a single number, there is no single coeffiecnt that indicates construct valdidty



Because constructs are open concepts, over a series of studies the meaning of the conflict begins to emerge



The results and observations over time gradually clarifity what the test is measuring and what they mean (nomologcial net)

What does construct validity do

It gradually accumulates information gathered from a variety of courses that related to the construct


-not just experiments



development of construct validity entails the measurement and understanding of a concept moving from open to more close concept



Any information that aids in explaining the underlying construct is appropriate for construct valdidty

Construct related validity based in on internal structure evidence

Examination if the internal structure of a test (or battery of tests)


-the relationship between test items (test battery components) are consistent with the construct the test is designed to measure



By examining internal structure, the match between the actual internal structure can be compared with the hypothesized structure of the construct the test is supposed to measure

Age differentiation

Tests created to assess developmental processes use age difference to validate the underlying construct



Used to validate achievement, aptitude and ability tests



The assumption is that measures of construct should increase with age or cogntvie ability



Age differentiation is necessary but not a sufficient evidence for internal structure validity

Problem with age differentatiation

Failure to find age differences may mean the test is invalid



But the presence of age differences does not indicate whether developmental change was due to the underlying construct or its presumed structure

New test old test correlation

When a new test of a construct is developed, correlations with the old test can help validate the new measure



Assumes old test has established contruct validity



Should be moderately high (.3 to .4)


-don't want too high as it means it's the same test


-don't want the old test measurement error



New test should also be correlated with other measures to see if the new test is free from irrelevant sources of error

Indirect validity

What the test scores should not correlate with



Doesn't tell you what you are measuring but just that it is not relevant correlations



High correlations would render the construct validity of the test suspect



Low correlations do not necessarily insure internal structure construct validity

Factor analysis

Statistical procedure used to determine the number of conceptually distinct factors



Allows one to evaluate the presence and structure of any latent constructs existing amoung a set of variables



The latent construct underlies and is partly responsible for the way examinees respond to questions on the variables that make up the factor



Uses to validate the internal structure of test batteries

Process of factor analysis

Begins with a correlation matrix that contains inter correlations amoung the individual items



Produces as many factors as there are varialbe



There are statistical guidelines for determining factors



Most important guidelines is "does the factor solution make psychologyal sense"

Does thr factor solution make psychological sense

The variable lodaing on a factor share a common theme or meaning



A psychological uninterpretable factor solution has little practical value and will unlikely provide validity evidence

Exploratory and confirmatory factor analysis

If the theory underlying test construction hypothesized 3 factors and 3 factors emerge from the confirmatory factor analysis


- the test is said to have factorial valdidty


-factor structure of the test is supported



There are a number of statistics available that statistically test the fit or match between thr actual and hypothesized factor structure



Internal consistency

Internal consistency methods are used to assess construct validity



Measures of internal consistency reflect the homogeneity of test items


-indicates something about the doamin from which the items were drawn



Does not tell you about the underlying construct from that domain



It tells you what domain the items came from not what the construct means

Ways of measuring internal consistency

Extreme groups analysis


-on personality, social or abnormal psychological tests validity is established us those scoring high or low differ on some criteria



Item total correlations


-on ability, personality or social psychological tests


-examine item total correlations for pass fail scored items


-point biserial correlation for didchotomoisly scored items


-.2 to .35

Convergent divergent validity (discriminant)

Campbell and Fiske argue that construct validity is shown when



- test correlates prossitively with tests that it should correlate with (convergent)



-and shows low or zero correlations with measures that it theoretically should not correlate with (divergent)



Most convincing method to demonstrate construct valdidty

How to do convergent divergent analysis

Creation of a multittiat multimethod matrix



We measure two or more traits by two or more different methods



Validity coeffiencents should be greater than correlations between different traits measured by different measures and also greater than different traits measured by the same method

Common method variance and construct variance

Multittrait multimethod matrix also provides evidence on common method variance and distinction between method and construct variance


Method variance

Refers to the correlation between measured due to thier common assessment procedures

Common methods

The triangles indicating different traits same method reflect common methods



Correlations between different triats measured by different measures (lower left triangle )



Different traits measured by the same method (immediate off diagonal triangle)

How is construct validity different than predictive or content validity

It focuses on the role of the theory in test construction and the need to formulate Hypotheses that can be tested in validation studies



Is not subjective or intuitive judgments or rationalization independent of the data

Construct validity evidence based on response processes

Validity evidence based on response processes invoked by a test involves an analysis of the fit between the performance and psycholgocial processes examinees engage in while the construct being assessed



Collect this evidence by interviewing examinees about their response processes and strategies, recording behavioral indicators such as response items and eye movements or analyzing the types of errors committed

Ways to categorize types of questions

Objective subjective distinction classification



Selected response or constructed response classification



-when creating test items the overriding goal is to develop items that measure the specified construct and contribute to psychometrically sound tests

Objective subjective distinction categorization

Often used



The difficulty with this categorization is that it is sometimes difficult to say whether a question is sibetvie or objective

Selected response or constructed response classification categorization

If an item required an examinee to select a response from avaible alternatives it is classified as a selected response item (mutiple choice true or false matching)



It an item required examinees to create or construct a response it is classified as a constructed response item (fill in blank, short answer, essay item, oral examination, interviews)

Advantages or selected response questions

A large number of items can be answered in a short time period


-can include more items from the doamin to increase reliability



Items are flexible and can be used to assess a wide range of constructs with greatly varying levels of complexity



Decrease the influence of certian contruct irrelevant factors that can impact test scores (writing abilities)

Limitations of selected responses

Items are challenging to write


-mutiple choice tests can be difficult to come up with foils that are plausible yet incorrect



There are some constructs that cannot be measured using selected response items



Blind guessing and random responding are seen in such items

Types of constructed response items

Short answers



Essays



Performance assessments



Portfolio assessment

Short answer

Items can take a number of forms


-putting a word, phrase, number in response to a direct question or complete sentence

Performance assessments

Require examinees to be complete a process or produce a product in a context that closely remembles real life situations

Portfolio assessments

Involves the systematic collection of examinee work products over a specified period of time according to a specific ser of guidelines



Writers, artisitis, architects

Strengths of contracted response assessments

Are easy to write and develop


-developing a framework for how to properly score response can take time and effort



Well for assessing higher order cogntive abilities and complex task performance, and tasks that require contracted response formate like problem solving



Items eliminate guessing and random responding

Limitations of constructed response classification

Take more time to complete as a result you are not able to sample the content domain as throughly


-less reliable and time consuming to score



Do eliminate blind guessing but are vulnerable to faking or creative construction when answers are unknown



Vulnerable to influence of extraneous or construct irrelevant factors that can impact test scores (writing abilities)

Who to select item format

Key factor in selecting an item format involves identifitying the format that most directly measures the construct



Select the item format or task that will be the most direct measure of the construct of interest



Selected response items are recommended since they allow broader sampling of content domain and more objective and reliable scoring procedures

Writing items

There is an art to writing good items

Rules for writing items

Textbook

Types of selected response test questions


-which one is selected depends on the area of inquiry

Dichotomous items are scored as either true or false, agree or disagree



Polytomous items



Likert format

Dichotomous items

Advantages include simple to understand, score and administer



Appears in education and personality tests where absolute judgments are required



On maximal performance tests items should be easily scored (correct inccorect) according to a scoring criteria


-all scored in an objective manner and are classified as objective questions



Disadvantages include increased chance of guessing and many items need to be written


-T or f encourage superficial understanding



Because of guessing these items are less relabile and less precise than other formats

Poltyomous items

Several response alternatives are given (multiple choice)



One alternative is preferred and others are wrong or not indicative of the the construct or answer



3 alternatives (distractors, foils) in addition to the correct answer maximizes item and test relviality and discriminating between test takers



Reliability of tests in this format is constrained by guessing


-correctiin for guessing may be used

Likert format

5 to 7 alternatives are given such as strongly agree, agree, don't know ect



Test taker chooses the alternative that comes closest to their attitude toward the issue



Negatively worded items are reversed scored



Used for personality and attitude tests

Item analysis

General procedure for methods used to evaluate specific items or groups of items (not whole test)



All tests needed to undergo this as it us useful for test developers to decide which items to keep on a test and which items to modify, or eliminate



If improve quality of individual items the overall test quality improves



Tells you about test item homogeneity

Components of item analysis

Item difficulty


Item discrimination


Distractor analysis



Items not independent

Item difficulty

The propotion of people who pass an item



P = number who get right / number of examines



0 to 1 with lower proportion meaning more difficult item


-0 or 1 provide no information



Optimal level of difficulty is .3 to .7



To deffrtrntiate test takers items must range in difficulty

How hard an item should be depends on

1) probability of correct response by chance



2) what the test Is designed to do

Item difficulty need to consider item guessing

To take into consideration the effects of guessing the optimal item difficulty level is set higher than for constructed response items



Lord argues that for 4 option multiple choice items the average p should be about .74 with a range of .64 to .84

Percent endorsement

The item difficulty index is only applicable to maximize performance tests where items are scored correct incorrect

Item discriminability

Refers to how well an item discriminates amoung test takers who differ on the construct being measured by the test



Above .40 is considered Excellent


.3 to .39 are good


.11 to .29 are fair


0 to .10 are poor

Two item discriminability statistics

Item total correlations


Discrimination index



-Seperate test takers into top and bottom



-compute a discrimination index for each item



Di discrimination index = proportion of top who got it correct - proportion of bottom who got it correct



-remove or revise inverted items (negative)

Item discrimination and difficulty

Item discrimination are biased in favor of items with intermediate difficulty levels (p value)



Because relationship between p and d, items that have excellent d values will have p values between .2 and .8

Item total correlations coefficients

Point biserial correlations ( rbis)


-correlation between true dichotomy (yes or no) and comtinous measure (adjusted test score)



Large correlations mean that an item is measuring the same construct as the overall test and discriminates between individuals with high and low construct ability



If low then that item should be eliminated between .2 and .35



Allows the test developer to select items that discriminate between respondents thar are high and low in construct

Item total correlations with typical response tests

On a test designed to measure sensation seeking using true or false items



1 means yes and high sensation seeking



0 means low sensation seeking

Distracter analysis

On mutiple choice tests incorrect alternatives are referrd to as distractors (foils) since they distract those who don't actually know the correct response



Allows you to examine how many examinees in top and bottom groups selected each option on a multiple choice item



Good distractors should show negative item discrimination and be selected by some examinees

Relationship between item distracters item difficulty and discrimination

The selection of distractors impacts both item difficulty and discrimination



Significantly impacts item difficulty and consequently item discrimination

Item validity

Becomes and issue when yhe test constructor want to maximize criterion or predictive validity



Is assessed by point bie serial correlation between responses to each test item and a criterion measure multiple by items standard deviation


Rbis x SD

Item reliability

Assesses internal consistency of the test



Rbis × standard deviation of the test

Item response theory (IRT )


-alternative to item analysis

The construct underlying the test responses can be known by observing the performance of the test items



Responses to test items are explained by latent traits



Models the probability of a correct answer or saying yes to a personality test item is a mathematical function of their standing on the latent trait



Latent trait (theta) is assumed to be unobservable known only through test responses

Latent trait

Latent is an ability or trait that is inferred to exist based on theories of beahviour or empirical evidence (cant be assessed directly)



Those with more of the latent trait should get more items correct than those who score low on the construct

Two parameters of Item response theory

Person parameter


-constructed as a single latent trait or dimension



Item parameters


-are Item difficulty, Item discrimination and Item guessing

Item response curves or item characteristic curves

Each item can have a graph created that maps the probability of getting an item correct against the latent dimension



S shaped curves

Three most common item parameters in Item response theory

Item discrimination


-slope function



Item difficulty


-location function


-revived the most attention



Item guessing

Rasch models

Models relating item difficulty to probability of getting the item correct

Item difficulty in item response theory

Assumes that only item difficulty differentiates between items and all items are similar in slope and item guessing



Item difficulty is determined at the point of median probability


-the ability at which 50% of respondents endorse the correct answer (inflection point)

Know

On an ICC difficult item are hard or unlikely to be endored and shifter to the right of the scale indicating the higher ability of the respondents who endorse it correctly



Easier items more shifted to the left of thr ability scale



Typical range - 3 to 3

Item discrimination in item response theory

Determines the rate at which the probability of endorsing a correct item changes with given ability levels



This parameter is imperative in differentiating between individuals possessing similar levels of the latent construct



Purpose is to include items with high discrimination in order to be able to map individuals along the continuum of the latent trait (0 to 2)

Know

What the item response curves mean and why people respond as they do depends on thr theory that lead to test development



To avoid the impression that the latent traits are real terms such as item response analysis and item respond theory are used

Advantages of an Item response approach to item analysis

Item analysis using classical theory uses correlations, means and variances which are sample dependent and may not generalize to other samples



In an Item response analyses, Item characteristics are independent of the participant sample



What We want are items that are sample independent or sample invariant or invariace of Item parameters


-items can be given to people of varying ability know as population invariance

Invariance

To create invariance, item characteristics need to be independent of the sample characteristics known as anchoring


-structureal validity



Items can be given to people of varying ability know as population invariance

Significant contributions of item response theory

In the are of computer administered tests know as computer adaptive testing



In the detection of item or tesr biases


-ICCs for different groups are generated and statically compared to determine the degree of derrerential item functioning

Items too hard

Frequently subjective judgments are made about the stability of items despite all effort of item analysis



Item is too hard, too easy, too demeaning to minority groups



These subjective reviews have not been successful in predicting item difficulty or discrimation (need statistical analysis)



Hard items are not necessarily biased or unfair

Category format

Used in observational studies, developmental and organizational psychology



Scale consists of 10 catergoy responses (on scale from 1 to 10)



Category response scales are included by context and anchoring effects


-ratings are influenced by the behaviour of the other people being rated, and once rated, other ratings become anchored at one end or the other



Overcome this by labeling endpoints and middle of scale to remind respondents of definitions.



Increasing the number of categories decrease the reliability and validity because more categories call for more discrimation than is possible

Checklists and Q Sort

In checklist respondents are given a list of objects and asked to indicate which ones are self despcitive


-self dichotomous scoring method



Q sort is when a number of cards each containing a trait are sored along a 9 category scale labeled from least to most


-constraints imposed on thr number of attributes in each category