Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
191 Cards in this Set
- Front
- Back
Issue with measuring psychological constructs |
Psychological constructs unlike physical objects are hard to measure
When we get results the obtained score may over or under estimate measures of the construct
This issue means we need to know how much variability there is in the total test scores
Scores must demonstrate acceptable levels of consistency between observed scores and true scores in order for them to be meaningful |
|
True scores vs observed scores |
We are trying to assess a person's true score on wheat the test is assessing
We actually measure the observed score on the test not true score |
|
Error of measurement |
The difference between the true score and the observed score
Error doesn't mean mistake but rather the variability of observed scores around the true score
Error is estimated in thr standard error of measurement |
|
Relability |
Is the degree to which test scores are free from measurement error
High the reliability of a test (0-1) the lower the measurement error
The more confidence we can be that the observed score mirrors the true score |
|
When test scores are free of meausmrent errors |
Test scores that are free of errors display consistent and stable test results
Reliable assessments are relatively free from measurement error wheres less reliable results reflect measurement error |
|
Reliability outside of psycholgocial tests |
Whenever something is measured, reliability is an issue
Blood pressure tests have lower reliability than a well constructed psychological test
Economic indicators (GDP, poverty, SES) are particularly unreliable |
|
Classical test theory (Spearman) Also called test score theory |
Assumes that a person has a true score that could be measured if there were no errors of measurement (high reliability)
However there are errors between observed scores and true scores
These measurement errors are then the difference between the observed score and true score |
|
Classical test score theory formulas |
X is observed score T is true score E is measurement error
X = T +E
Because we want to know reliability or measumrent errors
E = X - T |
|
Major assumption of classical test theory. |
Assumption is that measurement errors are randomly distributed around the true score
Meaning chance factors or nonsystematic error increases or decrease observed scores
If people were to repeat the same test results would produce a normal distribution of errors around each person's true score mean - scores that greatly differ from true score will happen less
Can estimate true score by finding the mean of the observations from repeated applications |
|
Standard error of measurement |
Tells us on average how much a score varies from the true score
The standard deviation of observed score and the reliability of the test are used to estimate the standard error of measurement |
|
Pooled standard deviations |
In a reliable test it is assumed that error distributions overlap and differ only because of true test scores
Pooled variance of these errors tells us the magnitude of the variability of the sample observed scores around the true score of the sample
Pooled standard deviations from all test takers becomes the basic measure of error present in a test
Pooled standard deviation is called standard error of measurement |
|
Standard error of measurement |
The standard error of measurement is used to calculate the range of scores around the observed score within which the true score is likely to fall
Allows for confidence interval around the observed mean -true score will fall within + or - standard error of measurement (95%)
The mean of repeated testing is the true score estimate -the SD is the standard error of measurement |
|
Domain sampling model |
Considers the problems created by using a limited number of items to represent a larger and more complicated construct
Concern is to estimate true score from a limited sample of items where sampling from the full domain is impossible |
|
Domain sampling theory used in classical test theory |
Classical theory uses elements of domain sampling
From a sample of items (repeated test scores) a true score is estimated |
|
When is domain sampling important |
Problem is how much error measurement is there from one sample of items
This is important when the sample of test items is small relative to the size of the domain of items
Reliability increases as sample size approaches the size of the domain |
|
Repeated random sampling of items from the domain |
Each test has an unbiased estimate of true score
Due to measurement and sampling error these estimates will differ
These differences will be random and normally distributed
The mean of the correlations between the various test scores is the test reliability
Do not average sample correlations, each correlation is converted into z scores which are averaged and transfered back to correlation |
|
What does domain sampling allow us to calculate |
Allows for the calculation of maximum, unbiased reliability estimate that a test achieve |
|
Sources of measurement error |
Content sampling error
Time sampling errors
Other scoured of error |
|
Content sampling error |
The error that results from differences between the sample of items and the domain of items
Largest scourge of error
Is easiest and most accurate source of error to estimate
Determined by how well the domain is sampled -same difficulty, do sample all components of domain |
|
How is content sampling measurement error estimated |
Estimated by analyzing the degree of similarity amoung the items making up the test
To do so we Analyze the correlations between test items with the examinees standing on the construct being measured |
|
Time sampling errors |
Random changes in the test taker or testing enviroment can impact test performance
Reflect random fluctuations in performance from one situation to another
Limit our ability to generalize test results
Major concern since psychology test are rarely given in the exact same environment |
|
Other sources of errors |
Include errors in testing, administrative and scoring
Clerical errors committed while adding up scores or administrative errors on an individually administered test
When scoring relies heavily on subjective judgment of the tester, subtle discriminations in scoring can happen -must calculate inter rater or inter scorer agreement |
|
Expressing reliability |
Reliability is often expressed as a correlation coefficient
It is more preferable to express it as a ratio of the variances of true score to the observed score
The reliability is the proportion of observed score variance us accounted for by true score variance |
|
Equations of reliability ratio |
Rxx = variance of true score / variance of observed score
(Standard deviation squared)
Reliability of the observed score = true score variance + error variance
|
|
Estimate of error variance |
Estimate of error variance = 1- Rxx -if relaiblity (Rxx) is .8 error variance is .2
Means 80% of the test score reflects true score reliability and 20% reflects random nonsystematic error variability |
|
Reliability index |
Reflects the correlation between true and observed scores
Can't be calculated directly as true scores are unknown
Is equal to the square root of the reliability coeffiecnt
If Rxx= .81 then index is .9 -the correlation between observer and true score is 0.9 |
|
Know |
There are a number of ways to estimate reliability
Each measures a different aspect of random error
Which reliability estimate is chosen will depend on what the test is presumed to measure and what the test constructor wants to demonstrate |
|
Three ways to estimate test reliability |
Test retest Parallel forms Internal consistency |
|
Test retest method Also called stability coefficients |
Measures time sampling errors
Used to evaluate the error associated with administering a test at two different times
Applies only to stable traits
Susceptible to carryover effects
Follows classical test theory -theroy assumes attribute stability -test score variability is constructed as error variability |
|
Time intervals in test retest |
If time interval is short than random fluctuations take place -practice / carryover effects
If time interval is long than random fluctuations, unknown scourses of error and changes in the construct can take place over time
There is no single best time interval
The optimal interval is determined by the way the test results are to be used and the nature of the construct |
|
In test retest what does a postive correlation mean |
The scores are stable as they are generalized across time
Low susceptibility to testing or test taker conditions
Generalize over testing environments |
|
Carryover and practice effects |
Happens when the first testing session influences scores from the second session
When there are these effects the test retest correlation usually overestimates true reliability
Only a problem when the changes over time are random -not predictable, effects some but not all
Practice effects are a form of carryover effects -have sharpend their skills after frist test |
|
Test retest administrating |
Administer the same test on two well specified occasions and then the correlation between scores from the two administrations |
|
Poor test retest correlations |
Poor correlations does not mean that a test is unreliable
Suggests that the characteristic under study has changed |
|
Parallel forms reliability Also called alternate forms or equivalent forms |
Compares two or more different but equivalent forms of a test that measures the same attribute
Must be made of different items but the rules used select items are the same
Tests must be parallel in terms in content, difficultly ect.
Makes sure that the test scores do not represent any one particular set of items or a subset of items for the entire domain (content error) |
|
Why is parallel forms the most informative form of reliability for psychological studies |
1) contains estimate of consistency over time
2) contains two or more samples of items from the domain
3) can estimate error attribute selection from item sets
4) practice or carryover effects are reduced |
|
Nature of items in parellel forms |
Same number Cover the same domain Expressed in the same way Equal difficulty |
|
Drawbacks of parallel forms |
Practice or carryover effects change the meaning of the second tests
Creation of the many items needed for parallel forms is costly and time consuming |
|
Internal consistency reliability -split half reliability |
Reflects errors related to content sampling
These estimates are based on the relationship between items withinba test
Test is given once and responses are split into two halves and are correlated -congeneric tests |
|
Ways tests can be split |
Frist half second half split -if test is long and items are of equal difficulty
Odd even split -if items increase in difficulty or practice effects, fatigue, or declining attention effects |
|
Problem with split half internal reliability |
This reliability is an underestimate because each subset is only half as long as the full test
An estimate of reliability would be deflated because each half would be less reliable than the whole test
Since only half of the items are used the reliability underestimates true reliability for the whole test
-test gains reliability as the number of items increase |
|
Spearman brown correction correlation |
Allows you to estimate what the correlation between the two halves would have been if each half had been the length of the whole test
Corrects underestimation
Underlies thr general point that reliability increases as the number if items increase
Assumes equal variances in both halves of the test |
|
Spearman brown formulation |
Rsb = 2r / 1+ r
Rbs = correlation between the two halves of the test if each had the total number of items (split half correlation)
r = correlation between the two halves of the test |
|
Issue with Spearman brown correction |
Assumes that there are equal variances in both halves of the test
When variances are unequal we can't use it
Instead have to use alpha |
|
Kuder Richardson KR20 |
Is a measure of internal reliability so the test only has to be given once
Considers all possible splits simultaneously -avoids the problems of split half
Only be used for items that can be scored in a dicotomous matter (0 or 1) |
|
KR20 formula |
Kr20 = N / N-1 (S squared - sum pq / S squared)
Kr20 = reliability estimate
N = number of items on the test
S squared = variance of total test score
P = the proportion of people getting each item correct
Q = the proportion of people getting each item incorrect (1 -p)
Sum of pq = sum of the products of p x q for each item on the test -is the variance for an individual item -is the sum of individual item variances |
|
Know |
To have 0 reliability the variance for thr total test score must be greater than the sum of the variances for the individual items
Only happens when there is covariance between items
Covariance happens when the items are correlated with each other
The greater the covariance the sample the sum of pq term will be |
|
Cornbach coefficient alpha |
Estimates the internal consistency of tests which the items are not scored as 0 or 1
Examined thr consistency of responses to all test items regardless of how those items are scored
Can be thought of as the average of all possible split half coeffiecnts corrected for the length of the whole test
The sum of pq is replaced with sum of individual item variances
Most general method for finding estimates of reliability through internal consistency |
|
What is coefficient alpha sensitive to |
Content sampling measurement errors
Heterogeneity of the test content -the degree to which the test items measure unrelated characteristics -as item heterogeneity increases alpha coefficient decreases |
|
Coefficient alpha and kr20 |
Both estimates internal relability
Kr20 is a simplified version of the alpha coefficient
Alpha coefficient reduces to Kr20 when all items are dichotomous
If all test items are dichotomous then alpha and kr20 should give identical results within rounding error |
|
What does coefficient alpha provide us with |
Gives us the lowest bound estimate of reliability
A high alpha (>.80) suggests that the true reliability is higher
A low alpha only means that the true reliability may be higher
To overcome this issue 95% confidence intervals around alpha can be constructed |
|
Reliability for a test that measures more than one trait |
Factor analysis is a popular method for dealing with this situation
When factor analysis is used correctly, these subsets will be internally consistent (highly reliability) and independent from one another |
|
Limitations of alpha coefficient |
It assumes tau equivalence (T) or a unidimensional factor structure
When tau equivalence is not met alpha coefficient will underestimate the tests level of relability |
|
Tau equivalence |
All the indicators of a factor, The test items, all load or correlate in a similar manner on one dimension -item heterogeneity |
|
McDonald's omega coefficient |
Does not assume tau equivalence and can be used to assess internal reliability for non equivalent tau items
Calculation is not straightforward, relying on the outcome of a structural equation model
Spss macros are available to do calculation |
|
Sources of the error for when estimating reliability for behavioral observation data |
Individuals scoring the test (judges) Rating errors Definitional issues Item sampling errors |
|
How to estimate true scores in behavior observation studies |
Are unreliable because discrepancies between true scores and scores recorded by the observer
To address these problems we need to estimate the reliability of the observers
This is known as interrater reliability (Inter judge, inter scorer or inter observer ratings) |
|
Interrater reliability |
Estimating the consistency amoung judges or raters who are evaluating the behavior or output
The percentage of agreement between raters is sometimes used as a measure of interrater reliability
This is incorrect as percentages as they do not take into account chance level of agreement |
|
Most common for of interrater reliability |
Is to record the percentage of times that two or more observers agree
Not the best because 1) percentage does not consider the level of agreement that would be expected by chance alone 2) percentages should not be mathematically manipulated |
|
What is the best way to asses interrater reliability |
Kappa coeffiecnt
Indicates the actual agreement as a proportion of the actual agreement following correction for chance agreement
Is a measure of agreement between two judges who each rate a set of objects using nominal scales for Ordinal level data
A weighted coeffiecnt is available for Ordinal level data and takes into consideration how desperate the ratings are
1 (perfect aggrement) -1 (less agreement than can be expected on the basis of chance alone)
Less than .4 is poor. Greater than .75 is excellent |
|
When is Kappa coeffiecnts used (when agreement in classification is of interest) |
When a test is administered at two different points in time to classify people into diagnostic groups -person would be classified or assigned to a group using the obtained test scores on each occasion and the degree of agreement across times is compared via Kappa
Could use two different tests on the same group of people at the same point of time and classify them seperately using each set of test scores and then computer the cross test agreement in classification with Kappa |
|
Fleiss interrater correlation or Krippendorffs alpha |
When there are more than two raters |
|
Internal consistency |
Evaluates the extent to which the different items on a test measure the same ability Measures of internal consistency will all give low estimates of reliability if the test is designed to measure several traits |
|
Standard error of measurement and reliability |
Reliability coefficients reflects the proportion of observed variance attributable to true score variance Relability coefficients are a useful wah of comparing the consistency of tesr scores produced by different assessment procedures The standard error of measurement is useful for interpretation It is the standard deviation of the distibution of scores that would be obtained by one person if they were tested on an infinite number of parallel forms of a test comprised of items randomly sampled from the same domain |
|
How is standard error of measurement calculated |
= test standard deviation × the square root of 1 - test reliably As relability decreases the standard error of measurement increases. -this relationship occurs because the reliability coefficient reflects the proportion of observed score variance due to true score variance and the standard error of measurement is an estimate of the amount of error in test scores Low test reliability means larger standard error of measurement and the less confidence we have in the precision of the test |
|
Test batteries |
The standard error of measurement is needed to interpret individual scores and scores from test batteries In test batteries a number of attributes are assessed within a single test Test battery results are displayed in percentiles using the standard error of measurement When interpreting test results, the use of + or - 2 standard error of measurement is recommended -prevents over interpration of small test score differences |
|
Test battery sem consistently |
In Test batteries the standard error of measurement is not constant across an entire set of scores Standard error of measurement is lower for scores close to the score mean and higher for scores at both extremes Scores at the extremes need to be checked for accuracy |
|
Reliability for composite scores |
Many tests have mutiple scores that can be combined to form composite scores
In all cases, scores on tests are combined to yield a score on a composite measure
The more scores that make up the composite, the higher correlation between those scores and the higher the individual test reliability and the higher the composite relability |
|
The advantage of composite scores is that their relability is the result of |
1) number of test scores in the composite 2) the reliability of the individual test scores 3) the correlation between those scores |
|
Reliability of difference scores |
There are a number of situations where researchers and clinicians want to consider the difference between two scores Difference scores are used in pre post experimental designs When using difference scores, the reliability of a difference score is often lower than the reliability of either the pre or post test Increased when the original test measures have high relabilities and low correlations with each other |
|
Problems with difference scores |
Not only is the reliability lower than the pre and post measures buy those measures are often highly correlated |
|
How large does a reliability coefficient need to be depends on |
The nature of the construct The amount of time available for testing How test scores will be used The method of estimating reliability |
|
High reliability |
Diagnostic tests that inform major decisions about individuals should be held to a higher reliability standard than tests used with group research or for screening large numbers of individuals High stake decisions demand highly reliable information (.9 to .95) .98 on standford binet intelligence scales for adolescents |
|
Reliability strength outside of high stakes |
Reliability estimates of .8 are considered acceptable in many testing situations and are commonly reported for group and individually administered achievement and personality tests For teacher made classroom tests and tests used for screening, reliability estimates of at least .7 are expected Classroom tests are combined to form linear composites to form a final grade -the reliability of composite scores is greater than the relaiblities of the individual scores |
|
What to do when reliability is too low |
Increase the number of items
Factor and item analysis Discriminate analysis
Correction for attenuation |
|
Increase the number of items |
Adding more items increase reliability by increasing thr number of item samples from the domain
Spearman brown prohpey formula allows us to estimate how many items need to be added
N = rd (1 - Ro) / Ro (1- Rd)
N= is number of tests of the same length as the original test Ro = the observed level of reliability Rd = the desired level of reliability
Once N is know times it by current number of items |
|
What does increasing the number of items depend on |
Item availability Whether the time, effort and expense is worth the increase |
|
Factor analysis and item analysis |
Reliability can be increased by eliminating those items that do not correlate with other items
Those items that do not correlate are measuring a different construct than that assessed by other items and can be dropped from the scale
Increasing the cogeneratibity of the test
However less items decreases reliability |
|
Discriminate analysis |
Item total correlations Correlate each item with total test scores (.2 to .3) Those items with low item total correlations <.30 are propabably not measuring what other items are measuring -these items are dropped from the final scale This is a point bi serial correlation -continous variable (total score) and one true dichotomous variable (item right or wrong) Don't want higher than .35 cause don't want items carrying the whole test |
|
Correction for attenuation |
Used to correct for the unreliability of scores being correlated Eliminating the error Estimates maximum test reliability if their were no measurement errors Can be user with any reliability estimate Refered to as attenuated reliability coefficient |
|
Formula for correction for attenuation |
r'12 = r12 / (square root of r11) (square root of r22)
r' 12 = the maximum orrelation between the two tests
r 12 = the observed correlation between test
r 11 = relaiblity for test 1
r 22= relaiblity for test 2
The maximum relaiblity of test is equal to square root of Rxx (reliability index) |
|
What is validity |
Refers to evidence that supports interpretation of tesr results as reflecting the psychological construct that test was designed to measure
Refers into investigations into what tesrs measures
How well it measures what is says it measures
The degree to which evidence and theory support interpretations of test scores from the proposed uses of the test
Validity is the most fundamental consideration in developing and evaluating tests |
|
Validity and reliability |
Reliability is stability and accuracy of test scores, reflects amount of random error
Reliability is a necessarity but a insufficient condition for validity
An unreliable test cannot produce valid interpretations -only true score variance can be reliable and related to the construct of the test is supposed to measure However no matter how reliable a measurement is, it does not guarantee validity -it is a characteristic of test performance not the test |
|
Validity is not |
Indicated by the little of the test
A brief description of the test given in the test manual
Represented by a single number
High nor low, good or bad
Indicated by the nature of the items (face valdidy)
Is not static but a constantly moving target |
|
What is validity then |
Is what the test measures
How well it measures what it says it measures
Useful or not for certian purposes (test utiltiy)
Is a process that involves ongoing dynamic effort to accumulate evidence for a sound scientific basis for proposed test score interpretations
Indicated by empirical associations between test scores and other measures
Reflected by a nomonolocial net |
|
How valid are psycjolgocial tests |
Meyer (2001) Found psychological tests often provide results that are equal to or exceed the validity of medical tests
Pap smear is 0.36 while MMPI 2 0.37
Even when use both medical and psychological tests to detect the same disorder, psyvholgocial tests can provide superior results
MRI for dementia is 0.57 and neurophychological tests 0.68
Yes psyvholgocial tests can provide information that is as valid as common medical tests |
|
Maximum validity correlation |
Reliability places limits on the magnitude of validity coefficients when a test score is correlated with a criterion or outcome variable Using iQ to reading achievement, the reliability coefficient for the IQ test imposes a theoretical limit in the true value of the correlation that is equal to the square root of the reliability coefficient Maximum validity correlation = square root of reliability coefficient |
|
Definition of validity from the standards for educational and psychological testing |
The degree to which evidence and theory support the interpretations of test scores for proposed uses of the tests Validity us a unitary concept with subtypes These types represent different ways of collecting evidence to support the validity of interpretations of performance on a test, which we commonly express as a test score |
|
4 subtypes of validity |
Content related Criterion related Construct related Structural related |
|
Current view of validity |
There is a consensus that the older concept of test validity be abandoned in favor of an emphasis on the appropriateness or accuracy of interpretation of test scores
We do not refer to validity of a test but rather validity of the interpretation of test scores
Responses to test items mean the interpretation of performance and it is the interpretation that possesses the construct of valdidty
When test scores are interpreted in mutiple ways each interpretation needs to be validated |
|
Importance of validity evidence |
Sources of validity evidence differ in importance according to factors like the construct being measured, the intended test usage, and the population being assessed |
|
Content related validity |
Examination of the test content to see if the test covers a representative sample of the domain being measured Test content includes the themes, wording, and format of the items, tasks, test items, administration and scoring rules |
|
What does content validity validate |
Criterion or domain referenced tests (job based tests) Tests used in education Achievement tests Aptitude tests (Employee, student, training selection, evaluation or classification) |
|
Inspection of test items is not sufficient to assess content validity as... |
Inspection cont tell you of all relevant sub domains in an area have been covered Whether all objectives of the instruction have been assessed not just matter Whether there are items that assess the process by which people answer questions Irrelevant sources of error in the test gave been removed |
|
How is content validity assesses |
A test blueprint or test specification plan is constructed (must be written down) Items are written according to blueprint then area experts are consulted about whether important content, objectivities and processes are covered Care at this stage established the foundation for correspondence between test content and the construct Are experts are used to systematically review the test and reevaluate the correspondence between the test content and it's construct Adininstee the test and empirical procedures are used to assess test and item characteristics (item difficulty, item discrimination, test item correlations) Examine errors and check for irrelevant sources of error |
|
What does test blueprint specifie |
Areas to be covered Objectives to be tested and a rational for inclusion Processes to be assessed Important topics and issues thar warrant study |
|
Two types of errors examined in content validity test evaluation |
Item or domain relevance Item content or domain coverage |
|
What do content valdidyy studys allow us to conclude |
The test covers a representative sample of doamin relevant skills and knowledge
Obtained score is relatively free from irrelevant sources of error (reading level, item difficulty, domain irrelevancy) |
|
When we have these errors |
The presence of such errors indicates that there is construct irrelevant variance Poorly designed tests are described as having construct underrepresnetation or construct overrepresentation |
|
Criterion related validity (predictive valdidty) |
Refers to the degree to which a test predicts future performance or a future outcome -Future performance or outcome is the criterion
Requires an independent criterion to assess the outcome -independent means that collection and knowledge of the predictor test scores should be independent or isolated from the collection and knowledge of criterion |
|
What happens when predictor scores and criterion scores are not independent |
Criterion contamination
This raises validity coeffiecnts |
|
Two types of criterion related validity tests |
Predictive study - there is a passage of time between predictor (test) and the criterion (outcome)
Concurrent validity -when both are given within a short period of time Some tests may be excellent when used in concurrent applications but poor for predictive applications and visa versa |
|
Concurrent validity |
Becomes and issue when diagnoses are being made or when the goal of testing is to determine current status of the examinee as opposed to predicting future outcomes |
|
Criterion related predictive validity |
Becomes an issued when yhe test ID being used for selection and classification of individuals or personnel or when prediction us the ultimate goal of assessment |
|
Know |
While there is no limit or restriction on the nature of the predictors or criterion measures, both much be reliable and valid measures of the construct to be assessed Criterion should be viewed as the gold standard, the best existing measure of construct Only concern is whether this is an empirical link between predictor and criterion and whether the predictor is useful to those using rhe test |
|
4 classes predictor and criterion measures fall into |
Academic achivvment Specialized training Personality and interest inventories Simpler/ shorter new test |
|
Ratings made by judges, supervisors, interviewers |
An often used criterion measure are ratings made by judges, supervisor, teacher, advisors, interviewers, social workers, coach's evaluations about some attribute of behavior (someone is asked to make an evaluation of someone else)
Very often ratings, evaluations or judgments made by others of someone's else's behavior are the core of many criterion measures
Ratings are often subject to bias but when obtained under controlled conditions, their reliability is confirmed (Kappa) and validity has been established, the ratings can be valuable scoured of criterion data |
|
Cross validating sample |
In industrial and other applied settings, tests are devised in one location and used to fulfill similar functions in another
Early studies on the generalization of criterion related validity showed poor generalization or wide variability
Later studies indicate that poor generalizabiliy relected statistical artifacts such as a small group size or sampled that were restricted in range It is unnecessary to do local validation studies for previously validated tests. -if there us little existing research on the test than local validation may be needed |
|
When artifacts are removed |
When artifacts are removed the predictive validity of tests that assess verbal, numerical and reasoning skills generalize across samples Cognitive skills that tap a common core of abilities are broadly predictive of academic and occupational outcomes |
|
Validity coefficient |
Predictive validity assessed by correlation between the test and criterion
Coefficients should be large enough to indicate that information from the test will help predict how individuals will perform in criterion
Usally .3 to .4 |
|
Coefficient of determination |
The Coefficient of determination (r squared) is used to see how much variability is explained by the correlation |
|
Reasons why validity coeffiecnets are small and transfer poorly across situations |
1) Differences in sample size and characteristics of the initial validating and cross validating samples
2) Validity, reliability and appropriateness of the criterion
3) The manner of measuring the criterion, nature of the job or curriculum, people who take the test all affect generalizabiliy
If generalizabiliy is low consider Differential prediction (does the test better for some groups than others)
Coeffiecnets may be small but still my be useful |
|
Hunter 1984 |
Demonstrated the far reaching implications on worker productivity and subsequently the US gross domestic product if employers used employment tests to place workers in the best jobs, even if the employment tests had a validity coeffiecnts only in the .20 or .30 Even though the valdidity coeffiecnts are small for these employment tests and certainly well below what is acceptable for a reliability coeffiecnt, the very practical relationship they have with an overall increase in gross domestic product justifies their use Is important to emphasize the importance of the context of assessment, measurement and prediction in deciding whether validity coefficient is large enough to be useful for application |
|
Criterion and accuracy |
The criterion must be the most accurate measure avaliable of what you are tying to predict if it is to serve as the standard for the beahviour If that criterion exists then the only expense would justify using a lesser outcome measure Using a lessor criterion for less than adequate reasons fails to do justice to the test, clientele, patient or student |
|
The developed of what concepts mean and how to measure it with criterion related validity |
Tests in personality, social or developmental psychology involve the simultaneous development of what the concept means and how to measure the concept These two goals cannot be done by criterion related validity Meaning and measurement of a construct requires processes of construct related validity |
|
Constructs in construct related validity |
Constructs are used in all science to explan why things happen In Psychology constructs are used to explain why people do the things they do or to explain test responses Constructs organize and explain observed beahviour |
|
Frequently used constructs |
Intelligence Learning Love Anxiety |
|
How are constructs built |
Constructs are not physically real They are built up through the accumulation of evidence and the intetgration of that evidence into some theoretically meaningful pattern (nomological net) |
|
How is construct validity established |
Construct validity evidence is established through activities by which the construct is defined and measured at the same time This is relevant particularly when there is no consensus as to what measures adequately define the construct |
|
Pap 1953 ideas of open concepts |
Pap argued that psychological constructs like love, intelligence, extervrrsion, schizophrenia and anxiety are open
They are not clear and well defined They are not necessarily problematic |
|
Open concepts are characterized by |
Having intrinsically fuzzy boundaries
A large extendable and variable list of indicators
An unclear inner nature |
|
Concepts become less open concepts, better defined and understood as... |
Evidence is gathered as to the meaning of the concept How well the measures of the concept clarify the meaning of thr concept |
|
What is construct validity |
Is the degree to which a test measures the trait or attribute under consideration -trait or attribute is theoretical Construct validity is not a single number, there is no single coeffiecnt that indicates construct valdidty Because constructs are open concepts, over a series of studies the meaning of the conflict begins to emerge The results and observations over time gradually clarifity what the test is measuring and what they mean (nomologcial net) |
|
What does construct validity do |
It gradually accumulates information gathered from a variety of courses that related to the construct -not just experiments development of construct validity entails the measurement and understanding of a concept moving from open to more close concept Any information that aids in explaining the underlying construct is appropriate for construct valdidty |
|
Construct related validity based in on internal structure evidence |
Examination if the internal structure of a test (or battery of tests) -the relationship between test items (test battery components) are consistent with the construct the test is designed to measure
By examining internal structure, the match between the actual internal structure can be compared with the hypothesized structure of the construct the test is supposed to measure |
|
Age differentiation |
Tests created to assess developmental processes use age difference to validate the underlying construct
Used to validate achievement, aptitude and ability tests
The assumption is that measures of construct should increase with age or cogntvie ability
Age differentiation is necessary but not a sufficient evidence for internal structure validity |
|
Problem with age differentatiation |
Failure to find age differences may mean the test is invalid
But the presence of age differences does not indicate whether developmental change was due to the underlying construct or its presumed structure |
|
New test old test correlation |
When a new test of a construct is developed, correlations with the old test can help validate the new measure Assumes old test has established contruct validity Should be moderately high (.3 to .4) -don't want too high as it means it's the same test -don't want the old test measurement error New test should also be correlated with other measures to see if the new test is free from irrelevant sources of error |
|
Indirect validity |
What the test scores should not correlate with Doesn't tell you what you are measuring but just that it is not relevant correlations High correlations would render the construct validity of the test suspect Low correlations do not necessarily insure internal structure construct validity |
|
Factor analysis |
Statistical procedure used to determine the number of conceptually distinct factors Allows one to evaluate the presence and structure of any latent constructs existing amoung a set of variables The latent construct underlies and is partly responsible for the way examinees respond to questions on the variables that make up the factor Uses to validate the internal structure of test batteries |
|
Process of factor analysis |
Begins with a correlation matrix that contains inter correlations amoung the individual items Produces as many factors as there are varialbe There are statistical guidelines for determining factors Most important guidelines is "does the factor solution make psychologyal sense" |
|
Does thr factor solution make psychological sense |
The variable lodaing on a factor share a common theme or meaning A psychological uninterpretable factor solution has little practical value and will unlikely provide validity evidence |
|
Exploratory and confirmatory factor analysis |
If the theory underlying test construction hypothesized 3 factors and 3 factors emerge from the confirmatory factor analysis - the test is said to have factorial valdidty -factor structure of the test is supported There are a number of statistics available that statistically test the fit or match between thr actual and hypothesized factor structure |
|
Internal consistency |
Internal consistency methods are used to assess construct validity Measures of internal consistency reflect the homogeneity of test items -indicates something about the doamin from which the items were drawn Does not tell you about the underlying construct from that domain It tells you what domain the items came from not what the construct means |
|
Ways of measuring internal consistency |
Extreme groups analysis -on personality, social or abnormal psychological tests validity is established us those scoring high or low differ on some criteria Item total correlations -on ability, personality or social psychological tests -examine item total correlations for pass fail scored items -point biserial correlation for didchotomoisly scored items -.2 to .35 |
|
Convergent divergent validity (discriminant) |
Campbell and Fiske argue that construct validity is shown when - test correlates prossitively with tests that it should correlate with (convergent) -and shows low or zero correlations with measures that it theoretically should not correlate with (divergent) Most convincing method to demonstrate construct valdidty |
|
How to do convergent divergent analysis |
Creation of a multittiat multimethod matrix We measure two or more traits by two or more different methods Validity coeffiencents should be greater than correlations between different traits measured by different measures and also greater than different traits measured by the same method |
|
Common method variance and construct variance |
Multittrait multimethod matrix also provides evidence on common method variance and distinction between method and construct variance |
|
Method variance |
Refers to the correlation between measured due to thier common assessment procedures |
|
Common methods |
The triangles indicating different traits same method reflect common methods Correlations between different triats measured by different measures (lower left triangle ) Different traits measured by the same method (immediate off diagonal triangle) |
|
How is construct validity different than predictive or content validity |
It focuses on the role of the theory in test construction and the need to formulate Hypotheses that can be tested in validation studies Is not subjective or intuitive judgments or rationalization independent of the data |
|
Construct validity evidence based on response processes |
Validity evidence based on response processes invoked by a test involves an analysis of the fit between the performance and psycholgocial processes examinees engage in while the construct being assessed Collect this evidence by interviewing examinees about their response processes and strategies, recording behavioral indicators such as response items and eye movements or analyzing the types of errors committed |
|
Ways to categorize types of questions |
Objective subjective distinction classification
Selected response or constructed response classification -when creating test items the overriding goal is to develop items that measure the specified construct and contribute to psychometrically sound tests |
|
Objective subjective distinction categorization |
Often used
The difficulty with this categorization is that it is sometimes difficult to say whether a question is sibetvie or objective |
|
Selected response or constructed response classification categorization |
If an item required an examinee to select a response from avaible alternatives it is classified as a selected response item (mutiple choice true or false matching)
It an item required examinees to create or construct a response it is classified as a constructed response item (fill in blank, short answer, essay item, oral examination, interviews) |
|
Advantages or selected response questions |
A large number of items can be answered in a short time period -can include more items from the doamin to increase reliability
Items are flexible and can be used to assess a wide range of constructs with greatly varying levels of complexity
Decrease the influence of certian contruct irrelevant factors that can impact test scores (writing abilities) |
|
Limitations of selected responses |
Items are challenging to write -mutiple choice tests can be difficult to come up with foils that are plausible yet incorrect
There are some constructs that cannot be measured using selected response items
Blind guessing and random responding are seen in such items |
|
Types of constructed response items |
Short answers
Essays
Performance assessments
Portfolio assessment |
|
Short answer |
Items can take a number of forms -putting a word, phrase, number in response to a direct question or complete sentence |
|
Performance assessments |
Require examinees to be complete a process or produce a product in a context that closely remembles real life situations |
|
Portfolio assessments |
Involves the systematic collection of examinee work products over a specified period of time according to a specific ser of guidelines
Writers, artisitis, architects |
|
Strengths of contracted response assessments |
Are easy to write and develop -developing a framework for how to properly score response can take time and effort
Well for assessing higher order cogntive abilities and complex task performance, and tasks that require contracted response formate like problem solving
Items eliminate guessing and random responding |
|
Limitations of constructed response classification |
Take more time to complete as a result you are not able to sample the content domain as throughly -less reliable and time consuming to score
Do eliminate blind guessing but are vulnerable to faking or creative construction when answers are unknown
Vulnerable to influence of extraneous or construct irrelevant factors that can impact test scores (writing abilities) |
|
Who to select item format |
Key factor in selecting an item format involves identifitying the format that most directly measures the construct
Select the item format or task that will be the most direct measure of the construct of interest
Selected response items are recommended since they allow broader sampling of content domain and more objective and reliable scoring procedures |
|
Writing items |
There is an art to writing good items |
|
Rules for writing items |
Textbook |
|
Types of selected response test questions -which one is selected depends on the area of inquiry |
Dichotomous items are scored as either true or false, agree or disagree
Polytomous items Likert format |
|
Dichotomous items |
Advantages include simple to understand, score and administer
Appears in education and personality tests where absolute judgments are required
On maximal performance tests items should be easily scored (correct inccorect) according to a scoring criteria -all scored in an objective manner and are classified as objective questions
Disadvantages include increased chance of guessing and many items need to be written -T or f encourage superficial understanding
Because of guessing these items are less relabile and less precise than other formats |
|
Poltyomous items |
Several response alternatives are given (multiple choice)
One alternative is preferred and others are wrong or not indicative of the the construct or answer
3 alternatives (distractors, foils) in addition to the correct answer maximizes item and test relviality and discriminating between test takers
Reliability of tests in this format is constrained by guessing -correctiin for guessing may be used |
|
Likert format |
5 to 7 alternatives are given such as strongly agree, agree, don't know ect Test taker chooses the alternative that comes closest to their attitude toward the issue Negatively worded items are reversed scored Used for personality and attitude tests |
|
Item analysis |
General procedure for methods used to evaluate specific items or groups of items (not whole test)
All tests needed to undergo this as it us useful for test developers to decide which items to keep on a test and which items to modify, or eliminate
If improve quality of individual items the overall test quality improves Tells you about test item homogeneity |
|
Components of item analysis |
Item difficulty Item discrimination Distractor analysis
Items not independent |
|
Item difficulty |
The propotion of people who pass an item
P = number who get right / number of examines
0 to 1 with lower proportion meaning more difficult item -0 or 1 provide no information
Optimal level of difficulty is .3 to .7
To deffrtrntiate test takers items must range in difficulty |
|
How hard an item should be depends on |
1) probability of correct response by chance
2) what the test Is designed to do |
|
Item difficulty need to consider item guessing |
To take into consideration the effects of guessing the optimal item difficulty level is set higher than for constructed response items
Lord argues that for 4 option multiple choice items the average p should be about .74 with a range of .64 to .84 |
|
Percent endorsement |
The item difficulty index is only applicable to maximize performance tests where items are scored correct incorrect |
|
Item discriminability |
Refers to how well an item discriminates amoung test takers who differ on the construct being measured by the test
Above .40 is considered Excellent .3 to .39 are good .11 to .29 are fair 0 to .10 are poor |
|
Two item discriminability statistics |
Item total correlations Discrimination index
-Seperate test takers into top and bottom
-compute a discrimination index for each item
Di discrimination index = proportion of top who got it correct - proportion of bottom who got it correct
-remove or revise inverted items (negative) |
|
Item discrimination and difficulty |
Item discrimination are biased in favor of items with intermediate difficulty levels (p value)
Because relationship between p and d, items that have excellent d values will have p values between .2 and .8 |
|
Item total correlations coefficients |
Point biserial correlations ( rbis) -correlation between true dichotomy (yes or no) and comtinous measure (adjusted test score)
Large correlations mean that an item is measuring the same construct as the overall test and discriminates between individuals with high and low construct ability
If low then that item should be eliminated between .2 and .35 Allows the test developer to select items that discriminate between respondents thar are high and low in construct |
|
Item total correlations with typical response tests |
On a test designed to measure sensation seeking using true or false items 1 means yes and high sensation seeking 0 means low sensation seeking |
|
Distracter analysis |
On mutiple choice tests incorrect alternatives are referrd to as distractors (foils) since they distract those who don't actually know the correct response Allows you to examine how many examinees in top and bottom groups selected each option on a multiple choice item Good distractors should show negative item discrimination and be selected by some examinees |
|
Relationship between item distracters item difficulty and discrimination |
The selection of distractors impacts both item difficulty and discrimination Significantly impacts item difficulty and consequently item discrimination |
|
Item validity |
Becomes and issue when yhe test constructor want to maximize criterion or predictive validity Is assessed by point bie serial correlation between responses to each test item and a criterion measure multiple by items standard deviation Rbis x SD |
|
Item reliability |
Assesses internal consistency of the test Rbis × standard deviation of the test |
|
Item response theory (IRT ) -alternative to item analysis |
The construct underlying the test responses can be known by observing the performance of the test items
Responses to test items are explained by latent traits
Models the probability of a correct answer or saying yes to a personality test item is a mathematical function of their standing on the latent trait Latent trait (theta) is assumed to be unobservable known only through test responses |
|
Latent trait |
Latent is an ability or trait that is inferred to exist based on theories of beahviour or empirical evidence (cant be assessed directly)
Those with more of the latent trait should get more items correct than those who score low on the construct |
|
Two parameters of Item response theory |
Person parameter -constructed as a single latent trait or dimension
Item parameters -are Item difficulty, Item discrimination and Item guessing |
|
Item response curves or item characteristic curves |
Each item can have a graph created that maps the probability of getting an item correct against the latent dimension S shaped curves |
|
Three most common item parameters in Item response theory |
Item discrimination -slope function Item difficulty -location function -revived the most attention Item guessing |
|
Rasch models |
Models relating item difficulty to probability of getting the item correct |
|
Item difficulty in item response theory |
Assumes that only item difficulty differentiates between items and all items are similar in slope and item guessing
Item difficulty is determined at the point of median probability -the ability at which 50% of respondents endorse the correct answer (inflection point) |
|
Know |
On an ICC difficult item are hard or unlikely to be endored and shifter to the right of the scale indicating the higher ability of the respondents who endorse it correctly
Easier items more shifted to the left of thr ability scale
Typical range - 3 to 3 |
|
Item discrimination in item response theory |
Determines the rate at which the probability of endorsing a correct item changes with given ability levels This parameter is imperative in differentiating between individuals possessing similar levels of the latent construct Purpose is to include items with high discrimination in order to be able to map individuals along the continuum of the latent trait (0 to 2) |
|
Know |
What the item response curves mean and why people respond as they do depends on thr theory that lead to test development To avoid the impression that the latent traits are real terms such as item response analysis and item respond theory are used |
|
Advantages of an Item response approach to item analysis |
Item analysis using classical theory uses correlations, means and variances which are sample dependent and may not generalize to other samples
In an Item response analyses, Item characteristics are independent of the participant sample
What We want are items that are sample independent or sample invariant or invariace of Item parameters -items can be given to people of varying ability know as population invariance |
|
Invariance |
To create invariance, item characteristics need to be independent of the sample characteristics known as anchoring -structureal validity Items can be given to people of varying ability know as population invariance |
|
Significant contributions of item response theory |
In the are of computer administered tests know as computer adaptive testing In the detection of item or tesr biases -ICCs for different groups are generated and statically compared to determine the degree of derrerential item functioning |
|
Items too hard |
Frequently subjective judgments are made about the stability of items despite all effort of item analysis Item is too hard, too easy, too demeaning to minority groups These subjective reviews have not been successful in predicting item difficulty or discrimation (need statistical analysis) Hard items are not necessarily biased or unfair |
|
Category format |
Used in observational studies, developmental and organizational psychology Scale consists of 10 catergoy responses (on scale from 1 to 10) Category response scales are included by context and anchoring effects -ratings are influenced by the behaviour of the other people being rated, and once rated, other ratings become anchored at one end or the other Overcome this by labeling endpoints and middle of scale to remind respondents of definitions. Increasing the number of categories decrease the reliability and validity because more categories call for more discrimation than is possible |
|
Checklists and Q Sort |
In checklist respondents are given a list of objects and asked to indicate which ones are self despcitive -self dichotomous scoring method Q sort is when a number of cards each containing a trait are sored along a 9 category scale labeled from least to most -constraints imposed on thr number of attributes in each category |