Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Related Flashcards

Flashcards
»
Assessment test 2 (year 5 term 1)

Assessment Test 2 (Year 5 Term 1)

by fraser2017, Nov. 2022

Favorite

Add to folder

Flag

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/191

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

191 Cards in this Set

Front
Back

	Issue with measuring psychological constructs	Psychological constructs unlike physical objects are hard to measure When we get results the obtained score may over or under estimate measures of the construct This issue means we need to know how much variability there is in the total test scores Scores must demonstrate acceptable levels of consistency between observed scores and true scores in order for them to be meaningful
	True scores vs observed scores	We are trying to assess a person's true score on wheat the test is assessing We actually measure the observed score on the test not true score
	Error of measurement	The difference between the true score and the observed score Error doesn't mean mistake but rather the variability of observed scores around the true score Error is estimated in thr standard error of measurement
	Relability	Is the degree to which test scores are free from measurement error High the reliability of a test (0-1) the lower the measurement error The more confidence we can be that the observed score mirrors the true score
	When test scores are free of meausmrent errors	Test scores that are free of errors display consistent and stable test results Reliable assessments are relatively free from measurement error wheres less reliable results reflect measurement error
	Reliability outside of psycholgocial tests	Whenever something is measured, reliability is an issue Blood pressure tests have lower reliability than a well constructed psychological test Economic indicators (GDP, poverty, SES) are particularly unreliable
	Classical test theory (Spearman) Also called test score theory	Assumes that a person has a true score that could be measured if there were no errors of measurement (high reliability) However there are errors between observed scores and true scores These measurement errors are then the difference between the observed score and true score
	Classical test score theory formulas	X is observed score T is true score E is measurement error X = T +E Because we want to know reliability or measumrent errors E = X - T
	Major assumption of classical test theory.	Assumption is that measurement errors are randomly distributed around the true score Meaning chance factors or nonsystematic error increases or decrease observed scores If people were to repeat the same test results would produce a normal distribution of errors around each person's true score mean - scores that greatly differ from true score will happen less Can estimate true score by finding the mean of the observations from repeated applications
	Standard error of measurement	Tells us on average how much a score varies from the true score The standard deviation of observed score and the reliability of the test are used to estimate the standard error of measurement
	Pooled standard deviations	In a reliable test it is assumed that error distributions overlap and differ only because of true test scores Pooled variance of these errors tells us the magnitude of the variability of the sample observed scores around the true score of the sample Pooled standard deviations from all test takers becomes the basic measure of error present in a test Pooled standard deviation is called standard error of measurement
	Standard error of measurement	The standard error of measurement is used to calculate the range of scores around the observed score within which the true score is likely to fall Allows for confidence interval around the observed mean -true score will fall within + or - standard error of measurement (95%) The mean of repeated testing is the true score estimate -the SD is the standard error of measurement
	Domain sampling model	Considers the problems created by using a limited number of items to represent a larger and more complicated construct Concern is to estimate true score from a limited sample of items where sampling from the full domain is impossible
	Domain sampling theory used in classical test theory	Classical theory uses elements of domain sampling From a sample of items (repeated test scores) a true score is estimated
	When is domain sampling important	Problem is how much error measurement is there from one sample of items This is important when the sample of test items is small relative to the size of the domain of items Reliability increases as sample size approaches the size of the domain
	Repeated random sampling of items from the domain	Each test has an unbiased estimate of true score Due to measurement and sampling error these estimates will differ These differences will be random and normally distributed The mean of the correlations between the various test scores is the test reliability Do not average sample correlations, each correlation is converted into z scores which are averaged and transfered back to correlation
	What does domain sampling allow us to calculate	Allows for the calculation of maximum, unbiased reliability estimate that a test achieve
	Sources of measurement error	Content sampling error Time sampling errors Other scoured of error
	Content sampling error	The error that results from differences between the sample of items and the domain of items Largest scourge of error Is easiest and most accurate source of error to estimate Determined by how well the domain is sampled -same difficulty, do sample all components of domain
	How is content sampling measurement error estimated	Estimated by analyzing the degree of similarity amoung the items making up the test To do so we Analyze the correlations between test items with the examinees standing on the construct being measured
	Time sampling errors	Random changes in the test taker or testing enviroment can impact test performance Reflect random fluctuations in performance from one situation to another Limit our ability to generalize test results Major concern since psychology test are rarely given in the exact same environment
	Other sources of errors	Include errors in testing, administrative and scoring Clerical errors committed while adding up scores or administrative errors on an individually administered test When scoring relies heavily on subjective judgment of the tester, subtle discriminations in scoring can happen -must calculate inter rater or inter scorer agreement
	Expressing reliability	Reliability is often expressed as a correlation coefficient It is more preferable to express it as a ratio of the variances of true score to the observed score The reliability is the proportion of observed score variance us accounted for by true score variance
	Equations of reliability ratio	Rxx = variance of true score / variance of observed score (Standard deviation squared) Reliability of the observed score = true score variance + error variance
	Estimate of error variance	Estimate of error variance = 1- Rxx -if relaiblity (Rxx) is .8 error variance is .2 Means 80% of the test score reflects true score reliability and 20% reflects random nonsystematic error variability
	Reliability index	Reflects the correlation between true and observed scores Can't be calculated directly as true scores are unknown Is equal to the square root of the reliability coeffiecnt If Rxx= .81 then index is .9 -the correlation between observer and true score is 0.9
	Know	There are a number of ways to estimate reliability Each measures a different aspect of random error Which reliability estimate is chosen will depend on what the test is presumed to measure and what the test constructor wants to demonstrate
	Three ways to estimate test reliability	Test retest Parallel forms Internal consistency
	Test retest method Also called stability coefficients	Measures time sampling errors Used to evaluate the error associated with administering a test at two different times Applies only to stable traits Susceptible to carryover effects Follows classical test theory -theroy assumes attribute stability -test score variability is constructed as error variability
	Time intervals in test retest	If time interval is short than random fluctuations take place -practice / carryover effects If time interval is long than random fluctuations, unknown scourses of error and changes in the construct can take place over time There is no single best time interval The optimal interval is determined by the way the test results are to be used and the nature of the construct
	In test retest what does a postive correlation mean	The scores are stable as they are generalized across time Low susceptibility to testing or test taker conditions Generalize over testing environments
	Carryover and practice effects	Happens when the first testing session influences scores from the second session When there are these effects the test retest correlation usually overestimates true reliability Only a problem when the changes over time are random -not predictable, effects some but not all Practice effects are a form of carryover effects -have sharpend their skills after frist test
	Test retest administrating	Administer the same test on two well specified occasions and then the correlation between scores from the two administrations
	Poor test retest correlations	Poor correlations does not mean that a test is unreliable Suggests that the characteristic under study has changed
	Parallel forms reliability Also called alternate forms or equivalent forms	Compares two or more different but equivalent forms of a test that measures the same attribute Must be made of different items but the rules used select items are the same Tests must be parallel in terms in content, difficultly ect. Makes sure that the test scores do not represent any one particular set of items or a subset of items for the entire domain (content error)
	Why is parallel forms the most informative form of reliability for psychological studies	1) contains estimate of consistency over time 2) contains two or more samples of items from the domain 3) can estimate error attribute selection from item sets 4) practice or carryover effects are reduced
	Nature of items in parellel forms	Same number Cover the same domain Expressed in the same way Equal difficulty
	Drawbacks of parallel forms	Practice or carryover effects change the meaning of the second tests Creation of the many items needed for parallel forms is costly and time consuming
	Internal consistency reliability -split half reliability	Reflects errors related to content sampling These estimates are based on the relationship between items withinba test Test is given once and responses are split into two halves and are correlated -congeneric tests
	Ways tests can be split	Frist half second half split -if test is long and items are of equal difficulty Odd even split -if items increase in difficulty or practice effects, fatigue, or declining attention effects
	Problem with split half internal reliability	This reliability is an underestimate because each subset is only half as long as the full test An estimate of reliability would be deflated because each half would be less reliable than the whole test Since only half of the items are used the reliability underestimates true reliability for the whole test -test gains reliability as the number of items increase
	Spearman brown correction correlation	Allows you to estimate what the correlation between the two halves would have been if each half had been the length of the whole test Corrects underestimation Underlies thr general point that reliability increases as the number if items increase Assumes equal variances in both halves of the test
	Spearman brown formulation	Rsb = 2r / 1+ r Rbs = correlation between the two halves of the test if each had the total number of items (split half correlation) r = correlation between the two halves of the test
	Issue with Spearman brown correction	Assumes that there are equal variances in both halves of the test When variances are unequal we can't use it Instead have to use alpha
	Kuder Richardson KR20	Is a measure of internal reliability so the test only has to be given once Considers all possible splits simultaneously -avoids the problems of split half Only be used for items that can be scored in a dicotomous matter (0 or 1)
	KR20 formula	Kr20 = N / N-1 (S squared - sum pq / S squared) Kr20 = reliability estimate N = number of items on the test S squared = variance of total test score P = the proportion of people getting each item correct Q = the proportion of people getting each item incorrect (1 -p) Sum of pq = sum of the products of p x q for each item on the test -is the variance for an individual item -is the sum of individual item variances
	Know	To have 0 reliability the variance for thr total test score must be greater than the sum of the variances for the individual items Only happens when there is covariance between items Covariance happens when the items are correlated with each other The greater the covariance the sample the sum of pq term will be
	Cornbach coefficient alpha	Estimates the internal consistency of tests which the items are not scored as 0 or 1 Examined thr consistency of responses to all test items regardless of how those items are scored Can be thought of as the average of all possible split half coeffiecnts corrected for the length of the whole test The sum of pq is replaced with sum of individual item variances Most general method for finding estimates of reliability through internal consistency
	What is coefficient alpha sensitive to	Content sampling measurement errors Heterogeneity of the test content -the degree to which the test items measure unrelated characteristics -as item heterogeneity increases alpha coefficient decreases
	Coefficient alpha and kr20	Both estimates internal relability Kr20 is a simplified version of the alpha coefficient Alpha coefficient reduces to Kr20 when all items are dichotomous If all test items are dichotomous then alpha and kr20 should give identical results within rounding error
	What does coefficient alpha provide us with	Gives us the lowest bound estimate of reliability A high alpha (>.80) suggests that the true reliability is higher A low alpha only means that the true reliability may be higher To overcome this issue 95% confidence intervals around alpha can be constructed
	Reliability for a test that measures more than one trait	Factor analysis is a popular method for dealing with this situation When factor analysis is used correctly, these subsets will be internally consistent (highly reliability) and independent from one another
	Limitations of alpha coefficient	It assumes tau equivalence (T) or a unidimensional factor structure When tau equivalence is not met alpha coefficient will underestimate the tests level of relability
	Tau equivalence	All the indicators of a factor, The test items, all load or correlate in a similar manner on one dimension -item heterogeneity
	McDonald's omega coefficient	Does not assume tau equivalence and can be used to assess internal reliability for non equivalent tau items Calculation is not straightforward, relying on the outcome of a structural equation model Spss macros are available to do calculation
	Sources of the error for when estimating reliability for behavioral observation data	Individuals scoring the test (judges) Rating errors Definitional issues Item sampling errors
	How to estimate true scores in behavior observation studies	Are unreliable because discrepancies between true scores and scores recorded by the observer To address these problems we need to estimate the reliability of the observers This is known as interrater reliability (Inter judge, inter scorer or inter observer ratings)
	Interrater reliability	Estimating the consistency amoung judges or raters who are evaluating the behavior or output The percentage of agreement between raters is sometimes used as a measure of interrater reliability This is incorrect as percentages as they do not take into account chance level of agreement
	Most common for of interrater reliability	Is to record the percentage of times that two or more observers agree Not the best because 1) percentage does not consider the level of agreement that would be expected by chance alone 2) percentages should not be mathematically manipulated
	What is the best way to asses interrater reliability	Kappa coeffiecnt Indicates the actual agreement as a proportion of the actual agreement following correction for chance agreement Is a measure of agreement between two judges who each rate a set of objects using nominal scales for Ordinal level data A weighted coeffiecnt is available for Ordinal level data and takes into consideration how desperate the ratings are 1 (perfect aggrement) -1 (less agreement than can be expected on the basis of chance alone) Less than .4 is poor. Greater than .75 is excellent
	When is Kappa coeffiecnts used (when agreement in classification is of interest)	When a test is administered at two different points in time to classify people into diagnostic groups -person would be classified or assigned to a group using the obtained test scores on each occasion and the degree of agreement across times is compared via Kappa Could use two different tests on the same group of people at the same point of time and classify them seperately using each set of test scores and then computer the cross test agreement in classification with Kappa
	Fleiss interrater correlation or Krippendorffs alpha	When there are more than two raters
	Internal consistency	Evaluates the extent to which the different items on a test measure the same ability Measures of internal consistency will all give low estimates of reliability if the test is designed to measure several traits
	Standard error of measurement and reliability	Reliability coefficients reflects the proportion of observed variance attributable to true score variance Relability coefficients are a useful wah of comparing the consistency of tesr scores produced by different assessment procedures The standard error of measurement is useful for interpretation It is the standard deviation of the distibution of scores that would be obtained by one person if they were tested on an infinite number of parallel forms of a test comprised of items randomly sampled from the same domain
	How is standard error of measurement calculated	= test standard deviation × the square root of 1 - test reliably As relability decreases the standard error of measurement increases. -this relationship occurs because the reliability coefficient reflects the proportion of observed score variance due to true score variance and the standard error of measurement is an estimate of the amount of error in test scores Low test reliability means larger standard error of measurement and the less confidence we have in the precision of the test
	Test batteries	The standard error of measurement is needed to interpret individual scores and scores from test batteries In test batteries a number of attributes are assessed within a single test Test battery results are displayed in percentiles using the standard error of measurement When interpreting test results, the use of + or - 2 standard error of measurement is recommended -prevents over interpration of small test score differences
	Test battery sem consistently	In Test batteries the standard error of measurement is not constant across an entire set of scores Standard error of measurement is lower for scores close to the score mean and higher for scores at both extremes Scores at the extremes need to be checked for accuracy
	Reliability for composite scores	Many tests have mutiple scores that can be combined to form composite scores In all cases, scores on tests are combined to yield a score on a composite measure The more scores that make up the composite, the higher correlation between those scores and the higher the individual test reliability and the higher the composite relability
	The advantage of composite scores is that their relability is the result of	1) number of test scores in the composite 2) the reliability of the individual test scores 3) the correlation between those scores
	Reliability of difference scores	There are a number of situations where researchers and clinicians want to consider the difference between two scores Difference scores are used in pre post experimental designs When using difference scores, the reliability of a difference score is often lower than the reliability of either the pre or post test Increased when the original test measures have high relabilities and low correlations with each other
	Problems with difference scores	Not only is the reliability lower than the pre and post measures buy those measures are often highly correlated
	How large does a reliability coefficient need to be depends on	The nature of the construct The amount of time available for testing How test scores will be used The method of estimating reliability
	High reliability	Diagnostic tests that inform major decisions about individuals should be held to a higher reliability standard than tests used with group research or for screening large numbers of individuals High stake decisions demand highly reliable information (.9 to .95) .98 on standford binet intelligence scales for adolescents
	Reliability strength outside of high stakes	Reliability estimates of .8 are considered acceptable in many testing situations and are commonly reported for group and individually administered achievement and personality tests For teacher made classroom tests and tests used for screening, reliability estimates of at least .7 are expected Classroom tests are combined to form linear composites to form a final grade -the reliability of composite scores is greater than the relaiblities of the individual scores
	What to do when reliability is too low	Increase the number of items Factor and item analysis Discriminate analysis Correction for attenuation
	Increase the number of items	Adding more items increase reliability by increasing thr number of item samples from the domain Spearman brown prohpey formula allows us to estimate how many items need to be added N = rd (1 - Ro) / Ro (1- Rd) N= is number of tests of the same length as the original test Ro = the observed level of reliability Rd = the desired level of reliability Once N is know times it by current number of items
	What does increasing the number of items depend on	Item availability Whether the time, effort and expense is worth the increase
	Factor analysis and item analysis	Reliability can be increased by eliminating those items that do not correlate with other items Those items that do not correlate are measuring a different construct than that assessed by other items and can be dropped from the scale Increasing the cogeneratibity of the test However less items decreases reliability
	Discriminate analysis	Item total correlations Correlate each item with total test scores (.2 to .3) Those items with low item total correlations <.30 are propabably not measuring what other items are measuring -these items are dropped from the final scale This is a point bi serial correlation -continous variable (total score) and one true dichotomous variable (item right or wrong) Don't want higher than .35 cause don't want items carrying the whole test
	Correction for attenuation	Used to correct for the unreliability of scores being correlated Eliminating the error Estimates maximum test reliability if their were no measurement errors Can be user with any reliability estimate Refered to as attenuated reliability coefficient
	Formula for correction for attenuation	r'12 = r12 / (square root of r11) (square root of r22) r' 12 = the maximum orrelation between the two tests r 12 = the observed correlation between test r 11 = relaiblity for test 1 r 22= relaiblity for test 2 The maximum relaiblity of test is equal to square root of Rxx (reliability index)
	What is validity	Refers to evidence that supports interpretation of tesr results as reflecting the psychological construct that test was designed to measure Refers into investigations into what tesrs measures How well it measures what is says it measures The degree to which evidence and theory support interpretations of test scores from the proposed uses of the test Validity is the most fundamental consideration in developing and evaluating tests
	Validity and reliability	Reliability is stability and accuracy of test scores, reflects amount of random error Reliability is a necessarity but a insufficient condition for validity An unreliable test cannot produce valid interpretations -only true score variance can be reliable and related to the construct of the test is supposed to measure However no matter how reliable a measurement is, it does not guarantee validity -it is a characteristic of test performance not the test
	Validity is not	Indicated by the little of the test A brief description of the test given in the test manual Represented by a single number High nor low, good or bad Indicated by the nature of the items (face valdidy) Is not static but a constantly moving target
	What is validity then	Is what the test measures How well it measures what it says it measures Useful or not for certian purposes (test utiltiy) Is a process that involves ongoing dynamic effort to accumulate evidence for a sound scientific basis for proposed test score interpretations Indicated by empirical associations between test scores and other measures Reflected by a nomonolocial net
	How valid are psycjolgocial tests	Meyer (2001) Found psychological tests often provide results that are equal to or exceed the validity of medical tests Pap smear is 0.36 while MMPI 2 0.37 Even when use both medical and psychological tests to detect the same disorder, psyvholgocial tests can provide superior results MRI for dementia is 0.57 and neurophychological tests 0.68 Yes psyvholgocial tests can provide information that is as valid as common medical tests
	Maximum validity correlation	Reliability places limits on the magnitude of validity coefficients when a test score is correlated with a criterion or outcome variable Using iQ to reading achievement, the reliability coefficient for the IQ test imposes a theoretical limit in the true value of the correlation that is equal to the square root of the reliability coefficient Maximum validity correlation = square root of reliability coefficient
	Definition of validity from the standards for educational and psychological testing	The degree to which evidence and theory support the interpretations of test scores for proposed uses of the tests Validity us a unitary concept with subtypes These types represent different ways of collecting evidence to support the validity of interpretations of performance on a test, which we commonly express as a test score
	4 subtypes of validity	Content related Criterion related Construct related Structural related
	Current view of validity	There is a consensus that the older concept of test validity be abandoned in favor of an emphasis on the appropriateness or accuracy of interpretation of test scores We do not refer to validity of a test but rather validity of the interpretation of test scores Responses to test items mean the interpretation of performance and it is the interpretation that possesses the construct of valdidty When test scores are interpreted in mutiple ways each interpretation needs to be validated
	Importance of validity evidence	Sources of validity evidence differ in importance according to factors like the construct being measured, the intended test usage, and the population being assessed
	Content related validity	Examination of the test content to see if the test covers a representative sample of the domain being measured Test content includes the themes, wording, and format of the items, tasks, test items, administration and scoring rules
	What does content validity validate	Criterion or domain referenced tests (job based tests) Tests used in education Achievement tests Aptitude tests (Employee, student, training selection, evaluation or classification)
	Inspection of test items is not sufficient to assess content validity as...	Inspection cont tell you of all relevant sub domains in an area have been covered Whether all objectives of the instruction have been assessed not just matter Whether there are items that assess the process by which people answer questions Irrelevant sources of error in the test gave been removed
	How is content validity assesses	A test blueprint or test specification plan is constructed (must be written down) Items are written according to blueprint then area experts are consulted about whether important content, objectivities and processes are covered Care at this stage established the foundation for correspondence between test content and the construct Are experts are used to systematically review the test and reevaluate the correspondence between the test content and it's construct Adininstee the test and empirical procedures are used to assess test and item characteristics (item difficulty, item discrimination, test item correlations) Examine errors and check for irrelevant sources of error
	What does test blueprint specifie	Areas to be covered Objectives to be tested and a rational for inclusion Processes to be assessed Important topics and issues thar warrant study
	Two types of errors examined in content validity test evaluation	Item or domain relevance Item content or domain coverage
	What do content valdidyy studys allow us to conclude	The test covers a representative sample of doamin relevant skills and knowledge Obtained score is relatively free from irrelevant sources of error (reading level, item difficulty, domain irrelevancy)
	When we have these errors	The presence of such errors indicates that there is construct irrelevant variance Poorly designed tests are described as having construct underrepresnetation or construct overrepresentation
	Criterion related validity (predictive valdidty)	Refers to the degree to which a test predicts future performance or a future outcome -Future performance or outcome is the criterion Requires an independent criterion to assess the outcome -independent means that collection and knowledge of the predictor test scores should be independent or isolated from the collection and knowledge of criterion
	What happens when predictor scores and criterion scores are not independent	Criterion contamination This raises validity coeffiecnts
	Two types of criterion related validity tests	Predictive study - there is a passage of time between predictor (test) and the criterion (outcome) Concurrent validity -when both are given within a short period of time Some tests may be excellent when used in concurrent applications but poor for predictive applications and visa versa
	Concurrent validity	Becomes and issue when diagnoses are being made or when the goal of testing is to determine current status of the examinee as opposed to predicting future outcomes
	Criterion related predictive validity	Becomes an issued when yhe test ID being used for selection and classification of individuals or personnel or when prediction us the ultimate goal of assessment
	Know	While there is no limit or restriction on the nature of the predictors or criterion measures, both much be reliable and valid measures of the construct to be assessed Criterion should be viewed as the gold standard, the best existing measure of construct Only concern is whether this is an empirical link between predictor and criterion and whether the predictor is useful to those using rhe test
	4 classes predictor and criterion measures fall into	Academic achivvment Specialized training Personality and interest inventories Simpler/ shorter new test
	Ratings made by judges, supervisors, interviewers	An often used criterion measure are ratings made by judges, supervisor, teacher, advisors, interviewers, social workers, coach's evaluations about some attribute of behavior (someone is asked to make an evaluation of someone else) Very often ratings, evaluations or judgments made by others of someone's else's behavior are the core of many criterion measures Ratings are often subject to bias but when obtained under controlled conditions, their reliability is confirmed (Kappa) and validity has been established, the ratings can be valuable scoured of criterion data
	Cross validating sample	In industrial and other applied settings, tests are devised in one location and used to fulfill similar functions in another Early studies on the generalization of criterion related validity showed poor generalization or wide variability Later studies indicate that poor generalizabiliy relected statistical artifacts such as a small group size or sampled that were restricted in range It is unnecessary to do local validation studies for previously validated tests. -if there us little existing research on the test than local validation may be needed
	When artifacts are removed	When artifacts are removed the predictive validity of tests that assess verbal, numerical and reasoning skills generalize across samples Cognitive skills that tap a common core of abilities are broadly predictive of academic and occupational outcomes
	Validity coefficient	Predictive validity assessed by correlation between the test and criterion Coefficients should be large enough to indicate that information from the test will help predict how individuals will perform in criterion Usally .3 to .4
	Coefficient of determination	The Coefficient of determination (r squared) is used to see how much variability is explained by the correlation
	Reasons why validity coeffiecnets are small and transfer poorly across situations	1) Differences in sample size and characteristics of the initial validating and cross validating samples 2) Validity, reliability and appropriateness of the criterion 3) The manner of measuring the criterion, nature of the job or curriculum, people who take the test all affect generalizabiliy If generalizabiliy is low consider Differential prediction (does the test better for some groups than others) Coeffiecnets may be small but still my be useful
	Hunter 1984	Demonstrated the far reaching implications on worker productivity and subsequently the US gross domestic product if employers used employment tests to place workers in the best jobs, even if the employment tests had a validity coeffiecnts only in the .20 or .30 Even though the valdidity coeffiecnts are small for these employment tests and certainly well below what is acceptable for a reliability coeffiecnt, the very practical relationship they have with an overall increase in gross domestic product justifies their use Is important to emphasize the importance of the context of assessment, measurement and prediction in deciding whether validity coefficient is large enough to be useful for application
	Criterion and accuracy	The criterion must be the most accurate measure avaliable of what you are tying to predict if it is to serve as the standard for the beahviour If that criterion exists then the only expense would justify using a lesser outcome measure Using a lessor criterion for less than adequate reasons fails to do justice to the test, clientele, patient or student
	The developed of what concepts mean and how to measure it with criterion related validity	Tests in personality, social or developmental psychology involve the simultaneous development of what the concept means and how to measure the concept These two goals cannot be done by criterion related validity Meaning and measurement of a construct requires processes of construct related validity
	Constructs in construct related validity	Constructs are used in all science to explan why things happen In Psychology constructs are used to explain why people do the things they do or to explain test responses Constructs organize and explain observed beahviour
	Frequently used constructs	Intelligence Learning Love Anxiety
	How are constructs built	Constructs are not physically real They are built up through the accumulation of evidence and the intetgration of that evidence into some theoretically meaningful pattern (nomological net)
	How is construct validity established	Construct validity evidence is established through activities by which the construct is defined and measured at the same time This is relevant particularly when there is no consensus as to what measures adequately define the construct
	Pap 1953 ideas of open concepts	Pap argued that psychological constructs like love, intelligence, extervrrsion, schizophrenia and anxiety are open They are not clear and well defined They are not necessarily problematic
	Open concepts are characterized by	Having intrinsically fuzzy boundaries A large extendable and variable list of indicators An unclear inner nature
	Concepts become less open concepts, better defined and understood as...	Evidence is gathered as to the meaning of the concept How well the measures of the concept clarify the meaning of thr concept
	What is construct validity	Is the degree to which a test measures the trait or attribute under consideration -trait or attribute is theoretical Construct validity is not a single number, there is no single coeffiecnt that indicates construct valdidty Because constructs are open concepts, over a series of studies the meaning of the conflict begins to emerge The results and observations over time gradually clarifity what the test is measuring and what they mean (nomologcial net)
	What does construct validity do	It gradually accumulates information gathered from a variety of courses that related to the construct -not just experiments development of construct validity entails the measurement and understanding of a concept moving from open to more close concept Any information that aids in explaining the underlying construct is appropriate for construct valdidty
	Construct related validity based in on internal structure evidence	Examination if the internal structure of a test (or battery of tests) -the relationship between test items (test battery components) are consistent with the construct the test is designed to measure By examining internal structure, the match between the actual internal structure can be compared with the hypothesized structure of the construct the test is supposed to measure
	Age differentiation	Tests created to assess developmental processes use age difference to validate the underlying construct Used to validate achievement, aptitude and ability tests The assumption is that measures of construct should increase with age or cogntvie ability Age differentiation is necessary but not a sufficient evidence for internal structure validity
	Problem with age differentatiation	Failure to find age differences may mean the test is invalid But the presence of age differences does not indicate whether developmental change was due to the underlying construct or its presumed structure
	New test old test correlation	When a new test of a construct is developed, correlations with the old test can help validate the new measure Assumes old test has established contruct validity Should be moderately high (.3 to .4) -don't want too high as it means it's the same test -don't want the old test measurement error New test should also be correlated with other measures to see if the new test is free from irrelevant sources of error
	Indirect validity	What the test scores should not correlate with Doesn't tell you what you are measuring but just that it is not relevant correlations High correlations would render the construct validity of the test suspect Low correlations do not necessarily insure internal structure construct validity
	Factor analysis	Statistical procedure used to determine the number of conceptually distinct factors Allows one to evaluate the presence and structure of any latent constructs existing amoung a set of variables The latent construct underlies and is partly responsible for the way examinees respond to questions on the variables that make up the factor Uses to validate the internal structure of test batteries
	Process of factor analysis	Begins with a correlation matrix that contains inter correlations amoung the individual items Produces as many factors as there are varialbe There are statistical guidelines for determining factors Most important guidelines is "does the factor solution make psychologyal sense"
	Does thr factor solution make psychological sense	The variable lodaing on a factor share a common theme or meaning A psychological uninterpretable factor solution has little practical value and will unlikely provide validity evidence
	Exploratory and confirmatory factor analysis	If the theory underlying test construction hypothesized 3 factors and 3 factors emerge from the confirmatory factor analysis - the test is said to have factorial valdidty -factor structure of the test is supported There are a number of statistics available that statistically test the fit or match between thr actual and hypothesized factor structure
	Internal consistency	Internal consistency methods are used to assess construct validity Measures of internal consistency reflect the homogeneity of test items -indicates something about the doamin from which the items were drawn Does not tell you about the underlying construct from that domain It tells you what domain the items came from not what the construct means
	Ways of measuring internal consistency	Extreme groups analysis -on personality, social or abnormal psychological tests validity is established us those scoring high or low differ on some criteria Item total correlations -on ability, personality or social psychological tests -examine item total correlations for pass fail scored items -point biserial correlation for didchotomoisly scored items -.2 to .35
	Convergent divergent validity (discriminant)	Campbell and Fiske argue that construct validity is shown when - test correlates prossitively with tests that it should correlate with (convergent) -and shows low or zero correlations with measures that it theoretically should not correlate with (divergent) Most convincing method to demonstrate construct valdidty
	How to do convergent divergent analysis	Creation of a multittiat multimethod matrix We measure two or more traits by two or more different methods Validity coeffiencents should be greater than correlations between different traits measured by different measures and also greater than different traits measured by the same method
	Common method variance and construct variance	Multittrait multimethod matrix also provides evidence on common method variance and distinction between method and construct variance
	Method variance	Refers to the correlation between measured due to thier common assessment procedures
	Common methods	The triangles indicating different traits same method reflect common methods Correlations between different triats measured by different measures (lower left triangle ) Different traits measured by the same method (immediate off diagonal triangle)
	How is construct validity different than predictive or content validity	It focuses on the role of the theory in test construction and the need to formulate Hypotheses that can be tested in validation studies Is not subjective or intuitive judgments or rationalization independent of the data
	Construct validity evidence based on response processes	Validity evidence based on response processes invoked by a test involves an analysis of the fit between the performance and psycholgocial processes examinees engage in while the construct being assessed Collect this evidence by interviewing examinees about their response processes and strategies, recording behavioral indicators such as response items and eye movements or analyzing the types of errors committed
	Ways to categorize types of questions	Objective subjective distinction classification Selected response or constructed response classification -when creating test items the overriding goal is to develop items that measure the specified construct and contribute to psychometrically sound tests
	Objective subjective distinction categorization	Often used The difficulty with this categorization is that it is sometimes difficult to say whether a question is sibetvie or objective
	Selected response or constructed response classification categorization	If an item required an examinee to select a response from avaible alternatives it is classified as a selected response item (mutiple choice true or false matching) It an item required examinees to create or construct a response it is classified as a constructed response item (fill in blank, short answer, essay item, oral examination, interviews)
	Advantages or selected response questions	A large number of items can be answered in a short time period -can include more items from the doamin to increase reliability Items are flexible and can be used to assess a wide range of constructs with greatly varying levels of complexity Decrease the influence of certian contruct irrelevant factors that can impact test scores (writing abilities)
	Limitations of selected responses	Items are challenging to write -mutiple choice tests can be difficult to come up with foils that are plausible yet incorrect There are some constructs that cannot be measured using selected response items Blind guessing and random responding are seen in such items
	Types of constructed response items	Short answers Essays Performance assessments Portfolio assessment
	Short answer	Items can take a number of forms -putting a word, phrase, number in response to a direct question or complete sentence
	Performance assessments	Require examinees to be complete a process or produce a product in a context that closely remembles real life situations
	Portfolio assessments	Involves the systematic collection of examinee work products over a specified period of time according to a specific ser of guidelines Writers, artisitis, architects
	Strengths of contracted response assessments	Are easy to write and develop -developing a framework for how to properly score response can take time and effort Well for assessing higher order cogntive abilities and complex task performance, and tasks that require contracted response formate like problem solving Items eliminate guessing and random responding
	Limitations of constructed response classification	Take more time to complete as a result you are not able to sample the content domain as throughly -less reliable and time consuming to score Do eliminate blind guessing but are vulnerable to faking or creative construction when answers are unknown Vulnerable to influence of extraneous or construct irrelevant factors that can impact test scores (writing abilities)
	Who to select item format	Key factor in selecting an item format involves identifitying the format that most directly measures the construct Select the item format or task that will be the most direct measure of the construct of interest Selected response items are recommended since they allow broader sampling of content domain and more objective and reliable scoring procedures
	Writing items	There is an art to writing good items
	Rules for writing items	Textbook
	Types of selected response test questions -which one is selected depends on the area of inquiry	Dichotomous items are scored as either true or false, agree or disagree Polytomous items Likert format
	Dichotomous items	Advantages include simple to understand, score and administer Appears in education and personality tests where absolute judgments are required On maximal performance tests items should be easily scored (correct inccorect) according to a scoring criteria -all scored in an objective manner and are classified as objective questions Disadvantages include increased chance of guessing and many items need to be written -T or f encourage superficial understanding Because of guessing these items are less relabile and less precise than other formats
	Poltyomous items	Several response alternatives are given (multiple choice) One alternative is preferred and others are wrong or not indicative of the the construct or answer 3 alternatives (distractors, foils) in addition to the correct answer maximizes item and test relviality and discriminating between test takers Reliability of tests in this format is constrained by guessing -correctiin for guessing may be used
	Likert format	5 to 7 alternatives are given such as strongly agree, agree, don't know ect Test taker chooses the alternative that comes closest to their attitude toward the issue Negatively worded items are reversed scored Used for personality and attitude tests
	Item analysis	General procedure for methods used to evaluate specific items or groups of items (not whole test) All tests needed to undergo this as it us useful for test developers to decide which items to keep on a test and which items to modify, or eliminate If improve quality of individual items the overall test quality improves Tells you about test item homogeneity
	Components of item analysis	Item difficulty Item discrimination Distractor analysis Items not independent
	Item difficulty	The propotion of people who pass an item P = number who get right / number of examines 0 to 1 with lower proportion meaning more difficult item -0 or 1 provide no information Optimal level of difficulty is .3 to .7 To deffrtrntiate test takers items must range in difficulty
	How hard an item should be depends on	1) probability of correct response by chance 2) what the test Is designed to do
	Item difficulty need to consider item guessing	To take into consideration the effects of guessing the optimal item difficulty level is set higher than for constructed response items Lord argues that for 4 option multiple choice items the average p should be about .74 with a range of .64 to .84
	Percent endorsement	The item difficulty index is only applicable to maximize performance tests where items are scored correct incorrect
	Item discriminability	Refers to how well an item discriminates amoung test takers who differ on the construct being measured by the test Above .40 is considered Excellent .3 to .39 are good .11 to .29 are fair 0 to .10 are poor
	Two item discriminability statistics	Item total correlations Discrimination index -Seperate test takers into top and bottom -compute a discrimination index for each item Di discrimination index = proportion of top who got it correct - proportion of bottom who got it correct -remove or revise inverted items (negative)
	Item discrimination and difficulty	Item discrimination are biased in favor of items with intermediate difficulty levels (p value) Because relationship between p and d, items that have excellent d values will have p values between .2 and .8
	Item total correlations coefficients	Point biserial correlations ( rbis) -correlation between true dichotomy (yes or no) and comtinous measure (adjusted test score) Large correlations mean that an item is measuring the same construct as the overall test and discriminates between individuals with high and low construct ability If low then that item should be eliminated between .2 and .35 Allows the test developer to select items that discriminate between respondents thar are high and low in construct
	Item total correlations with typical response tests	On a test designed to measure sensation seeking using true or false items 1 means yes and high sensation seeking 0 means low sensation seeking
	Distracter analysis	On mutiple choice tests incorrect alternatives are referrd to as distractors (foils) since they distract those who don't actually know the correct response Allows you to examine how many examinees in top and bottom groups selected each option on a multiple choice item Good distractors should show negative item discrimination and be selected by some examinees
	Relationship between item distracters item difficulty and discrimination	The selection of distractors impacts both item difficulty and discrimination Significantly impacts item difficulty and consequently item discrimination
	Item validity	Becomes and issue when yhe test constructor want to maximize criterion or predictive validity Is assessed by point bie serial correlation between responses to each test item and a criterion measure multiple by items standard deviation Rbis x SD
	Item reliability	Assesses internal consistency of the test Rbis × standard deviation of the test
	Item response theory (IRT ) -alternative to item analysis	The construct underlying the test responses can be known by observing the performance of the test items Responses to test items are explained by latent traits Models the probability of a correct answer or saying yes to a personality test item is a mathematical function of their standing on the latent trait Latent trait (theta) is assumed to be unobservable known only through test responses
	Latent trait	Latent is an ability or trait that is inferred to exist based on theories of beahviour or empirical evidence (cant be assessed directly) Those with more of the latent trait should get more items correct than those who score low on the construct
	Two parameters of Item response theory	Person parameter -constructed as a single latent trait or dimension Item parameters -are Item difficulty, Item discrimination and Item guessing
	Item response curves or item characteristic curves	Each item can have a graph created that maps the probability of getting an item correct against the latent dimension S shaped curves
	Three most common item parameters in Item response theory	Item discrimination -slope function Item difficulty -location function -revived the most attention Item guessing
	Rasch models	Models relating item difficulty to probability of getting the item correct
	Item difficulty in item response theory	Assumes that only item difficulty differentiates between items and all items are similar in slope and item guessing Item difficulty is determined at the point of median probability -the ability at which 50% of respondents endorse the correct answer (inflection point)
	Know	On an ICC difficult item are hard or unlikely to be endored and shifter to the right of the scale indicating the higher ability of the respondents who endorse it correctly Easier items more shifted to the left of thr ability scale Typical range - 3 to 3
	Item discrimination in item response theory	Determines the rate at which the probability of endorsing a correct item changes with given ability levels This parameter is imperative in differentiating between individuals possessing similar levels of the latent construct Purpose is to include items with high discrimination in order to be able to map individuals along the continuum of the latent trait (0 to 2)
	Know	What the item response curves mean and why people respond as they do depends on thr theory that lead to test development To avoid the impression that the latent traits are real terms such as item response analysis and item respond theory are used
	Advantages of an Item response approach to item analysis	Item analysis using classical theory uses correlations, means and variances which are sample dependent and may not generalize to other samples In an Item response analyses, Item characteristics are independent of the participant sample What We want are items that are sample independent or sample invariant or invariace of Item parameters -items can be given to people of varying ability know as population invariance
	Invariance	To create invariance, item characteristics need to be independent of the sample characteristics known as anchoring -structureal validity Items can be given to people of varying ability know as population invariance
	Significant contributions of item response theory	In the are of computer administered tests know as computer adaptive testing In the detection of item or tesr biases -ICCs for different groups are generated and statically compared to determine the degree of derrerential item functioning
	Items too hard	Frequently subjective judgments are made about the stability of items despite all effort of item analysis Item is too hard, too easy, too demeaning to minority groups These subjective reviews have not been successful in predicting item difficulty or discrimation (need statistical analysis) Hard items are not necessarily biased or unfair
	Category format	Used in observational studies, developmental and organizational psychology Scale consists of 10 catergoy responses (on scale from 1 to 10) Category response scales are included by context and anchoring effects -ratings are influenced by the behaviour of the other people being rated, and once rated, other ratings become anchored at one end or the other Overcome this by labeling endpoints and middle of scale to remind respondents of definitions. Increasing the number of categories decrease the reliability and validity because more categories call for more discrimation than is possible
	Checklists and Q Sort	In checklist respondents are given a list of objects and asked to indicate which ones are self despcitive -self dichotomous scoring method Q sort is when a number of cards each containing a trait are sored along a 9 category scale labeled from least to most -constraints imposed on thr number of attributes in each category

Share This Flashcard Set

Set the Language

Related Flashcards

Assessment Test 2 (Year 5 Term 1)

Add to Folders

Upgrade to Cram Premium

Card Range To Study

191 Cards in this Set