Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
105 Cards in this Set
- Front
- Back
Assumptions About Psychological Testing |
1.) Psychological States and Traits Exist 2.) Traits and States Can Be Quantified and Measured 3.) Test-related Behavior Predicts Non-Test-Related Behavior 4.) Tests Have Strengths and Weaknesses 5.) Error is Part of Assessment 6.) Testing Can Be Conducted in a Fair Manner 7.) Testing and Assessment Benefit Society |
|
Norm-Referenced Testing and Assessment |
a method of evaluation and a way of deriving meaning from test scores by evaluating an individual test taker's score and comparing it to scores of a group of test takers. |
|
Norms |
the test performance data of a particular group of test takers that are designed for use as a reference when evaluating or interpreting individual test scores (a nominal sample is the reference group) |
|
Stratified Sampling |
Sampling that includes different subgroups, or strata, from the population -stratified-random sampling is when every member of the population has an equal opportunity of being included in a sample |
|
Purposive Sample |
Randomly selecting a sample that is believed to be representative of the population |
|
Incidental/ convenience sample: |
A sample that is convenient or available for use. May not be representative of the population -generalization of finding must be made with caution |
|
Standardization |
the process of administering a test to a representative sample of test takers for the purpose of establishing norms |
|
Sampling |
when test developers select a population, or which the test is intended, that has at least one common, observable characteristic |
|
Percentile Norms |
percentile: the percentage of people whose score on a test or measure falls below a particular raw score -popular for tests because they are easily calculated -difference between raw scores may be minimized at the ends and exaggerated in the middle |
|
Other Types of Norms |
-age norms -grade norms -national norms -national anchor norms -subgroup norms -local norms |
|
Fixed Reference Group Scoring Systems |
The distribution of scores obtained on the test from one group of test takers is used as the basis for the calculation test scores for future administrations of this test |
|
Reliability |
consistency in measurement |
|
Reliability Coefficient |
an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance observed score: true score plus error (X=T+E) |
|
Variance |
true variance plus error variance (the standard deviation squared)
|
|
Measurement Error |
all the factors associated with the process of measuring some variable, other than the variable being measured random error: caused by unpredictable fluctuations & inconsistencies of other variables systematic error: error that is always accounted for |
|
Sources of Error Variance |
-Test construction -Test administration -Test scoring and interpretation -sampling error -methodological error |
|
Test-Retest Reliability |
an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test -good for variables that should be stable over time -estimates decrease over time -after 6 months, estimate is called the coefficient of stability |
|
Parallel-Forms |
-for each form of the test, the means and the variances of observed test scores are equal |
|
Alternate-Forms |
different versions of a test that have been constructed so as to be parallel. do not meet the strict requirements of parallel forms buy typically item content and difficulty is similar between tests |
|
Coefficient of Equivalence |
the degree of the relationship between various forms of a test |
|
Split-Half Reliability |
obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. step 1.) divide the test into equivalent halves step 2.) calculate pearson r between two halves step 3.) adjust the half-test reliability using the spearman-brown formula |
|
Spearman-Brown formula |
allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test |
|
Inter-Item Consistency |
The degree of relatedness of items on a test. Able to gauge the homogeneity of a test |
|
Inter-Scorer Reliability |
The degree of agreement or consistency between two or more scorers with regard to a particular measure. Often used with behavioral measures, guards against biases. |
|
Classical Test Theory (CTT) |
most widely used true-score model due to its simplicity
|
|
True Score |
a value that according to CTT genuinely reflects an individual's ability or trait level as measured by a particular test |
|
Domain-Sampling Theory |
estimates the extent to which specific sources of variation under defined conditions are contributing to the test score
|
|
Generalizability Theory |
based on the idea that a person's test scores vary from testing to testing because of variables in the testing situation |
|
Item-Response Theory (IRT) |
provides a way to model the probability that a person with X ability will be able to perform at a level of Y |
|
Discrimination |
the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or other variable being measured -incorporates considerations of item difficulty and discrimination. difficulty relates to an item not being easily accomplished |
|
Standard Error of Measurement (SEM) |
provides a measure of the precision of an observed test score. an estimate of the amount of error inherent in an observed score or measurement -the higher the reliability, the lower the SEM |
|
Confidence Interval |
a range or band of test scores that is likely to contain the true score |
|
Standard Error of Difference |
a measure that can aid a test user in determining how large a difference in test scores should be expected before it is considered statistically significant |
|
SED Answers These Questions: |
1.) How did this individual's performance on test 1 compare with his or her performance on test 2? 2.) How did this individual's performance on test 1 compare with someone else's performance on test 1? 3.) How did this individual's performance on test 1 compare with someone else's performance on test 2? |
|
Validity |
a judgement or estimate of how well a test measures what it purports to measure in a certain context |
|
Validation |
the process of gathering and evaluating evidence about validity |
|
3 Types of Validity |
1.) Content Validity: measure of validity based of eval. of subjects, topics, or content covered by the items in a test 2.) Criterion-Related Validity: measure of validity obtained by evaluating the relationship of scores obtained on the test to scores on other tests 3.) Construct Validity: a measure of validity arrived at by executing a comprehensive analysis of: a.) how scores on the test relate to other test scores and measures and b.) how scores on the test can be understood with some theoretical framework for understanding the construct that the test was designed to measure |
|
Face Validity |
a judgement concerning how relevant the test items appear to be - self-report tests tend to be high, Rorscach is low - low face validity leads to low validity confidence |
|
Test Blueprint |
a plan regarding the types of information to be covered by the items, the number of items tapping each area of coverage, the organization of the items in the test, etc. |
|
Criterion |
the standard against which a test or test score is evaluated. - a criterion is relevant, valid, and uncontaminated, which means it's not part of the predictor |
|
Concurrent Validity |
an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently) |
|
Predictive Validity: |
an index of the degree to which a test score predicts some criterion, or outcome, measure in the future. Tests are evaluated as to their predictive validity. |
|
Validity Coefficient |
a correlation coefficient that provides a measure of the relationship between test scores and scores on the criterion measure -affected by restriction or inflation of range |
|
Incremental Validity |
the degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use |
|
Expectancy Tables |
show the percentage of people within specified test-score intervals who subsequently were placed in various categories of the criterion (pass vs fail categories) |
|
Evidence of Construct Validity |
-homogeneity: how uniform a test is in measuring a single construct -changes with age/ changes over time -pretest/posttest changes: scores change as result of experience -from distinct groups: scores vary because of membership of a group
|
|
Convergent Evidence |
scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established tests designed to measure the same/ similar construct |
|
Discriminant Evidence |
validity coefficient showing little relationship between test scores and other variables with which scores on the test should not theoretically be correlated |
|
Factor Analysis |
a new test should load on a common factor with other tests of the same construct |
|
Bias |
a factor inherent in a test that systematically prevents accurate, impartial measurement -implies systematic variation in tests -prevention during development is best cure |
|
Rating Error |
a judgement resulting from the intentional or unintentional misuse of a rating scale |
|
Halo Effect |
a tendency to give a particular person a higher rating than he or she objectively deserves because of a favorable overall impression |
|
Fairness |
The extent to which a test is used in an impartial, just, and equitable way |
|
Utility |
the usefulness or practical value of testing to improve efficiency |
|
Factors Affecting Utility |
-psychometric soundness: high criterion validity, higher utility -costs -benefits
|
|
Costs of Utility |
-money -time -place you're using -doctor insurance -skill of assessor -travel |
|
Benefits of Utility |
-conclusive outcome -improved well-being -better functioning in life -better vocational placement -increase in performance -assessment is worth it |
|
Utility Analysis |
a family of techniques that entail a cost-benefit analysis designed to yield information relevant to a decision about the usefulness and/or practical value of a tool of assessment -ask question "what test will give us the most bang for our buck" |
|
Taylor-Russel Tables |
provide an estimate of the % of employees hired by the use of a particular test who will be successful at their jobs, give the test's validity (validity coefficient), the selection ratio (# that represents #of people to be hired vs. # of people available to be hired), and the base rate (% of people hired under existing system for particular reason |
|
Naylor-Shine Tables |
entail obtaining the difference between the means of the selected and unselected groups to derive an index of what the test is adding to already established procedures |
|
Brogden-Cronbach-Gleser Formula |
used to calculate the $ of a utility gain resutling from the use of a particular selection instrument under specified conditions |
|
Cronbach and Gleser presented |
1.) a classification of decision problems 2.) various selection strategies ranging from single-stage to sequential analysis 3.) quantitative analysis of the relationship between test utility, selection ratio, cost, and expected value of the outcome 4.) a recommendation that in some instances, job requirements be tailored to the applicant's ability instead of vice versa (adaptive treatment) |
|
Practical Considerations |
-the pool of job applicants (infinite vs finite) -the complexity of the job -the cut score in use -multiple cut scores (Student gets A, B, C, etc.) -multiple hurdles (need to answer a question right to move onto the next one) |
|
The Angoff Method |
judgements of experts are averaged to yield cut scores for the test -problems arise if there is low agreement between experts |
|
The Known Groups Method |
entails collection of data on the predictor of interest from groups known to possess, and not to possess, a trait, attribute, or ability or interest -no guidelines exist to establish guidelines |
|
IRT Based Method |
each item is associated with a particular level of difficulty -in order to "pass" the test, test taker must answer items that are deemed to be above minimum level of difficulty |
|
Method of Predictive Yield |
took into account the # of positions to be filled, projections regarding the likelihood of offer acceptance, and the distribution of applicant scores |
|
Discriminant Analysis |
a family of statistical techniques used to shed light on the relationship between identified variables and naturally occurring groups |
|
Five Stages of Test Development |
1.) Conceptualization 2.) Construction 3.) Tryout 4.) Analysis 5.) Revision |
|
Scaling |
the process of setting rules for assigning numbers in measurement |
|
Rating Scales |
a grouping of words, statements, or symbols on which judgements of the strength of a particular trait, attitude or emotion are indicated by the test taker |
|
Likert Scale |
Each item presents the test taker with five alternative options (sometimes seven), usually on an agree-disagree or approve-disapprove |
|
Scaling Methods |
-numbers can be assigned to responses to calculate test scores using a number of methods -unidimensional: one dimension is presumed to underlie the ratings -multidimensional: more than one dimension is thought to underlie the ratings
|
|
Method of Paired-Comparisons |
test takers must choose between two alternatives according to some rule (which is more justified) -test takers receive more points for choosing option deemed more justifiable by majority group of judges
|
|
Comparative Scaling |
Entails judgements of a stimulus in comparison with every other stimulus on the scale
|
|
Categorical Scaling |
stimuli are placed into one of two or more alternative categories |
|
Guttman Scale |
Items range sequentially from weaker to strong expressions of attitude, belief, or feeling being measured -respondents who agree with the stronger statements of the attitude will agree with milder statements |
|
Item Pool |
the reservoir or well form which items will or will not be drawn for the final version of a test |
|
Item Format |
includes variables like form, plan, structure, arrangement, and layout of individual questions |
|
Selected Response Fomat |
Items require test takers to select a response from a set of alternative responses |
|
Constructed Response Format |
items require test takers to supply or create the correct answer, not just select it |
|
Multiple Choice |
three components: 1.) a stem, the ? 2.) a correct alternation or option, the answer 3.) incorrect alternatives referred to as distractors or foils |
|
Computerized Adaptive Testing (CAT) |
an interactive, computer-administered test-taking process wherein items presented to the test taker are based in part of the test takers performance on the previous items -provide economy in testing time and # of items presented |
|
Floor Effect |
your base level, the lowest skill level you achieve |
|
Ceiling Effect |
you get 3 answers in a row wrong and the test is concluded, that is your ceiling |
|
Cumulatively Scored Test |
assumption that the higher the score on the test, the higher the test taker is on the ability, trait, or other characteristic that the test purports to measure |
|
Class Scoring |
responses earn credit toward placement in a particular class or category with other test takers whose pattern of responses is presumably similar in some way (diagnostic testing) |
|
Ipsative Scoring |
comparing a test taker's score on one scale within a test to another scale within that same test |
|
Test Tryout |
-should be tested on pop. it was designed for -should be tested 5-10 times -should be administered consistently and fairly |
|
Item Analysis |
test developers will use the tools of an index of the item's difficulty, reliability, validity, and discrimination |
|
Item-Difficulty Index |
the proportion of respondents answering an item correctly |
|
Item Reliability Index |
indication of the internal consistency of the scale -factor analysis can also provide this |
|
Item-Validity Index |
allows test developers to evaluate the validity of items in relation to a criterion measure |
|
Item-Discrimination Index |
indicates how adequately an item separates or discriminates between high scorers and low scorers on an entire test -measure of difference between proportion of high scorers answering a ? right and proportion of low scorers answering the ? right |
|
Item Characteristic Curves (ICC) |
a graphic representation of item difficulty and discrimination |
|
Other Considerations |
-guessing -item fairness -biased test items -speed tests |
|
Qualitative Methods |
techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures -read a question out loud, respondents are asked to verbalize their thoughts as they occur during testing |
|
Sensitivity Review |
items are examined in relation to fairness to all prospective test takers (check for offensiveness) |
|
Revision In Test Development |
items are evaluated as to their strengths and weaknesses -some may be replaced by items from item pool -revised tests will be tried again -once a test has been finalized, norms may be developed from the data and it is said to be standardized |
|
Cross-Validation |
the revalidation of a test on a sample of test takers other than those on whom test performance was originally found to be a valid predictor of some criterion |
|
Co-Validation |
a test validation process conducted on two or more tests using the same sample of test takers |
|
Anchor Protocol |
test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies |
|
Scoring Drift |
a discrepancy between scoring in an anchor protocol and the scoring of another protocol |
|
IRT Applications In Building/Revising Tests |
1.) evaluating existing tests for the purpose of mapping test revisions 2.) determining measurement equivalence across test taker populations 3.) developing item banks |
|
hey cas guess what |
YOU'RE AWESOME AS HECK OKAY GO YOU. YOU CAN DO THIS. YOU WILL DO AMAZING ON THIS TEST. ILY.
-cas |