• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/105

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

105 Cards in this Set

  • Front
  • Back

Assumptions About Psychological Testing

1.) Psychological States and Traits Exist


2.) Traits and States Can Be Quantified and Measured


3.) Test-related Behavior Predicts Non-Test-Related Behavior


4.) Tests Have Strengths and Weaknesses


5.) Error is Part of Assessment


6.) Testing Can Be Conducted in a Fair Manner


7.) Testing and Assessment Benefit Society

Norm-Referenced Testing and Assessment

a method of evaluation and a way of deriving meaning from test scores by evaluating an individual test taker's score and comparing it to scores of a group of test takers.

Norms

the test performance data of a particular group of test takers that are designed for use as a reference when evaluating or interpreting individual test scores (a nominal sample is the reference group)

Stratified Sampling

Sampling that includes different subgroups, or strata, from the population


-stratified-random sampling is when every member of the population has an equal opportunity of being included in a sample

Purposive Sample

Randomly selecting a sample that is believed to be representative of the population

Incidental/ convenience sample:

A sample that is convenient or available for use. May not be representative of the population


-generalization of finding must be made with caution

Standardization

the process of administering a test to a representative sample of test takers for the purpose of establishing norms

Sampling

when test developers select a population, or which the test is intended, that has at least one common, observable characteristic

Percentile Norms

percentile: the percentage of people whose score on a test or measure falls below a particular raw score


-popular for tests because they are easily calculated


-difference between raw scores may be minimized at the ends and exaggerated in the middle

Other Types of Norms

-age norms


-grade norms


-national norms


-national anchor norms


-subgroup norms


-local norms

Fixed Reference Group Scoring Systems

The distribution of scores obtained on the test from one group of test takers is used as the basis for the calculation test scores for future administrations of this test

Reliability

consistency in measurement

Reliability Coefficient

an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance


observed score: true score plus error (X=T+E)

Variance

true variance plus error variance (the standard deviation squared)


Measurement Error

all the factors associated with the process of measuring some variable, other than the variable being measured


random error: caused by unpredictable fluctuations & inconsistencies of other variables


systematic error: error that is always accounted for

Sources of Error Variance

-Test construction


-Test administration


-Test scoring and interpretation


-sampling error


-methodological error

Test-Retest Reliability

an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test


-good for variables that should be stable over time


-estimates decrease over time


-after 6 months, estimate is called the coefficient of stability

Parallel-Forms

-for each form of the test, the means and the variances of observed test scores are equal

Alternate-Forms

different versions of a test that have been constructed so as to be parallel. do not meet the strict requirements of parallel forms buy typically item content and difficulty is similar between tests

Coefficient of Equivalence

the degree of the relationship between various forms of a test

Split-Half Reliability

obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.


step 1.) divide the test into equivalent halves


step 2.) calculate pearson r between two halves


step 3.) adjust the half-test reliability using the spearman-brown formula

Spearman-Brown formula

allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test

Inter-Item Consistency

The degree of relatedness of items on a test. Able to gauge the homogeneity of a test

Inter-Scorer Reliability

The degree of agreement or consistency between two or more scorers with regard to a particular measure. Often used with behavioral measures, guards against biases.

Classical Test Theory (CTT)

most widely used true-score model due to its simplicity


True Score

a value that according to CTT genuinely reflects an individual's ability or trait level as measured by a particular test

Domain-Sampling Theory

estimates the extent to which specific sources of variation under defined conditions are contributing to the test score


Generalizability Theory

based on the idea that a person's test scores vary from testing to testing because of variables in the testing situation

Item-Response Theory (IRT)

provides a way to model the probability that a person with X ability will be able to perform at a level of Y

Discrimination

the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or other variable being measured


-incorporates considerations of item difficulty and discrimination. difficulty relates to an item not being easily accomplished

Standard Error of Measurement (SEM)

provides a measure of the precision of an observed test score. an estimate of the amount of error inherent in an observed score or measurement


-the higher the reliability, the lower the SEM

Confidence Interval

a range or band of test scores that is likely to contain the true score

Standard Error of Difference

a measure that can aid a test user in determining how large a difference in test scores should be expected before it is considered statistically significant

SED Answers These Questions:

1.) How did this individual's performance on test 1 compare with his or her performance on test 2?


2.) How did this individual's performance on test 1 compare with someone else's performance on test 1?


3.) How did this individual's performance on test 1 compare with someone else's performance on test 2?

Validity

a judgement or estimate of how well a test measures what it purports to measure in a certain context

Validation

the process of gathering and evaluating evidence about validity

3 Types of Validity

1.) Content Validity: measure of validity based of eval. of subjects, topics, or content covered by the items in a test


2.) Criterion-Related Validity: measure of validity obtained by evaluating the relationship of scores obtained on the test to scores on other tests


3.) Construct Validity: a measure of validity arrived at by executing a comprehensive analysis of: a.) how scores on the test relate to other test scores and measures and b.) how scores on the test can be understood with some theoretical framework for understanding the construct that the test was designed to measure

Face Validity

a judgement concerning how relevant the test items appear to be


- self-report tests tend to be high, Rorscach is low


- low face validity leads to low validity confidence

Test Blueprint

a plan regarding the types of information to be covered by the items, the number of items tapping each area of coverage, the organization of the items in the test, etc.

Criterion

the standard against which a test or test score is evaluated.


- a criterion is relevant, valid, and uncontaminated, which means it's not part of the predictor

Concurrent Validity

an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently)

Predictive Validity:

an index of the degree to which a test score predicts some criterion, or outcome, measure in the future. Tests are evaluated as to their predictive validity.

Validity Coefficient

a correlation coefficient that provides a measure of the relationship between test scores and scores on the criterion measure


-affected by restriction or inflation of range

Incremental Validity

the degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use

Expectancy Tables

show the percentage of people within specified test-score intervals who subsequently were placed in various categories of the criterion (pass vs fail categories)

Evidence of Construct Validity

-homogeneity: how uniform a test is in measuring a single construct


-changes with age/ changes over time


-pretest/posttest changes: scores change as result of experience


-from distinct groups: scores vary because of membership of a group


Convergent Evidence

scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established tests designed to measure the same/ similar construct

Discriminant Evidence

validity coefficient showing little relationship between test scores and other variables with which scores on the test should not theoretically be correlated

Factor Analysis

a new test should load on a common factor with other tests of the same construct

Bias

a factor inherent in a test that systematically prevents accurate, impartial measurement


-implies systematic variation in tests


-prevention during development is best cure

Rating Error

a judgement resulting from the intentional or unintentional misuse of a rating scale

Halo Effect

a tendency to give a particular person a higher rating than he or she objectively deserves because of a favorable overall impression

Fairness

The extent to which a test is used in an impartial, just, and equitable way

Utility

the usefulness or practical value of testing to improve efficiency

Factors Affecting Utility

-psychometric soundness: high criterion validity, higher utility


-costs


-benefits


Costs of Utility

-money


-time


-place you're using


-doctor insurance


-skill of assessor


-travel

Benefits of Utility

-conclusive outcome


-improved well-being


-better functioning in life


-better vocational placement


-increase in performance


-assessment is worth it

Utility Analysis

a family of techniques that entail a cost-benefit analysis designed to yield information relevant to a decision about the usefulness and/or practical value of a tool of assessment


-ask question "what test will give us the most bang for our buck"

Taylor-Russel Tables

provide an estimate of the % of employees hired by the use of a particular test who will be successful at their jobs, give the test's validity (validity coefficient), the selection ratio (# that represents #of people to be hired vs. # of people available to be hired), and the base rate (% of people hired under existing system for particular reason

Naylor-Shine Tables

entail obtaining the difference between the means of the selected and unselected groups to derive an index of what the test is adding to already established procedures

Brogden-Cronbach-Gleser Formula

used to calculate the $ of a utility gain resutling from the use of a particular selection instrument under specified conditions

Cronbach and Gleser presented

1.) a classification of decision problems


2.) various selection strategies ranging from single-stage to sequential analysis


3.) quantitative analysis of the relationship between test utility, selection ratio, cost, and expected value of the outcome


4.) a recommendation that in some instances, job requirements be tailored to the applicant's ability instead of vice versa (adaptive treatment)

Practical Considerations

-the pool of job applicants (infinite vs finite)


-the complexity of the job


-the cut score in use


-multiple cut scores (Student gets A, B, C, etc.)


-multiple hurdles (need to answer a question right to move onto the next one)

The Angoff Method

judgements of experts are averaged to yield cut scores for the test


-problems arise if there is low agreement between experts

The Known Groups Method

entails collection of data on the predictor of interest from groups known to possess, and not to possess, a trait, attribute, or ability or interest


-no guidelines exist to establish guidelines

IRT Based Method

each item is associated with a particular level of difficulty


-in order to "pass" the test, test taker must answer items that are deemed to be above minimum level of difficulty

Method of Predictive Yield

took into account the # of positions to be filled, projections regarding the likelihood of offer acceptance, and the distribution of applicant scores

Discriminant Analysis

a family of statistical techniques used to shed light on the relationship between identified variables and naturally occurring groups

Five Stages of Test Development

1.) Conceptualization


2.) Construction


3.) Tryout


4.) Analysis


5.) Revision

Scaling

the process of setting rules for assigning numbers in measurement

Rating Scales

a grouping of words, statements, or symbols on which judgements of the strength of a particular trait, attitude or emotion are indicated by the test taker

Likert Scale

Each item presents the test taker with five alternative options (sometimes seven), usually on an agree-disagree or approve-disapprove

Scaling Methods

-numbers can be assigned to responses to calculate test scores using a number of methods


-unidimensional: one dimension is presumed to underlie the ratings


-multidimensional: more than one dimension is thought to underlie the ratings


Method of Paired-Comparisons

test takers must choose between two alternatives according to some rule (which is more justified)


-test takers receive more points for choosing option deemed more justifiable by majority group of judges


Comparative Scaling

Entails judgements of a stimulus in comparison with every other stimulus on the scale


Categorical Scaling

stimuli are placed into one of two or more alternative categories

Guttman Scale

Items range sequentially from weaker to strong expressions of attitude, belief, or feeling being measured


-respondents who agree with the stronger statements of the attitude will agree with milder statements

Item Pool

the reservoir or well form which items will or will not be drawn for the final version of a test

Item Format

includes variables like form, plan, structure, arrangement, and layout of individual questions

Selected Response Fomat

Items require test takers to select a response from a set of alternative responses

Constructed Response Format

items require test takers to supply or create the correct answer, not just select it

Multiple Choice

three components: 1.) a stem, the ? 2.) a correct alternation or option, the answer 3.) incorrect alternatives referred to as distractors or foils

Computerized Adaptive Testing (CAT)

an interactive, computer-administered test-taking process wherein items presented to the test taker are based in part of the test takers performance on the previous items


-provide economy in testing time and # of items presented

Floor Effect

your base level, the lowest skill level you achieve

Ceiling Effect

you get 3 answers in a row wrong and the test is concluded, that is your ceiling

Cumulatively Scored Test

assumption that the higher the score on the test, the higher the test taker is on the ability, trait, or other characteristic that the test purports to measure

Class Scoring

responses earn credit toward placement in a particular class or category with other test takers whose pattern of responses is presumably similar in some way (diagnostic testing)

Ipsative Scoring

comparing a test taker's score on one scale within a test to another scale within that same test

Test Tryout

-should be tested on pop. it was designed for


-should be tested 5-10 times


-should be administered consistently and fairly

Item Analysis

test developers will use the tools of an index of the item's difficulty, reliability, validity, and discrimination

Item-Difficulty Index

the proportion of respondents answering an item correctly

Item Reliability Index

indication of the internal consistency of the scale


-factor analysis can also provide this

Item-Validity Index

allows test developers to evaluate the validity of items in relation to a criterion measure

Item-Discrimination Index

indicates how adequately an item separates or discriminates between high scorers and low scorers on an entire test


-measure of difference between proportion of high scorers answering a ? right and proportion of low scorers answering the ? right

Item Characteristic Curves (ICC)

a graphic representation of item difficulty and discrimination

Other Considerations

-guessing


-item fairness


-biased test items


-speed tests

Qualitative Methods

techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures


-read a question out loud, respondents are asked to verbalize their thoughts as they occur during testing

Sensitivity Review

items are examined in relation to fairness to all prospective test takers (check for offensiveness)

Revision In Test Development

items are evaluated as to their strengths and weaknesses


-some may be replaced by items from item pool


-revised tests will be tried again


-once a test has been finalized, norms may be developed from the data and it is said to be standardized

Cross-Validation

the revalidation of a test on a sample of test takers other than those on whom test performance was originally found to be a valid predictor of some criterion

Co-Validation

a test validation process conducted on two or more tests using the same sample of test takers

Anchor Protocol

test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies

Scoring Drift

a discrepancy between scoring in an anchor protocol and the scoring of another protocol

IRT Applications In Building/Revising Tests

1.) evaluating existing tests for the purpose of mapping test revisions


2.) determining measurement equivalence across test taker populations


3.) developing item banks

hey cas guess what

YOU'RE AWESOME AS HECK OKAY GO YOU. YOU CAN DO THIS. YOU WILL DO AMAZING ON THIS TEST. ILY.



-cas