• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/64

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

64 Cards in this Set

  • Front
  • Back
3 common characteristics of psychological tests
1. Samples Behavior
2. Uses standardized procedures
3. Rules for scoring and categorizing.
Differences in psychological tests.
-Behavior assessed.
-Attribute measured.
-Content (types of questions).
-Adminstration & format.
-Scoring & interpretation.
-Psychometric quality
What are some assumptions of psychometric quality?
-We assume individual differences exist
-Validity
-Reliability
-Individuals can understand test items similarly.
-Individuals can accurately report about themselves.
-Individuals will report their thoughts and feelings honestly.
The equation for test scores. What does a test score represent?
X = T + e

Test scores are equal to true ability plus some error, which may come from the test itself, from the examiner, or from the environment. ** We do our best to make e as small as possible.
What do we use psychological tests for?
1. Classification (placement, screening, certification, diagnosis).
2. Self-knowledge
3. Program evaluation
4. Research
Why do we restrict access to psychological tests?
1. Can cause harm (Rorschach= psychotic break).
2. Tests become invalid if you preview items.
3. Tests become useless if general public sees them.
What are the "6 Easy Steps to Test Construction"?
1. Plan the test (define your construct, review the literature, review other tests).
2. Write items for each of the areas of the plan (2x as many as you think you will need).
3. Administer all items to a reasonably large sample of at least 50 examinees; pilot testing.
4. Conduct an item analysis to select out the good itmes.
5. Administer the revised test to another representative sample of examinees; test-norming/standardization.
6. Perform reliability & validity studies.
Things to consider when planning a test.
Purpose/Objective
Need
Population
Content
Administration
Item Format
Alternate forms necessary?
Training requirments for administration of the test.
What are 3 types of data that a psychological test can yield?
1. Cumulative model of scoring: total # of correct answers is the raw score.
2. Categorical model of scoring: place in a category.
3. Ipsative model of scoring: test taker's response on various scales are compared with each other to yield a "profile."
Reliability
The degree of consistency in our measurement. It is an attempt to gauge the likely accuracy, or repeatability of test scores. **Think of it on a continuum.

Reliability is the extent to which the scores on a measure only change as a result of changes in the underlying variable being measured. That is, scores on the measure should stay the same if the underlying variable (in this case level of attractiveness) does not change, and the scores should change if they do. Any change in scores not related to the underlying variable is considered error variance.
Classical Test Theory vs. Item Response Theory
Looking at reliability of a measure in terms of summing the number of items answered correctly vs. looking at the behavior of each individual item. For example, number correct on the SAT vs. an SAT that changes as you take it based on your ability.
Reliability coefficient
The correlation of one test with another test in the domain.
-Bwtn 0 and 1, want closer to 1
-For people: .9
-For research:.7
What are 4 types of random measurement error?
1. Item selection
2. Test administration
3. Test scoring
4. Attitude of the examiner
Systematic Measurement Error
When a test consistently measures something other than the trait it was intended to measure. This affects all test takers the same way. Ex. social introversion and anxiety, it is difficult to isolate one from the other.
Test-Retest Reliability
Administer the test 2 times to the SAME GROUP of people. The second score should be completely predictable from the 1st score (despite practice effects).

**Not appropriate when measuring states unless time interval btwn testing is short.
Alternate Forms Reliability
Administer 2 different forms of the same test (similar content and difficulty) to the SAME GROUP.
-Eliminates practice effects.
-Expensive!
-Introduces "item-sampling" differences as additional source of error variance.
Split-Half Reliability (internal consistency)
Correlate pairs of scores obtained from equivalent halves of a test administered ONLY ONCE to a single sample.
-Methods for splitting the test: top/bottom, odd/even, random assignment
-Less precise because reliability estimate depends on how you choose to split the test.
Interrater Reliability
When scores are heavily dependent on the judgment of the examiner, it is CRITICAL to report interrater reliability. A sample of tests is independently scored by 2 or more examiners and pairs of examiners' scores are correlated.
What factors influence reliability?
-Sources of Measurement Error: item selection, test administration, systematic error, changes over time, item sampling, nature of split, scorer differences
-Test length
-Variablity (range) of test scores
Construct Validity
The degree to which the test measures the theoretical construct or trait that it was designed to measure (in the context in which it is to be applied).
What are 4 types of experimental hypotheses when looking at evidence for construct validity.
Group differences (btwn groups)
Changes (over time, before/after)
Processes (differences in problem solving)
Correlations (w/other constructs).
Multitrait-Multimethod Validity
Variance in observed scores can be thought of as arising from both the trait (e.g., team coordination) and the method (e.g., instructor ratings). Each measure can be thought of as a trait-method unit. With a single trait method unit, we cannot separate variance due to traits and methods; with multiple trait-method units, we can begin to do so. With multiple trait-method units we can construct a correlation matrix that shows the relations among all the units.
Convergent Validity
1 trait, 2 methods
Demonstrated by high correlations between scores on tests measuring the same trait by different methods.
Discriminant Validity
2 traits, 1 method
Demonstrated by low correlations between scores on tests measuring different traits, especially when using the same methods.
Content Validity
Degree to wich test items are representative of the behavior domain that the test was designed to measure. Established using a qualitative analysis, rather than a quantitative analysis.
What are the 2 ways to establish content validity?
BEFORE test development: define the test universe and develop a test plan.
AFTER test development: use the judgment of experts.
Criterion-Related Validity
OR Predictive Validity
The extent to which a test is shown to be effective in estimating an examinee's performance on some outcome measure; we are establising a statistical relationship with a particular criterion.
-NOT convergent validity, ie. comparing the scores on a test with a test examining the same construct.
Ex. SAT is designed to predict the outcome on some criteria, grades in college.
2 methodes for demonstrating Criterion-Related Validity
1. Predictive Method- criterion measures are obtained in the future, sometimes months or even years after the test has been given.
2. Concurrent Method- criterion measures are obtained at approximately the same time as the test scores.
What is the coefficient used for Criterion-related validity?
Pearson r: the correlation btwn the test score and the criterion score.
Explain the two types of regression used in making predictions from validity information.
1. Linear Regression: use 1 set of test scores to predict 1 set of criterion scores.
2. Multiple Regression: use more than 1 set of test scores to predict 1 set of criterion scores.
Utility
Looking at psychological testing NOT as a measurement in and of itself, but measurement in the service of decision making.

The degree to which the predictor's use improves the quality of the decision beyond what it would have been had the predictor not been used. It refers to Return on Investment or the Usefulness of Meaningfulness of a test.
Cutting Score
If you have a test that you use to make a decision, the simplest way you may wish to use the data is to screen or separate people into dichotomous categories (hire/don't hire, accept/reject). This procedure involves setting a cutting score and placing people who obtain a score equal to or higher than the cutting score in one categoy and all other people in the other category. The cutoff score will effect how accurate the decisions based on the test will be.
"Hit" Rate
The percentage of predictions that are correct (predicting that a successful person will be successful).
Sensitivity
OR Detection Rate
The number of correctly identified people who have the quality (ie, the test says the quality is present and in actuality, it is present) divided by the number of people who actually have the quality. (Can the test detect when this quality is present?)
Specificity
The number of correctly idenfitied people who do NOT have the quality (test says NO) divided by the number of people who actually do not have the quality. (Can the test detect when the quality is NOT present?)
Base Rate
Tests should always provide more information than what is known without any testing (base rate).
Tests are most useful when the base rate, or ratio of success to failure, is ...
50/50
100/0, where everyone who applies is successful, would have a test be useless because it cannot improve on perfect success.
0/100, where everyone who applies fails is also useless because it cannot pick anyone who will succeed on the job.
-As the base rate becomes larger, it becomes harder for for a test to improve on the base rate.
Major problem of base rates.
The only concern is for the selected people that end up successful, not for the rejected people who could have been successful. Applicants care about it, but usually not decision makers.
Incremental Validity
Whether a test adds anything to what is already known from other sources. This is done by seeing if a test makes a difference when added to other variables in a multiple regression equation.
Stratified Random Sampling
Dividing your population into homogeneous subgroups (ex. sex, age, race) and then taking a simple random sample in each subgroup.
Test Bias
Systematic error in the estimation of some 'true' value for a group of individuals. A test is deemed biased if it is differentially valid for different subgroups.
Test Fairness
Reflects social values and philosophies of test use, particularly when test use extends to selection for priviledge or employment. Even a test that is unbiased according to the traditional technical criteria might still be deemed unfair because of the social consequences of using it for selection decisions. In the assessment of test fairness, subjective values are of overarching importance; the statistical criteria of test bias are merely ancillary.
Is an item or subscale biased just because it feels like it?
NO! An item or subscale of a test is considered biased in content when it is demonstrated to be relatively more difficult for members of one group than for members of another in a situation where the general ability level of the groups being compared is held constant and no reasonable theoretical rationale exists to explain group differences on the item or subscale in question.

-You may feel like it is biased; however the statistic showing that two groups differ, regardless of the reason you come up with, may still be true.
Differential Item Functioning (DIF)
When a particular population's performance on a test item differs even when trait level is controlled.
Ex. consider an item from a depression scale that inquires about the amt of tx sought for the person's depressive episodes. Two ppl from different ethnic groups may accept/acknowledge mental illness differently which can differentially affect how much tx is sought rather than actual depression severity. So 2 ppl of different ethnicities may have the same level of depression but respond differently to the item.
How do we detect DIF?
Item Characteristic Curves (ICCs)
A separate ICC is graphed for each item. A good item has a positive slope, and the graphs are useful for identifying items that perform differently for subgroups of examnees.
Has research supported the popular belief that the specific content of test items is a source of cultural bias against minorities in well-known standardized tests of ability and aptitude?
NO!!!
Discuss bias in Predictive (Criterion)-Related Validity
A test is considered biased with respect to predictive validity if there is a constant error in an inference for prediction as a function of membership in a particular group.
-Ex. an unbiased SAT will predict future academic performance of African Americans and white Americans with near-identical accuracy. If the predictive validity coefficient for one group is .9 and the other is .3, then that would be evidence of significant predictive validity bias, and that the test is more predictive for one group than another.

**Research suggests infrequent evidence of predictive validity bias for standardized ability and aptitude tests.
Discuss bias in construct validity.
Bias exists in regard to construct validity when a test is shown to measure different hypothetical traits for one group than another or to measure the same trait but with differing degrees of accuracy.

**There is very little evidence of construct validity bias in standardized intelligence tests.
What does Jensen say about test bias?
The answers to questions about test bias surely need not await a scientific concensus on the so-called nature-nurture question. A proper assessment of test bias, on the other hand, is an essential step towards a scientific understanding of the observed differences in all the important educational, occupational, and social correlates test scores. Test scores themselves are merely correlates, predictors, and indicators, of other socially important variables, which would not be altered in the least if tests did not exist. The problem of individual differences and group differences would not be made to disappear by abolishing tests. One cannot treat a fever by throwing away a thermometer.
Rights of Test Takers
-Right to Informed Consent
-Right to know and understand the test results.
-Protection from invasion of privacy.
-Protection from stigma.
-Right to confidentiality.
Responsibilities of Test Publisher
-Make sure only qualified people can purchase the test.
-Make sure marketing is truthful and test items are kept secure.
-Make sure there is a comprehensive manual to go with the test.
Responsibilities of Test Users
KSAO
K-Knowledge
S-Skills
A-Abilities
O-Other
Comprehensive Assessment
-Select tests that sufficiently sample the behaviors for a specific purpose.
-Include a personal history to integrate with test results.
-Base decisions on wider info than a single test score.
-Base decisions on multiple sources of convergent data.
Proper Test Use
-Accept full responsibility.
-Gain appropriate consent.
-Inform test takers why they are being tested and how the test scores will be interpreted and used.
-Select tests that are appropriate to both the purpose of the measurement and the test taker.
-Select quality tests.
-Use tests only for their designed purposes.
-Know the tests and their limitations.
-Only allow qualified personnel to administer, score, and interpret tests.
-Use appropriate settings for testing.
-Safeguard tests and records by storing them in a secure location.
-Keep up with your field.
-Establish rapport with test takers.
-Ensure test takers follow directions and follow standardized procedures.
Psychometric Knowledge
-Know the factors that can affect the accuracy of a test score.
-Consider the SEM when interpreting scores.
-Appreciate the implications of test reliability and validity.
-Understand why and how scores are manipulated and transformed.
Use of Norms
-Know what norms are.
-Know the different types.
-Know their uses and limitations.
-Select the appropriate norm group when interpreting scores.
Maintaining Integrity of Test Results
-Investigate low or deviant scores to determine their causes.
-Appreciate the individual differences of test takers instead of presenting test scores directly from manual descriptions or printouts.
-Avoid interpreting scores beyond the limits of the test.
-Understand that test scores represent only one pt in time and are subject to change over time from experience.
-Understand that absolute cutoff scores are questionable because they ignore measurement error.
Accuracy of Scoring
-Follow scoring directions.
-Carefully score the test to avoid scoring and recording errors.
-Use checks to ensure scoring accuracy.
Feedback to test takers
-Have qualified staff who are willing and available to provide counseling to test takers and significant others.
-Provide feedback in understandable language/terms.
-Allow only qualified staff to provide feedback.
What are some relevant conditions that may differ from culture to culture?
Intrinsic interest of the test content.
Rapport with examiner.
Drive to do well.
Desire to excel others.
Past habits of individual vs. cooperative problem solving.
When do cultural differences become a handicap?
When the individual moves out of the culture or subculture in which s/he was reared and tries to function, compete, or succeed in another culture.
How can cultural stereotypoes affect test performance?
Longitudinal research finds that knowledge about existing stereotypes may affect some test-takers in their motivation and attitudes toward the test through distraction, self-concept, reduced effort, and low expectation of succesful performance.

**Stereotype Vulnerability (seen in both gender and ethnic comparisons)
What is the role of the examiner in cross-cultural testing?
Obtain:
Cultural identity
Degree & type of acculturation
Characteristics of initial culture likely to affect the individual's test performance.

Adapt behavior in how test is introduced, explained; how taker is motivated and rapport is developed.

Re-consider how interpretation is done, nature of feedback, and to whom feedback is given.
What about test bias vs. stereotyping vulnerability?
"While bias internal to intelligence tests is rare, test overuse, misuse, misinterpretation, and differential impact remains problematic."