Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
197 Cards in this Set
- Front
- Back
Reliability |
Measures the amount of random or nonsystematic error present in a test |
|
Part of error Resides in the test situation |
Who gives the test (examiner effects) Who takes the test (test taker characteristics) Item effects |
|
Relationship between examiner and taker |
Familiarity with and rapport between examiner and test taker can increase test scores People report more health concerns in response to interview questions delivered online, or by self report or telephone than direct questioning Face to face questioning can lead to expectancy effects particularly when questioning children or vulnerable people. |
|
Effects of tester race and gender on test scores (Great truth) |
There is little evidence that tester race or gender influences test scores on individual or group administed ability tests Belief in such effects is myth and is unsupported by data |
|
Why does race and gender not matter |
Test administration guidelines are very specific for most ability tests and training for administration of these tests is often done When effects are found there is usally a deviation from administration given in the manual -effects are small, insignificant |
|
Rosenthal effects |
Experimenter expectancy effects |
|
Rosenthal effects (experimenter expectancy effects) |
Expectancy effects are real but the overall effect on test scores is small Not clear is expectancy effects can be replicated in the manner they erre found in Rosenthals early studies Whether expectancy effects happen in standardized testing is unclear -when present small |
|
Responses after a correct or incorrect |
Inconsistent feedback can reduce reliability of test results Results are inconclusive with some showing increase scores after praise and others showing little effect Reinforcement does alter responses on attitude surveys -frequently increasing yea saying responses |
|
What to do after a response |
After a response no response should be given to avoid the possibility of reinforcement or random reinforcement
What to say or not say is outlined in the test manual -there are no exceptions to following standardized testing procedures |
|
Advantages if presenting instructions and items on computer (computer administered tests) |
Complete standardization
Branching and adaptive testing is possible
Precise timing
Self paced presentation and response
Complete randomization of question presentation |
|
Difference between computer and test administration |
Little evidence of score differences between the two administration methods
Reliability is comparable with both |
|
Responses and feelings about computer administered tests |
Rather than feeling alienated or frightened, most test takers find the interaction enjoyable
This may not be true for disadvantaged test takers who have little exposure to technology
People may be more likely to respond honestly to personally sensitive questions on a computer than on self report interviews |
|
Computer adaptive testing (CAT) procedure Also known as branching |
All tesr takers start with the same set of questions of moderate difficulty
Program presents harder or easier questions depending on how these initial questions were answered Test takers spend little time on questions that are too hard or easy Program presents items based in test takers skill level until a predetermined number if items have been answered incorrectly (test is then over) |
|
Computer adaptive testing (CAT) advantages |
Provide a better profile of a person in a short period of times Test scores can be given almost immediately |
|
Behavioral assements include measures of |
Job samples in which ratings are made of ongoing job related activities Ratings of children's in class or playground behavior Ratings of psychiatric patients before and after tearment Ratings of on going social interactions |
|
What is being measured/ effects are we looking at in behavioral assessments |
In all beahviour assessments, a rater is evaluating someone else's beahvior using some form of evaluative scales or dimensions Both the test and reliability of it is based in tester |
|
What is the main concern/ issue in behavioral assessments |
Reliability of the ratings from raters |
|
Reliability of ratings/ raters |
Reactivity -When raters know their ratings will be evaluated, ratings are more accurate than when not being checked |
|
Overcoming reactivity |
Suprsie spot checks are made from time to time
Kappa coefficients are needed to assess interrater reliability (agreement between raters)
Interrater reilaiblity needs to be assessed during the actual observation periods not training |
|
Training of observers (raters) |
Behavioral ratings require that observers be extensively trained on what to observe and how to code what is observed -need to generate a code boom to train raters what to look for and how to evaluate -anchor (everyone knows what a 1 or 5 means) |
|
Drift |
In the field actual ratings may depart from ratings done at training and become unique to the rater These unique individual ratings can take many forms |
|
Forms that unique individual ratings of the rater can take |
Inconsistency effects -same beahviour is rated differently on different occasions -reliability is gone Shifting standards -ratings of the same behavior differs across people Group standards effects -when groups of raters are observing, they may adopt a informal, implicit rules for observing -unless known they don't match to trying and reialbilty is gone Contrast effect and assimilation effects -behavior rated differently depending on preceding beahviour |
|
Overcoming drift effects |
Periodically retraining on yye original coding format is often necessary Best way is to video tape tge whole thing and then evaluate the tape with raters |
|
Interview |
One person asks another a series of questions believed to be diagnostic of a quality or attribute the interviewer is trying to assess Best known technique for assessment of individual differences |
|
Landy (1985) |
Companies interview between 5-20 people for each hire Given the costs in time, effort and resources for companies the utility, validity, and reliability of interviews have been studied |
|
What is the purpose of interviews |
Is to ask questions that reveal whether or not the interviewee has the skills, ability, interest and motivation to do the job, profit from additional traning , be of benefit to organization, enjoy the new position, get along well with co workers These concerns are about predictive validity |
|
3 forms of interviews |
Structured Semi structured Unstructured |
|
Structured interviews |
Everyone gets the same questions in the same order from a panel -in essence an orally administered questionnaire |
|
Semi structured interview |
A patterned or guided interview covering certain prefeterminided areas of interest |
|
Unstructured interview |
Nondirective depth interview where the interviwer sets the situation and encourages interviewee to tall as freely as possible |
|
Interviews gather (observe) different types of information |
Observation of a limited sample of behavior such as speech rate, language usage, poise, reaction to being in an unfamiliar situation, nervousness, style of dress, posture
It is empirically unknown whether any of this collected information from interviews relate to or predict the criterion |
|
Interviews can elicit information that may predict the criterion It is belived |
Is often claimed that what a person has done in the past is a good predictor of what they will or are likely to do in the future |
|
The claim that past is a predictor of future behavior is true if |
1) the situation and person remain stable 2) the person's interpretation of their behavior in that or similar situation remains constant |
|
To get useful information or the effectiveness of an interview rests on what |
Ability to collect useful information from an interview rests solely on the skill of the interviewers -their skill to ask the right questions and to correctly interpret the respondents answers (Interviwer are the test) |
|
Interviews can go wrong in predicting outcomes when |
Respondents or interviewers conceal important information Important questions were not asked Information was not correctly interpreted Interviewer is insensitive to cues in interviewees behavior Interviewer is inattentive to information that was reported |
|
Why is sensitivity to responses important |
Sensitivity to what was stated and how it was stated may lead to further probing, learning new information and qualifying previous answers |
|
Different types of interviews include |
Employment interviews Mental status exams Clinical interviews |
|
What is an employment interview |
Is the most frequently used pre employment assessment done by organizations
Can vary along the dimensions of traditional to structured |
|
Traditional interviews |
Where a number of different areas are discussed with each job applicant
Serves to acquaint the applicant with possible work colleagues and the work enviroment |
|
Structured interviews |
Standardized, with each applicant receiving the same questions and responses are scored using a scoring format |
|
Reliability and validity of traditional unstructured interviews |
Traditional unstructed interviews are often invalid and are unreliable predictors of future work performance |
|
Hunter and hunter (1984) |
Found a predictive validity coefficient of 0.14 between the Interviewer judgments during a traditional interview and future job evaluations and performance
Reasons for low validity are not hard to find
Results from these interviews say more about the Interviewer than interviewee |
|
Traditional interviews have low validity because they are |
Plagued with age, gender and attractiveness stereotypes
Halo and horns effects
Do not address the major concerns of the organization |
|
Halo and horns effects |
a form of rater bias which occurs when some walks in and projects competence or incompetence in one area, and the supervisor rates the employee correspondingly high or low in all areas |
|
Negative search strategy |
Search for any negative information that would disqualify an applicant
Any negative information will be enough to reget applicants unless demand is high enough and few workers are available
When impressions are favorable (halo effect) the rejection rate drops to 25%
(Most interviews operate on negative search) |
|
Webster 1964 |
Found that one unfavorable impression was enough to sink the applicants chances in 90% of cases |
|
What constitutes as negative information |
Poor communication skills
Lack of confidence or poise
Low enthusiasm
Nervousness
Failure to maintain eye contact
(These are all signs of introversion and social anxiety) |
|
What constitutes as postive information |
Ability to express oneself
Self confidence
Poise
Enthusiasm
Ability to sell ones self
(All signs of extroverson, assertiveness and social skills) |
|
What causes a good first impression (tipping the balance in your favor) |
Looking professional
Well groomed
Project an aura of competence and expertise
Nonverbal cues that imply friendliness and warmth |
|
Structured job interviews |
Address issues of reliability and validity that are raised by traditional interviews
Change from interpersonal relationships to job focused questions |
|
Structured employment interviews uses questions that are |
Job focused Pre-planned Presented in the same order for all Answers are scored according to a predetermined scoring procedure |
|
Interviewers in structured job interviews |
Interviewers are trained on how to ask questions, how to take notes, and score answers Procedures standardize interviews for each candidate |
|
What do structured interviews focus on |
Focuses on the relationship between past behavior and current and future beahviour Linking past, present and future beahviour provides better predictions for future behavior Traditional interviews focus on questions that assess attitudes, opinions and interpersonal dynamics |
|
In structured job interviews all job seekers are asked to |
Provide specific examples of beahviours they have used in the past Provide examples of what they would do under specific circumstances -answers are rated on behaviorlly anchored rating scales |
|
What interview style are used most often |
Many companies and organizations use traditional employment interviews So standard questions result in standard answers |
|
Successful candidates and turnover |
People do not have a great track record when it comes to identifying successful job candidates Harvard business review points out that 80% of employee turnover is due to bad hiring decisions Hiring is difficult and mistakes are expensive |
|
Society for human resource management reports that |
36% of new hires faol withing the frist 18 months 40% of senior managers hired from outside the organization fail within 18 months It's costs on average one third of a new hires yearly salary to replace them |
|
Reasons why so many hires fail reflect |
Improper gathering of information during the interview Improper analysis of information Improper interpretation and intergration of data Implicit reliance on stereotypes and halo/horn effects, heuristic reasoning (All play key roles in thr success or failure in selecting goof employees) |
|
Mental status exam |
15- 20 minute interview that an intake worker assess the likelihood of brain damage, drug or alcohol problems, psychosis, and other major mental and physical health issues |
|
What is the purpose of mental status exams |
Purpose Is to assess neurological or emotional problems in terms of variables known to be possible causes |
|
What is noted in mental status exams |
Patients appearance, behavior, speech, perception, thoughts, attitudes and general appearance are noted |
|
What is assessed in mental status exams |
Emotional states -flat affect (little fluctuation in emotion) -emotionsl inappropriatrness -emotional liability
Intellectual functioning -speed and accuracy of thinking -richness of thought content -memory capacity -judgmental accacy -proverbs tests
Attention processes -level of distraction -perseverance -presence of hallucinations -delusions |
|
What do emotional, intellectual, attention and thought problems associated with |
Are markers of schizophrenia, drug dependency, anxiety disorders and brain disease |
|
What do the results from Mental status exams tell us |
Tentative diagnosis Likelihood of injury to self or others Outcome of Psychotherapy |
|
How to do a mental status exam competently |
A complete understanding of the major mental disorders is required: -Thorough knowledge of various forms of brain damage -thorough knowledge of neurological impairment -thorough knowledge of the DSM - 5 coding system (Not one fixed exam) |
|
Clinical interviews cover same ground as mental status exams but can be broader and also explore |
Job prospects Career alternatives Self knowledge Information to make more appropriate life choices Therapy and therapeutic related outcomes |
|
What is the the purpose of clinical interviews |
Task is to obtain important information for the person but what is important depends on the nature and purpose of the interview |
|
Clinical interviews can be broad or narrow depending on |
Nature of the referral question Nature and quality of the background information Time demands Concurrent clinical judgments |
|
Interviewers in mental status exams and clinical interviews |
The tone, interview climate, and answers elicited hinge on the beahviour of the interviewers |
|
Research on clinical judgments |
Just how accurate are individuals or panels of individuals in -synthesizing and integrating information about another person -arriving at a correct decision
-how accurate are clinical diagnosis, judgements, evaluations made by single judge or panel |
|
What types of information and outcomes are used |
Judges or panelists have a number of scores of information on the person being interviewed
-test scores -test score patterns -interview results -famkly histories -medical information -biological information -school records |
|
Studies on how acutrate judgments by judges ask |
Given the wealth of information available to decision makers, how accurate are evaluators, teachers, clinicians, coaches in predicting outcomes Evaluators are the test instruments -the validity and reliability of evaluations made by evaluators becomes an issue |
|
What are actuarial methods |
Involves converting all available background information into numbers and entering all of the information into regression equations Method allows your tk see what information are predictors that best predict an outcome |
|
Cleary model |
Which is another name for actuarial methods Are often constructed with clinical judgments in which evaluators arrive at a judgment, evaluation or determination of a particular case |
|
Who comes up with a better prediction -a regression equation or humans integrating the available data |
A regression model often excels or equals diagnoses, predictions, judgments and evaluations made by individuals or panels |
|
Why are evaluators reports not better than regression |
Evaluators or judges reports of how they combine and weight data bear little relationship to how -information is actually combined -the weight or importance attached to that information (Orgian and rationalization issues) -don't know where results come from but come up with reason |
|
Sarbin (1943) combined high school counselors predicton of grade 12 students sucess in college against the accuracy of regression equation |
Counselors used college aptitude test scores, grades interview results, scores on vocational interest inventory, personality test scores, post high school interviews Regression equation only used college aptitude test scores and high school grades and predictors Counselor prediction came close to regression equation predictions for girls but regression did better for boys (Regression did better overall) |
|
Meehl (1954) |
Asked if clinical judgments were better predicts than regression equations Judges were clinical psychologist, counselors, teachers, clinical social workers, other professions in varying degrees of education and work experience |
|
What did Meehl find |
Found that with Few exceptions (administrated assistants) , actuarial methods yielded as many correct and frequently more correct predictions as did clinical analysis given by professionals |
|
Meehl 1965 -repeated the study but included 50 new clinical outcome studies |
67% of studies favored statistical prediction 33% showed no difference between clinical and actuarial judgments |
|
Goldberg (1965) |
Reported that statistical predictions from MMPI profiles predicted future mental health status better than did clinical predictions |
|
Grove (1996) -meta analysis of 136 studies that directly compared clinical and actuarial judgments |
64 studies (47%) actuarial methods predicted better than did clinical judgments 64 studies (47%) showed no difference -clinicians had the advantage because more information was aviable to them to use than actuarial predictions 8 studies (5%) favored clinical judgments |
|
Enhancing clinical predictions Neither... |
-The amount of clinical experience -The number of years of professional training
Enhanced predictive accuracy over regression equations or the use of mechanical prediction rules In mechanical prediction, weights are given to each predictor on the basis of past outcomes |
|
Why are clinical judgments sub optimal |
Fail to realize that diagnostic cues are probabilistic not absolute categorical cues for outcomes
Fail to account for cultural, sub, cultural or gender differences
Use of racial or gender stereotypes when making judgments
Use illusory correlations rather than decision rules -looks like a relationship but not
Overuse and rely on inaccurate predictive principles -ignore bass rate information, regression to the mean or use of too many correlated predictors |
|
The most important reason why clinical judgments sub optimal |
Use intuition, emotion or gut feelings when making judgments or interpreting information |
|
What is regression to the mean |
What happens when an event happens and u can assume that it will continue to happen and become shocked when it doesn't What you say is the outlier |
|
Is information intergration or gathering the issue in clinical interviews |
In Clinical interviews data synthesis (intergration) is the issue not information gathering |
|
When are Clinical interviews better |
Clinical interviews are better than actuarial procedures for obtaining information ok infrequent behavior |
|
test biases |
Refers to questions concerning serveral issued such as -item fairness -comparable prediction scores across groups -or the construct validity of the of the test across groups |
|
Concern with test biases |
The concern with test bias arises when test characteristics detract from the construct measurement
-it does not refer to attributes associated with test takers |
|
How does the standards define bias |
A bias is a systematic error in a test score
A biased assessment is one that systematically under or over estimates the construct it is designed to measure
Bias exists in the test not the people |
|
When is a test not biased |
If an achivment test produces different mean scores for different ethics groups but there are actual true score differences between groups then the test is not biased |
|
When is a test not biased |
If an achivment test produces different mean scores for different ethics groups but there are actual true score differences between groups then the test is not biased |
|
When is a test biased |
If the observed differences in achievement scores are the result of the tesr underestimating or overestimating the achievement of a group then the test is culturally biased |
|
The concept of test bias focuses on what question |
Questions about the interpretation of the validity of the test score (Test performace) |
|
How can tesr bias affect test takers |
Test bias refers to systematic error in the estimation of some true value for a group of individuals construct over or under representation and construct irrelevant components may affect the performance of different groups of test takers |
|
Most controversial finding in Psychology |
The persistent one standard deviation difference between the intelligence test performance of black and while students -15 standard score points |
|
Cultural test bias hypothesis (CTBH) |
Any difference between gender, ethnic, racial or group performance is due to biases |
|
Any gender, ethnic, racial, or group performance difference on mental tests can be attributed to |
-Inherent artificial biases produced within the test through flawed psychometric methodology
-group differences are believed to stem from test characteristics
-group differences are unrelated to any actual differences in the psychological trait, skill or ability in question |
|
Cultural loading |
Refers to the degree of cultural specificity present in the test or items
Can be culturally loaded without being culturally biases
-the greater cultural specificity the greater likelihood of the item being biased when used with individuals from other cultures
-all tests in current use are bound in some way by their cultural specificity |
|
Mean difference hypothesis |
Mean level differences in performance on tasks between two groups are believed to constisute test bias
Asserts that there is no valid scientific reasons to believe that performance levels should differ across racial, ethnic, gender groups
Tests that demonstrate differences are biased
-this is not correct as there are no prior basis for deciding that differences don't exist |
|
Thinking of mean difference hypothesis |
Require that the distribution of test scores in each population be identical prior to assuming that the test is nonbiased regardless of its valdidty
Portraying a test as biased regardless of its purpose or the validity of its interpretations suggests poor understanding of the construct being assessed and issues of bias |
|
Jensen (1980) |
Discusses The Mean differences as bias definition in terms of the egalitarian fallacy
Difference with regard to any aspect of the distribution of mental test scores indicate that something is wrong with the test |
|
Egalitarian fallacy |
The idea that all human populations are identical on all mental traits or abilities |
|
Berry and Annis (1974) |
Temmne live in a vertical world and inuit live in a horizontal world
Subject to vertical horizontal illusion |
|
Features od a test that indicate fairness |
Interpreting test scores
Minimizing error in test presentation and scoring
Enhancing test validity
Accommodations for those with disabilities
Writing appropriate items
Evacuating potential job candidates through standard criteria for all |
|
Irrelevant factors in fair tests |
Factors irrelevant to the construct Are eliminated during assessment to help ensure that the construct is measured in a way that is impacted only by knowledge, skills or abilities relevant to the construct itself |
|
Why are other definitions of tesr bias in CTBH or cross group test validity unacceptable as a scientific perspective |
The imprecise nature of other uses of the term makes empirical investigation and rational inquiry exceedingly difficult Other uses of the term invoke specific moral value systems that are the subject of intense emotional debates that do not have a mechanism for rational resolution -emotional appeals, legal adversarial approaches and political remedies of scientific issues are not scientifically unacceptable and not useful |
|
Once mean group difference are identified there are 4 common explanations for these differences |
The differences primarily have a genetic basis
The differences have an environmental basis
The differences are due to the interactive effect of genes and environment
Tests are defective and systematically underestimating the knowledge and skills of minorities and leads to Differential validity (CTBH) |
|
Unfairness as a measurement bias |
When test items are unrelated to the intended construct, it can result in test score differences across subgroups |
|
What is Differential item functioning |
Differences in the functioning on test scores between defined groups Indicates that individuals from different groups who have the same standing on the construct being measured do not have the same expected test score Happens when test takers of equal abilities do not have the same probability of answering a test item correctly test item correctly Leads predictive bias |
|
Differential item functioning needs what |
An indication of Differential item functioning must be accompanied by a suitable explanation for for Differential item functioning to justify an item as biased |
|
Predictive bias |
Differences exist in the pattern of associations between test scores and other variables for different groups, causing concerns about bias in the inferences drawn from the use of test scores |
|
Fairness |
Fairness is concerned with the validity of interpreting individual scores for their intended uses -unfairness means that the test score interpretations are invalid for the intended uses |
|
To have fairness |
A individuals need to be treated as similarly as possible (important aspect of fairness) Important to take into account the individual characteristics of the test taker and understanding how there characteristics may interfere with contextual factors of the testing situation and the interpretation of test scores |
|
What are the major issues with giving achievement tests to minority groups |
Inappropriate content
Inappropriate standardization samples
Examiner and language bias
Meaurment of different constructs
Differential predictive validity
Qualitatively distinct aptitude and personality
Inequitable social consequences |
|
Inappropriate content |
Black and other minority children have not been exposed to the material involved in the test questions
Tests are geared primarily toward white middle class homes, vocabulary, knowledge, and values
Inappropriate content makes the tesr unsuitable for the use with minority children |
|
Inappropriate standardization samples |
Ethnic minorities are underrepresented in standardardization samples used in the collection of normative reference data
Inappropriate standard samples makes the test unsuitable for use with minority children |
|
Examiner and language bias |
Because most psychologist are white and speak English, they may intimidate black and ethical minorities
Examiner race and language use biased test results
Biases happen because examiners are unable to accurately communicate with minority children and are insensitive to ethnic pronunciation of words on the test |
|
Measurement of different constructs |
Tests measure different constructs when used with children from other than middle class culture on which the tests are largely based
Not a valid measure of intelligence in minority groups. |
|
Differential predictive validity |
Tests measure constructs more accurately and make more valid predictions for individuals from the groups that tests are mainly based on than other groups |
|
Qualitatively distinct aptitude and personality |
Majority and minority groups have different aptitude and personality triats
So test developers should begin with different definitions for different groups
Helms argued that European and African values and beliefs are different which effects responses |
|
Inequitable social consequences |
Due to educational and psychological test biases, minority group members are already disadvantaged in educational and vocational markets because of past discrimination, thoughts of inability to learn and are disproportionately assigned to dead end educational tasks
Represent the inequitable social consequences of biased testing
|
|
What is a biased item |
An item is biased when it is demonstrated to be significantly more difficult for one group than another
-test items must be unidimensional (all items must be measuring same factor)
-items identified as biased must be differentially more difficult for one group than another
-in this definition groups will have different mean test scores but group differences must be reflected on all items an in an equivalent fashion across items |
|
How to determine biased test items |
Number of statistical techniques with many based on item response theory and are used to detect Differential item functioning |
|
Research results on biased items |
Very little bias in tests at the level of individual items
Some biased items are nearly always found accounting for more than 2-5% of the variance in performance
For every item favoring one group there is an item favoring the other group |
|
Similarity amoung biased items |
Very little similarity amoung biased items has been found
Poorly written, sloppy and ambiguous items tend to be identified as biased items with greater frequency than items encountered in a well constructed standardized instruments |
|
How to eliminate biased items |
Expert panels of minority psychologists are asked to indicate which items would be too difficult for minority or disadvantaged individuals Items that are seen as culturally biased by the panel are removed |
|
Use of expert panels show two consistent findings |
Expert judges were no better than chance in choosing test items that minority children scored lower than whites Judges are not able to detect items which are more difficult for minority children and the ethnic background of the judge makes no difference in accuracy of item selection |
|
Methods used for the internal analysis of test items (item biases in construct measurement) |
Factor analysis across groups
Correlation of raw item scores with age .
Comparison of item total correlations across groups
Comparisons of parallel forms and test retest correlations |
|
Comparative item selection (Reynolds 1998) |
Multiple retesting of item sets across groups
Unbiased tests will show a 90% overlap rate between tests
Biased tests and tests with low reliability will show low overlap
Need large samples for stable results |
|
Bias in construct measurement |
Construct measurement of a large number of often used assessment instruments has been investigated across ethnicity and gender with a divergent set of methodologies No consistent evidence of bias in construct measurement has been found in the many prominent standardized tests investigated |
|
Psyvholgocial tests |
Function and are measured in the same manner across people from diverse ethnicities and gender Tests appear to be inbiased for the groups investigated and mean score differences do not appear to be an artifact of test item bias |
|
What is the recommended method for detecting item bias |
Item response theory followed by a logical analysis of item content These methods are used to determine the degree of Differential item functioning (see if items function differently across groups by model parameters associated with the items) |
|
Item response theory models have various item model parameters that describe the item beahviour (three parameter model) |
1) item difficulty (most important) -the point on the difficulty level of the latent trait at which the examinee has a 50% chance of correctly answering item 2) discrimination power of the item (slope) 3) guessing parameter |
|
Rausch model |
Single parameter model that models item difficulty |
|
Using Item Response theory to determine Differential item functioning |
Compares the item characteristics curves of two groups to create a Differential item fucnti9ng index Various statistical methods have been developed for measuring the gaps between item characteristic curves across groups of examinees |
|
Partial correlation analysis |
Simple but less precise way to determine item bias
Test for differences between groups on the degree to which there exists meaningful variation in observed scores on items not attributable to the total test score
Mewningfulness is based on effect size which is obtained by coefficient of determination
Need to be attentive to experimenter wise error rates |
|
Biases when useing tests to predict future outcomes are constrained by two problems |
Biases in the measurement of the criterion/outcome Correlation between predictor and criterion is limited by the poor measurement characteristics of t he criterion -sqrt of validity is maximum -from the standpoint of the application of aptitude, achievement, and intelligence tests in forecasting probabilities of future performance, prediction is the most crucial use of test scores to examine |
|
Predictive accuracy can be determined in a few different ways |
An item analysis can determine if items function the same in all groups (no criterion)
Assess unstabdardized regression weights (slope) to see if weights are comparable across groups
Differences in group averages on the test and averages on the criterion
Exaime cut off scores separated by group and assess differences |
|
Job performance tests |
Tests which are similar to actual job performance show little diversity across groups (Little biases) Biases arise when inferences are made on the basis of test results to behavior unrelated to the test |
|
Regression equations |
Regression equations are used to assess biases in prediction Predictions take tye form of y=ax+b where a is the constant and b is coeffirnct An unbiased test requires errors on prediction to be independent of group membership and the x y regression line must be the same for each group -error around regression line should be similar |
|
Homogeneity of regression across groups -simultaneous regression -fairness in prediction |
When regression equestrian for two groups are equivalent (the prediction is the same for those groups) When homogeneity of regression across groups does not happen then seperate regression equations should be used for each group |
|
Cleary model |
The use of a single equation to make predictions from test scores Refers to the use of regression weights (slope) to predict job success or outcomes |
|
Clinical use of regression equations |
In Clinical practice regression equations are rarely generated for the prediction of future performance
Rather some arbitrary or statistically derived cutoff score us determined failure if below -usally based on clinical lore or past practices
2 SD below test mean is used to infer a high probability of failure in school performance |
|
Cut off criterion |
Using cutoff scores, clinicians are establishing prediction implicit (implied) regression equations about mental aptitude that are assumed to be equivalent across race, sex ect |
|
Gorden 4 types of test bias |
Case 1 -groups A and B differ on the test but have identical slopes relating the test and criterion -example of homogeneity of regression across groups
Case 2 -there are different test scores, different slopes and different intercepts meaning different test validities from the two groups
Case 3 -similar slopes but minority receives higher criterion scores than majority
Case 4 -similar slopes but majority receives higher criterion scores than minority |
|
Interpreting Case 1 and 2 |
The issue is Differential test valdidty for the two groups
In both cases the regression slope and intercept are examined
-the slope of the line (regressing weight) is the correlation between test score and criterion
-the correlation is the predictive validity coefficient
If the test has significantly different validity coefficients for one group compared to another then slope bias and Differential test validity is present |
|
Common issue in case 2 Differential validity issues |
There are often many more test takers in the majority than minority group
This means that the regression weight will be significant for the majority group but not the minority and because of sample size differences, between group comparisons will be significant
In such cases the test will show that the test is more suitable for the majority then the minority group (discriminates against minorty) |
|
Hunter and Schmidt 1997 |
In a review of 866 black white predation comparisons
There was no evidence for the Hypotheses of Differential or single group validity with regard to the the prediction of job performance across race for whites and blacks |
|
Great secret truths of differences |
Large scale industrial samples, tests on armed services personnel, school division wide testing all typically fail to find significant differences in validity coefficients
Validity coefficients for nationally administered tests typically fail to find differences in validity coeffiecnts between rail groups
In terms of predictive validity, abilityit test are equally valid for minority and majority groups in predicting occupational and educational outcomes
When sample size and composition are comparable and the test and criterion are properly constructed, no slope biase is reporter |
|
Why is there slope bias in well constructed tests |
Performance on the test and on the criterion are influenced by a number of factors (language skills, age, motivation)
There factors can influence scores on the predictor or criterion
Making the test culturally appropriate does not address the underlying issue -low scores need to be addressed directly |
|
Intercept bias |
Even though tests show comparable predictive validity across groups, intercept bias may still be present
Intercept bias is present if the test consistently over or under estimates performance on the criterion by one group compared to another |
|
Case 3 |
Although the valdidty coefficient is the same for both groups, any score on the test (x) will lead to different criterion scores for the groups
Test scores have different predictive meaning for two the two groups
Selecting people on the basis of majority scores underpredicts minority group criterion performance
Case 3 is the situation concerning those who view tests as biased |
|
Case 4 |
Use of majority test scores in predictive regression over predicts minority group performance Discriminating in favor of them -evidence on intercept bias indicated that on well constructed tests, there is no significant intercept bias -or a slight tendency in the opposite direction |
|
When does case 4 happen |
Occurs when other variables are correlated with the test and the criterion -reading ability, language proficiency, but primary test familiarity and test preparation |
|
Current work on test bias |
Over the last several decades emphasis has shifted from evaluation of test bias to the design of selection strategies for fair test usage with minority groups
Examples Cleary model, compensatory models These selection models cause public uproar but what they do is assign different values to other facts to be considered when accepting minority group candidates |
|
Cleary model |
The use of regression weights to predict job success of other outcomes
Selection is based strictly on the test and criterion scores without regard for other goals in the selection process |
|
Compensatory models |
Select larger proportion of minority group members by lowering acceptable test scores or select applicants based on other criteria (proportion of minority applicants) |
|
Selection models like the Cleary model are called |
Expected utility models -in such models clear statements of values are intended consequences of selection decisions that are made explicit Issues such as providing equal opportunity, increasing demographic mix, preferential selection of people from historically disadvantaged groups are all part of the selection process |
|
No agreed upon defintion for intelligence But there are three broad conceptual ideas about intelligence which can be described |
Psychometric tradition -examines the structure of test items, dimensions underlying responses, and correlated of test responses Information processes in approach -examines the underlying encoding, processing and solving of various problems Cogntive approach -focuses in how people solve real world problems and adapt to real world demands |
|
Development of the Binet scale |
In the last 1800s current theory suggested a relationship between head size and school success (craniometry)
Benet failed to find any relationship
In 1899 Binet dropped craniometry as a measure of intelligence and began a search for other measures Binet returned to measuring intelligence in 1904 at the same time Francis Galton published his work on intelligence |
|
What did Galton believe |
That intelligence could be assessed through physical measures -grip strength, reaction time, keenness of vision, auditory acuity, and mental imagery Binet wanted something more |
|
Benit wanted a measure that reflected |
What people do (not who they are) A number or numbers that reflected whether questions were answered correctly Answers to questions that indicated and underlying mental process |
|
What did Benit ask children to do |
Asked children to respond to asks that relect common experiences -counting coins, giving and receiving instructions, making simple inferences, answering questions and solving problems Tasks were presented by a trained tester Items were graded in terms of difficulty and covered a wide ranger of problems |
|
These tasks used by Benit were thought to underlie three processes that reflected intelligence |
Comprehension Invention Correction |
|
General mental ability |
General mental ability meant that the reason why children who were correct in one question were also right on others is because intelligence is a General mental ability made up of several different processes (postive manifold) He thought that performance on a wife range of varied tasks could reflect a measure of General mental ability |
|
Original Benit scale |
Original 1905 scale has 25 age graded items The 1908 scale by Theodore Simon had 32 and an age criterion for each item Start with the simplest items and progress untill the child continues to make mistakes |
|
What was considered normal intelligence |
The criterion benit and Simon adopted was that the age at which children could correctly answer a question 66%- 75% of the time
The age associated with the last correct answer became known as the child's mental age -children whose intelligence level was less than 0 were identified for special education |
|
Mental age |
Benit and Simon defined a child intelligence level as mental age - chronological age |
|
What does benit define by intelligence |
Adopted a functional perspective where intelligence must be relected in behaviors thar are adaptive and goal directed
-Take and maintain a definite course of action (comprehensive)
-Capacity to change plans or method to attain a desired end (invention)
-Ability to see errors and correct them (correction) Intelligent people use information more efficiently to meet their desired goals than do less Intelligent people |
|
What did Stern 1912 argue |
That mental age should be divided by chronological age to give an intelligence quotient -division is more appropriate than subtraction because the relative not absolute difference between mental age and chronological age is important Interest not becomes the rate of development relative to age |
|
Terman 1916 |
Brought the binet-simon test to the US where it became the standford- Binet Ushering in the modern age of intelligence testing |
|
Spearman (physiological efficiency) |
Benit saw intelligence relected in people's ability to solve problems that arise when attaining their desired goals The problem of this idea was that some people had more functionally adaptive problem solving ability than others Spearmans two factor theory of intelligence saw intelligence as a central general ability (g) and levels of specific abilities (s) -sought to isolate intellectual power from knowledge content |
|
How does Spearman see intelligence |
Intelligence was less about goal directed activities and more about abstract reasoning -our ability to perceive and apply relationships Termed this abstract reasoning ability as g (general mental ability) G is one of the best predictors of occupational and educational success |
|
How is G measured |
Analogy problems that require people to perceive relationships between problem components and apply those relationships to the problem Analogy problems can be expressed verbally or in symbols, pictures, and geometic forms |
|
Functional unity of intelligence |
A physiological structure from which a mental energy or process flow Spearman and others thought thar differences in intellectual functionial reflected a functional unity One measure of inner unity is neural speed or processing speed -idea is that the faster a person can process information the higher their level of intelligence |
|
Speed of response as a measure of intelligence |
With a speed measure it is critical to seperate speed from knowledge
It is processing power (g) that is being measured not expertise or past knowledge To separate processing power from expertise you have to use tasks that are completely novel or tasks that are very familiar or easy |
|
Response time from movement times |
It is also essential in speed measures to separate response time (decision time) from movement time Done by separating movement time from response time |
|
Galton and reaction time |
Reaction time as a measure of intell was suggested by galton before binet produced the frist intelligence test in 1905 Every methodological and statistical problem conceivable plagued Galtons attempts at measuring response time as a measure of human intelligence Benits new intelligence test had obvious face validity for intelligence and Galtons idea of chronometry was easily overtaken and soon forgotten |
|
Advantages of chronometry |
One advantage lies in the scale of measurement
IQ tests results produce an Ordinal scale
Level of intelligence is always relative to the scores of others in norm group
Speed based measures produce an absolute ratio scale -An response time of 30 msconds is twice as fast as 60msec no matter who takes the test |
|
Advantages of chronometry |
There is a theoretical advantage with chronometry Binets approach to intelligence was entirely functional -why a person scored the way they did was not Binets concern Time based measures permit theoretical development -why are some people faster than otherw Fast response times reflect a speedy rate of oscillation in neural responsiveness and are therefore indicative of intelligence |
|
Inspection time as a measure of physiological efficiency |
Spearman thought that reaction time might reflect encoding sensitivity and retrieval speed
While informative reaction time data are difficult to interpret because it is difficult to know what processes are being assessed
Inspection time is an alternative measure of physicaologicl efficiency -dependent variable is the correct answer not response speed |
|
Inspection time accuracy |
Accuracy is assessed against exposure time
The longer the exposure time the greater the accuracy but the lower the processing speed
For each person, exposure duration given 75% accuracy is recorded |
|
What actually is inspection time |
Since it is concerned with accuracy and exposure duration It is refered to as the speed of taking in information and encoding sensitivity Correlations between inspection time and tests loading on G are (-.50) to (.55) As inspection time increase intelligence test scores decrease |
|
Correlations between response time and test scores |
0.2- 0.3 |
|
Does inspection time cause intelligence or vis vera |
Several studies report that intelligence causes fast inspection time Developmental studies suggest that inspection time at an early age are more closely related to IQ at a later age than visa versa Inspection item is a measure of something occurring inside the head, it is not a construct, a process or explanation |
|
What is inspection time taping into |
While evidence is incomplete Inspection time tape into the efficiency with which information is processed after it has been received |
|
Fast inspection time and high IQ scores |
Fast inspection time and high IQ scores occur in people whose evoked potential are maximal at 140-200ms after stimulus display |
|
Cigarette smoking |
Cigarette smoking is associated with fast inspection time and higher scores on the Ravens matrices This happens beacue smoking stimulates the brains cholinergic processes underlining one mechanism that underlines intelligence |
|
Cattells two favors theory of intelligence |
Proposed that intelligence is composed of two components
Fluid and crystallized abilities
These abilities are called Gf and Gc |
|
Gc crystallized abilities |
Relect past lessons or will learned responses that have become crystallized (reading and driving) Reflected in shared educational experiences and seen in tests of computational speed, word recognition, pattern matching, basic information, vocabulary |
|
Gf fluid abilities |
Label applied to an adaptive process of encoding and correctly processing unfamiliar configurations, rearranging those configurations to meet some requirement (Ragens matrices or black design test)
Is spoken of in the singular but there are several components to this factor |
|
Relationship between fluid and crystallized abilities |
Crystallized develops out of fluid because when tasks are new there is no crystallized knowledge to use
When people of equal fluid differe in crystallized, the reason probably reflects educational experience, motivation or environmental factors Although fluid is largely perceptual and verbal in nature, words (crystallized) are needed to formulate and check Hypotheses and answers Crystallized abilities such as language and numerical skills are needed to express creative and novel ideas that.come from fluid abilities |
|
What do fluid and crystallized reflect |
Fluid is without content reflecting pure intelligence power. Crystallized reflects book learning, experience, or knowledge and really isn't intelligence -this is misleading and incorrect |
|
How much fluid and crystallized abilities do we need |
How much fluid ability a task requires reflects the culture and learning history With enough practice or experience any new test or problem could become crystallized |
|
Fluid and crystallized abilities in intelligence tests |
Many intelligence tests reflect fluid and crystallized abilities No matter how bright a person is, a poor reader will not do well on ability tests |