• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/77

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

77 Cards in this Set

  • Front
  • Back

issue with guessing


it leads to problems in understandingwhat one’s true test score - especially on achievement tests

Abbott's formula for blind guessing

R(correctresponses) W (wrong responses) - K (number of alternatives)237

To overcome the influence of blind guessing - one should advise examinees to

attempt every question

Items that are clear in multiple choice formats may be confusing in

short answer formats

Accordingto Ebel abetter way to increase test reliability is

to add more items

The bestway to calculate reliability for speeded tests is to

do a split-half reliability on the test

Halo Effect

rater’stendency to perceive an individual who is high (or low) in one areas is alsohigh (or low) in other areas

general-impression model

tendency of rater to allow overall impressions of an individual influence judgment of a person’s performance(e.g.person may rate reporter as “impressive” and thus, also rate his/her speech as strong as well)

Salient Dimension model

When the rating of one quality affects the rating of another independent quality (e.g.people rated as attractive are also rated as more honest)

SimpsonParadox

aggregatingdata can change the meaning of the data - can obscure the conclusions becauseof a third variable/span>6jG

In terms of minority hiring - minorities applied to two levels of positionsclerical and executive - overall hiring rats found that only 11% (110/1010) ofminority group were hired as compared to 14%(85/600) of majority group.




What is this scenario an example of?

the Simpson Paradox

Thereis a debate that whether our clinical judgment is superior to

mechanical judgment

mechanical judgment

statistical predictions or predictions based on some type of quantative indice

Marital relationship satisfaction wasdetermined based on higher sex versus argument ratios - people tend to raterelationships higher if have more sex and less fights




This is an example of what kind of mechanical decision making?

crude

Mechanicalor quantitative prediction can only work when

people highlight what variables to examine to determine prediction

People are not as good, in terms of prediction, as

integrating the data in unbiased ways

Our belief in prediction if reinforced by the

isolated incidents we can access

Factor Analysis

a statistical tool that is used to mathematically determine which items are associated with various latent constructs

Factor analysis requires that one come up with

number of items

Steps in factor analysis

1. sample items on 200-500 subjects


2. input how the sample rated each item


3. run factor analysis and then look at the pattern of where items load and then name the factor

When doing item development for factor analysis, you need to have ___________ items because they give you greater ability to tap into multiple aspects of the construct

more

Facets

defined-homogenousitem clusters that directly map onto the larger order factors

Dichotomous item response formats cannot be used for factor analysis because

it can cause a serious disturbance in the correlation matrix

When utilizing factor analysis, more response items mean generate a greater amount of

variance

For well defined factors, you can use a sample size of _____________ for factor analysis

100-200

If factors are not well defined you may need a sample size of up to _________ for factor analysis

500

4 Reasons for Conducting Factor Analysis

1. Developing and Identifying Hierarchical Factor Structure


2. Improving Psychometric Properties of a Test


3. Developing Items that Discriminate between Samples


4. Developing more unique items

All tests with sound items should have a strong

internal consistency

Factor analysis can help developers determine items to remove, revise or add in order to improve

internal consistency

2 Primary Objections to Short Form Development

1. Rigorous and comprehensive evaluation is crucial and short form cannot give the level of information that is required for an appropriate assessment


2. Short forms are often developed without careful and thorough examination of the new form's validity

2 General Problems for Short Forms

1. Assumption that all the reliability and validity of the long form automatically applies to the abbreviated form


2 Assumption that the new shorter measure requires less validity evidence

7 Problems in Regard to Empirical Evidence for Short Forms

1. Researchers found that if large measure does not have good validity, neither will the short one!


2 Found that by reducing the items the content coverage may be compromised- very few short form designers performed content domain checks


3. Found significant reduction in reliability coefficients


4. found that many times researchers do not run another factor analysis on the short form to see if the same factor structure is present.


5. Need to administer short form to an independent sample to determine validity- not use the sample that long form was developed on


6. need to use the short form to classify clinical populations and compare if it is as accurate as the long form


7. need to establish if there is genuine time and money savings with a short form



Item Analysis

general term for a set of methods used to evaluate test items

2 Types of Item Analysis

Item Difficulty vs. Item Discriminability

Item Difficulty

defined by the number of people who get a particular item correct

Item difficulty should usually fall between

.3 and .7

When developing item difficulty, you need to consider whom

you are testing (like medical students vs. disabled students)

test floor

sufficient amount of easy items

test ceiling

sufficient amount of hard items

ItemDiscriminability

determineswhether the people who have done well on a particular item have also done wellon the entire test

Extremegroup method for Item Discriminability

comparespeople who have done very well with those who have done very poorly on a test

discrimination index in extreme group method

proportion of people in each group

Item Difficulty Formula

U + M + L

Item Discrimination Formula

U - L

2 Methods of Item Discriminability

1) Extreme Group Method


2) Point Biserial Method

Point Biserial Method for Item Discriminability

find the correlation between the performance on the item and compare it with the enire test

Item Response Theory (IRT) is a collection of mathematical models and statistical models that do these 3 things:

1. analyze items and scales


2. measure psychological constructs


3. compare individuals on psychological constructs

The basic unit of IRT is

item response function

item response function is a mathematical function describing

therelation between where an individuals falls on the continuum of a givenconstruct such as depression and the probability that the he/she will give aparticular response to a scale item designed to measure that construct

In IRT, a construct is called a

latent variable

in terms of item difficulty, the higher the number, the

easier the question

Point biserial ranges from

-1 to +1

A positive point biserial tells us that

the item discrimnates well because those that scored higher on the test also got the questions correct

The closer a point biserial is to +1, the more _______________________ it has

discrimination power

Discrimination power means that

it does well at discriminating between upper and lower ranges

A negative point biserial generally indicates that people in the higher scoring ranges got the item _________, as compared to those in the lower scoring range.

wrong,

A negative point biserial means that there is something wrong with

your question, but we don't know what.

Classical Testing Theory or CTT is limited by

only 2 sources of error- random and systematic

True Score Model from Classical Testing Theory

X (Observed Score) = T (True Score) + E (Error)

Random Error

fluctuations in the measurement based purely on chance

Systematic Error

error that affects a score because some particular characteristic of the person or the test that has nothing to do with the construct being measured

CTT recognizes only two sources of variance, and cannot adequately estimate

individual sources of error influencing a measurement

Generalizability Theory acknowledges that

multiple factors may affect the error associated with measurement of one’s true score

Generalizability Theory allows researchers to estimate the total variance or error in terms of

individualfactors that vary in terms of the assessment, setting, time, items andraters

Dependability

is the testers score dependable across a myriad of conditions

Reliabilityis dependent on

theinferences (generalizations) that the investigator wishes to make with the datafrom the measurement

2 Types of Error Analyses

1. G-Study


2. D-Study

G- Study (Generalizability Study)

Toprovide as much information as possible about the sources of variation in themeasurement

D-Study

usesG-Study information to evaluate the effectiveness of alternative designs forminimizing error and maximizing reliability.

Generalizabilitycoefficient

A reliable measure is one where the observed value closely estimates the expectedscore over all acceptable observations

dependability coefficient

how dependable are the measures from one judge to the next

Reliability Formula

X= T + E



Variance of T / Variance of X




or Variance of T / Var T + Var E

Item difficulty is synonymous with severity. Which means the

more severe the person's diagnosis is, the more likely they will be to endorse that item.

*Item difficulty is indicated by the curve that is furthest away

from the Y axis

Item discrimination is determined by the steepness of

the slope

Generalizability coeffcients are from

.8-1.0- good generalizability


.6-.8- marginal generalizability


<.6- poor generalizability

The biggest advantage of iIRT over CCT is that

you can map differential severity patterns for each item. You can look at individual items and look at differential scoring patterns to determine levels of severity as well as discriminability, independent of test bias, which can prevent a clinician from overpathologizing.