• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/129

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

129 Cards in this Set

  • Front
  • Back
Test
measurement device to

QUANTIFY & PREDICT

behavior
Item
specific question or problem that makes up a test
Be able to define, recognize, and differentiate between states and traits
State-Specific condition or status of an individual.

Trait-Enduring characteristics or tendency to respond in a certain manner.
Achievement
refers to previous learning
Aptitude
refers to potential for learning.
Intelligence
refers to general potential to solve problems
What encompasses achievement, aptitude, and intelligence?
Human Ability
Psychological test
-a set of items that are designed to measure

CHARACTERISTICS OF HUMAN BEINGS

that pertain to behavior
Psychological testing
refers to all possible

USES
APPLICATIONS
UNDERLYING CONCEPTS

of psychological and educational tests
Psychological assessment
GATHER AND INTEGRATE

information to make an evaluation.
If a test is reliable its results are what
-consistent
-Accurate
-dependable

CAD
How and where did testing originate?
China 4000 years ago to help determine work evaluations and promotion decisions
What are test batteries?
Two or more tests used in conjunction
Define standardization?
Uniformity of test conditions
Why is it important to obtain a standardization sample?
Because then you have a set of data to compare all new scores to
representative sample
Similar to a standardization sample except the sample is constructed to represent to population to whom the test will be given
factor analysis
minimum number of dimensions

account for a large number of variables
hypothetical construct

(memorize)
Processes that are not directly measurable

inferred to exist

give rise to measurable phenomena
operational definition
A QUANTIFICATION of the HYPOTHETICAL CONSTRUCT.

measurable phenomenon, What it is that you are measuring to indicate the hypothetical construct.

veil of measurability, Reference to the iceberg, you can only measure so much of a construct. and hypothetical construct
Be able to reconstruct the figures (including labels) from Chapter 1 lecture. See Power points
see slides
psychometry

(Definition and 2 properties)
-The branch of psychology dealing with the properties of psychological tests

1.Reliability
2.Validity
What are norm referenced tests?
compare test taker’s performance with others
What are criterion referenced tests?
Predict performance outcome outside of the test
What types of questions are answered by psychologists through assessment
Diagnosis and treatment planning

monitoring treatment progress

help client make more effective life choices and changes

program evaluation

help third parties make informed decisions
measurement
The act of assigning numbers or symbols to characteristics of objects according to prescribed rules
Measuring Scale
a set of numbers whose properties model empirical properties of the objects to which the numbers are assigned

any progressive series of values or magnitudes according to which a phenomenon can be defined
What are the three properties of scales that make scales different from one another?
Magnitude

Equal Intervals

Absolute Zero
Magnitutde
property of ‘moreness.’
Equal Intervals
difference between two points on a scale has the same meaning as the difference between two other points that differ by the same number of units
Absolute Zero
When nothing of the measured property exists
four scales of measurement
Nominal
Ordinal
Interval
Ratio
Nominal
categorization without any other real differences i.e. Football player numbers, gender..etc.

-can be classified, counted, and proportioned

-cant be ranked, added/subtracted, or divided
Ordinal
Assignment of ranks i.e. Race finishers

can be classified, counted, proportioned, and ranked.

can’t be added/subtracted or divided
Interval
equal intervals i.e. Fahrenheit. No absolute zero.

Can be classified, counted proportioned, ranked, added/subtracted, divided to form averages

Can’t be divided to form averages
Ratio
has all of the qualities the others don’t including absolute zero

i.e. Kelvin all mathematical operations and other stuff can be done with this scale.
Know the Levels of Measurement Summary Table from lecture
see powerpoint
frequency distribution
displays scores on a variable to show how frequently each value was obtained

histogram: this is a frequency distribution that provides limited information

bar graph: it doesn’t show info about differences in a specific interval

stem and leaf plot: This is also a frequency distribution but it provides more info than the histogram about actual values because it represents each case
Define percentile rank and percentile. How do percentile ranks differ from percentiles?
Percentiles are the specific scores or points within a distribution, they divide the total frequency for a set of observations into hundredths.

Percentile ranks indicate what percentage of scores fall below a particular score.
central tendency
Know the three components and how to calculate each Central tendency-score around which other scores congregate, Mean Median Mode…you know this
Variance
Average squared deviations around the mean.
Standard Deviation
useful approximation of how much
A typical score is above or below the average score
Equations for Variance
MUST KNOW SEE STUDY GUIDE
Normal Distribution
understand conceptually and memorize the figure from lecture 50% above 50% below the mean, skewness of Zero. 34% in each 1st SD (above and below). 95% within 2 SDs 99.7% within 3 SDs.
Skewness
index of the degree to which symmetry is absent
Identify Skewness
Look at the direction of the tail
Kurtosis
Index of the ‘peakedness’ v. ‘Flatness’ of a distribution
Playtikurtic
Negative kurtosis value reflects a flatter than normal distribution (Plate)
Leptokurtic
Positive kurtosis value reflects a more peaked than normal distribution (leaping)
Mesokurtic
Value of 0 reflects a medium distribution that is shaped like a normal distribution.
What is a z score?
A standardization of raw scores to allow for comparison
How is a z score calculated?
Subtract mean score from observed score and divide by SD.

(Observed Score-Mean Score)
___________________________
Standard Deviation
How are T scores different from Z scores?
Z scores: mean of 0 SD of 1.

T scores: mean of 50 SD of 10.

T scores are all positive
What are quartiles?
points dividing the distribution into four equal parts (25%, 50%, 75%)
Interquartile range
MIDDLE 50

discards the upper and lower 25% and takes what remains (middle 50% ) Caution-may discard too much of that data
Norm
performance is based on a standardized sample of individuals (normative sample). Individual test scores are compared to the normative sample. The norm is the standardized sample.
Norming
the process of creating norms (gender norms, age norms, grade norms, ethnic norms, etc).
Standardization
SPECIFIC PROCEDURES

for test administration, scoring, and interpretation.
Norm-referenced tests
compares each person to a norm, derives meaning by comparing scores to the normative sample
Criterion-referenced:
references scores to some external standard (criterion).

Criterion is based on values and standards of the test consumers. It is sometimes known as “domain-referenced” or “content-referenced” tests because of their focus on a specified content area or domain. Used to assess mastery or achievement. Disadvantage-performance relative to others is lost due to a lack of normative information.
5 Characteristics of a Good Theory
1)Explanatory power
2)Broad scope
3)Systematic
4)Fruitful
5)Parsimonious
Explanatory Power
explain patterns of a variable’s behavior that we know or suspect to exist with some accuracy and precision.
Broad Scope
applies to a wide variety of phenomena.
Systematic
interconnected, internally consistent (coherent) statements rather than odd assemblage of loosely related items that may not make logical sense when considered as a whole.
Fruitful
predicts the existence of regularities (patterns) that may not have been suspected before the theory was proposed; facilitates the discovery, useful organization, and interpretation of newly discovered facts; not circular (does not assume the facts they are investigating or concluding).
Parsimonious
Abides by Occam’s Razor-explains and predicts phenomena using a minimum of variables, concepts, and propositions needed to adequately perform the task.
What is a scatterplot (scatter diagram)? How does it work?
A scatterplot is a graphical tool used to show linear and nonlinear relationships.

Bivarite: one variable is plotted on the x-axis, and the other variable is plotted on the y-axis.

It’s useful for visual data inspection to determine the degree of linear relationship, revealing curvilinear relationships, detecting outliers, detecting restricted range problems (ceiling/floor effect, central tendency rater errors).
Understand and be able to differentiate and plot positive, negative, and 0 correlation
Positive correlation is an upward diagonal

negative correlation is a downward diagonal

0 correlation is bad. I’m assuming it will either be a vertical or horizontal line.
What is the principle of least squares? How does it relate to the regression line?
BEST PREDICTION LINE

It minimizes the squared deviation around the regression line.

Used to figure out the regression line
covariance
The degree to which two variables vary together, the degree to which x and y share a linear relationship, the degree to which points of two variables fall along a least squares regression line. As one value changes, the other changes in the same or opposite direction.
What is the principle of dilution in correlation?
There is no dilution with a perfect 1:1 correspondence between x and y.

All pairings are of equal magnitude.

Dilution is evident with an imperfect correspondence between x and y. If r=.5, x=.5y and thus y “waters down” x
What is the Pearson product moment correlation? What meaning do the values -1.0 to 1.0 have?
A ratio used to determine the degree of variation in one variable that can be estimated from knowledge about variation in the other variable. It is the most commonly used index of correlation and is used appropriately when both variables are continuous. -1.0 and 1.0 are good, 0 bad. The closer to -1 or 1 the better.
Define residual and how is it calculated?
The amount of observed y “left over” after the regression equation has taken its best guess.

Residuals always sum to zero due to the slope and intercept calculations.
What is the standard error of estimate?
The standard deviation of residuals. It is an index of the accuracy of prediction. If highly accurate, the differences between observed and predicted y will be small because the data points will lie close to the regression line. Consequently, the standard error of the estimate will also be small. If poor, the differences between observed and predicted y will be large because the data points will fall far away from the regression line. Consequently, the standard error of the estimate will also be large.
What is the coefficient of alienation?
A measure of nonassociation between two variables.
What is shrinkage?
Shrinkage often occurs when a regression equation is calculated using one group of subjects and used to predict performance in another group of subjects. Regression analysis is prone to overestimate a relationship’s strength between variables. It not only takes the true nature of the relationships into account, but it also takes advantage of “chance” relationships. When a regression equation is exported to a new group of subjects, the chance factors tend to drop out, revealing the true (weaker) relationship. Bottom Line: Replicated studies tend to more clearly identify the strength of the actual relationships between variables than “first time” studies.
What is the purpose of using discriminant analysis?
Used to predict group membership (passed vs. failed, stayed in class vs. dropped out).

The criterion (predicted) variable in discriminant analysis is group membership (2 or more groups are required to perform the analysis).

The criterion is a categorical (nominal-level), and not a continuous, variable. In test construction, discriminant analysis is useful in identifying items that discriminate between known (pre-classified) groups. Which test items best discriminate between recruits who will drop out of boot camp and those who will stay on? (predicting members in two groups).
What is factor analysis? Memorize all portions of the figure
Studies interrelationships among a set of variables without reference to an external criterion. A data reduction technique which identifies the number of underlying constructs (factors), that the variables in a correlation matrix are collectively measuring. Multiple regression and discriminant analysis find linear combinations of predictor variables that maximize the prediction of some external criterion (continuous criterion variable, or group membership). Operates according to the following basic principles:
When two reliable variables in a correlation matrix correlate highly, they measure the same thing or load on to the same factor.
When two reliable variables in a correlation matrix do not correlate, they measure different things or load on to different factors.
One factor is extracted for each “cluster” of highly intercorrelated variables in a correlation matrix.
Factors are desirable because they simplify and reduce data complexity. One deals with a smaller set of factors instead of many individual variables. Since multiple variables are used to measure each factor, factors are more reliably measured.
*There are two-three slides at the end of the Chapter 3 powerpoint that has the figures on it. I’m assuming that’s what he wants us to memorize. Sorry I couldn’t include them in the study guide.
What word can be used in place of reliability?
Consistency (stability or dependability)
What components make up Classical Test Score Theory?
X=T+E

X(observed score)=T(true score) + E(error) Assumes each person has a true score that would be obtained if there were no measurement error (this score will not change despite repeated applications). Imperfect instruments give scores that almost always differs from true ability. Measurement error leads to differences between true and observed scores. Classical Test Score Theory assumes random error. Because we assume the distribution of standard errors will be the same for all people, CTT uses the standard deviation of errors as the basic measure of error.
What are the most important components of the Domain Sampling Model?
Domain (universe, population): an extremely large collection of items. A true score is conceptualized as a universe, population, or domain score. Test items are a sample of a universe of hypothetical items measuring a domain. Each item should equally represent the studied ability, the larger the sample the more accurately it represents the domain. Thus, the greater the number of items, the higher the reliability. Sampling error can cause random items to give different true score estimates. Internal Consistency (reliability): measures the item pool’s homogeneity, measures the sampled universe’s homogeneity. The more homogeneous the “sample”, the more homogeneous the universe.
Know the reliability formula as requested in lecture
see lecture (ch.4 slides)
In what ways can error impact the observed score?
Error is potentially due to situational factors and how the test was created.

Three areas of importance are:

test construction (item content and wording)

test administration (environment, test-taker variables: fatigue, motivation, etc and examiner-related variables: demeanor, body language, professionalism),

test scoring and interpretation (different scoring criteria, administrator bias, arbitrary scoring criteria, low inter-rater agreement
Test reliability is usually estimated in one of what three ways? Know the major concepts in each way.
1. TEST-RETEST
2. PARALLEL FORMS
3. INTERNAL CONSISTENCY
TEST RETEST
most common): Evaluates error from the same test given on two administrations. How close to the identical percentile rank are the time 1 and 2 distributions? Shifts in the z-score rankings between time 1 and 2 reduce reliability. The more drastic the shift, the lower the reliability. This is only valuable when measuring stable characteristics.
PARALLEL FORMS
Compares two equivalent test forms measuring the same attribute. Different items between forms with the same selection rules. Equivalent content, difficulty, means, and variances. Advantage: reduces memory bias, it’s one of the most rigorous assessment of reliability. Disadvantages: hard to construct, potential sources of error variance.
INTERNAL CONSISTENCY
examines the homogeneity with which a test measures a construct.

Based on the domain sampling theory.

Advantage: requires only one test administration, simplifies test interpretation because the test items all measure the same thing.

Disadvantages: reliability coefficients may be biased by which items are sampled from the universe, less appropriate for measuring multi-faceted constructs, item “heterogeneity” lowers average inter-item correlations.

Solution: when measuring multifaceted constructs, create homogeneous subscales, each measuring a different component.
What is a carryover effect?
When the first session influences the second.

Test-retest correlation usually overestimates (inflates) the true reliability: practice, memory.
Define parallel forms reliability. What are its advantages and disadvantages?
Compares two equivalent test forms measuring the same attribute. Different items between forms with the same selection rules. Equivalent content, difficulty, means, and variances.

Advantage: reduces memory bias, it’s one of the most rigorous assessment of reliability.

Disadvantages: hard to construct, potential sources of error variance.
Define split half reliability.
A test is given and divided into halves that are scored separately. Each half is compared by correlating the two halves. This examines the stability in scores for two equivalent halves of the same test. To correct for half-length, the Spearman-Brown formula is used.

Advantage: requires only one test administration.

Disadvantage: reliability on 2 half-tests is deflated because each half is less reliable than the whole, difficult to split the test into “mini-parallel forms” equivalent in content, difficulty, means, variances.
How do the different aspects of internal consistency differ?
Kuder-Richardson Formula 20: Simultaneously considers all ways of splitting the items. Statistic of choice for measuring dichotomous items. Proofs have shown that it fives the same estimate of reliability as if you took the mean of the split-half reliability estimates obtained by dividing the test in all possible ways.

Cronbach’s Alpha: most general method of finding estimates of reliability through internal consistency. Appropriate for dichotomous and nondichotomous values (multi-point or continuous). Based on the average inter-item covariances.
Understand the major components of inter-rater reliability.
Examines the consistency among different judges evaluating the same behavior (also called inter-scorer, inter-observer, or inter-judge).
How does the Kappa statistic relate?
Kappa Coefficient: A measure of agreement between two or more judges who each rate a set of objects using nominal (categorical) data. Calculated as the actual agreement beyond chance divided by the potential agreement beyond chance. Designed to correct for problems with tabulating “percent agreement” between raters. “Percent agreement” does not correct for chance agreement. “Percent agreement” is affected by the base rate of what is rated.
What factors should be considered when choosing a reliability coefficient?
1.Homogeneity vs. heterogeneity of items (Internal consistency)


2. Dynamic vs. static characteristics (Test-retest)
Homogeneity vs. heterogeneity of items
Is the test measuring a multi-faceted or a uni-faceted construct?
Dynamic vs. static (test-retest)
i.Is the “true score” fluctuating or relatively stable over time?

ii.Does it change from moment to moment, or from situation to situation?
How can one address low reliability?
-Increase test length (sample a better domain)

-Throw out items that run down reliability

-Estimate the true correlation if the test did not have measurement error

(slide 27 & 28)
What is the purpose of factor and item analysis?
Since reliability depends on items measuring a common characteristic, it is possible to make the test more accurate by leaving out items that do not measure the construct, which increases reliability.
Factor analysis is used to ensure the items measure the same thing.
What is the correction for attenuation?
Low reliability diminishes (attenuates) the magnitude of correlation coefficients, reducing the likelihood of finding significant correlations between measures.

Used to estimate the true correlation between variables that have been measured with error
Don’t need to know formula
What example was given in class regarding reliability
Reliability is referring to the carpenters rubber yard-stick

Reliability=consistency

You will have to do a matching section on RELIABILITY!!
What are the five iterative steps of the test development process?
Test Conceptualization
Test Construction
Test Tryout
Item Analysis
Test Revision
Dichotomous Format
offer only 2 alternatives for each item

Ex: true/False or Yes/No

Advantage: simplicity, ease of administration

Disadvantage: Encourage memorization (perform well on tests they do not understand)
Truth often comes in shades of gray
50% probability of guessing correctly
To achieve reliability, the tests must contain many items
Polytomous
Multiple choice and matching
Advantages:
Ease of administration/scoring; familiar format; probability of correct answer is lower than True/False; short response time
Disadvantages:
Harder to write (3-4 distractors per item), relatively shallow coverage
Which types of questions are “selected-response format”?
Dichotomous and Polytomous
Which three types of questions would be considered “constructed-response format”?
Short answer
Fill in the blank
eassay
What are the two major formats of summative scales, as given in lecture? What type of data do they create?
Likert Format
Category Format
Likert Format.

What scales most frequently uses the Likert format
Likert Format

5,6 & 7 point scales indicating degree of agreement
Ex: you like cheese.
Strongly disagree, moderately disagree, etc.
Category Format
9 or 10 point max.
Ex: On a scale of 1-10, with 1 being the lowest, what is the level of pain you feel?
1 2 3 4 5 6 7 8 9 10
What are the primary differences between the Likert and Category formats?
Likert
Requires assessment of item discriminability
Category
Used to make more fine grained discrimination
In creating a category format, the use of what will reduce error variance?
Clearly defined anchor intervals (endpoints and midpoints)
When does the category format begin to reduce reliability?
More points may reduce reliability (i.e. a 9 point scale is better than a 50 point scale)
What are the four questions that should be asked when generating a pool of candidate test items?
What content domain (construct) should the test items cover?
How many items should I generate to form a good test item pool?
What are the demographics of my test-taker population?
Cultural background
Reading level
Developmental level

How shall I word my items?

WHAT CONTENT
HOW MANY ?'s
DEMOGRAPHICS
WORDING
What are the four ways to score tests and how is each differentiated from the others?
Cumulative Scoring

Full-Scale Score

Subscales Scores
Cumulative Scoring
Cumulative Scoring
Performed on summative scales
Items are summed for a total scale score
Subscale Scoring
Subscale scoring (Alternative):
Total test score is divided into theoretically or empirically-divided subscales that are independently summed
Example, a measure of PTSD may be scored by:

Subscales Scores: Sum Items for Intrusion, Avoidance, etc.
“Full-” and “subscale” scoring are not mutually exclusive
Both can be used to adjust the level of specificity in measurement and interpretation (global vs. single facet of the construct)
Full Scale Socring
Full-Scale Score: Sum all 17 items to create “Total PTSD Symptoms”
Item Analysis:
Describes a set of methods to evaluate the “goodness” of candidate test items
What two methods are closely associated with item analysis?
Item Difficulty
Item Discriminability
Define item difficulty. What does the proportion of people getting the item correct indicate?
Item Difficulty: # of people who get a particular item correct
The higher the proportion who get it right, the easier the item
What is the point midway between 100% correct and the level of success expected by chance alone?
Optimum Difficulty Level: the point midway between 100% and the level of success expected by chance alone
Expected level of chance performance
He will give us an example, such as 100 MC exam, with 4 choices
Level of chance is 25%
For most tests, items in the difficulty range of _____ tend to maximize information about differences
.3-.7
Define item discriminability.
Whether the people who have done well on particular items have also done well on the whole test.
Extreme Group Method
Calculate the proportions of people in each group (typically top & bottom 25-33%) and calculate the difference from the two groups.
High values reflect good discrimination
0 values reflect no discrimination (bad/useless item)
Low values (below 0) reflect reverse discrimination (very bad item)
“Red flags” are items with low, especially negative, d values
Point Biserial Method
Find the correlation between item performance and total test performance
Point Biserial Correlation - Correlation between dichotomous & continuous variables
Problematic on tests with only a few items because performance on the item contributes to the total test score.
The closer the value to 1.0 the better the item
If the value is negative or low, then the item should be eliminated
Explanation of how they differ
Extreme Group basically looks at the people who did the best, and the people who did the worst and compares them. Point Biserial Correlation basically looks to see if the people who did the best tended to miss a certain question, if so, it was obviously a bad question.
Define item characteristic curve. Know what information the X and Y axes give as well as slope
A graph which helps to show the characteristics of an item

X axis: plots of ability (high vs. low scoring groups)
Y axis: probability of correct response (proportion correct)
PLEASE REFER TO GRAPHS on Ch. 6 slides 32-34 to be sure you understand how to read them
When shown an ICC (item characteristic curve), be able to determine good or poor discrimination
see study guide
Define criterion-reference test. What is its primary purpose?
Compares performance with some clearly defined criterion for learning
Define antimode
Least frequent score
Cutting score (pass/fail)