Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
29 Cards in this Set
- Front
- Back
Instrument |
any type of measurement device (e.g. test, questionnaire, interview schedule, or personality scale). |
|
Instrumentation |
heading in formal research reports for the section where measurement devices are described. |
|
Validity |
BLUF: does an instrument measure the construct that you are trying to measure?
it is the extent that an instrument measures what it is designed to measure and accurately performs the function(s) it is purported to perform.
Always a matter of how valid a test is, not whether it is valid--no test is perfectly valid, because: 1. all tests tap only a sample of the behavior underlying the constructs being measured. 2. some traits researchers want to measure are inherently difficult to measure. |
|
Content validity |
the appropriateness of an instrument's contents. Example: achievement tests--what questions and in what format should you ask to evaluate comprehension?
Researchers make judgments on content validity, and generally: 1. a broad sample of content is usually better than a narrow one. 2. important material should be emphasized. 3. questions should be written to measure the appropriate skills, such as knowledge of facts and definitions, application of definitions to new situations, drawing inferences, making critical appraisals, etc. |
|
Face validity |
whether an instrument appears to be valid on the face of it. Example: pilot training test--drawings of shapes moving in space seemingly less valid than drawings of airplanes moving in space. |
|
Criterion |
standard by which a test is being judged. |
|
Predictive validity |
the extent to which a test predicts the outcome it is supposed to predict at some future time. Example: SAT tests predicting college abilities/performance. |
|
Concurrent validity |
the extent to which a test predicts your current state. Example: mid-term exam predicts current understanding in the class. |
|
Validity coefficient |
a correlation coefficient used to express validity, ranges from 0 to 1, but in practice we usually don't see higher than 0.6 as a strong correlation.
In general, higher number (i.e. closer to 1) is more valid.
|
|
Construct validity |
the type of validity that relies on subjective judgments and empirical data. Offers only indirect evidence regarding the validity of a measure.
To determine construct validity, researchers begin by hypothesizing about how the construct the instrument is designed to measure should affect or relate to other variables. |
|
Construct |
stands for a collection of related behaviors that are associated in a meaningful way.
Example: "Depression" is construct that stands for lethargy, apathy, loss of appetite, etc, etc |
|
Reliability |
BLUF: how repeatable is the measure?
it is the consistency of results between measurements.
Equation: Reliability = (System Variance)/(System Variance + Error Variance)
Reliability is necessary, but not sufficient for validity. If you have to pick--go for VALIDITY first, then RELIABILITY second.
|
|
Interobserver reliability |
reliability or agreement between observers/testers. |
|
Correlation coefficient |
correlation between variables represented as a number from 0 to 1, where 1 = perfect correlation. Note: 1 is never found in the field. |
|
Reliability coffecients |
correlation coefficient to decribe reliability, can be extended to the particular type of reliability.
Low reliability = below 0.7 Good reliabilty = 0.7 and above
Example: Interobserver reliability = reliability coefficient (correlation) between observers. |
|
Two methods for estimating the reliability of a test: |
1. Test-retest reliability 2. Parallel-forms reliability |
|
Test-retest reliability (aka Stability) |
BLUF: researchers measure *two different points in time*.
extent to which the instrument gives the same score each time you use it (i.e. pre-test and post-test). Expect high correlation with stable tests. Note: Time is a factor in the stability.
Example: Personality test over time -- results may be less reliable if you wait a long time between testing. |
|
Parallel-forms reliability |
BLUF: test comes in two versions with same content, over *two points in time*.
when a test comes in two parallel (equivalent) forms that are designed to be interchangeable with each other (different items but same content).
Example: GRE tests (Version A and Version B) |
|
Split-half reliability |
BLUF: comparing test content to itself based on results, at *single point in time*.
researcher administers a tests but scores the items as though they consisted of two separate tests by doing a even-odd split. |
|
Cronbach's alpha (aka coefficient alpha) |
BLUF: comparing test content to itself based on results, at *single point in time*.
measures internal consistency (extent to which items in test are homogenous *WITHIN that test only*). Based on computing all possible split-halfs possible in a single administration of a test.
higher alpha value (higher internal consistency) = more reliable test |
|
Norm-referenced tests (NRTs) |
tests designed to facilitate a comparison of an individual's performance with that of a norm group (peer sample or population).
NRTs are intentionally built to be of medium difficulty.
Example: examinee test results are percentile rank of 64 means that individual scored better than 64% of the individuals in the norm group.
|
|
Criterion-referenced tests (CRTs) |
tests designed to measure the extent to which individuals examinees have met performance standards (i.e. specific criteria).
When building CRTs, item difficulty typically is of little concern--expert judgment is used to determine the desired level of performance and how to test for it.
Example: PT test |
|
Which type of test should be favored in research? |
Answer depends on the research purpose. Two general guidelines: 1. If purpose is to describe specifically what examinees know and can do -- use CRTs. 2. If the purpose is to examine how a local group differs from a larger norm group -- use NRTs. |
|
Achievement test |
measures knowledge and skills individuals have acquired.
Example: Midterm exam |
|
Aptitude test |
designed to predict some *specific type* of achievement.
The most widely used aptitude tests typically have low to modest validity (validity coefficients between 0.20 - 0.60).
Example: SAT |
|
Intelligence test |
designed to predict achievement *in general*.
Most popular intelligence test have low to moderate validity (0.20 - 0.60) and are: 1. culturally loaded -- measures skills acquired in some cultural milieu 2. measure knowledge and skills that can be acquired with instruction
Example: IQ test, Wonderlic test |
|
Three basic approaches to reducing social desirability in participant responses: |
1. by administering personality measures anonymously 2. to observe behavior unobtrusively (without the participant's awareness, to the extent that is ethically possible) and rate selected characteristics 3. use projective techniques (provide loosely structured or ambiguous stimuli such as ink blots) |
|
Likert-type scales |
scales that have choices from "strongly agree" to "strongly disagree". Objective in terms that responses are restricted to certain choices and each statement presents a clear statement on a single topic. |
|
Reverse scoring |
mixing positive and negative statements to highlight/reduce response bias (firewall marking answers without considering individual items). |