Study your flashcards anywhere!

Download the official Cram app for free >

  • Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

How to study your flashcards.

Right/Left arrow keys: Navigate between flashcards.right arrow keyleft arrow key

Up/Down arrow keys: Flip the card between the front and back.down keyup key

H key: Show hint (3rd side).h key

A key: Read text to speech.a key

image

Play button

image

Play button

image

Progress

1/105

Click to flip

105 Cards in this Set

  • Front
  • Back
Holland's Six Occupational Themes
Holland's classification of occupations and occupational interests into six thematic areas, which Holland believed reflected basic personality characteristics: Realistic,
investigative, artistic, social, enterprising and conventional. Different patterns of high scores on Holland's occupational themes place people into different occupational categories.
Infant Tests (Bayley's Scales/Denver/Cattell)
"Infant tests" are mental ability tests designed for children aged 5 years and under. In contrast to standard IQ tests, infant tests usually assess sensorimotor functioning (e.g., ability to lift head, turn over, grasp objects). Infant tests are not considered good
predictors of future intelligence test performance or academic achievement.
1. BAYLEY'S SCALES OF INFANT DEVELOPMENT: An infant/preschool test used to
assess the developmental status of children aged 2 through 30 months.
2. DENVER DEVELOPMENTAL SCREENING TEST: An infant/preschool test used to
assess developmental delays in children from birth to age 6.4 years. It can be
administered by a nonprofessional who has received minimal training.
3. CATTELL INFANT INTELLIGENCE SCALE: An infant test designed as a downward extension of the Stanford-Binet for infants aged 2 to 30 months.
Intelligence (Crystallized/Fluid/IQ Fluctuations)
Despite widespread use of intelligence tests, no widely-accepted definition of
"intelligence" exists. Psychologists tend to view intelligence as either a global mental capacity (e.g., Wechsler, Spearman) or as independent factors (e.g., Gardner).
Operationally, intelligence is what intelligence tests measure.
1. CRYSTALLIZED INTELLIGENCE: Abilities that are primarily a function of learning and experience. Crystallized intelligence is less affected by physiological condition and, therefore, less negatively affected by aging.
2. FLUID INTELLIGENCE: Abilities, such as memory span and mental speech, that are affected by physiological condition and maturation.
3. IQ FLUCTUATIONS: Although cross-sectional studies demonstrate a gradual decline in IQ starting in late adolescence, longitudinal studies show that IQ scores are relatively stable throughout the lifespan. At least during the school years, however, IQ fluctuations as high as 15 points are not uncommon.
Intelligence Tests (Standford-Binet/Weschler Tests)
1. STANFORD-BINET: An individually-administered intelligence test for individuals aged 2 through adult. The current version (Fourth Edition) yields a Composite Standard Age Score (SAS), Area SASs and subtest SASs. Composite and Area SASs have a mean of 100 and a
standard deviation of 16.
2. WECHSLER TESTS: (a) WAIS-R: An individual intelligence test that yields Verbal, Performance and Full Scale IQs, as well as individual subtest scores. Verbal, Performance and Full Scale IQs have a mean of 100 and a standard deviation of 15. The WAIS-R is appropriate for individuals aged 16 and over. (b) WISC-III: A downward extension of the WAIS-R for children aged 6 through 16 years, 11 months. (c) WPPSI-R: A downward
extension of the WAIS-R and WISC-III for children aged 3 through 7 years, 3 months.
Interest Tests (Kuder Preference Record-Vocational/Kuder Occupational Interest Inventory/Strong Interest Inventory)
Interest tests have relatively low (but positive) correlations with measures of ability, but higher correlations with occupational and educational choice. They are less valid than intelligence tests for predicting occupational and academic success. 1. KUDER PREFERENCE RECORD-VOCATIONAL (KPR-V): Appropriate for persons in grades 9 through 16 and adults. The KPR-V yields information on interests in 10 broad vocational interest areas. Like other the Kuder tests, it uses a forced-choice format; i.e., examinees respond to items by indicating which of three statements they most and least prefer. Scores are ipsative; i.e., they indicate the relative strengths and weaknesses of interests in the examinee.
2. KUDER OCCUPATIONAL INTEREST INVENTORY (KOIS): Appropriate for persons in grades 11 and 12 and adults. Scores are reported in terms of occupations, college majors and vocational interests. Items selected for the KOIS scales were those that differentiated between persons in different occupational groups (i.e., empirical criterion keying).
3. SVIB-SCII: The most recent version of the Strong interest inventories. The SVIB-SCII is used for vocational counseling and personnel selection of persons in grades 11 and 12 and above. It yields scores on basic interests, occupational themes and occupations. The SVIB-SCII was developed using empirical criterion keying (people in various occupations were compared to a general reference group).
Ipsative and Normative Measure
1. IPSATIVE MEASURE: A test that reports an examinee's scores in relative, rather than absolute, terms. Ipsative scores permit comparisons of the relative strengths of the
attributes measured by the test within an examinee, but do not permit comparisons of the absolute strengths within an examinee or between different examinees. When using ipsative measures, the sum of scores over all scales is constant for all individuals. The opposite of "ipsative" is "normative".
2. NORMATIVE MEASURE: A test that yields scores that provide information about the
absolute amount of the characteristic(s) measured by the test possessed by the examinee. Normative scores permit comparisons both within and between examinees.
Mastery Testing
Involves specifying the terminal level of performance required for all learners (e.g., 90% correct) and periodically administering a test to learners to assess their degree of mastery. If a learner's test performance indicates deficiencies, he/she is given appropriate remedial
instruction. This process is repeated until the learner has reached the desired level of performance. The goal of mastery testing is not to identify differences between examinees (as it is in many other types of testing), but to make sure that all examinees eventually reach the same performance level.
MAXIMAL & TYPICAL PERFORMANCE
1. MAXIMAL PERFORMANCE: A test of maximal performance provides information about
an examinee's "best" possible performance (what a person can do). A test would be designed as a measure of maximal performance if it is an achievement or aptitude test.
2. TYPICAL PERFORMANCE: A test of typical performance yields information about an
examinee's usual performance (what he/she usually does or feels). A test would be
designed as a measure of typical performance if it is a test of personality, interest, or attitude or a behavioral observation scale.
Mental Status Exam
An "MSE" is a systematic evaluation conducted to determine an individual's level of psychological, emotional and intellectual functioning and orientation to time and place. It consists of observing the individual's affect, thought content, perceptive and cognitive functions and need and motivation for treatment and asking him/her specific questions (e.g., to interpret a proverb, what day it is, to repeat a series of numbers).
Multiaptitude Test Batteries (Differential/General Aptitude Test)
1. DAT (DIFFERENTIAL APTITUDE TEST): An educational and vocational guidance test for individuals in grades 8 through 12. The DAT contains eight subtests (e.g., verbal
reasoning, number ability, abstract reasoning, space relations).
2. GATB (GENERAL APTITUDE TEST BATTERY): A vocational aptitude test battery for
high school seniors and adults. The GATB is less "school-oriented" than the DAT and more useful for vocational counseling and personnel selection and placement.
Neuropsychological Assessments (Halstead-Reitan/Luria Nebraska)
1. DAT (DIFFERENTIAL APTITUDE TEST): An educational and vocational guidance test for individuals in grades 8 through 12. The DAT contains eight subtests (e.g., verbal
reasoning, number ability, abstract reasoning, space relations).
2. GATB (GENERAL APTITUDE TEST BATTERY): A vocational aptitude test battery for
high school seniors and adults. The GATB is less "school-oriented" than the DAT and more useful for vocational counseling and personnel selection and placement.
Neuropsychological Assessment Tests (Bender-Gesalt/Benton Visual Retention Test)
1. BENDER-GESTALT: Originally developed as a measure of visual-perceptual skills, the Bender-Gestalt test is now also used as an objective technique to detect brain damage and estimate intelligence and as a projective test to assess personality. While many clinicians interpret examinees' drawings on the basis of intuition, a number of objective scoring systems are available (e.g., Koppitz system, Pascal-Suttell).
2. BENTON VISUAL RETENTION TEST: Used to assess visual memory, spatial perception
and visual-motor skills for the purpose of identifying brain damage in individuals aged 8 and above. Studies assessing the validity of the Benton have generally found it useful for identifying brain injury, especially in adults.
Objective and Subjective Tests
1. OBJECTIVE TEST: A test that has clear and unambiguous scoring criteria.
Multiple-choice and true/false tests are objective tests.
2. SUBJECTIVE TEST: A test that is scored according to subjective (nonobjective)
standards; i.e., that is dependent, to some degree, on the scorer's judgment.
PL 94-142, SOMPA & THE VINELAND
1. PL 94-142: The Education for All Handicapped Children Act of 1975, which guarantees
public education for all handicapped children and requires that an individualized educational plan (IEP) be prepared for each handicapped child. The law also requires that any tests used to evaluate handicapped children be reliable, valid, and
non-discriminatory.
2. SOMPA (SYSTEM OF MULTICULTURAL PLURALISTIC ASSESSMENT): A comprehensive assessment technique for children aged 5 through 11. SOMPA is designed specifically to prevent the misclassification of minority children as mentally-retarded on the
basis of standard IQ test scores alone.
3. VINELAND ADAPTIVE BEHAVIOR SCALE: The Vineland is designed to measure
adaptive functioning ("social competence") in children and adults. It is often used in conjunction with an IQ test for diagnosing Mental Retardation.
Power and Speeded Tests
1. POWER TEST: Pure power tests are made up of items of varying difficulty levels (often arranged in order of difficulty from easiest to most difficult) and are administered with no time limit or a limit that allows most or all examinees to attempt all items. An examinee's score on a power test reflects the level of difficulty she has mastered.
2. SPEED TEST: Pure speed tests contain items that are so easy that all examinees could answer all items correctly if given sufficient time. However, strict time limits are imposed so that no examinee can complete all items and, as a result, differences in examinees' scores reflect differences in response rates. Despite their differences, speed and power tests are both designed to prevent all examinees from attaining perfect scores so that differences between examinees can be detected.
Profile Analysis
A method of score interpretation that involves looking at the pattern of subtest or scale scores. Profile analysis is commonly used to interpret MMPI scores: For the MMPI, profile analysis involves looking at the scales with the highest scores only and using a codebook or computerized interpretation system to determine the meaning of those scores.
Projective Personality Tests (Rorschach/TAT)
"Projective personality tests" are relatively unstructured tests; the stimuli presented to an
examinee are ambiguous and responses are open-ended. Projective tests are based on
the projective hypothesis, which proposes that a person's interpretation of ambiguous stimuli reflects his personality traits, feelings, needs, etc.
1. RORSCHACH: Presents an examinee with 10 inkblots and his/her responses
presumably reflect his/her underlying personality, conflicts, etc. Scoring and interpretation involve looking at dimensions of a response (e.g., content, form, originality). Administration includes a free association and inquiry phase. The Rorschach appears to be most valid for assessing cognitive style and perceptual organization and least valid for psychoanalytic interpretations.
2. TAT: The TAT is based on Murray's theory of needs. The examinee is asked to make
up a story about a series of vague pictures. Murray's scoring and interpretation system involves identifying the story's hero and evaluating the intensity, frequency and duration of needs, environmental press, thema and outcomes expressed in each story. The TAT appears to have little utility for assigning specific diagnoses, but may be useful for gross
diagnostic distinctions.
Psychomotor Ability Tests
"Psychomotor ability tests" are measures of special abilities related to speed and quality of movement that are used primarily to predict success in certain skilled jobs or trades. Factor analyses of psychomotor tests have demonstrated that there is no underlying "g" factor and that the various abilities are relatively independent.
Scholastic Tests (Scholastic Apptitude Test/Graduate Record Examination)
1. SCHOLASTIC APTITUDE TEST (SAT): A widely used group test administered to high
school students for the purpose of predicting college success. The SAT provides Verbal and Mathematical scores. Research on its validity has shown that SAT Verbal and
Mathematical scores combined have slightly better predictive validity than either score alone and that the highest level of validity is achieved when SAT scores are combined with high school grades.
2. GRADUATE RECORD EXAMINATION (GRE): An aptitude-achievement test of verbal, mathematical and analytical ability used to select applicants to graduate school. The GRE General Test provides Verbal, Quantitative and Analytic scores, and GRE Subject (Advanced) Tests are available for 20 subject areas. The GRE composite score (General plus Subject scores) has been found to be more valid than undergraduate GPA for
predicting graduate school performance. The highest predictive validity is obtained with weighted composites that include GPA and one or more GRE scores.
Standardized Test
*** Wanna buy some Kleptomania? ***
A "standardized test" is one that has a specific set of procedures for administration,
scoring and interpretation. A "standardization sample" is the sample used to develop
standardized procedures and to obtain test norms.
Structured Personality Test (MMPI/MMPI-2/CPI/EPPS)
"Structured personality tests" provide an examinee with specific statements and require him/her to make specific responses (e.g., multiple-choice questions). Structured
personality tests are usually self-report measures that require an examinee to observe and report on his/her own behaviors.
1. THE MMPI: Reports performance in terms of 10 clinical scales and four validity scales. The MMPI was originally intended as a tool for deriving psychiatric diagnoses, but is more
commonly used to assess personality through profile analysis. The clinical scales were developed on the basis of empirical criterion keying.
2. MMPI-2: A revision of the MMPI that includes 15 new content scales (derived on the basis of content analysis) and three new validity scales. The standardization sample is more representative of the population than the sample for the original MMPI.
3. CPI: Contains many items from the MMPI, but is designed to assess "normal"
personality characteristics, such as self-control, dominance and sociability.
4. EPPS: Provides scores on Murray's 15 basic needs (e.g., Achievement, Affiliation, Autonomy). The EPPS uses a forced-choice format, which presumably controls the social desirability response set, and yields ipsative scores.
TESTS IN PRINT & MENTAL MEASUREMENTS YEARBOOK
1. TESTS IN PRINT: A comprehensive bibliography of all commercially-published tests in
English.
2. MENTAL MEASUREMENTS YEARBOOK (MMY): A periodically-published review of commercially-available tests published in English. It includes factual information about tests, as well as critical reviews.
MMPI Validity Scales
"Validity scales" are special scales designed to assess test-taking attitudes and to determine if the results of a test for a particular examinee are valid. The MMPI-2 includes seven validity scales: L, K, F(1), F(2), Cannot Say, VRIN and TRIN.
1. F (VALIDITY) SCALE: A high F scale score suggests response carelessness,
eccentricity, psychiatric dysfunction or scoring errors.
2. L (LIE) SCALE: A high L Scale score indicates that the examinee has attempted to
present himself in an unrealistically favorable light.
WECHSLER VERBAL-PERFORMANCE IQ SCORE DISCREPANCY, SCATTER ANALYSIS
1. WECHSLER VERBAL-PERFORMANCE IQ SCORE DISCREPANCY: A method for
interpreting Wechsler test scores. A discrepancy of 15 points or higher is generally
considered significant. A higher Verbal IQ suggests right hemisphere damage, neurosis or psychosis. A higher Performance IQ may indicate left hemisphere damage, educational deficits or sociopathy.
2. SCATTER (PATTERN) ANALYSIS: A type of pattern analysis of subtest or scale scores that involves looking at score variability. WAIS-R subtest scores are often interpreted in terms of scatter analysis, but the research suggests that scatter analysis is not particularly valid and results in too many "false positives."
Wonderlic Personnel Test
The Wonderlic is a 12-minute test of intelligence. Although the Wonderlic is often used in industry to assist in personnel decisions, its use for this purpose has been criticized on the ground that it unfairly discriminates against members of certain minority groups.
Base and Positive Hit Rate
1. BASE RATE: The proportion of correct decisions (e.g., hiring decisions, diagnoses)
being made without use of the predictor of interest. The base rate is calculated by dividing the number of correct decisions by the total number of decisions. A predictor is likely to have the greatest incremental validity when the base rate is moderate (around .50).
2. POSITIVE HIT RATE: When assessing a test's incremental validity, the proportion of correct acceptances. The positive hit rate is calculated by dividing the number of true positives by the total number of positives.
Classical Test Theory
Views an examinee's obtained test score (X) as being composed of two additive and
independent components, a true score component (T) and an error component (E): X = T
+ E. The true score component of an obtained score reflects an examinee's actual status with regard to the attribute that is measured by the test. The error component represents measurement error, which is random error due to factors that are irrelevant to what is being measured by the test and have an unpredictable (unsystematic) effect on the test score. A measure of reliability provides an estimate of the proportion of variability in
examinees' obtained scores that is due to true differences among examinees on the attribute(s) measured by the test.
Concurrent and Predictive Validity
1. CONCURRENT VALIDITY: A type of criterion-related validity that involves administering the predictor and criterion at about the same time. Concurrent validity is the appropriate type of validity when a predictor will be used to estimate current status on the predictor.
2. PREDICTIVE VALIDITY: A type of criterion-related validity that is determined by correlating predictor scores with subsequently-obtained criterion scores. Predictive
validity is the appropriate type of validity when a predictor will be used to predict future performance on the criterion.
CONTENT VALIDITY, CONSTRUCT VALIDITY & CRITERION-RELATED VALIDITY
1. CONTENT VALIDITY: The extent to which a test adequately samples the domain of information, knowledge or skill that it purports to measure. Content validity is determined primarily by "expert judgment" and is most important for achievement tests and job sample
tests.
2. CONSTRUCT VALIDITY: The extent to which a test measures the hypothetical trait (construct) it is intended to measure. Methods for establishing construct validity include correlating test scores with scores on measures that do and do not measure the same trait (i.e., assessing convergent and discriminant validity); conducting a factor analysis to assess the test's factorial validity; determining if changes in test scores reflect expected developmental changes; and seeing if experimental manipulations have the expected impact on test scores.
3. CRITERION-RELATED VALIDITY: Involves determining the relationship (correlation) between the predictor and the criterion. Criterion-related validity can be either concurrent or predictive.
CONVERGENT & DIVERGENT VALIDITY
1. CONVERGENT VALIDITY: Used to assess a test's construct validity. Convergent
validity exists when a test correlates highly with measures designed to assess the same or a related trait.
2. DIVERGENT (DISCRIMINANT) VALIDITY: Used to assess a test's construct validity. Divergent validity exists when a test has low correlations with measures designed to assess unrelated traits.
Correction for Attentuation
The formula used to correct a validity coefficient for the unreliability of the predictor and/or criterion. The correction for attenuation formula estimates what a predictor's criterion-related validity coefficient would be if the predictor and/or criterion had a reliability coefficient of 1.0.
Criterion and Predictor
1. CRITERION: In test construction and in organizations, the criterion is the variable predicted by the predictor. In research design, the criterion is the dependent variable. A criterion is a measure of behavior or performance.
2. PREDICTOR: Any test or measuring device used to predict performance on some other variable. In test construction, the predictor is the measure used to predict or estimate
performance on a criterion. In research design, the predictor is the independent variable. In organizations, selection tests are used as predictors of future job performance to assist in selection decisions.
Criterion Contamination
The bias introduced into a person's criterion score as a result of the knowledge of the scorer about his/her performance on the predictor. Criterion contamination tends to
artificially inflate the relationship between the predictor and criterion, resulting in an artificially high criterion-related validity coefficient.
Criterion and Norm Referenced Interpretation
1. CRITERION-REFERENCED INTERPRETATION: Interprets test scores in terms of a
pre-specified standard: (a) PERCENTAGE SCORE: The percent of test content answered correctly. A score of 80 means that the examinee answered 80% of the test content correctly. (b) EXPECTANCY TABLE: Used to interpret examinees' scores in terms of their likely status on an external criterion. A test developer can determine the probability that examinees will achieve different criterion scores, given their obtained predictor (test) score.
2. NORM-REFERENCED INTERPRETATION: Interprets an examinee's test score relative to the performance of examinees in a normative (standardization) sample: (a)
PERCENTILE RANK: The percent of examinees in the normative group who obtained a lower score. If a raw score is equal to a PR of 60, then 60% of examinees in the norm
group obtained lower raw scores. (b) STANDARD SCORE: Reports test performance in
terms of standard deviation units from the mean achieved by examinees in the norm group. For example, T-SCORES have a mean of 50 and standard deviation of 10, Z-SCORES
have a mean of 0 and standard deviation of 1 and the DEVIATION IQ is a standard score on an intelligence test that has a mean of 100 and a fixed standard deviation (16 for the Stanford-Binet and 15 for the Wechsler tests).
Criterion-Related Validity Coefficient
The correlation coefficient derived by correlating a predictor with a criterion. When the coefficient is large, this confirms that the predictor has criterion-related validity.
Cross-validation and Shrinkage
"Cross-validation" entails determining predictor-criterion validity on a second sample independently drawn from the same population used in the original validation study. This is necessary because when a predictor is developed, the items from the original sample that are included in the final version of the test are the items that correlate most highly with the criterion. Some high correlations, however, are caused by unique characteristics (chance factors) of the item try-out sample, rather than by a true relationship between the items and the criterion. Thus, the predictor is tailor-made for the item try-out sample and, if the same sample is used to validate the test, the criterion-related validity coefficient may be spuriously high. Cross-validation is associated with SHRINKAGE (the cross-validation coefficient is often lower than original coefficient) because chance factors in the item try-out sample are not all present in the cross-validation sample.
FACTOR ANALYSIS (FACTOR LOADINGS, COMMUNALITY, SPECIFICITY)
"Factor analysis" is a multivariate statistical method that identifies the minimum number of common factors (constructs) needed to account for the intercorrelations among a set of tests, subtests or items. It can establish construct validity: Construct (factorial) validity
exists when a test has high correlations with the factor(s) it is expected to correlate with and low correlations with the factor(s) it is not expected to correlate with; i.e., a factor analysis provides information about a test's convergent and discriminant validity.
1. FACTOR MATRIX/FACTOR LOADINGS: A factor matrix displays the correlations (factor loadings) between each test and each identified factor and each test's communality.
FACTOR LOADINGS are correlation coefficients that show the degree of association between each test and each identified factor; they can be squared to determine the
amount of score variability explained by the factor.
2. COMMUNALITY: Test score variability due to factors that the test shares in common with the other tests in the factor analysis. If the factors are orthogonal, communality can be computed by summing the squared factor loadings.
3. SPECIFICITY: Test score variability due to factors that are specific (unique) to the test and not measured by any other test in the factor analysis.
Factors that affect the reliability coefficient
1. TEST LENGTH: The larger the sample of the attribute being measured by a test, the
ess the relative effects of measurement error and the more likely the sample will provide consistent information. In general, the longer the test, the larger its reliability coefficient.
The SPEARMAN-BROWN PROPHECY FORMULA can be used to estimate the effects of
Iengthening or shortening a test on its reliability coefficient.
2. RANGE OF SCORES: The reliability coefficient is maximized when the range of scores is unrestricted. The range is affected by the degree of similarity of examinees with regard to the attribute measured by the test (when they are heterogeneous, the range is
maximized) and the difficulty level of the items (when all items are very difficult or very
easy, all examinees will obtain either low or high scores resulting in a restricted range). It
is best to make sure that most items are moderately difficult.
3. GUESSING: The reliability coefficient is affected by the probability that examinees can guess the correct answers. As the probability of correctly guessing increases, the reliability coefficient decreases.
INCREMENTAL VALIDITY (TRUE OR FALSE POSITIVE/TRUE OR FALSE NEGATIVE)
"Incremental validity" (utility) refers to the increase in decision-making accuracy resulting from the use of a new predictor. Utility is calculated by subtracting the base rate from the
positive hit rate. The TAYLOR-RUSSELL TABLES provide estimates of a predictor's positive hit rate for various combinations of validity coefficients, base rates and selection
ratios.
1. TRUE POSITIVES: People who score above the predictor cutoff and above the criterion cutoff; e.g., those who are hired because of their predictor score and who are successful on the criterion.
2. FALSE POSITIVES: People who score above the predictor cutoff, but below the criterion cutoff; e.g., those who are hired because of their predictor scores, but who should not have been hired because they are unsuccessful workers.
3. TRUE NEGATIVES: People who score below the predictor cutoff and below the criterion cutoff; e.g., those who are not hired because of their predictor score and who are unsuccessful on the criterion.
4. FALSE NEGATIVES: People who score below the predictor cutoff, but above the criterion cutoff; e.g., those who are not hired because of their predictor scores, but who should have been hired because they would have been successful workers.
ITEM DIFFICULTY INDEX & ITEM DISCRIMINATION INDEX
1. ITEM DIFFICULTY INDEX: A measure of an item's difficulty level. The item difficulty index is calculated by dividing the number of individuals who answered the item correctly
by the total number of individuals, and ranges in value from 0 (very difficult item) to 1.0 (very easy item). In general, an item difficulty index of .50 is preferred because it
maximizes the differentiation between individuals with high and low ability and helps
ensure a high reliability coefficient.
2. ITEM DISCRIMINATION INDEX: A measure of how effectively an item discriminates between examinees who achieve high and low total scores on the test or on an external criterion measure.
LINEAR TRANSFORMATION & NONLINEAR TRANSFORMATION
1. LINEAR TRANSFORMATION: A type of score transformation in which the resulting
distribution maintains the same shape as the raw scores. Examples include z-scores and T-scores.
2. NONLINEAR TRANSFORMATION: A type of score transformation in which the resulting distribution is not the same shape as the raw score distribution. Examples include percentile ranks and normalized z-scores.
Measurement Error (Time/Content Sampling/Rater Differences)
The component of obtained test score variability that is attributable to random factors that are irrelevant to the purposes of testing and that produce inconsistencies in measurement. According to classical test theory, variability in obtained test scores is due to a
combination of true score variability and measurement error. For example when estimating a test's reliability: (1) TIME SAMPLING FACTORS (e.g., random changes in examinees or the test setting due to the passage of time) are a source of measurement error when estimating test-retest and alternate forms reliability; (2) CONTENT SAMPLING FACTORS (i.e., error due to an interaction between different examinees' knowledge and the different
content assessed by two forms of a test) are a source of measurement error when assessing alternate forms and internal consistency reliability; and (3) RATER DIFFERENCES are a source of measurement error when estimating inter-rater reliability.
Methods for Estimating Reliability (Test-Retest/Alternate Forms/Split Half-Spearman Brown Prophecy Formula/Alpha-Internal Consistency/Interrater)
1. TEST-RETEST (COEFFICIENT OF STABILITY): Involves giving one test to one group of examinees 2x, then correlating the two sets of scores. Is useful for measures of stable attributes and those not affected by repeated measurement.
2. ALTERNATE FORMS (COEFFICIENT OF EQUIVALENCE/STABILITY): Involves giving 2 forms of a test to one group of examinees and then correlating the two sets of scores. Not useful when the attribute may fluctuate or the scores may be affected by repeated measurement.
3. SPLIT-HALF (COEFFICIENT OF INTERNAL CONSISTENCY): Involves giving a test
once to one group of examinees, splitting it in half, then correlating the two sets of scores. Is corrected with the SPEARMAN-BROWN PROPHECY FORMULA, which determines what the reliability coefficient would have been if it were based on the full test. Are useful when the attribute may fluctuate or may be affected by repeated exposure to the test and when the content is homogeneous.
4. COEFFICIENT ALPHA (COEFFICIENT OF INTERNAL CONSISTENCY): Involves giving
a test once to a group of examinees and then correlating individual test items. When the items are scored dichotomously (right/wrong), KR20 can be used.
5. INTER-RATER: Involves giving a test once, having it scored by two or more raters and then correlating the scores or calculating percent agreement.
Orthogonal and Oblique Rotation
When conducting a factor analysis, a test developer decides, on the basis of his/her theory about the constructs measured by the tests included in the analysis, whether to conduct an orthogonal or oblique rotation: If he/she believes that the constructs are unrelated, he/she would choose an orthogonal rotation, but if he/she believes that the constructs are not entirely independent, he/she would choose an oblique rotation.
1. ORTHOGONAL ROTATION: In a factor analysis, a transformation of the identified factors that yields uncorrelated (orthogonal) factors; i.e., the attribute measured by one factor is independent from the attribute measured by the other factor(s).
2. OBLIQUE ROTATION: In a factor analysis, a transformation of the identified factors that yields correlated (oblique) factors; i.e., the attributes measured by the factors are not independent because some of the variability that is explained by one factor is also explained by the other factor(s).
Reliability and Validity
1. RELIABILITY: The consistency of test scores; i.e., the extent to which a test measures an attribute without being affected by random fluctuations (measurement error) that produce inconsistencies over time, across items or over different forms.
2. VALIDITY: The extent to which a test actually measures what it is intended to measure.
3. THE RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY: Reliability is a necessary, but not a sufficient condition for validity. For criterion-related validity, the
validity coefficient can be no greater than the square root of the product of the
reliabilities of the predictor and criterion.
Reliability Coefficients (Coefficient of Stability/Equivalence/ Internal Consistency
A "reliability coefficient" is the correlation coefficient used to estimate a test's reliability. It ranges in value from 0 to 1.0 (usually, a value of .80 or higher is acceptable) and is interpreted directly as a measure of true score variability; e.g., a reliability coefficient of
.86 means that 86% of variability in obtained test scores reflects true score variability and 14% reflects measurement error.
1. COEFFICIENT OF STABILITY: Yielded when estimating reliability using the test-retest method. A coefficient of stability indicates the degree of stability in scores over time.
2. COEFFICIENT OF EQUIVALENCE: Yielded when estimating reliability using the alternate forms method and the two forms are administered at about the same time. The coefficient of equivalence indicates the consistency of responding to different item
samples.
3. COEFFICIENT OF EQUIVALENCE AND STABILITY: Yielded when estimating reliability using the alternate forms method and a relatively long period of time separates
administration of the forms. The coefficient of equivalence and stability indicates the consistency of responding to different item samples.
4. COEFFICIENT OF INTERNAL CONSISTENCY: Yielded when using split-half reliability
and coefficient alpha.
Reliability of Difference Scores
The "reliability of difference scores" is associated with the interpretation of reliability. A test user computes a difference score when he/she is interested in comparing the performance of an examinee on two different tests or subtests. The reliability of difference scores refers to the reliability of scores that have been calculated by subtracting examinees' scores on one test from their scores on another test. The reliability of difference score is never greater than the average of the reliabilities of the two tests.
ESPONSE SET (BIAS) & RESPONSE STYLE
1. RESPONSE SET (BIAS): The tendency to respond to test items in a particular way,
regardless of the nature of the test items. Social desirability and acquiescence are examples of response sets. The FORCED-CHOICE ITEM FORMAT is a method of presenting test items in which an examinee must choose one of two or more equally desirable or undesirable statements.
2. RESPONSE STYLE: A type of response set in which an examinee tends to select the
most socially-desirable response, regardless of the nature of the test item.
STANDARD ERRORS OF ESTIMATE & MEASUREMENT & CONFIDENCE INTERVALS
1. STANDARD ERROR OF ESTIMATE: An index of error when predicting criterion scores from predictor scores. The standard error of estimate is used to construct a CONFIDENCE
INTERVAL around an examinee's predicted criterion score. Its magnitude depends on two
factors: the criterion's standard deviation and the predictor's validity coefficient.
2. STANDARD ERROR OF MEASUREMENT: An estimate of the degree to which a
particular set of measurements obtained in a given situation might be expected to deviate from true values. A common practice when interpreting an examinee's obtained score is to construct a CONFIDENCE INTERVAL around that score.
3. CONFIDENCE INTERVAL: The range of values that, with a given probability (e.g., 68%,
95%), indicates the "true" value of interest; i.e., the range within which an individual's true score is likely to fall. A confidence interval is calculated using the appropriate standard error (STANDARD ERROR OF MEASUREMENT or STANDARD ERROR OF ESTIMATE).
Standardization
Standardization has two meanings:
1. The development of uniform (standardized) administration and scoring guidelines for a
test.
2. The administration of the test to a standardization sample for the purpose of
establishing test norms.
Steps in Test Construction
1. SPECIFY THE TEST'S PURPOSE: Identify the examinees to whom the test will be administered, specify the goals of the test and define the attributes (e.g., skills,
knowledge) that must be measured to achieve those goals.
2. ITEM GENERATION: Translate the attribute(s) to be measured by the test into a set of observables (e.g., for a test that will assess learning in a course, systematically analyze
the course content), specify the format of the test (e.g., maximal vs. typical performance, speed vs. power, objective vs. subjective) and select a scoring method.
3. ITEM TRYOUT AND ANALYSIS: Generate (write and/or select) test items. This usually yields more items than will be included in the final version of the test. Item analysis is used to obtain information about each item in order to determine which items to keep. In this process, the items are administered to a sample of examinees similar to those who will be taking the final version of the test. The items are then evaluated in terms of difficulty level, discrimination, etc.
4. STANDARDIZATION
True vs. Obtained Score
1. TRUE SCORE: The score an examinee would obtain on a test if the test could measure the given attribute without error; in other words, if the test were perfectly reliable, an examinee's true score would reflect his/her actual status on the attribute being measured by the test. According to classical test theory, however, there is no such thing as a perfectly reliable test. Therefore, a true score is a hypothetical test score because one can never know its value.
2. OBTAINED SCORE: According to classical test theory, an examinee's obtained score is
equal to his/her true score plus error. Error, or measurement error, refers to any factors unrelated to the attribute being measured by the test that affect an examinee's score on the test.
Alpha and Beta
1. ALPHA (LEVEL OF SIGNIFICANCE): The probability of rejecting a true null hypothesis;
i.e., the probability of making a Type I error. The value of alpha is set by an experimenter before collecting or analyzing the data. In psychological research, alpha is commonly set at either .01 or .05.
2. BETA: The probability of retaining a false null hypothesis; i.e., the probability of making
a Type II error. The exact value of beta is not set by an investigator and cannot be
directly calculated, but the probability of making a Type II error can be indirectly
influenced: A Type II error is more likely when the value of alpha is low, when the sample size is small and when the independent variable is not administered in sufficient intensity.
Bivariate Correlation Techniques (Bivariate Frequency Distribution/Scattergram/
Correlation Coefficient/Regression Analysis/Regression Line/Least Squares Criterion/Regression Equation)
"Bivariate correlational techniques" are used to assess the degree of association between
two variables. Examples include:
1. BIVARIATE FREQUENCY DISTRIBUTION: Distribution used to present the relationship between two variables in a table format. The number in each cell (category) represents
the frequency of observations in that cell.
2. SCATTERGRAM (SCATTERPLOT): A graphic summary of the degree of association between two variables. A wide scatter indicates a low correlation.
3. CORRELATION COEFFICIENT:
4. REGRESSION ANALYSIS: A statistical analysis used when the goal is to use a
predictor to predict performance on a criterion (Y). An assumption underlying regression
analysis is that there is a linear relationship between the two variables of interest and, therefore, the relationship can be summarized by a straight line. That relationship can be described by a REGRESSION LINE ("line of best fit"), which is identified by the LEAST SQUARES CRITERION, a mathematical process that locates the line so that the amount of error in prediction is minimized. The regression line is defined by a REGRESSION EQUATION, which can be used to predict the most probable values of Y for any known X (score on the predictor).
Constant and Variable
1. CONSTANT: Any event, trait or other phenomenon that is limited to one state, level or condition. For example, gender is a constant in a study that includes only male subjects.
2. VARIABLE: Any event, trait or other phenomenon that is capable of varying, or existing in at least two different states, conditions or levels. For example, gender is a variable in a study that includes male and female subjects. (a) CONTINUOUS AND DISCRETE VARIABLES: A continuous
variable is capable of assuming a potentially infinite number of different values within a given range of values. For a discrete variable, only a limited number of values can be generated. (b) MANIPULATED AND ORGANISMIC VARIABLES: A manipulated variable is
an independent variable that is controlled by the experimenter. When an experimenter can identify the levels of the variable and determine who will receive which levels, the variable is a manipulated variable. An organismic (subject) variable is an independent variable that is a characteristic of the subjects (e.g., IQ, height) and, therefore, cannot be
manipulated by the investigator.
Constant Effects (Adding and Subtracting to Constant)
1. ADDING OR SUBTRACTING A CONSTANT TO/FROM EACH SCORE IN A DISTRIBUTION: The measures of central tendency change, but the measures of
variability do not. For the mean, adding or subtracting a constant has the following effect: The mean is equal to the mean of the original distribution of scores plus (or minus) the constant.
2. MULTIPLYING OR DIVIDING EACH SCORE IN A DISTRIBUTION BY A CONSTANT: The measures of central tendency and measures of variability are all affected. Multiplying
or dividing by a constant has the following effects: (a) The mean is equal to the mean of the original distribution of scores multiplied (or divided) by the constant. (b) The standard deviation is equal to the standard deviation of the original set of scores multiplied (or divided) by the constant. (c) The variance is equal to the variance of the original set of scores multiplied (or divided) by the square of the constant.
Correlation and Correlational Techniques
1. CORRELATION: An association between two or more variables in which systematic increases in the magnitude of one variable are associated with systematic increases or decreases in the magnitude of the other variable (or variables). When a correlation exists,
it is possible to make predictions about the expected magnitude of one variable on the basis of the known magnitude of the other variable. The stronger the relationship (the further from zero the correlation coefficient), the more confidence in the accuracy of the prediction.
2. CORRELATIONAL TECHNIQUES: Statistical techniques used to assess the strength of
the relationship between two or more variables. Psychologists are often interested in correlation because they want to use one variable as a predictor (estimator) of another variable. When correlational techniques are used for this purpose, the X (independent)
variable is often called the predictor, and the Y (dependent) variable is called the criterion. Correlational techniques are bivariate or multivariate.
Correlation Coefficient (Pearson R, Spearman Rank Order, Biserial, Pt. Biserial)
A "correlation coefficient" is a numerical index of the relationship (association) between two or more variables. The magnitude of the coefficient indicates the strength of the
relationship, and its sign indicates the direction (positive or negative). The selection of a coefficient is based on the scale of measurement of the variables to be correlated.
1. PEARSON r: A correlation in which both variables have been measured on an interval
or ratio scale. The Pearson r ranges in value from -1.0 to +1.0. To use the Pearson r, three assumptions must be met: linearity, unrestricted range of scores and
homoscedasticity.
2. SPEARMAN RANK ORDER: A correlation based on the ranks of the scores on the two variables; i.e., both variables are rank-ordered.
3. BISERIAL: A correlation in which one variable is measured on an interval or ratio scale and the other is dichotomous, or two-valued. For this correlation, the dichotomy is artificial; e.g., favorable/unfavorable.
4. POINT BISERIAL: A correlation in which one variable is measured on an interval or ratio scale and the other is dichotomous, or two-valued. In this case, the dichotomy is true; e.g., male/female.
Counterbalancing
A method used in repeated measures designs to control for carryover (order) effects (e.g., fatigue, practice) on the dependent variable measure. Counterbalancing involves
administering all or some forms of the treatments to subjects in all groups.
Critical Values and Degrees of Freedom
1. CRITICAL VALUE: An inferential statistical test yields a "test statistic" that allows an investigator to determine whether his/her sample data fall in the rejection or retention region of the sampling distribution. This is done by comparing the test statistic to a "critical value." For most statistical tests, if the test statistic equals or exceeds the critical
value, the investigator concludes that the obtained sample value lies in the rejection region of the sampling distribution and rejects the null hypothesis. Conversely, if the test statistic is less than the critical value, the investigator concludes that the sample value les in the retention region and retains the null hypothesis. The magnitude of the critical value for a particular study is usually based on two factors: The level of significance (alpha) and the degrees of freedom.
2. DEGREES OF FREEDOM: The number of values or categories in a distribution that are "free to vary," given that certain values or categories in the distribution are known or fixed.
Dependent and Independent Variable
1. DEPENDENT VARIABLE (DV): The variable that is observed and measured in a study and believed to be affected in some way by the independent variable.
2. INDEPENDENT VARIABLE (IV): The variable that is manipulated in a research study for the purpose of determining its effects on the dependent variable. The IV is also known as the experimental variable.
Descriptive and Inferrential Statistical Methods
1. DESCR PTIVE STATISTICAL METHODS: Statistical methods used to describe and
summarize data. A researcher might use a table or a graph, for example, to summarize the set of test scores he/she has collected in a research study.
2. INFERENTIAL STATISTICAL METHODS: Statistical methods used to draw conclusions
about relationships between independent and dependent variables. A researcher would
use an analysis of variance (ANOVA) or other inferential statistical test to determine
whether or not the relationship between variables he/she has found in a research study is
statistically significant.
Developmental Research (Cross-sectional-Cohort Effects/Longitudinal&Cross-Sequential Studies
"Developmental research" involves assessing changes that occur in variables as a
function of time. Because developmental research often assesses the effects of
organismic variables or other variables that cannot be manipulated by the experimenter, it is classified as quasi-experimental research.
1. CROSS-SECTIONAL STUDIES: Assess the effects of aging and/or developmental changes over time by comparing groups of individuals representing different age groups or
developmental levels at the same point in time. Cohort effects are a possible confound. COHORT EFFECTS are effects of being part of a group that was born at a particular time and, as a result, was exposed to unique experiences. Any observed differences between age groups might be due to these effects rather than to differences in age only.
2. LONGITUDINAL STUDIES: A group of subjects is followed and evaluated over an extended period of time to assess the effects of aging, natural developmental processes or one or more other independent variables on one or more dependent variables over time.
3. CROSS-SEQUENTIAL STUDIES: Assess the effects of aging and/or developmental
changes over time. Cross-sequential studies help overcome the shortcomings of
cross-sectional and longitudinal research by combining both methodologies.
Double and Single-Blind Technique
1. DOUBLE-BLIND TECHNIQUE: An experimental condition used to reduce reactivity by
keeping research participants and the experimenter uninformed about which level of the
independent variable that each subject is receiving.
2. SINGLE-BLIND TECHNIQUE: An experimental condition used to reduce reactivity by
keeping research subjects uninformed about the experimental hypothesis and which level of the independent variable they have been assigned to.
Error Variance
"Error" reflects the proportion of the variance in a set of scores (i.e., the dispersion, or
variability) that cannot be attributed to controlled factors. Error variance is increased by
sampling error, experimental error, measurement error, etc.
Problems in Conducting Evaluation Research
(1) When conflicts arise between the needs of the program and the research, the program
takes precedence. (2) It is difficult to operationally define all of a program's objectives.
(3) Program administrators and staff are sometimes reluctant to cooperate with the
evaluators. (4) It is often difficult to identify or obtain an appropriate control (comparison)
group and, therefore, to conduct an experimental study. (5) Because the effects of a
social program can be influenced by a number of relevant and irrelevant factors, it is often
not possible to identify which components of the program are responsible for any observed
affects. (6) Even when systematic evaluation studies are conducted, their results are often not used to modify or extend a program for a variety of reasons including cost prohibitions and resistance by the administration and staff.
Ex-Facto Research
"After the fact" research in which the experimental treatment (the independent variable) has been applied prior to the onset of the study. Because ex-post facto studies do not allow the experimenter to control the assignment of subjects to treatment groups, they are
considered a type of quasi-experimental research.
Experimental and Descriptive Research
1. EXPERIMENTAL RESEARCH: Research conducted to test hypotheses about the
effects of one or more independent variables one on or more dependent variables.
Experimental research is classified as either true experimental or quasi-experimental.
2. DESCR PTIVE RESEARCH (PROTOCOL ANALYSIS): Research conducted to describe behavior, rather than test hypotheses about behavior. Examples include case studies, observational techniques, surveys and questionnaires and archival research. (a) CASE STUDIES: The in-depth investigation of a single individual, family, organization, etc.
Although case studies are usually classified as descriptive research, they can be conducted as experimental studies (e.g., as in single-subject research). A shortcoming of
case studies is that their results might not be generalizable to other cases. (b) OBSERVATIONAL STUDIES: Studies involving observing behavior in a systematic way, often in a naturalistic context; e.g., naturalistic field studies, participant observation.
EXPERIMENTWISE ERROR RATE
The probability of making a Type I error. As the number of statistical comparisons in a study increases, the experimentwise error rate also increases.
External and Internal Validity
1. EXTERNAL VALIDITY: The degree to which a study's results can be generalized to
other people, settings, conditions, etc. External validity is always limited by internal validity; i.e., if an investigator cannot conclude that there is a causal relationship between variables within the context of the study, then he/she cannot conclude that
there is a relationship for other people or circumstances.
2. INTERNAL VALIDITY: The degree to which a research study allows an investigator to conclude that the observed variability in a dependent variable is due to the independent variable, rather than to other factors.
Extraneous Variable and Controlling Variability
A variable that is irrelevant to the research hypothesis about the relationship between the
independent variable and dependent variable, but has a systematic and potentially
confounding effect on the dependent variable. Methods for controlling variability due to an extraneous variable include: randomization; holding the extraneous variable constant by selecting subjects who are homogeneous with regard to that variable; using a
statistical technique (e.g., the ANCOVA) to remove variability in the dependent variable that is due to the extraneous variable; and: (1) MATCHING: Pairing or grouping subjects on the basis of their status on the extraneous variable and randomly assigning members of each pair or group to a different treatment group so that groups are initially equivalent with regard to the extraneous variable. (2) BLOCKING: Used to control the impact of an extraneous variable when an investigator wants to statistically analyze its effects on the dependent variable. It involves blocking (grouping) subjects with regard to their status on the extraneous variable and then randomly assigning subjects in each block to one of the treatment groups.
Factorial Design (Main Effects and Interaction)
"Factorial design" is the name given to any research design that includes two or more "factors" (independent variables). A factorial design provides more thorough information about the relationships among variables by allowing the investigator to analyze the main effect of each independent variable, as well as the interaction between variables: (a) MAIN EFFECT: The effect of one independent variable on the dependent variable,
disregarding the effects of another independent variable. (b) INTERACTION: The effects of two or more independent variables considered together. An interaction occurs when the effects of one variable are different at different levels of another variable. When a study has a significant interaction, the main effects should be interpreted with caution.
Group (Multi-Subject) Research Designs
Research that includes one or more groups of subjects. For example:
1. BETWEEN GROUPS: This design is used to assess the effects of different levels of an IV by administering each level to a different group of subjects, and then comparing the status or performance of the groups on the DV. An example is a two-group design that
includes two or more IVs (i.e., a factorial design), with each IV having at least two levels.
2. WITHIN SUBJECTS (REPEATED MEASURES): In this design, all levels of the IV are
administered, at different times, to all subjects, so that comparisons of the different levels of the IV are made within subjects, rather than between groups of subjects. An example is a single-group time-series design in which the effects of a treatment (IV) are evaluated by measuring the DV before and after the treatment is applied. In this design, subjects act as their own controls.
3. MIXED: This design combines between groups and within subjects methodologies. An example is a counterbalanced design, which permit comparisons between groups and within subjects. In this design, different levels of the IV are administered to different subjects or groups of subjects in a different order.
HETEROSCEDASTICITY & HOMOSCEDASTICITY
1. HETEROSCEDASTICITY: In a scatterplot, unequal variability of Y scores at all values of X.
2. HOMOSCEDASTICITY: In a scatterplot, equal variability of Y scores at all values of X. It is assumed that there is a homoscedastic relationship between X and Y. Violation of this assumption does not necessarily result in a coefficient that is too low or too high, but produces a coefficient that does not represent the full range of scores.
Hypothesis Testing (Correlation Coefficient)
Correlation coefficients can be tested to determine if they are statistically significant. In this situation, the null hypothesis is that the correlation coefficient in the population is equal to zero. The hypothesis is tested by comparing the obtained coefficient to an
appropriate critical value. If the coefficient exceeds the critical value, the null hypothesis is rejected; if it is less than the critical value, the null hypothesis is retained.
Measures of Central Tendency (Mean, Median, Mode)
The "measures of central tendency" describe data using a single number. The number conveys a maximum amount of information about the data, summarizes the entire set of
observations and is a typical measure of all the observations. The measures of central tendency are:
1. MEAN: The arithmetic average of a set of scores. The mean can be used when scores are measured on an interval or ratio scale. The mean is affected by every score in the distribution; therefore, it can be misleading when a distribution contains one or a few atypical scores.
2. MEDIAN: The middle score in a distribution of scores when the scores have been
ordered from lowest to highest. The median can be used when scores are measured on an ordinal, interval or ratio scale.
3. MODE: The most frequently occurring category or score in a distribution. The mode can be used when scores are measured on a nominal, ordinal, interval or ratio scale.
4. THE RELATIONSHIP OF THE MEAN, MEDIAN AND MODE IN A SKEWED DISTRIBUTION: In a positively skewed distribution, the mean is greater than the median,
which, in turn, is greater than the mode. In a negatively skewed distribution, the mode is
greater than the median, which, in turn, is greater than the mean.
Measures of Variability (Range, Variance, and SD)
*** Commercial version is infinite. Order at http://www.structurise.com/kleptomania ***
The "measures of variability" indicate the amount of heterogeneity, or dispersion, within a
set of scores. Examples include:
1. RANGE: Calculated by subtracting the lowest score in a distribution from the highest score. It can be misleading when the distribution contains an atypically high and/or low score.
2. VARIANCE: Calculation of the variance involves all scores in the distribution. It is calculated by dividing the sum of the squared deviation scores (the sum of squares) by N (or N - 1), and provides a measure of the average amount of variability in the distribution. Calculation requires squaring each deviation score; as a result, the variance represents a
unit of measurement that differs from the original unit of measurement.
3. STANDARD DEVIATION: The SD is expressed in the same unit of measurement as the
original score. It is calculated by taking the square root of the variance, and can be interpreted as a measure of variability (i.e., the larger the standard deviation, the greater
the dispersion of scores), which is useful when comparing the variability of two or more distributions. It can also be interpreted in terms of the normal distribution.
Multivariate correlational technique (Canonical correlation/Discriminant Function Analysis/Path Analysis/Multiple Regression Analysis--Multiple Correlation Coefficient (R)--Multiple Regression Equation)
"Multivariate correlational technique" are used to assess the relationships among three or more variables. Examples include:
1. CANONICAL CORRELATION: Used to assess the degree of association between one
Iinear combination of variables (predictors) and another linear combination of variables (criteria). Also, it can be used to identify the dimensions that underlie the correlation between two sets of variables.
2. DISCRIMINANT FUNCTION ANALYSIS: Used when there are two or more predictors
and one criterion that is measured on a nominal scale.
3. PATH ANALYSIS: Used to test a pre-formulated theory about the causal relationships among a set of variables. It involves translating the theory into a path diagram, collecting data on the variables and calculating and interpreting path coefficients.
4. MULT PLE REGRESSION ANALYSIS: A statistical analysis used when the goal is to use two or more predictors to predict status on a criterion that has been measured on an interval or ratio scale. The output of the analysis is a MULTIPLE CORRELATION
COEFFICIENT (R), which is the correlation between the criterion and a linear combination of the predictor variables. Ideally, predictors included in a MULTIPLE REGRESSION
EQUATION will have low correlations with each other and high correlations with the criterion.
Nonparametric and Parametric Tests
1. NONPARAMETRIC TESTS: The inferential statistical tests used when the data to be analyzed represent an ordinal or nominal scale or when the assumptions for a parametric test are not met. Because these tests do not make the same assumptions about the
population distribution(s) as the parametric tests, they are also known as "distribution-free
tests."
2. PARAMETRIC TESTS: The inferential statistical tests used when the data to be analyzed represent an interval or ratio scale and when certain assumptions about the population distribution(s) are met. These assumptions are: scores on the variable of interest are normally distributed and there is homoscedasticity (population variances are equal). These tests are more "powerful" than nonparametric tests.
Nonparametric Test (Single-Sample/Multiple Sample Chi-Square)
1. SINGLE-SAMPLE CHI-SQUARE: Used when a study includes one variable that is
measured on a nominal scale and the data to be analyzed are reported in terms of
frequencies in each category. The single-sample chi-square test compares observed frequencies to expected frequencies to determine if the two distributions of frequencies differ.
2. MULTIPLE-SAMPLE CHI-SQUARE: Used when a study includes two or more variables that are measured on a nominal scale and the data to be analyzed are reported in terms of frequencies in each category. The multiple-sample chi square test compares observed
frequencies to expected frequencies to determine if the two distributions of frequencies differ.
Nonparametric Tests for Ordinal Data (Mann-Whitney U/
WILCOXON MATCHED-PAIRS SIGNED-RANK/KRUSKAL-WALLIS
1. MANN-WHITNEY U: Used when a study includes two independent groups and data on the dependent variable are reported in terms of ranks (ordinal data). The Mann-Whitney U test assesses whether or not the ranks of observations in one group are equivalent to the ranks of observations in the other group. Considered the nonparametric alternative to the t-test for independent samples.
2. WILCOXON MATCHED-PAIRS SIGNED-RANK: Used when a study includes two
correlated groups and the differences between the dependent variable scores of matched pairs are converted into ranks (ordinal data). The Wilcoxon test assesses whether or not there is a difference between the sum of the ranks of the positive differences between scores and the sum of the ranks of the negative differences between scores and determines if there is a significant difference between the two correlated groups. Considered the nonparametric alternative to the t-test for correlated samples.
3. KRUSKAL-WALLIS: Similar to the Mann-Whitney U test, but can be used when a study
includes two or more independent groups. The Kruskal-Wallis test is used to assess the hypothesis that the ranks of observations in the different groups are equivalent.
Normal Curve: Areas under the normal curve/Kurtosis/Leptokurtic Distribution/Platykurtic Distribution/Skew--Negative and Positive)
A "normal curve" (distribution) is a symmetrical bell-shaped distribution.
1. AREAS UNDER THE NORMAL CURVE: When scores on a variable are normally-distributed, a specific number of observations fall within certain areas of that
distribution that are defined by the standard deviation: About 68% of observations fall between the scores that are plus and minus one SD from the mean; 95% between the
scores that are plus and minus two SDs from the mean; and 99% between the scores that are plus and minus three SDs from the mean.
2. KURTOSIS: The relative height or flatness of a distribution. (a) LEPTOKURTIC
DISTRIBUTION: More peaked than a normal distribution. The scores are piled up in the central region of the distribution, with a relatively few scores in the tails. (b) PLATYKURTIC DISTRIBUTION: Flatter than a normal distribution because scores are more evenly distributed throughout the range of possible scores.
3. SKEW: Describes asymmetrical distributions. (a) NEGATIVE SKEW: Most scores are piled up in the positive (high score) end of the distribution, but a few scores are located in
the distribution's negative tail. (b) POSITIVE SKEW: Scores are piled up in the negative (low score) side of the distribution, but a few scores are located in the positive tail.
ONE-TAILED TEST & TWO-TAILED TES
1. ONE-TAILED TEST: Inferential statistical test that places the entire rejection region in only one tail of the sampling distribution. A one-tailed test is used when the alternative hypothesis is directional.
2. TWO-TAILED TEST: Inferential statistical test that divides the region of unlikely values (rejection region) equally between the sampling distribution's two tails. A two-tailed test is used when the alternative hypothesis is nondirectional.
Parameters and Statistics
1. PARAMETERS: Measurements derived from populations; a.k.a population values.
2. STATISTICS: Measurements derived from samples; a.k.a. sample values. When
conducting a research study, the investigator does not have access to the entire
population of interest; instead, he/she estimates population values based on obtained sample values. In other words, he/she uses a "sample statistic" to estimate a "population parameter."
Parametric Tests (ANOVAs--ONE-WAY ANOVA/MANOVA/FACTORIAL ANOVA/ANCOVA/RANDOMIZED BLOCK ANOVA/STUDENT'S T-TESTS--SINGLE/CORRELATED SAMPLES/INDEPENDENT SAMPLES
1. THE ANOVAs: (a) ONE-WAY ANOVA: Used when the study has one independent
variable and one dependent variable measured on an interval or ratio scale. The one-way ANOVA compares the means of two or more groups and yields an F-ratio. When the treatment has had an effect, the F-ratio is greater than 1.0. (b) FACTORIAL ANOVA: Used when the study has two or more independent variables(c) MANOVA: Used when the study has one or more independent variables and two or more dependent variables, measured on an interval or ratio scale. It lowers experimentwise
error by analyzing the effects of the independent variable(s) simultaneously on all dependent variables. (d)ANCOVA: Dependent variable scores are adjusted on the basis of scores on an extraneous variable in order to control variability in the dependent
variable due to an extraneous variable. (e) RANDOMIZED BLOCK FACTORIAL ANOVA: Used when blocking has been used to control an extraneous variable. The main and interaction effects of the extraneous variable are statistically analyzed.
2. STUDENT'S t-TESTS: (a) SINGLE SAMPLE: Compares one sample mean to a known or hypothesized population mean. (b) CORRELATED SAMPLES: Compares two sample means when the subjects in two groups are related; e.g., they are matched on an extraneous variable. (c) INDEPENDENT SAMPLES: Compares two sample means when the subjects in two groups are independent (unrelated).
Placebo Control and Experimental Groups
1. PLACEBO CONTROL GROUP: A control (comparison) group that is exposed to the inert
or nonspecific aspects of the treatment; e.g., an attention-only control group.
2. EXPERIMENTAL GROUP: A group that is exposed to the independent variables under
investigation.
PostHoc Tests (Scheffe's/Tukey/Fisher's LSD Tests)
Statistical tests used to make pairwise and/or complex comparisons of group means.
Post-hoc tests are applied when the ANOVA yields statistically significant results and an investigator wants to determine specifically which group means differ. For example:
1. SCHEFFE'S TEST: Permits all possible pairwise and complex comparisons between
group means when the groups are of equal size. Scheffe's test is conservative with regard to Type I errors.
2. TUKEY TEST: Used to make pairwise comparisons between group means when the samples are of equal size. The Tukey test is conservative with regard to Type I errors.
3. FISHER'S LSD TEST: Used to compare group means. A problem with the Fisher's test is that it does not control the experimentwise error rate.
Random Error
"Random" error is error that is unpredictable or unsystematic in terms of its effects; i.e., that has different effects on different people.
Random Sampling and Randomization (Random Assignment)
1. RANDOM SAMPLING: The selection of a sample (or samples) from the population in a way so that each member of the population has an equal chance of being included in any sample.
2. RANDOMIZATION (RANDOM ASSIGNMENT): A term used (by most authors) to describe the random assignment of subjects to different treatment groups. Randomization is considered the "hallmark" of true (vs. quasi-) experimental research.
Rejection and Retention Region
An inferential statistical test yields a value that indicates where the obtained sample statistic falls in the sampling distribution for the parameter predicted by the null hypothesis. The sampling distribution is divided into two regions:
1. REJECTION REGION: The region of a sampling distribution that contains those sample values (e.g., means) that are unlikely to be obtained simply as the result of sampling error. When an inferential statistical test indicates that an obtained sample value falls in the rejection region, the null hypothesis is rejected and the alternative hypothesis is retained. The size of the rejection region is defined by alpha.
2. RETENTION REGION: The region of a sampling distribution that contains those sample values (e.g., means) that are likely to be obtained simply as the result of sampling error. When an inferential statistical test indicates that an obtained sample value falls in the
retention region, the null hypothesis is retained and the alternative hypothesis is rejected. The retention region is equal to one minus alpha.
Reversal (Withdrawal) Design
A type of single-subject design that includes, at a minimum, two baseline phases and one treatment phase (e.g., an ABA or ABAB design). The treatment is withdrawn ("reversed") during the second and subsequent baseline phases. An advantage of the reversal designs over the simple AB design is that they provide additional control over the threats to a
study's internal validity; i.e., when a reversal design is used, if status on the dependent variable returns to the initial baseline (no treatment) level during the second baseline phase, an investigator can be more certain that an observed change in the dependent variable during the treatment phase was actually due to the independent variable, rather than to an historical event or other extraneous factor.
SAMPLING ERROR, SAMPLING DISTRIBUTION OF MEANS, STANDARD ERROR OF MEAN
1. SAMPLING ERROR: A type of random error due to uncontrolled factors. Sampling error is responsible for the fluctuations found between sample values and the corresponding value for the population from which the samples were randomly drawn.
2. SAMPLING DISTRIBUTION OF MEANS: The theoretical distribution of sample means
that would be obtained if an infinite number of equal-size samples were randomly drawn from the population and the mean for each sample were calculated. It is normally-shaped, its mean is equal to the population mean and its standard deviation (STANDARD ERROR OF THE MEAN) is equal to the population standard deviation divided by the square root of the sample size. The sampling distribution of means is used in inferential statistics to determine how likely an obtained sample value is, given the population mean and
standard deviation, the sample size and the level of significance. The CENTRAL LIMIT THEOREM predicts that the sampling distribution of means will approach a normal shape as the sample size increases, regardless of the shape of the population distribution of scores.
3. STANDARD ERROR OF THE MEAN: The standard deviation of the sampling distribution of means, which provides an estimate of SAMPLING ERROR. The standard
error of the mean is calculated by dividing the population standard deviation by the square root of the sample size.
Scales of Measurement (Nominal/Ordinal/Interval/ Ratio Scales)
Each "scale of measurement" involves dividing a set of observations into mutually
exclusive and exhaustive categories, but each provides a different kind of information and allows different mathematical operations to be performed:
1. NOMINAL SCALE: The variable is divided into unordered groups or categories. The data to be described or analyzed are frequency data; e.g., the frequency of observations in each category.
2. ORDINAL SCALE: The variable is divided into ordered categories, scores or levels. An ordinal scale is less mathematically complex than the interval or ratio scales because it does not have equal intervals between scale values or (in most cases) an absolute zero point.
3. INTERVAL SCALE: There are equal intervals between successive points on the scale. Because the intervals are equal, it is possible to perform addition and subtraction.
4. RATIO SCALE: There are equal intervals between successive points and there is an absolute zero point (which indicates an absence of the characteristic being measured). Because there is an absolute zero point, it is possible to multiply and divide scores and to more precisely determine how much more or less of a characteristic a person has than another person.
SHARED VARIABILITY & COEFFICIENT OF DETERMINATIO
1. SHARED VARIABILITY: A method of interpreting a correlation coefficient when the correlation is between two different variables. A measure of shared variability is obtained
by squaring the correlation coefficient. The squared correlation (COEFFICIENT OF
DETERMINATION) indicates how much variability is shared by the two variables; i.e., how much variability in one variable is explained by variability in the other variable.
2. COEFFICIENT OF DETERMINATION: An interpretation of the correlation coefficient
that indicates the sources of variability. When this interpretation is used in conjunction with a validity coefficient, the correlation coefficient is squared to obtain a coefficient of determination. When this interpretation is used in conjunction with a reliability coefficient,
the correlation coefficient is not squared to obtain a coefficient of determination.
SIMPLE RANDOM SAMPLING, STRATIFIED RANDOM SAMPLING, CLUSTER SAMPLING
1. SIMPLE RANDOM SAMPLING: Every member of the population has an equal chance of being included in the sample and the selection of one member of the population has no effect on the selection of another member. Simple random sampling reduces the
probability that the sample will be biased, especially when the sample is large.
2. STRATIFIED RANDOM SAMPLING: The population as a whole is divided into distinct parts (strata) and each part is drawn from randomly.
3. CLUSTER SAMPLING: Units (clusters) of individuals are selected and then individuals are selected from those units. Cluster sampling is useful when it is not possible to identify
or obtain access to the entire population of interest.
SINGLE SUBJECT RESEARCH DESIGNS (AB/ABAB/MULTIPLE BASELINE DESIGNS)
In "single subject research designs," the effects of an independent variable on a
dependent variable are assessed using data from only one or a few subjects. Single subject designs include at least one baseline (no treatment) phase and one treatment
phase; therefore, subjects act as their own no-treatment controls. The dependent variable is measured repeatedly at regular intervals throughout the baseline and treatment phases.
1. AB DESIGN: Includes a single baseline (A) phase and a single treatment (B) phase. An AB design does not adequately control history, which can threaten a study's internal validity.
2. ABAB DESIGN: A reversal design in which baseline phases (A) are alternated with treatment phases (B). The ABAB design provides additional control over potential threats to internal validity.
3. MULTIPLE BASELINE DESIGN: Involves sequentially applying a treatment to different baselines (e.g., different behaviors, settings or subjects). A multiple baseline design is
useful when a reversal design would be impractical or unethical.
SOLOMON FOUR GROUP DESIGN
An experimental design used to control (measure) the effects of pretesting on a study's internal and external validity. When using this design, the pretest is treated as an additional independent variable so that its effects on the dependent variable can be statistically analyzed.
STATISTICAL HYPOTHESES (NULL HYPOTHESIS & ALTERNATIVE HYPOTHESIS)
*** Klepting everything! ***
An investigator tests a verbal research hypothesis about the relationship between
variables by simultaneously testing two competing "statistical hypotheses":
1. NULL HYPOTHESIS: States that there is no relationship between the independent and dependent variables, and implies that any observed relationship is simply the result of sampling error.
2. ALTERNATIVE HYPOTHESIS: States that there is a relationship between the
independent and dependent variables. The alternative hypothesis can be directional or nondirectional.
STATISTICAL REGRESSION (REGRESSION TO THE MEAN)
The tendency for extreme scores to be closer to the mean when the measure is
re-administered to the same examinees. Regression to the mean was originally described
by Galton, who found that, in nature, there is a tendency for the offspring of people who have extreme characteristics to be closer to the population mean (e.g., for very tall parents to have children who are a bit shorter).
THREATS TO THE EXTERNAL VALIDITY OF A RESEARCH STUDY
1. PRETEST SENSITIZATION: A pretest can sensitize subjects to the study's purpose, which can alter their reaction to the independent variable. The Solomon four-group design can control this threat.
2. SELECTION-TREATMENT INTERACTION: Subjects may have traits that make them
respond to the independent variable in a particular way (e.g., they are volunteers).
3. REACTIVITY: Subjects may respond to the independent variable in a particular way
because they know they are being observed. Unobtrusive measures, deception or a
single- or double-blind technique can control this threat. Examples of reactivity are: (a) DEMAND CHARACTERISTICS: Cues in the research setting that inform subjects of how they are expected to behave. (b) EXPERIMENTER EXPECTANCY: Bias that an experimenter introduces into a study, due to his/her expectations about the outcome. (c) EVALUATION APPREHENSION: Subjects' concerns about whether or not they are
performing correctly. (d) HAWTHORNE EFFECT
4. PRACTICE, ORDER OR CARRYOVER EFFECTS: When the same subjects are exposed
to multiple levels of an independent variable (i.e., to multiple treatments), the effects of one level can be affected by prior exposure to another level.
THREATS TO THE INTERNAL VALIDITY OF A RESEARCH STUDY
1. MATURATION: A physical or psychological process or event that occurs as the result of the passage of time (e.g., fatigue) and has a systematic effect on the subjects' status/performance on the dependent variable.
2. HISTORY: An event either outside or within the experimental situation that is not relevant to the research hypothesis and has a systematic effect on the subjects' status/performance on the dependent variable.
3. TESTING: Prior exposure to a pretest alters subjects' performance on the posttest.
4. INSTRUMENTATION: Changes in the accuracy or sensitivity of measuring devices or procedures during the course of the study.
5. STATISTICAL REGRESSION: A threat when subjects have been selected because of
their extreme status on the dependent variable.
6. SELECTION: Subjects are not randomly assigned to treatment groups; as a result, there are systematic differences between groups at the beginning of the study.
7. ATTRITION/MORTALITY: Subjects who drop out of one group differ in a relevant way from subjects who drop out of another group.
Transformed Score
A descriptive statistic for individual observations. Interpreting a single score can be difficult unless the score is somehow referenced to the distribution from which it was drawn. Converting a raw score to a transformed score makes it easier to interpret the single score by referencing it to the distribution of scores. PERCENTILE RANKS and
Z-SCORES are two examples of transformed scores.
TRUE EXPERIMENTAL RESEARCH & QUASI-EXPERIMENTAL RESEARCH
1. TRUE EXPERIMENTAL RESEARCH: Experimental research that provides the
investigator with maximal experimental control. Most important, when conducting a true experimental research study, an investigator can randomly assign subjects to groups,
which makes it easier to determine if observed variability in the dependent variable was actually caused by the different levels of the independent variable.
2. QUASI-EXPERIMENTAL RESEARCH: Experimental research in which an investigator's experimental control is limited, especially his/her ability to assign subjects to groups because intact groups must be used, the variable of interest is organismic or the study includes only one group (which will be compared to itself). Quasi-experimental research does not allow an investigator to conclude that an observed relationship between
variables is a causal one. Ex-post facto research and developmental research are usually classified as quasi-experimental.
TYPE I ERROR, TYPE II ERROR & POWER
1. TYPE I ERROR: Decision error that occurs when a true null hypothesis is rejected. The probability of making a Type I error is equal to alpha.
2. TYPE II ERROR: Decision error that occurs when a false null hypothesis is retained. The probability of making a Type II error is equal to beta (which is usually unknown).
3. POWER: An investigator can make a correct decision by either retaining the null
hypothesis when it is true or rejecting the null hypothesis when it is false. An investigator usually wants to make the second type of correct decision and, when a statistical test enables him/her to make this decision, it is said to have "power." Therefore, power is the probability of rejecting a null hypothesis when it is false. Power cannot be directly controlled, but it can be increased by including a large sample, maximizing the effects of the independent variable, increasing the size of alpha and reducing the effects of error.
MULTITRAIT-MULTIMETHOD MATRIX
MULTITRAIT-MULTIMETHOD MATRIX: A systematic way to organize the correlation coefficients obtained when assessing a test's convergent and discriminant validity. The matrix contains four types of correlation coefficients: (a) MONOTRAIT-MONOMETHOD:
The reliability coefficient (the correlation between a test and itself). The monotrait-monomethod coefficient is not directly relevant to a test's convergent and discriminant validity, but it should be large for the matrix to provide useful information. (b) MONOTRAIT-HETEROMETHOD: The correlation between two different measures that measure the same trait. The monotrait-heteromethod coefficient indicates convergent validity when it is large. (c) HETEROTRAIT-MONOMETHOD: The correlation between two similar measures that each measure a different trait. The heterotrait-monomethod
coefficient indicates discriminant validity when it is low. (d)
HETEROTRAIT-HETEROMETHOD: The correlation between two different measures of two different traits. The heterotrait-heteromethod coefficient indicates discriminant validity when it is low.