Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
52 Cards in this Set
- Front
- Back
Artefact |
An artificial pattern caused by deficiencies in the data collection process |
|
Association (alternative name: relationship) |
A pattern that connects two (or more) variables. This pattern would be unlikely to be generated by purely chance. Conversely, there is no relationship when learning the value of one variable would tell you nothing new about the likely value of the other |
|
Bar Chart (alternative names: bar graph, bar plot, column chart) |
A graph used for categorical variables to display the percentages or frequencies falling into each category |
|
Bimodal |
When two peaks are evident in a graph of the distribution of a numeric variable |
|
Box Plot (alternative name: box and whisker plot) |
A graph for displaying the distribution of a numeric variable. It splits the data into quartiles with the box part going from the lower (1st) quartile to the upper (3rd) quartile. A line is drawn at the median |
|
Categorical Variable (alternative names: qualitative variable, factor, class variable) |
A variable whose values are names or codes for different groups (or categories) |
|
Centre (alternative name: average, location) |
The idea of where the "middle" of the set of observations is |
|
Cluster |
A distinct grouping of values that is separated from other groupings of values |
|
Dot Plot |
A graph for displaying the distribution of a numeric variable. Each dot represents a single observation from a set of data. The form we use is a special case, a stacked dot plot |
|
Entities (alternative names: individuals, units, cases) |
The individual "things" we are recording data about |
|
Estimate |
A number calculated from the data; used to estimate an unknown parameter value |
|
False Negative |
The individual has the condition but tests negative for the condition |
|
False Positive |
The individual does not have the condition but tests positive for the condition |
|
Frequency (alternative names: count, tally) |
The number of times a value of a variable, or a category, occurs |
|
Histogram |
A graph made up of vertical rectangles that displays the distribution of a numeric variable. The range of the data is divided into class intervals which form the bases of each rectangle. The height of each rectangle is set so that the area of the rectangle represents the relative frequency with which values fall into that class interval |
|
Interquartile Range (IQR) |
A measure of spread for a distribution of a numeric variable. It gives "the length of the middle half of the data". Calculated by the difference between the upper (3rd) and lower (1st) quartiles |
|
Mean |
A measure of the centre for a distribution of a numeric variable. The total of all values divided by the total number of values |
|
Median |
A measure of the centre for a distribution of a numeric variable. The "middle value". It splits the data in half with half the observations at or above and half at or below |
|
Missing Value |
No information has been recorded for this cell or of this variable for this entity |
|
Modality |
Relating to or constituting the most frequent value in a distribution (unimodal - 1 peak; bimodal - 2 peaks; multi-modal - many peaks) |
|
Nominal Variable |
A categorical variable in which the categories have no natural order |
|
Numeric Variable (alternative name: quantitative variable) |
A variable for which all of the values are numbers (e.g. from counting or measuring) |
|
Oddities |
Anything in the data that looks strange or odd. Things that make us wonder, "Is that a mistake?" |
|
Ordinal Variable |
A categorical variable in which the categories have a natural order |
|
Outlier(s) |
Value(s) that lie so far away from the bulk of the data that they look odd and make us wonder, "Is that a mistake?" |
|
Overlap |
A visual notion. The degree to which plots extend over common values |
|
Overprinting |
A problem with scatter plots when points sit on top of one another so that we are unable to tell how many points are sitting at a given position. This can lead to very misleading impressions of what the data is saying |
|
Pie Chart |
A graph for displaying the relative frequencies of a categorical variable. A circle is divided into sectors according to the relative frequency of each category |
|
Proportion |
A proportion refers to the fraction of the total that possesses a certain attribute |
|
Quartiles |
Comes from separating a numeric distribution into four groups, each containing equal numbers of values. The lower (1st) quartile is the middle of the lower half of the data and the upper (3rd) quartile is the middle of the upper half of the data |
|
Range |
A measure of spread for a distribution of a numeric variable, calculated by: largest value - smallest value |
|
Rectangular Data |
Data organised and stored in such a way that each row corresponds to an individual entity and each column corresponds to a property recorded for these entities |
|
Risk |
A way of expressing the chance that something will happen. Risk is the same as probability, but it usually is used to describe the probability of an adverse event |
|
Risk: Absolute Risk |
The probability or chance a person in a population will have a specified (medical) event. Usually expressed as a percentage |
|
Risk: Relative Risk |
A comparison of the risk of a particular event for two different groups of people |
|
Scatter |
In a scatter plot, the extent to which the values of the response variable deviate from the trend |
|
Scatter Plot (alternative name: scatter graph) |
A graph for displaying a pair of numeric variables in which points are plotted on a pair of axes to represent each entity. The coordinates of each point are the values of the two variables for that entity |
|
Shape |
Used to talk about the outline (or profile) of a plot of the distribution of a numeric variable |
|
Side-By-Side Bar Chart |
A bar chart for investigating the relationship between two categorical variables where, for each response (outcome) category in turn, we put all of the bars for the explanatory (predictor) categories together "side-by-side" |
|
Skewed |
The lack of symmetry in a distribution of a numeric variable. Positively skewed is when the data are piled up on the left and the tail extends out to the right. Negatively skewed is when the data piled up on the right and the tail extends out to the left |
|
Spread (alternative names: variability, variation) |
The idea of the degree to which values of a numeric variable differ from one another (vary) or, visually, are spread out along the axis |
|
Stacked Bar Chart (alternative names: segmented bar chart) |
A graph for displaying the relationship between two categorical variables. Constructed by taking a bar graph for one categorical variable and subdividing each bar according to the percentages of the second categorical variable |
|
Standard Deviation |
Approximately measures the average of the differences (distances) between the observations and the mean |
|
Steam-and-Left Plot |
A graph used to display numeric data. It is similar to a histogram but retains most or all of the numerical information |
|
Subset |
Used in this course in its everyday, nontechnical sense - a collection of things that is part of a larger collection |
|
Subsetting |
Dividing the entities in our data set into different groups (subsets) on the basis of their values for one or two subsetting variables. This allows us to make separate graphs of the same type for every subset and present them either as a matrix of tiled graphs, or by playing through them like a movie |
|
Systematic Biases (alternative name: systematic error) |
Consistent biases caused by the way a system or process functions |
|
Tile Density Plot |
A tile density plot looks like a crude scatter-plot. In a tile density plot, the scatter-plotting region is divided into a set of rectangular tiles. If there is no data in the area covered by a tile it is coloured white. If there is data in the area covered by the tile it is coloured with the depth of colour (darkness) determined by the number of data points in the area covered by the tile |
|
Transparency |
A technique used in scatter plots to deal with overprinting in which we make the symbols semi-transparent. Where there is a lot of overprinting, the symbols will be darker and where there are single or few points overprinted, the symbols will be lighter |
|
Trend |
The overall pattern between the two numeric variables displayed in a scatter-plot |
|
Variability |
The extent to which we get different values for different individuals (or in some contexts, different values at different times) |
|
Variable |
A property that we record for each entity, e.g. a measurement, or one of a set of group labels (indicating categories) |