Keyword

Biostatistics and Epidemiology

Basic Statistical Concepts

  • Statistics is a field of study concerned with the collection, organization, summarization, and analysis of data. When data analyzed is derived from biological science and medicine, we used the term biostatistics [1].

  • Data may come from many sources: medical records, external sources, surveys, experiments.

  • Descriptive Statistics: to describe and summarize the data.

  • Inferential Statistics: to make inferences that can expand the data to a population.

  • Variables: characteristics presented as values in different persons, places, or things.

    • Quantitative variable: can be measured or counted

      • Discrete variable: the possible values are either finite or countable numbers (e.g. number of patients, length of hospital stay)

      • Continuous variable: the possible values can take any value in a particular limit (e.g. height, weight)

    • Qualitative variable (categorical): can be placed in different categories distinguished by some characteristic or attribute (e.g. race, gender)

  • Measurement scales: the first step in any statistical analysis is to determine the level of measurement [2], which influences the type of statistical analysis that can be performed on it.

    • Nominal scale: names or categories (e.g. sex, race, ethnicity)

    • Ordinal scale: ranked categories, classifications (e.g. CEAP (CVD [3]), Rutherford (PAD) [4])

    • Interval scale: ordinal scale, the differences between units of data can be defined, and there is no meaningful zero (e.g. temperature, years)

    • Ratio scale: interval scale, and there is a meaningful zero (e.g. age, weight)

Inferential Statistics

  • Uses random samples of data and makes inferences (predictions) about the population.

  • Uses sample data from the population to answer research questions (test hypothesis)

  • Population: complete collection of all elements/subjects to be studied

  • Sample: a subset of elements drawn from the population

  • Methods of sampling: convenience, simple random, systematic, stratified random, and cluster.

Descriptive Statistics

  • Results that summarize a given data set, usually a sample of a population

  • Data may be distributed in different ways: skewed to the left, skewed to the right, or sometimes is disarranged without any particular shape (See Fig. 28.1)

  • In some cases, the data tends to be around a central value (e.g. mean) with no bias left or right, resembling a “Normal Distribution” (See Fig. 28.1d)

  • Descriptive statistics show their results as measures of central tendency (summary) and measures of variability (dispersion) [5]

  • Measures of central tendency

    • Mean: the sum of scores in the data set divided by the total number of scores. Is strongly affected by extreme values

      \( \mu =\frac{3+3+7+9+10+10}{6}=7;\kern1em \mu =\frac{3+3+7+9+10+100}{6}=22 \)

    • Median: the midpoint of the arranged \( \frac{n+1}{2} th \) observationof the dataset. Is not affected by extreme values

      • From the set: 2, 5, 7, 16, 84; the midpoint is the \( \left(\frac{5+1}{2}\right)3\textrm{rd} \) observation = 7

    • Mode: the value in the data set that occurs most frequently. Sometimes there are more than one mode

      • From the set 21, 45, 30, 25, 45, 21, 45; the mode = 45

  • Measures of Variability: show the amount of dispersion present in a dataset; if the values are close to each other (small dispersion) or if they are widely scattered (greater dispersion)

    • Range: largest value − smallest value

    • Variance: measures dispersion relative to the scatter of the values about the mean

    • Standard deviation (SD): square root of the variance. In a normal distribution shows us the percentage of data that falls between 1, 2, or 3 SDs (68.3%, 95.5%, and 99.7%, respectively) (Fig. 28.1d).

    • Coefficient of variation: standard deviation divided by the mean, used to compare dispersion between two or more groups.

Fig. 28.1
Three bar graphs plot ten bars. The highest values for bar graphs A, B, and C are 5.8, 47, and 24.8, respectively. The values are estimated. D is a bell curve with three values, 68.3%, 95.5%, and 99.7% of the data.

(a) Distribution bar chart with frequencies skewed to the left. (b) Distribution bar chart with frequencies skewed to the right. (c) Distribution bar chart with frequencies disarranged. (d) Normal distribution of a dataset with the corresponding standard deviations from the mean

Probability

  • Probability is the likelihood (chance) of the occurrence of an event

  • Observational probability will calculate probabilities from a sample using relative frequencies

  • The Law of large numbers: summary results that are based upon a large number of independent observations (trials) which are less susceptible to the effects of variance (random error) when compared to results derived from fewer observations

  • In probability, the central limit theorem (CLT) states that the means of a large number of independent random samples, each with a finite mean and variance, will approach a normal distribution (usually if the sample size is >30) (See Fig. 28.1d).

  • The normal distribution can be used to model the distribution of many variables that are of interest. This allows us to answer probability questions about these random variables.

Estimating Population Parameters

  • Estimate: process of using the data available from the sample to estimate the unknown value of the population parameter (statistic from the population)

  • We can compute two types of estimates: a point estimate and an interval estimate.

  • Point estimate: a single value used to estimate a population parameter.

  • Interval estimate: a range of values that includes the parameter being estimated

Hypothesis Testing

  • Null hypothesis (HØ): hypothesis to be tested, is a statement of status quo (no difference)

  • Alternate hypothesis (HA): hypothesis that competes with the HØ, is a statement of what we believe is true if the sample data results in rejecting the null hypothesis

  • Significance level (α level): is the probability of rejecting the HØ when it is true. It is suggested that a probability of 1 in 20 (0.05), is a convenient cutoff level to reject the null hypothesis, but the significance level can change according to specific circumstances

  • p-value: smallest value of α for which the HØ can be rejected, or said in other words, the probability of obtaining a result more extreme than the result actually obtained when the null hypothesis is true [6]

  • Confidence interval: shows an estimated range of values which is likely to include an unknown population parameter

    • The narrower the confidence interval is, the more precision it has

    • A wide interval may indicate that more data should be collected before making assumptions about the parameter

  • Test statistic: is a value computed from the sample data that is used in making the decision about the rejection of the null hypothesis

    • Helps to evaluate whether the test statistic falls within the rejection region

    • Helps to decide if we reject or fail to reject the HØ at the pre-specified α level and make a conclusion

    • There are different hypothesis tests which use different test statistics based on the probability model assumed in the null hypothesis. The most common are:

      • Z-test, to determine differences between two population means, used if the data has a known normal distribution

      • t-test, similar to a Z-test but used if an unknown normal distribution, used for continuous or ordinal scales

      • ANOVA, F statistic, similar to a t-test and can compare means of more than two groups

      • Chi-square test, X2 statistic, used for categorical variables (counts or frequency data)

  • Type I error (α error): reject the HØ when it is true. The probability of committing a type I error is the same as α (the significance level)

  • Type II error (β error): failing to reject the null hypothesis when it is false (See Table 28.1)

  • Power: probability of rejecting the HØ when it is false [7]. This is defined by 1-β. Decreasing α makes it harder to reject the null hypothesis and thus lowers the power (See Fig. 28.2)

Table 28.1 Hypothesis testing table
Fig. 28.2
Two bell curves. The bell curve on the top is of alpha type 1 error, and the vertical dashed line in the middle of the X axis is alpha, with a double-faced line near the top of the bell curve where H subscript 0 is true and false. The bell curve on the bottom is of beta type 2 error, and a vertical dashed line in the middle of the X axis is beta, with a double-faced line near the top of the bell curve where H subscript 0 is true and false.

Hypothesis testing graph: distributions of the null (H0) and the alternate (HA) hypothesis

Evaluating the Relationship Between Variables

  • Correlation: measure of association between two continuous variables (strength). The direction and strength of the linear relationship are measured by the correlation coefficient (r).

    • If the variables X and Y have nonlinear relationship, it will not provide a valid measure of association

    • Correlation does not imply causation

  • Simple linear regression: determines the linear relationship between two continuous variables and determines the equation of the best line that fits through the data

    • The regression equation is then used to predict the value of the dependent variable Y given the independent variable X. The dependent variable is continuous.

  • Multiple linear regression: introduces two or more predictor variables into the prediction model

    • Multicollinearity: occurs when the predictor variables are so highly intercorrelated that they produce instability problems [8]

    • Overfitting: the inclusion of too many variables in the equation can lead to an equation that does not predict well the outcome [9]

    • Stepwise regression, forward selection, and backward elimination help with overfitting and multicollinearity problems

  • Logistic Regression: method for predicting binary outcomes on the basis of one or more predictor variables. The dependent variable is binary (dichotomous). Measure of effect is the Odds ratio.

  • Poisson Regression: uses a count dependent variable, is suitable for rate data. Measure of effect is Incidence Rate ratio.

  • Proportional Hazards Regression: models the relationship between survival of a patient and a set of independent variables (e.g. age, comorbidity index, BMI). Measure of effect is Hazards ratio.

Epidemiology

  • Definition: The study of the distribution and determinants of health-related states or events in specified populations and the application of this study to control of health problems [10]

  • Descriptive Epidemiology: Study of the amount and distribution of disease within a population by person, place, and time.

    • Person variables: age, sex, race, social status

    • Place variables: natural boundaries, urban/rural differences, international comparisons

    • Time variables: secular trends, cyclic trends

  • Types of studies

    • Observational: does not manipulate the exposure and does not randomize subjects

      • Descriptive: useful to generate hypothesis, e.g. case reports, cross-sectional surveys, ecologic studies

      • Analytic: generate and test hypothesis, suggest causality

        • Case-control: used for rare diseases. Less expensive. Start from the outcome and look for the exposures. High selection and information bias

        • Cohort: used for common diseases (outcomes). Start from the exposure and follow the development of an outcome. Can calculate incidence and risk ratio. Confounding and loss to follow-up may occur.

    • Interventional (clinical trial)

      • Randomized controlled trials: experimental studies. The researcher manipulates the exposure and randomly assigns subjects to the exposed and unexposed. Eliminates selection bias. Strongest prove of cause and effect. Highest cost.

  • Measures of effect: summarize the strength of the association between exposures and outcomes [11]

    • Rate: probability of occurrence of some particular event (outcome) in relation to a population and a measure of time. \( {\frac{\textrm{number}\kern0.5em \textrm{of}\kern0.5em \textrm{events}}{\textrm{population}\kern0.5em \textrm{at}\kern0.5em \textrm{risk}}}^{\ast}\textrm{time}\kern0.5em \textrm{specification} \)

    • Incidence: \( \frac{\textrm{number}\kern0.5em \textrm{of}\kern0.5em \textrm{new}\kern0.5em \textrm{cases}}{\textrm{total}\kern0.5em \textrm{population}\kern0.5em \textrm{a}\textrm{t}\kern0.5em \textrm{risk}}\left(\textrm{at}\kern0.5em \textrm{a}\kern0.5em \textrm{given}\kern0.5em \textrm{point}\kern0.5em \textrm{in}\kern0.5em \textrm{t}\textrm{ime}\right) \)

    • Prevalence: \( \frac{\textrm{number}\kern0.5em \textrm{of}\kern0.5em \textrm{existing}\kern0.5em \textrm{cases}}{\textrm{total}\kern0.5em \textrm{population}\kern0.5em \textrm{a}\textrm{t}\kern0.5em \textrm{risk}}\left(\textrm{at}\kern0.5em \textrm{a}\kern0.5em \textrm{given}\kern0.5em \textrm{point}\kern0.5em \textrm{in}\kern0.5em \textrm{t}\textrm{ime}\right) \)

    • Crude Mortality rate: \( \frac{\textrm{number}\kern0.5em \textrm{of}\kern0.5em \textrm{a}\textrm{ll}\kern0.5em \textrm{deaths}\left(\textrm{at}\kern0.5em \textrm{a}\kern0.5em \textrm{defined}\kern0.5em \textrm{period}\kern0.5em \textrm{of}\kern0.5em \textrm{time}\right)}{\textrm{total}\kern0.5em \textrm{population}\kern0.5em \textrm{in}\kern0.5em \textrm{the}\kern0.5em \textrm{same}\kern0.5em \textrm{period}\kern0.5em \textrm{of}\kern0.5em \textrm{time}} \)

    • Relative risk: \( \frac{\textrm{Probability}\kern0.5em \textrm{of}\kern0.5em \textrm{the}\kern0.5em \textrm{event}\kern0.5em \textrm{in}\kern0.5em \textrm{the}\kern0.5em \textrm{exposed}\kern0.5em \textrm{group}}{\textrm{Probability}\kern0.5em \textrm{of}\kern0.5em \textrm{the}\kern0.5em \textrm{event}\kern0.5em \textrm{in}\kern0.5em \textrm{the}\kern0.5em \textrm{unexposed}\kern0.5em \textrm{group}} \)

    • Risk difference (excess risk, or attributable risk): Riskexposed − Riskunexposed

    • Number needed to treat (NNT): measurement of the impact of a therapy by estimating the number of patients that need to be treated in order to have an impact on one person

      $$ NNT=\frac{1}{{\textrm{Risk}}_{\textrm{unexposed}}-{\textrm{Risk}}_{\textrm{exposed}}} $$
  • Screening Tests: widely used in medicine to assess the likelihood that members of a defined population have a particular disease (see Table 28.2).

    • Sensitivity: screening test’s ability to correctly identify those individuals who truly have the disease.

      $$ \textrm{Sensitivity}\kern0.5em \%=\frac{a}{a+c}\times 100=\frac{TP}{TP+ FN} $$
    • Specificity: screening test’s ability to correctly identify those individuals who truly do not have the disease.

      $$ \textrm{Specificity}\kern0.5em \%=\frac{d}{b+d}\times 100=\frac{TN}{TN+ FP} $$
    • Positive Predictive value (PPv): screening test’s ability to identify correctly those individuals who truly have the disease (true positive) among all individuals whose screening tests are positive [12]. PPv increases with increasing disease prevalence, therefore high-risk populations are the best targets for screening programs. It is a critical measure of the performance of a diagnostic method, as it reflects the probability that a positive test reflects the underlying condition being tested for.

      $$ PPv\kern0.5em \%=\frac{a}{a+b}\times 100=\frac{TP}{TP+ FP} $$
    • Negative Predictive value (NPv): screening test’s ability to identify correctly those individuals who truly do not have the disease (true negative) among all individuals whose screening tests are negative [12]. NPV decreases with increasing disease prevalence.

      $$ NPv\kern0.5em \%=\frac{d}{d+c}\times 100=\frac{TN}{TN+ FN} $$
Table 28.2 Screening test for a disease

Questions and Answers

  1. 1.

    A surgeon has designed a study in which he will compare the mean length of stay in days after the use of three different endovascular devices for the treatment of type A and type B aorto-iliac disease. The best statistic test is:

    1. (a)

      Correlation coefficient

    2. (b)

      Chi-square

    3. (c)

      ANOVA

    4. (d)

      Paired T-test

    5. (e)

      Z-test

  2. 2.

    Specificity is determined by:

    1. (a)

      False negatives/False negatives + True negatives

    2. (b)

      False positives/False positives + True positives

    3. (c)

      True negatives/True negatives + False negatives

    4. (d)

      False negatives/False negatives + True positives

    5. (e)

      True negatives/True negatives + False positives

  3. 3.

    The following type of study includes the outcome variable (e.g. severity of internal carotid occlusion) and tries to estimate the exposure variable:

    1. (a)

      Correlation

    2. (b)

      Case-control

    3. (c)

      Cohort

    4. (d)

      Randomized control trial

    5. (e)

      Logistic regression

  4. 4.

    A research group is studying the effect on survival after BKA for complicated type II diabetes. They obtained a couple of thousand patients from an established prospectively collected database. After stepwise elimination of variables, they would like to perform a multivariate analysis, the best statistical test would be:

    1. (a)

      Logistic regression

    2. (b)

      Cox proportional Hazards regression

    3. (c)

      Poisson regression

    4. (d)

      Simple linear regression

    5. (e)

      Meta-analysis

  5. 5.

    To analyze the data of a new screening tool for detecting skin perfusion of the lower limbs in patients with moderate to severe claudication, the following statistical test has a direct relationship with the incidence of the disease in a population:

    1. (a)

      Specificity

    2. (b)

      Sensitivity

    3. (c)

      Odds ratio

    4. (d)

      Positive predictive value

    5. (e)

      Number needed to treat

Answers: 1 (c), 2 (e), 3 (b), 4 (b), 5 (d)