Biostatistics

de Paz, Carlos Chavez; Murga, Allen

doi:10.1007/978-3-031-24121-5_28

Carlos Chavez de Paz⁵ &
Allen Murga⁶

690 Accesses
1 Citations

Abstract

Biostatistics is a collection, organization, summarization, and analysis of data that is derived from the biological science and medicine. Whereas epidemiology is the study of the distribution and determinants of health-related states or events in specified populations and the application of this study to control of health problems.

Physicians are in the constant pursuit of knowledge and face the challenges of understanding and critically assessing journal articles, which contain a variety of statistical methods, analysis, and conclusions. To apply the recently acquired knowledge for the care of their patients, they need to evaluate the different methodologies reported in medical journals and perceive the weaknesses and strengths of the research literature. In this chapter, we intend to summarize the different definitions of biostatistics, probability, hypothesis testing, the relation between variables and epidemiology, so that these concepts can reinforce the techniques and definitions that physicians most commonly encounter. The relation between variables is a very important concept that will help understand the basics of regression and the differences between the different outcomes about the predictive variables.

Access provided by Autonomous University of Puebla. Download chapter PDF

Biostatistics

Basics of Biostatistics

Keyword

Biostatistics and Epidemiology

Basic Statistical Concepts

Statistics is a field of study concerned with the collection, organization, summarization, and analysis of data. When data analyzed is derived from biological science and medicine, we used the term biostatistics [1].
Data may come from many sources: medical records, external sources, surveys, experiments.
Descriptive Statistics: to describe and summarize the data.
Inferential Statistics: to make inferences that can expand the data to a population.
Variables: characteristics presented as values in different persons, places, or things.
- Quantitative variable: can be measured or counted
  - Discrete variable: the possible values are either finite or countable numbers (e.g. number of patients, length of hospital stay)
  - Continuous variable: the possible values can take any value in a particular limit (e.g. height, weight)
- Qualitative variable (categorical): can be placed in different categories distinguished by some characteristic or attribute (e.g. race, gender)
Measurement scales: the first step in any statistical analysis is to determine the level of measurement [2], which influences the type of statistical analysis that can be performed on it.
- Nominal scale: names or categories (e.g. sex, race, ethnicity)
- Ordinal scale: ranked categories, classifications (e.g. CEAP (CVD [3]), Rutherford (PAD) [4])
- Interval scale: ordinal scale, the differences between units of data can be defined, and there is no meaningful zero (e.g. temperature, years)
- Ratio scale: interval scale, and there is a meaningful zero (e.g. age, weight)

Inferential Statistics

Uses random samples of data and makes inferences (predictions) about the population.
Uses sample data from the population to answer research questions (test hypothesis)
Population: complete collection of all elements/subjects to be studied
Sample: a subset of elements drawn from the population
Methods of sampling: convenience, simple random, systematic, stratified random, and cluster.

Descriptive Statistics

Results that summarize a given data set, usually a sample of a population
Data may be distributed in different ways: skewed to the left, skewed to the right, or sometimes is disarranged without any particular shape (See Fig. 28.1)
In some cases, the data tends to be around a central value (e.g. mean) with no bias left or right, resembling a “Normal Distribution” (See Fig. 28.1d)
Descriptive statistics show their results as measures of central tendency (summary) and measures of variability (dispersion) [5]
Measures of central tendency
- Mean: the sum of scores in the data set divided by the total number of scores. Is strongly affected by extreme values
  
  $ \mu =\frac{3+3+7+9+10+10}{6}=7;\kern1em \mu =\frac{3+3+7+9+10+100}{6}=22 $
- Median: the midpoint of the arranged $ \frac{n+1}{2} th $ observationof the dataset. Is not affected by extreme values
  - From the set: 2, 5, 7, 16, 84; the midpoint is the $ \left(\frac{5+1}{2}\right)3\textrm{rd} $ observation = 7
- Mode: the value in the data set that occurs most frequently. Sometimes there are more than one mode
  - From the set 21, 45, 30, 25, 45, 21, 45; the mode = 45
Measures of Variability: show the amount of dispersion present in a dataset; if the values are close to each other (small dispersion) or if they are widely scattered (greater dispersion)
- Range: largest value − smallest value
- Variance: measures dispersion relative to the scatter of the values about the mean
- Standard deviation (SD): square root of the variance. In a normal distribution shows us the percentage of data that falls between 1, 2, or 3 SDs (68.3%, 95.5%, and 99.7%, respectively) (Fig. 28.1d).
- Coefficient of variation: standard deviation divided by the mean, used to compare dispersion between two or more groups.

Three bar graphs plot ten bars. The highest values for bar graphs A, B, and C are 5.8, 47, and 24.8, respectively. The values are estimated. D is a bell curve with three values, 68.3%, 95.5%, and 99.7% of the data. — **Fig. 28.1**

Probability

Probability is the likelihood (chance) of the occurrence of an event
Observational probability will calculate probabilities from a sample using relative frequencies
The Law of large numbers: summary results that are based upon a large number of independent observations (trials) which are less susceptible to the effects of variance (random error) when compared to results derived from fewer observations
In probability, the central limit theorem (CLT) states that the means of a large number of independent random samples, each with a finite mean and variance, will approach a normal distribution (usually if the sample size is >30) (See Fig. 28.1d).
The normal distribution can be used to model the distribution of many variables that are of interest. This allows us to answer probability questions about these random variables.

Estimating Population Parameters

Estimate: process of using the data available from the sample to estimate the unknown value of the population parameter (statistic from the population)
We can compute two types of estimates: a point estimate and an interval estimate.
Point estimate: a single value used to estimate a population parameter.
Interval estimate: a range of values that includes the parameter being estimated

Hypothesis Testing

Null hypothesis (H_Ø): hypothesis to be tested, is a statement of status quo (no difference)
Alternate hypothesis (H_A): hypothesis that competes with the H_Ø, is a statement of what we believe is true if the sample data results in rejecting the null hypothesis
Significance level (α level): is the probability of rejecting the H_Ø when it is true. It is suggested that a probability of 1 in 20 (0.05), is a convenient cutoff level to reject the null hypothesis, but the significance level can change according to specific circumstances
p-value: smallest value of α for which the H_Ø can be rejected, or said in other words, the probability of obtaining a result more extreme than the result actually obtained when the null hypothesis is true [6]
Confidence interval: shows an estimated range of values which is likely to include an unknown population parameter
- The narrower the confidence interval is, the more precision it has
- A wide interval may indicate that more data should be collected before making assumptions about the parameter
Test statistic: is a value computed from the sample data that is used in making the decision about the rejection of the null hypothesis
- Helps to evaluate whether the test statistic falls within the rejection region
- Helps to decide if we reject or fail to reject the H_Ø at the pre-specified α level and make a conclusion
- There are different hypothesis tests which use different test statistics based on the probability model assumed in the null hypothesis. The most common are:
  - Z-test, to determine differences between two population means, used if the data has a known normal distribution
  - t-test, similar to a Z-test but used if an unknown normal distribution, used for continuous or ordinal scales
  - ANOVA, F statistic, similar to a t-test and can compare means of more than two groups
  - Chi-square test, X² statistic, used for categorical variables (counts or frequency data)
Type I error (α error): reject the H_Ø when it is true. The probability of committing a type I error is the same as α (the significance level)
Type II error (β error): failing to reject the null hypothesis when it is false (See Table 28.1)
Power: probability of rejecting the H_Ø when it is false [7]. This is defined by 1-β. Decreasing α makes it harder to reject the null hypothesis and thus lowers the power (See Fig. 28.2)

Table 28.1 Hypothesis testing table

Full size table

Two bell curves. The bell curve on the top is of alpha type 1 error, and the vertical dashed line in the middle of the X axis is alpha, with a double-faced line near the top of the bell curve where H subscript 0 is true and false. The bell curve on the bottom is of beta type 2 error, and a vertical dashed line in the middle of the X axis is beta, with a double-faced line near the top of the bell curve where H subscript 0 is true and false. — **Fig. 28.2**

Evaluating the Relationship Between Variables

Correlation: measure of association between two continuous variables (strength). The direction and strength of the linear relationship are measured by the correlation coefficient (r).
- If the variables X and Y have nonlinear relationship, it will not provide a valid measure of association
- Correlation does not imply causation
Simple linear regression: determines the linear relationship between two continuous variables and determines the equation of the best line that fits through the data
- The regression equation is then used to predict the value of the dependent variable Y given the independent variable X. The dependent variable is continuous.
Multiple linear regression: introduces two or more predictor variables into the prediction model
- Multicollinearity: occurs when the predictor variables are so highly intercorrelated that they produce instability problems [8]
- Overfitting: the inclusion of too many variables in the equation can lead to an equation that does not predict well the outcome [9]
- Stepwise regression, forward selection, and backward elimination help with overfitting and multicollinearity problems
Logistic Regression: method for predicting binary outcomes on the basis of one or more predictor variables. The dependent variable is binary (dichotomous). Measure of effect is the Odds ratio.
Poisson Regression: uses a count dependent variable, is suitable for rate data. Measure of effect is Incidence Rate ratio.
Proportional Hazards Regression: models the relationship between survival of a patient and a set of independent variables (e.g. age, comorbidity index, BMI). Measure of effect is Hazards ratio.

Epidemiology

Definition: The study of the distribution and determinants of health-related states or events in specified populations and the application of this study to control of health problems [10]
Descriptive Epidemiology: Study of the amount and distribution of disease within a population by person, place, and time.
- Person variables: age, sex, race, social status
- Place variables: natural boundaries, urban/rural differences, international comparisons
- Time variables: secular trends, cyclic trends
Types of studies
- Observational: does not manipulate the exposure and does not randomize subjects
  - Descriptive: useful to generate hypothesis, e.g. case reports, cross-sectional surveys, ecologic studies
  - Analytic: generate and test hypothesis, suggest causality
    - Case-control: used for rare diseases. Less expensive. Start from the outcome and look for the exposures. High selection and information bias
    - Cohort: used for common diseases (outcomes). Start from the exposure and follow the development of an outcome. Can calculate incidence and risk ratio. Confounding and loss to follow-up may occur.
- Interventional (clinical trial)
  - Randomized controlled trials: experimental studies. The researcher manipulates the exposure and randomly assigns subjects to the exposed and unexposed. Eliminates selection bias. Strongest prove of cause and effect. Highest cost.
Measures of effect: summarize the strength of the association between exposures and outcomes [11]
- Rate: probability of occurrence of some particular event (outcome) in relation to a population and a measure of time. $ {\frac{\textrm{number}\kern0.5em \textrm{of}\kern0.5em \textrm{events}}{\textrm{population}\kern0.5em \textrm{at}\kern0.5em \textrm{risk}}}^{\ast}\textrm{time}\kern0.5em \textrm{specification} $
- Incidence: $ \frac{\textrm{number}\kern0.5em \textrm{of}\kern0.5em \textrm{new}\kern0.5em \textrm{cases}}{\textrm{total}\kern0.5em \textrm{population}\kern0.5em \textrm{a}\textrm{t}\kern0.5em \textrm{risk}}\left(\textrm{at}\kern0.5em \textrm{a}\kern0.5em \textrm{given}\kern0.5em \textrm{point}\kern0.5em \textrm{in}\kern0.5em \textrm{t}\textrm{ime}\right) $
- Prevalence: $ \frac{\textrm{number}\kern0.5em \textrm{of}\kern0.5em \textrm{existing}\kern0.5em \textrm{cases}}{\textrm{total}\kern0.5em \textrm{population}\kern0.5em \textrm{a}\textrm{t}\kern0.5em \textrm{risk}}\left(\textrm{at}\kern0.5em \textrm{a}\kern0.5em \textrm{given}\kern0.5em \textrm{point}\kern0.5em \textrm{in}\kern0.5em \textrm{t}\textrm{ime}\right) $
- Crude Mortality rate: $ \frac{\textrm{number}\kern0.5em \textrm{of}\kern0.5em \textrm{a}\textrm{ll}\kern0.5em \textrm{deaths}\left(\textrm{at}\kern0.5em \textrm{a}\kern0.5em \textrm{defined}\kern0.5em \textrm{period}\kern0.5em \textrm{of}\kern0.5em \textrm{time}\right)}{\textrm{total}\kern0.5em \textrm{population}\kern0.5em \textrm{in}\kern0.5em \textrm{the}\kern0.5em \textrm{same}\kern0.5em \textrm{period}\kern0.5em \textrm{of}\kern0.5em \textrm{time}} $
- Relative risk: $ \frac{\textrm{Probability}\kern0.5em \textrm{of}\kern0.5em \textrm{the}\kern0.5em \textrm{event}\kern0.5em \textrm{in}\kern0.5em \textrm{the}\kern0.5em \textrm{exposed}\kern0.5em \textrm{group}}{\textrm{Probability}\kern0.5em \textrm{of}\kern0.5em \textrm{the}\kern0.5em \textrm{event}\kern0.5em \textrm{in}\kern0.5em \textrm{the}\kern0.5em \textrm{unexposed}\kern0.5em \textrm{group}} $
- Risk difference (excess risk, or attributable risk): Risk_exposed − Risk_unexposed
- Number needed to treat (NNT): measurement of the impact of a therapy by estimating the number of patients that need to be treated in order to have an impact on one person
  $$ NNT=\frac{1}{{\textrm{Risk}}_{\textrm{unexposed}}-{\textrm{Risk}}_{\textrm{exposed}}} $$
Screening Tests: widely used in medicine to assess the likelihood that members of a defined population have a particular disease (see Table 28.2).
- Sensitivity: screening test’s ability to correctly identify those individuals who truly have the disease.
  $$ \textrm{Sensitivity}\kern0.5em \%=\frac{a}{a+c}\times 100=\frac{TP}{TP+ FN} $$
- Specificity: screening test’s ability to correctly identify those individuals who truly do not have the disease.
  $$ \textrm{Specificity}\kern0.5em \%=\frac{d}{b+d}\times 100=\frac{TN}{TN+ FP} $$
- Positive Predictive value (PPv): screening test’s ability to identify correctly those individuals who truly have the disease (true positive) among all individuals whose screening tests are positive [12]. PPv increases with increasing disease prevalence, therefore high-risk populations are the best targets for screening programs. It is a critical measure of the performance of a diagnostic method, as it reflects the probability that a positive test reflects the underlying condition being tested for.
  $$ PPv\kern0.5em \%=\frac{a}{a+b}\times 100=\frac{TP}{TP+ FP} $$
- Negative Predictive value (NPv): screening test’s ability to identify correctly those individuals who truly do not have the disease (true negative) among all individuals whose screening tests are negative [12]. NPV decreases with increasing disease prevalence.
  $$ NPv\kern0.5em \%=\frac{d}{d+c}\times 100=\frac{TN}{TN+ FN} $$

Table 28.2 Screening test for a disease

Full size table

Questions and Answers

1.
A surgeon has designed a study in which he will compare the mean length of stay in days after the use of three different endovascular devices for the treatment of type A and type B aorto-iliac disease. The best statistic test is:
1. (a)
  Correlation coefficient
2. (b)
  Chi-square
3. (c)
  ANOVA
4. (d)
  Paired T-test
5. (e)
  Z-test
2.
Specificity is determined by:
1. (a)
  False negatives/False negatives + True negatives
2. (b)
  False positives/False positives + True positives
3. (c)
  True negatives/True negatives + False negatives
4. (d)
  False negatives/False negatives + True positives
5. (e)
  True negatives/True negatives + False positives
3.
The following type of study includes the outcome variable (e.g. severity of internal carotid occlusion) and tries to estimate the exposure variable:
1. (a)
  Correlation
2. (b)
  Case-control
3. (c)
  Cohort
4. (d)
  Randomized control trial
5. (e)
  Logistic regression
4.
A research group is studying the effect on survival after BKA for complicated type II diabetes. They obtained a couple of thousand patients from an established prospectively collected database. After stepwise elimination of variables, they would like to perform a multivariate analysis, the best statistical test would be:
1. (a)
  Logistic regression
2. (b)
  Cox proportional Hazards regression
3. (c)
  Poisson regression
4. (d)
  Simple linear regression
5. (e)
  Meta-analysis
5.
To analyze the data of a new screening tool for detecting skin perfusion of the lower limbs in patients with moderate to severe claudication, the following statistical test has a direct relationship with the incidence of the disease in a population:
1. (a)
  Specificity
2. (b)
  Sensitivity
3. (c)
  Odds ratio
4. (d)
  Positive predictive value
5. (e)
  Number needed to treat

Answers: 1 (c), 2 (e), 3 (b), 4 (b), 5 (d)

References

Daniel WW, Cross CL. Biostatistics: a foundation for analysis in the health sciences. New York: Wiley; 2018.
Google Scholar
Mertler CA, Reinhart RV. Advanced and multivariate statistical methods: practical application and interpretation. London: Routledge; 2016.
Book Google Scholar
Eklöf B, Rutherford RB, Bergan JJ, Carpentier PH, Gloviczki P, Kistner RL, et al. Revision of the CEAP classification for chronic venous disorders: consensus statement. J Vasc Surg. 2004;40(6):1248–52.
Article PubMed Google Scholar
Hardman RL, Jazaeri O, Yi J, Smith M, Gupta R. Overview of classification systems in peripheral artery disease. Semin Intervent Radiol. 2014;31:378–88.
Article PubMed PubMed Central Google Scholar
Marshall G, Jonker L. An introduction to descriptive statistics: a review and practical guide. Radiography. 2010;16(4):e1–7.
Article Google Scholar
Cohen HW. P values: use and misuse in medical literature. Am J Hypertens. 2011;24(1):18–23.
Article PubMed Google Scholar
Krzywinski M, Altman N. Points of significance: power and sample size. London: Nature Publishing Group; 2013.
Google Scholar
Vatcheva KP, Lee M, McCormick JB, Rahbar MH. Multicollinearity in regression analyses conducted in epidemiologic studies. Epidemiology. 2016;6(2):227.
PubMed Google Scholar
Ivanescu AE, Li P, George B, Brown AW, Keith SW, Raju D, et al. The importance of prediction model validation and assessment in obesity and nutrition research. Int J Obesity. 2016;40(6):887.
Article CAS Google Scholar
Porta M. A dictionary of epidemiology. Oxford: Oxford University Press; 2014.
Book Google Scholar
Tripepi G, Jager K, Dekker F, Wanner C, Zoccali C. Measures of effect: relative risks, odds ratios, risk difference, and ‘number needed to treat’. Kidney Int. 2007;72(7):789–91.
Article CAS PubMed Google Scholar
Trevethan R. Sensitivity, specificity, and predictive values: foundations, pliabilities, and pitfalls in research and practice. Front Public Health. 2017;5:307.
Article PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Department of Surgery, Cary Medical Center, Caribou, ME, USA
Carlos Chavez de Paz
Department of Vascular Surgery, Loma Linda University, Loma Linda, CA, USA
Allen Murga

Authors

Carlos Chavez de Paz
View author publications
You can also search for this author in PubMed Google Scholar
Allen Murga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Allen Murga .

Editor information

Editors and Affiliations

Department of Vascular Surgery, Loma Linda University, Loma Linda, CA, USA
Allen Murga
Department of Surgery, Loma Linda University, Loma Linda, CA, USA
Theodore H. Teruya
Chief of Vascular Surgery of Loma Linda, Loma Linda University, Loma Linda, CA, USA
Ahmed M. Abou-Zamzam Jr
Department of Vascular Surgery, Loma Linda University, Loma Linda, CA, USA
Christian Bianchi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

de Paz, C.C., Murga, A. (2023). Biostatistics. In: Murga, A., Teruya, T.H., Abou-Zamzam Jr, A.M., Bianchi, C. (eds) The Vascular Surgery In-Training Examination Review (VSITE). Springer, Cham. https://doi.org/10.1007/978-3-031-24121-5_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-24121-5_28
Published: 03 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24120-8
Online ISBN: 978-3-031-24121-5
eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics

Biostatistics

Abstract