1 Database Description

1.1 Introduction

Medicaid is the largest health care program for persons with low income in the US and is jointly funded by the federal and individual state governments, while Medicare is solely funded by the federal government and provides health care for the vast majority of elderly persons. Both programs were established in 1965 and are overseen by the Centers of Medicare and Medicaid Services (CMS) of the United States Department of Health and Human Services. The CMS also maintain a database with administrative data from both health care programs, which is the main source for researchers working with Medicaid or Medicare data (Leonard et al. 2017). The administrative data of Medicaid has been used to answer pharmacoepidemiological research questions since the early 1980s (Hennessy et al. 2012a). Administrative data of Medicare has increasingly been used for this purpose since the implementation of prescription drug coverage in 2006 (Hanlon and Donohue 2010).

1.2 Database Characteristics

Medicaid originally covered expenditures for low income pregnant women, families with children or elderly patients as well as chronically disabled patients in all federal states of the US. In 2014, the eligibility was extended to all low income patients in the course of the Patient Protection and Affordable Care Act, but extension is voluntary and varies by state. Medicare covers most patients aged 65 years and above as well as some disabled persons and patients with end-stage renal disease or amyotrophic lateral sclerosis. The Medicare program is divided into four parts. Part A is available to all Medicare enrollees and covers inpatient and hospice/nursing home care. Part B is available for an additional monthly fee and covers outpatient physician visits, services or products. Part C includes Medicare Advantage plans, which can be demanded by Part A or B enrollees and cover additional services, typically paid by an extra premium each month. However, the claims of Part C are generally not available for research through CMS. Part D was implemented in 2006 as part of the Medicare Modernization Act of 2003 and covers outpatient prescriptions. To be included in Part D, patients must enroll in stand-alone prescription drug plans or Medicare Advantage prescription drug plans, which are administered by private health insurances. As the eligibility criteria for Medicare and Medicaid overlap patients could be enrolled in both Medicare and Medicaid.

The trend of enrolled patients over time is depicted in Fig. 1. There has been a steady increase in enrolled persons over the last 50 years in both health care programs with a steeper increase in enrollment for Medicaid than for Medicare (Part A or B). Table 1 shows the latest demographic characteristics. Medicaid covered 74.49 million people in 2012 and 58.48 million were enrolled in Medicare in 2017, corresponding to 23.7 and 18.0% of the US population, respectively (U.S. Census Bureau Population Division 2018). The beneficiaries of Medicaid are not representative for the US population, as children, females as well as non-white persons are over-represented due to the selective eligibility criteria of the program. For the same reason, white persons, females and seniors are overrepresented in Medicare. Over 10.6 million patients are currently enrolled in both Medicare and Medicaid (Centers for Medicare and Medicaid Services 2020).

Fig. 1
figure 1

Enrolled persons in Medicaid (Medicaid and CHIP Payment and Access Commission 2019) and Medicare (Centers for Medicare and Medicaid Services 2017, 2019a, b) from 1966 until 2016

Table 1 Demographic characteristics of the medicaid and medicare population

The raw administrative data of Medicaid is processed by the CMS and made available for research purposes via research identifiable files (RIFs). For Medicaid, the administrative data needs three to four years to become available and the RIFs currently cover data from 1999 up to 2013 for all federal states and up to 2015 for some of them. The administrative data of Medicare Part A, B, and D is updated more frequently than the Medicaid data and is now available after approximately one year. Currently, the RIFs of Medicare cover the years 1999 to 2017 and 2018 data is expected to be complete at the beginning of 2020 (Part D data is available from 2006 and already includes 2018 data).

1.3 Available Data

Medicaid data is stored in five different data files, which are referred to as Medical Analytic eXtract (MAX) files by the CMS:

  • The Personal Summary File contains demographic characteristics of the patients, e.g., date of birth, gender and race/ethnicity, date of death (without cause of death). It also contains information about the eligibility status (e.g., eligibility group, months of eligibility, dual eligibility to Medicare and Medicaid) and summary measures on use of the health system (e.g., number and duration of hospital stays, total number of prescribed drugs, payments).

  • The Inpatient File contains all hospital stay records for enrollees using inpatient services. Included are hospital admission dates, type and begin/end of services, status at discharge and payments. Diagnoses are coded by the International Classification of Diseases, 9th Edition, Clinical Modification (ICD-9-CM and procedures by ICD-9-CM, Healthcare Common Procedure Coding System (HCPCS) level I-III or state-specific codes.

  • The Other Therapy File includes claims from outpatient and inpatient physician visits (including physician specialty), as well as outpatient hospitals, clinics, home health care and hospices with respective dates. Diagnoses and procedures are coded in the same manner as in the Inpatient File. Outpatient laboratory and radiology records are also contained, but lab results are not reported.

  • The Prescription Drug File covers all drug claims of Medicaid, which are coded according to the National Drug Code (NDC) system. Drugs identified by other codes such as HCPCS or state-specific codes are contained in the Other Therapy file. Note that most drugs for patients which are dually eligible for Medicaid and Medicare are contained in the Medicare files. All drugs are contained with prescription date, prescription fill date, strength, quantity and duration of supply, and identification number of the prescribing physician. Non-reimbursable drugs administered at hospital and indications for drugs are not contained.

  • The Long Term Care File includes claims from long-term care facilities, i.e., nursing homes, intermediate care facilities, and psychiatric facilities. The file includes admission dates to the facility, dates of services, diagnoses coded in ICD-9-CM, and discharge status of the patient.

The data of Medicare is available via numerous data files, which can be linked via unique subject identifiers:

  • The Master Beneficiary Summary File contains demographic characteristics such as sex, race, region of residence, date and cause (only from 1999 to 2008) of death, monthly enrollment information for Plan C and D, and summary measures on costs and uses of services.

  • The Standard Analytic Files (SAFs), also known as Medicare Claims Files, contain claims from institutional and non-institutional health care providers. Institutional data covers inpatient and outpatient data as well as claims from skilled nursing facilities, hospices, and home health agencies. Non-institutional claims cover durable medical equipment and data on physicians and free-standing facilities such as clinical laboratories. In general, diagnosis (coded in ICD-9-CM) and procedures (coded in ICD-9-CM or HCPCS) are included together with the date of service and the amount of reimbursement.

  • The Medicare Provider and Analysis Review (MedPAR) Files contain inpatient hospital and/or skilled nursing facility final action claims with diagnoses and procedures (coded in ICD-9-CM) and the corresponding date of service and reimbursement amount. Contrary to the SAFs, each record represents a complete stay in a hospital or nursing facility and thus might contain multiple claims if they belong to the same stay.

  • Information on prescription claims is available via the Part D Drug Event (PDE) File. This file contains one record per prescription and contains prescription date, NDC, days of supply and quantity dispensed. As in the Medicaid data, indications for drugs are missing. Further, medication administered at the hospital is covered by Plan A and is thus not contained in the PDE file. Information on prescription drug plans, pharmacies, drugs and prescribers is available in supplemental files.

Medicaid and Medicare data can be linked to each other but also to other data sources. For example, since laboratory results are absent, a study has been performed to link Medicare data of 10 eastern states data of a large national laboratory service (Hammill et al. 2015). For studies involving cancer, linkage of Medicare data with the Surveillance, Epidemiology, and End Results (SEER) program of the National Cancer Institute is possible to obtain detailed information on cancer site, stage and histology. The Personal Summary File of Medicaid also contains state-specific case numbers identifying the Medicaid cases which each individual belongs to. Palmsten et al. (2013) used this number to identify mother-infant pairs. The authors showed that linkage is in general feasible, but the percentage of linked deliveries varies greatly by state.

1.4 Strengths and Limitations

A major strength of Medicaid and Medicare is their enormous size, which enables studies of rare events even in small subpopulations. The induced homogeneity by the eligibility criteria increases control for confounding and restriction to subpopulations might be possible, where treatment effects might be detectable in contrast to the general population. However, the non-representativeness precludes studies aiming to describe the overall US population. Further, although a small proportion of patients is included in Medicaid without gaps of enrollment (Leonard et al. 2017), membership to Medicaid is generally not stable over time precluding studies of long-term effects of most treatments (Hennessy et al. 2012b). In contrast, members of Medicare generally stay in the program once they entered such that studies with long follow-ups are possible.

As with all administrative databases, Medicaid and Medicare lack information on important confounders such as lifestyle factors (e.g. smoking, diet) or occupation. Regarding prescriptions, the databases face the common problem that over-the-counter medication is not captured. Additionally, the prescription drug plans of Medicare Part D each differ with respect to drug coverage and cost-sharing options and drug availability thus differ across plans. Prescription data in Medicaid was shown to be accurate and complete (Leonard et al. 2017), but this might not hold for low cost generics (Choudhry and Shrank 2010).

1.5 Validation

Medicaid data of each state is validated internally by the CMS (Centers for Medicare and Medicaid Services 2016) and anomalies are reported in validation tables. For both Medicare and Medicaid, reimbursement of the health care providers is determined by the recorded procedures, which are thus checked for errors and can be considered as accurate. However, a validation study comparing medical records for surgical procedures of hip fractures with Medicaid claims found some procedures coded for another purpose (Wysowski and Baum 1993).

Outpatient prescription data of Medicaid was validated in the 1980s and found to be accurate (Lessler and Harris 1984). Leonard et al. (2017) recently investigated the quality of prescription claims in Medicaid and noted that 95–99% of the prescription claims were identifiable in a commercially-available database of NDCs. Further, the absolute number of prescriptions increased steadily and consistently over time, which suggest completeness of prescription claims. Validation of prescription claims was also performed with Medicare data. Colantonio et al. (2016) e.g. validated the prescription claims of lipid lowering drugs in Medicare with self-reported drug use and observed that many beneficiaries reported drug use although no claims in Medicare exist. However, both data sources might have caused this discrepancy e.g. due to recall error in self-reports or due to missing incentives to submit claims to Medicare for reimbursement.

Leonard et al. (2017) further found that hospital data in Medicaid was underrepresented in patients aged 45 years and above due to patients with poorer health status, who are dually eligible for Medicare and Medicaid. The authors therefore advised to additionally consider Medicare data for patients 45 years and older in case hospitalization data is needed to answer the study question of interest. Further, although no gross diagnostic miscoding in the in- and outpatient claims of Medicaid occurred, the authors acknowledged that the validity of health outcomes of interest generally remains open.

The gold-standard of outcome validation represents medical record validation, which was performed in a variety of studies. Hennessy et al. (2010) e.g. reviewed codes of inpatient and emergency department encounters in Medicaid to identify sudden cardiac death and ventricular arrhythmia originating in the outpatient setting and found very good agreement with medical records. Hernandez-Trujillo et al. (2015) performed medical record validation for primary immunodeficiency disease diagnoses. They observed low positive predicted values for individual ICD-9-CM codes and propose to use additional data sources to define disease status. Different methods to retrieve medical records were summarized in a Medicare-based example study validating adverse events of special interest (Wright et al. 2017). Other methods besides medical record validation were also performed. Brouwer et al. (2015) e.g. compared an algorithm to define myocardial infarction in Medicaid HIV patients with clinical cohort data and an algorithm to identify chronic kidney disease in older Medicare adults was validated with the Reasons for Geographic and Racial Differences in Stroke (REGARDS) study (Muntner et al. 2015).

1.6 Governance and Ethical Issues

The personnel information of Medicaid or Medicare is protected under the Privacy Rule of the Health Insurance Portability and Accountability Act of 1996, which prohibits disclosure of protected health information (PHI) without written consent. Exceptions are granted for research purposes under certain conditions (Office for Civil Rights 2018). A study using PHI via research identifiable files of CMS requires a data use agreement and must be approved by the Privacy Board of CMS (Research Data Assistance Center 2020a, b). The study should assist CMS in monitoring, managing and improving the Medicaid and Medicare program and the services provided to the beneficiaries. Researchers must demonstrate experience in conducting research with files containing PHI and may only apply for the data files which are necessary to appropriately answer their study question. They further need to define a cohort in advance, since the provision costs depend on the number of patients in the cohort and the data files which are requested. However, there are some public use files containing non-identifiable data on a summary level, which can be requested without a data use agreement and without approval of the Privacy Board.

1.7 Documents and Publications

The Research Data Assistance Center (ResDaC), a contractor funded by the CMS, provides introductory workshops and webinars on Medicaid, Medicare and the corresponding data files. The ResDac website also includes a detailed description of all information included in the RIFs (https://www.resdac.org/).

Important methodological and applied example studies in pharmacoepidemiology were summarized by Hennessy et al. (2012b): Using routine data of elderly Medicaid beneficiaries of New Jersey Stürmer et al. (2005) compared conventional confounder adjustment with propensity score adjustment and adjustment using disease risk scores but found no major difference between the three methods. Schneeweiss et al. (2009) proposed an algorithm for automated confounder selection for propensity score models and evaluated its performance in three studies based on elder Medicare patients. The algorithm resulted in estimates that were all closer to the results of corresponding randomized clinical trials. Roumie et al. (2009) investigated the risk of cardiovascular events for certain non-steroidal anti-inflammatory drugs in Tennessee Medicaid enrollees and observed an increased risk for current users of rofecoxib, valdecoxib and indomethacin compared to non-users in patients without a history of cardiovascular diseases. The relation between adherence osteoporosis treatment and the risk of fractures in elderly Medicare beneficiaries was analyzed by Patrick et al. (2010), who found a consistent relation between adherence and risk reduction.

Recently conducted studies showed that Medicaid and Medicare data are still used frequently in diverse fields of epidemiological research: Leonard et al. (2018) compared new users of different antidiabetic monotherapies regarding the risk of severe hypoglycaemia in a cohort based on Medicaid beneficiaries from California, Florida, New York, Ohio and Pennsylvania and found the highest rate of serious hypoglycaemia for sulfonylureas. The prevalence of antidiabetic and antilipidemic medications in children and adolescents treated with atypical antipsychotics was estimated by Varghese et al. (2016) in Virginia Medicaid beneficiaries. The authors observed that the medication was more often prescribed in atypical antipsychotic users than in non-users. Ray et al. (2015) compared time-dependent propensity scores with conventional adjustment of time-dependent confounders and inverse-probability-of-treatment (IPT) weighted estimation of parameters in marginal structural models in a cohort of opioid users of the Medicaid population in Tennessee in the absence of confounders on the causal pathway. IPT weighted estimates were shown to be less efficient than the other two in this example. Santos et al. (2016) analyzed the use of cytomegalovirus prophylaxis in kidney transplant recipients of Medicare and found that prophylaxis use was common. In an empirical example with Medicare patients Gokhale et al. (2016) illustrated the considerable loss of power, when steps in the design study reduce sample size to minimize potential bias. Gilbertson et al. (2016) examined the influence of different time-windows for baseline confounder assessment on the mortality risk in a simulation study based on haemodialysis patients of Medicare and concluded that the timing of confounders should be taken into account for improvement of confounder control.

1.8 Administrative Information

The administrative data of Medicaid and Medicare is maintained by CMS, but ResDaC helps with data requests.

Contact details

Organization/affiliation: Research Data Assistance Center

University of Minnesota School of Public Health

Division of Health Policy and Management

420 Delaware Street SE, Mayo D355

Minneapolis, MN 55455

Administrative Contact: resdac@umn.edu

1-888-973-7322

Website: https://www.resdac.org/