Introduction

Image quality assurance (QA) in magnetic resonance imaging (MRI) is often based on phantom tests defined in standards and guidelines [1,2,3,4,5] or by the manufacturer. Quantitative phantom measurements can characterize some aspect of the scanner’s absolute imaging performance, but the relationship with the actual clinical image quality is often unclear. The selected imaging sequences may emphasize effects not observed with other techniques. Phantom images are often acquired with robust 2D (e.g. conventional spin echo) sequences which are prone to different characteristic artefacts than 3D sequences [6]. Additionally, human anatomy provides exceedingly more complex imaging target, including non-voluntary movement and flow. Thus, the scanner performance cannot be entirely predicted by phantom studies alone.

In addition to phantom based QA, it would be rational to measure image quality directly from the clinical images. However, the clinical image quality assessment is mostly based on qualitative observer-based ranking in the Likert scale [7] or similar. This approach is susceptible to intra- and interobserver variation and, therefore, lack reproducibility. The grading criteria can differ between (or even within) departments [8]. Quantitative computational methods for clinical image QA would enable the clinically relevant and reproducible assessment of MRI hardware and imaging sequence performance. It would also offer a possibility to compare scanners and sequences in a uniform scale.

There are only a few published quantitative methods designed to analyze clinical images in the sense of image quality [9,10,11,12,13,14]. Magnotta et al. presented a method for calculating signal-to-noise and contrast-to-noise (CNR) from clinical 2D images. Weng-Tung et al. studied 2D image resolution by applying radiofrequency tagging to images. Mortamet et al. presented two methods using an image volume background to derive QC measures. More recently, Borri et al. have used image power spectrum analysis to assess image spatial resolution, Osadebey et al. have developed quality measures based on local entropy in the images and Jang et al. have used feature statistics to track distortions in the images. A large cohort of clinical images has been analysed in respect for image quality in studies by Gedamu et al. and Kruggel et al. [15, 16]. A proposal for a complete pipeline capable of automatic image analysis was presented by Gedamu [17]. Additional methods for the detection of motion artefacts were presented in studies by Gedamu et al., Backhausen et al. and White et al. [18,19,20]. These methods measured also general image quality although their subject had a very specific error source.

In this work, we present four methods of assessing the image quality of 3D fluid attenuation inversion recovery (FLAIR) MRI sequence from clinical brain images. These methods were bundled as a novel automated pipeline and applied to a large cohort of clinical brain studies. The pipeline was utilized to demonstrate variations and trends in MRI scanner performance to assess variations in the scanner stability in long and short term. The results obtained with clinical volumes were compared with phantom QA results. Additionally, the effect of motion artefacts in the presented methods was studied.

Materials and methods

Imaging sequence

FLAIR is a valuable MRI technique for the detection of intracranial hemorrhage [21, 22]. The 3D FLAIR sequence used in this study was a turbo spin echo-based sequence involving radio frequency (RF) inversion pre-pulse and variable angle refocusing pulses to optimize contrast in the image [23]. The 3D sequence is nowadays often applied in brain studies instead of more traditional 2D sequences to decrease the duration of imaging protocol while maintaining adequate contrast between brain tissues and increasing imaging resolution. 3D FLAIR sequences are also less prone to cerebrospinal fluid (CSF) flow artefacts than their 2D counterparts [6]. In our department, the 3D FLAIR sequence is the most common imaging sequence used in brain studies.

The sequence can be optimized to produce optimal signal-to-noise ratio or modulation transfer function (MTF). These properties are somewhat interconnected and are only partly adjustable by the user [24]. Thus, the effect of changes in parameters and the scanner performance on the image quality are not entirely predictable. In our study, we used both phantom and clinical head volumes scanned with the 3D FLAIR sequence on a 3 T MRI scanner. The sequence parameters presented in Table 1 were used unless stated otherwise.

Table 1 Sequence parameters of the clinical 3D FLAIR sequence

Image quality analysis

Preprocessing

As a first step of the analysis pipeline, the brain volume was extracted from the original image volume with the Statistical Parametric Mapping toolbox (SPM, http://www.fil.ion.ucl.ac.uk/spm/software/spm12). SPM generated brain tissue probability maps for white matter (WM), grey matter (GM), CSF and other tissue types. In addition, SPM produced a bias corrected version of the original image volume. The segmentation settings were SPM12 defaults (bias FWHM: 60 mm cutoff, bias regularization: 0.001, number of tissue types: 6). These maps were employed to generate an initial brain mask as a union of voxels which had at least 0.35 probability in WM or GM map (Fig. 1). From the initial brain mask, all but the largest connected object was removed, and remaining holes were filled to obtain the final brain mask.

Fig. 1
figure 1

An axial slice of typical probability maps for a WM, b GM and c initial brain mask

In addition to the brain mask, a mask for the head volume was generated to separate the signal producing volume from the background (Fig. 2). First, the whole image volume was thresholded. The used thresholding level was derived semi-empirically by applying value obtained with Otsu’s method [25] divided by four. After the thresholding, all but the largest volume was removed from the mask. Next, the remaining structure was dilated with a spherical structuring element with the radius of 10 voxels which after remaining holes in the object were filled [26].

Fig. 2
figure 2

a An illustration of segmented brain and head volumes. b Illustration of the outer perimeters of the head and brain mask in one transversal image slice

Resolution

The resolution of an imaging system describes its ability to reproduce sharp material interfaces and distinguish closely spaced features from each other. The former can be quantified by differentiating an edge spread function (ESF) and Fourier transforming resulting in line spread function (LSF) to obtain modulation transfer function (MTF). MTF describes an imaging system’s spatial frequency contrast response. [27].

A well-defined edge is an essential requirement for MTF measurement. This condition can be easily satisfied using phantoms, but it becomes problematic when clinical images are assessed. In this study, we used the cortical surface as an edge for resolution measurement. The strong contrast and a sharp interface provide a favorable target surface.

In our method, the bias corrected image volume was first interpolated to isotropic 0.5 × 0.5 × 0.5 mm3 resolution. Tetrahedral mesh was generated from the brain mask with iso2mesh library [28]. Typically, from 50,000 to 95,000 triangular polygons were generated from which 30,000 randomly selected were used in the resolution measurement. Each of these polygons was used to define a cylinder with a diameter of 1 mm and direction perpendicular to the brain surface (Fig. 3). The grey value and the distance from the mesh were recorded for the voxels inside the cylinder and used to create preliminary ESFs. The location of the edge in each preliminary ESF was refined by 1D convolution with the first derivative of Gaussian and finding the maximum [29]. Each preliminary profile origin was then shifted correspondingly to the center of the detected edge.

Fig. 3
figure 3

a A sample of vectors perpendicular to the brain surface and b an illustration of the normal vector and corresponding cylinder (right)

Each preliminary ESF was verified to represent an actual edge of the brain surface. The automatic verifying methods included a check that difference of the grey values over the edge is reasonable and to check that the 1D convolution resulted in high enough values to guarantee a reasonable gradient of the edge. Also, there was a limit on the maximum detected edge distance (5 mm) from the mesh. Filtered ESFs could then be averaged together at selected directions. In this work, we studied ESFs in three orthogonal directions corresponding to anatomical volume alignment: anterior–posterior (AP), feet–head (FH) and right–left (RL). The opening angle limiting accepted preliminary ESFs in each direction was 15°. An example of directional ESFs is presented in Fig. 4 together with point cloud representing a sample of all ESF points.

Fig. 4
figure 4

a An example of point cloud consisting of a sample of 5000 points taken from ESFs in AP direction. b Typical directional ESF profiles in AP, FH and RL directions

The averaged ESFs were then differentiated to obtain LSFs. Before Fourier transformation, the LSFs were Hanning filtered to suppress high frequency components originating from the LSF tails [27]. Typical resulting directional MTFs are presented in Fig. 5. The used MTF10 and MTF50 values were chosen as resolution measures corresponding spatial frequencies where the MTF is 10% and 50% of the value at zero spatial frequency as presented in Fig. 5.

Fig. 5
figure 5

Typical directional MTFs in AP, FH and RL directions

Quality index

Instabilities during scanning can produce ambiguity in the image reconstruction. This can result, e.g. from the patient movement or imperfect operation of gradient fields. Signal intensity can then spill to erroneous locations inside the imaging volume. The most evident is the introduction of extra signal outside the actual signal producing volume. Thus, the amount of signal outside the anatomical volume can be used as a figure of merit. A quality index (QI) adapted in this study has been presented by Mortamet et al. [11], and it can be calculated by

$$ {\text{QI}} = \frac{{N_{\text{artefact}} }}{{N_{\text{BG}} }} $$
(1)

where NBG is the number of background voxels and Nartefact is the number of background voxels labelled as artefactual. Background voxels were defined as image volume excluding the head volume and a 10 voxel margin to the image volume borders. As proposed by Mortamet et al., voxels were labeled as artefactual by thresholding background volume with the value corresponding to the peak of the background histogram and eroding and dilating the result with 3D cross-structuring element [26].

Contrast-to-noise ratio

CNR is an image quality parameter that reflects the imaging systems ability to differentiate noise contaminated objects by their signal level in the image. The contrast can be defined as a relative difference in the signal strengths of two known objects in the image. The discernibility at different contrast levels is limited by the amount of noise. The CNR parameter in this study was calculated by

$$ {\text{CNR}} = \frac{{{\text{GM}} - {\text{WM}}}}{{\sigma_{\text{BG}} }} $$
(2)

The representative WM and GM grey values were obtained from the model fitting parameters of the SPM software used for tissue segmentation. The software package is fitting a mixture Gaussian model to brain and the used WM and GM values are expectation values of Gaussian distributions respectively. σBG is the standard deviation of voxel grey values in original image volume excluding the head volume and 10 voxel margins to the image volume border.

Image intensity spatial homogeneity

Image intensity level inhomogeneity may be induced to MRI by spatial differences in a scanner’s RF-transmit (B1) or a receive field. Modern MRI scanners have advanced methods of correcting signal inhomogeneities in the images. If there is substantial signal inhomogeneity present after these corrections, it may be a sign of a hardware failure.

The level of signal inhomogeneity was studied using the bias correction feature of the SPM package which produced an intensity corrected version of the original image volume. The average bias correction of a voxel, or bias index (BI), was then calculated by

$$ {\text{BI}} = 100\% \; \times \;\frac{{\mathop \sum \nolimits_{i = 1}^{N} \left| {\frac{{{\text{bias corrected}}_{i} - {\text{original}}_{i} }}{{{\text{original}}_{i} }}} \right|}}{N} $$
(3)

where N is the number of voxels included in the spherical volume with a 10 cm diameter and concentric with the brain mask center of mass, bias corrected is the grey value in the bias corrected volume, original is the intensity in original volume and i is the voxel index.

Resolution measurement validation

The resolution calculation was validated with a standard spherical quality assurance phantom with the diameter of 17 cm and liquid signal producing content. In the testing, the phantom was scanned with variable isotropic voxel sizes of 0.9–1.3 mm with step of 0.2 mm. The imaging sequence was otherwise identical to the one used in the clinical images.

Additionally, the resolution calculation method was tested by filtering a binary brain mask with 3D Gaussian filter and studying how the parameters of the filter affect the resulting MTF10 and MTF50 values. The results were compared with simulated ideal MTF10 and MTF50 values obtained by the Gaussian filtration of a 1D step function.

Clinical head volume analysis

The presented image quality metrics was calculated for a large cohort of clinical head volumes spanning over a test period of 9.5 months. In total, 665 head volumes were included. The inclusion criterion of the head volume was a GM/WM volume ratio between 0.3 and 2.0 to filter out volumes with substantial pathologies and inaccurate segmentation. The inclusion criterion was verified visually. The GM and WM volumes were calculated as the sum of respective probability maps generated by SPM. Fifty-five percent of the patients were female, 45% male and the median age was 43 (range 13–84) years. The possible presence of foreign objects (such as metallic implants) was not taken into consideration in the analysis. The use of clinical volumes was approved by department’s scientific committee.

The mean and standard deviation for each parameter were calculated for each day of the period by using a 7-day moving window. Each 7-day window included in average 19.4 studies with the standard deviation of 5.5 studies. During the test period, there were two major scanner breakdowns. The first breakdown occurred at the beginning of the month number three and was caused by a broken RF-amplifier. The second breakdown occurred at the beginning of the month number eight and was caused by a break in a gradient coil requiring the full replacement of the gradient coil system.

Comparison with phantom measurements

During the clinical QA test period, a daily phantom QA program was in position. A cross-sectional image of a manufacturer provided cylindrical phantom was scanned daily using a head coil, fixed phantom position and a standard spin echo (SE) sequence. The primary purpose of acquiring the daily QA image was to verify the scanner was working properly before the first patient of the day. In addition to visual inspection, the image was sent to a QA server for detailed analysis. Calculated QA parameter time series included signal-to-noise ratio (SNR), image intensity uniformity, image ghosting and geometric distortion. SNR was calculated by a single image method presented by National Electrical Manufacturers Association (NEMA) [4], ghosting was calculated as presented by International Engineering Consortium (IEC) [3] and image intensity uniformity with methods presented by both IEC [3] and NEMA [5]. The geometric distortion was followed by measuring a phantom diameter in horizontal and vertical directions. A full description of the utilized automatic daily phantom QA pipeline is presented by Peltonen et al. [30].

The effect of motion artefact

The effect of patient motion artefact was studied by labeling all head volumes as normal (91%) or affected by patient motion (9%). The labeling was done by an experienced QA specialist (JIP) based on the amount of blurring in a central axial slice. The median, interquartile range and range of all image quality parameters were calculated over all image volumes for three cases: all the data, only the non-artefact volumes and only the artefact-positive volumes. The statistical difference between the images was studied with two-sample Student’s t test.

Results

The image resolution measurement method was validated by imaging a spherical phantom with variable voxel sizes. The effect on the voxel size to measured MTF10 and MTF50 values in the ball phantom is presented in Fig. 6. Voxel size has a linear relation to the measured resolution variables MTF10 (R2 = 0.98) and MTF50 (R2 = 0.95). Additionally, a test was done with a single brain mask (Fig. 7) where the effect of 3D Gaussian filtering on the measured resolution parameters was studied. The MTF10 and MTF50 value response to filtering was close to the ideal response.

Fig. 6
figure 6

MTF10 and MTF50 values measured by using a spherical phantom with variable isotropic acquisition voxel size

Fig. 7
figure 7

MTF10 and MTF50 values obtained by using a brain mask filtered with variable sized 3D Gaussian filter

The directional MTF10 and MTF50 values with 7-day running average and standard deviation measured from the clinical head scans during the9.5-month period are presented in Fig. 8. In the FH direction there is a decrease of the resolution values before the both breakdowns of the scanner. Additionally, MTF10 and MTF50 values in FH direction are increased after the gradient system breakdown in month eight compared with the values before the breakdown. In other directions, the effect of the breakdowns is not apparent.

Fig. 8
figure 8

MTF10 and MTF50 running average and standard deviation (7-day window) in a AP, b FH and c RL directions

The mean and standard deviation values of the QI during the time series are presented in Fig. 9. QI is stable until the MRI scanner gradient breakdown in the month eight. After this breakdown, we see a clearly increased QI values resulting from increased signal outside head area in the image volume.

Fig. 9
figure 9

Quality index running average and standard deviation (7-day window)

CNR value mean and standard deviation in the test period are presented in Fig. 10. The value has a decreasing trend throughout the time series. After the gradient breakdown in month eight, the CNR values were substantially decreased.

Fig. 10
figure 10

Contrast-to-noise ratio running average and standard deviation (7-day window)

The BI representing the image intensity inhomogeneity is presented in Fig. 11 over the test period. The value increased in month number seven well before the gradient breakdown. At that point, a change of baseline was observed.

Fig. 11
figure 11

Bias index running average and standard deviation (7-day window)

Phantom QA result time series for the test period has been presented in Fig. 12. There was no evident effect or trend in the phantom QA results before either of the scanner’s breakdowns. A decrease in the variation of the SNR was seen after the second breakdown.

Fig. 12
figure 12

Phantom QA results for the test period. a SNR, b image intensity uniformity based on the methods presented by NEMA and IEC, c image ghosting and d phantom diameter in horizontal and vertical direction

Of the presented QA measures, MTF10, MTF50 and CNR had significantly different results between volumes affected and not affected by patient motion (p < 0.05). The effect of patient motion on MTF50 values is presented in Fig. 13.

Fig. 13
figure 13

Median, interquartile range and from the 2nd to 98th percentile range of MTF50 values in three orthogonal directions for studies with motion artefact, without motion artefact and all patients

Discussion

Unexpected changes in image quality are important indicators of an MRI scanner hardware condition. QA measurement can be used to verify nominal the operation of the scanner or in communicating the problems with the manufacturer or service personnel. Aside from detecting errors, it is often important in a patient care setting to know as precisely as possible the date and time when the scanner was verifiably working properly. It is, however, often difficult to determine if the clinical image quality has degraded during the scanner’s lifetime. Several informative parameters can be obtained using standardized QA phantoms and the results compared with those of the acceptance testing or previous quality control measurements. However, the results may not be available, or they may not fully represent the image quality produced by other imaging sequence types or clinical situation. With the presented methods, image quality assessment can be made directly from the clinical image data and results compared with any corresponding study retrospectively. The chosen methods were aimed to be robust, easy to interpret and reflect changes in MRI hardware performance.

The resolution measurement methodology is relying heavily on data averaging to prevent the effect of local anomalies. However, the vast amount of data points in a 3D image is enabling the method of providing information on the actual directional clinical image resolution achieved with the scanner. The accuracy of the resolution assessment was verified with the simulated test models and phantom imaging. High correspondence to idealized expected values was achieved for the degraded brain mask (i.e. simulated) images. Also, a linear relationship between set and measured resolution was observed in the phantom acquisitions verifying the measurement’s feasibility for QA purposes.

The MTF10 and MTF50 values are presented here as indicators of MRI scanner resolution. Both values demonstrated a clear response to the changes in scanner hardware. The system breakdowns were seen as a drop of values in FH direction just before the malfunction. The gradient related problems should introduce ambiguity to spatial frequency components and thus the sensitivity to gradient related breakdown is expected. However, the correlation between the RF-system breakdown to changes in MTF values is not apparent. MTF10 values are generally more sensitive to image artefacts with high frequency components. This is seen in the AP direction, where there is a period of increased MTF10 values and standard deviation in month nine that is absent from MTF50 values. A similar period is seen in the RL direction in month four where equal effect is seen in both MTF10 and MTF50 values.

The measurement of QI is based on a principle that all signal outside the actual anatomical volume is resulting from anomalies in the scanning process. Thus, QI is likely to be sensitive to any problems with scanner hardware or patient co-operation. Of the presented quality parameters, QI showed the strongest response on the scanner operation after the second breakdown. QI is not a specific QA measure since there are multiple hardware error mechanisms influencing the value including gradient system instability, the mechanical vibration of a scanner, eddy currents and RF interference. Although no significant correlation between increased QI values and the movement of the patient was found, it is likely that especially major patient movement is provoking increased QI values. This is likely the reason behind small peaks seen in the QI time series.

CNR measurement is depending on the variation of the background noise and contrast between the WM and GM. In the time series we did see a small decreasing trend in the CNR values and a clear drop after the scanner gradient breakdown. The drop after the breakdown is likely caused by increased signal or noise outside the anatomical volume seen also as increased QI values. Likely, the variations in the CNR value are indicating changes in the scanner’s RF-system: increased noise or changes in achieved flip angles. Furthermore, patient motion blurs the image and decreases contrast. This is presumably behind the relatively high standard deviation of the CNR values in the time series. The standard deviation could be decreased if only the head volumes without motion would be included.

The BI should be sensitive to inhomogeneities in RF transmit or receive field affecting the RF excitation flip angle in the target volume. These may be induced by problems in RF instrumentation. Additionally, the RF transmission field shape is affected by the anatomical shape of the patient, which may induce strong inherent variation in the measure. Also, metal implants in the patient are causing strong disturbance to the field. A substantial increase in BI value is seen in results during the test period but it is unclear if it has a direct connection to either of the occurred breakdowns. Generally, RF field inhomogeneity effects are effectively corrected by scanner’s image intensity normalization algorithms which may decrease methods sensitivity to scanner performance changes. Accessing the normalization algorithm parameters could yield interesting performance information.

The variation of all measurements is highly depending on the stability of image volume segmentation. For the image segmentation, the method is relying on the SPM package which has been utilized and tested comprehensively in multiple studies [31, 32]. If the clinical volume includes severe pathologies, the segmentation algorithm may fail consequently affecting measurement results. Thus, volumes with atypical segmentation results should be removed from the analysis. A simple rule based on GM and WM volume ratio calculated from the segmentation result was used in this study. A more sophisticated set of rules would likely result in decreased standard deviation, but at the same time, limit the amount of available data. The optimization of the criteria for head volume inclusion is should be performed in future.

Additionally, the tetrahedral grid placed on a desired interface to track the perpendicular direction to the surface has to adapt to actual acquisition resolution. The grid has to be dense enough to track the topography of the surface without reacting to voxel-size features. Multiple grid parameter values can be tested empirically to find a stable area where small changes to grid settings have minimal effect on the results.

In addition to physiological error sources, scanner hardware induces inherent variation to the results. For example, the main magnetic field and RF chain can demonstrate temporal fluctuation. One reason to use running average over several days is to mitigate these effects along with inter-patient differences.

The presented methods could potentially be utilized as absolute measures to compare the performance of scanners from different vendors and sequence types. The sequence should produce reasonable contrast regions and boundaries, which can be segmented reliably. Other anatomical regions instead of head could also be considered. The effect of changes (e.g. protocol optimization) in sequence parameters on image quality could be studied quantitatively. It is also possible to produce online clinical image quality assurance tools that are automatically detecting the abnormal operation of the scanner and possibly enable the pre-emptive maintenance of the scanner before noticeable effect on clinical image quality.

Similarly, application of on-line QA enables the detection of patient related issues, e.g. movement. Optimally, this type of system could give a prompt suggestion to rescan before the examination is over. The online detection of a patient motion artefact could significantly affect patient care and costs [33]. The effect of patient motion on the presented QA measures was studied by labeling part of the image volumes as including or not including patient motion related artefact. The motion artefact had significant effect on MTF10, MTF50 and CNR values. In the 3D FLAIR images, the patient motion was found to produce blurring rather than traditional ghosting in the phase-encoding direction. Finding the optimal threshold values for patient motion detection requires further research.

Results obtained from clinical volumes were compared with phantom QA results. There were no clear trends in any of the phantom QA parameters nor visible effects before either of the scanner breakdowns. It is possible that the standard SE sequence is not as demanding in terms of hardware as the 3D FLAIR sequence. Thus, the sensitivity of the phantom QA measurements may not be sufficient to detect effects seen in clinical image QA. Nevertheless, phantom QA has many advantages. For example, it is difficult to evaluate scanner’s geometric distortions from patient images.

Recently, machine learning methods have been applied to quality control purposes [34,35,36,37,38]. In future, novel automated approaches may open interesting possibilities in detecting and labeling scanner specific image artefacts. All in all, well-defined, robust, specific and quantitative methods are needed for general QA and sequence optimization purposes, regardless whether they are machine learning or more traditional image analysis in nature.

In this study, four methods for quantifying the image quality of clinical 3D FLAIR acquisitions were presented and applied to a large patient cohort. These can be used in QA to monitor the long-term image quality of an MRI scanner and potentially detect malfunctions before complete hardware failures. The methods can be utilized to measure the effect of changes in sequence parameters or assess the absolute quality of a single patient study.