Introduction

Many studies have shown that conversion to digital mammography can increase screening sensitivity [1, 2]. Another known consequence is an increased recall rate [2, 3] especially in the first period after implementation [4, 5]. This increase can be explained partially by increased visibility of microcalcifications [3, 4], but differences in the appearance of digital and analogue mammograms may also be of influence.

Images acquired using a digital mammography system must be processed before they are suitable for display. Image processing converts the images so that they can be interpreted by radiologists. Because digital mammography is a relatively new technique, it is continuously being developed. Important factors like X-ray spectrum and image processing have not yet been fully optimised. As a consequence, the variations among different commercially available processing algorithms are large. When comparing image processing algorithms, one should concentrate on diagnostic accuracy rather than on the appealingness of the images. Because of the lack of easy and objective methods for measuring processed image quality however, we often have to rely on the impression experts have of the appearance of images to rate image processing [68].

Originally developments in image processing were mainly pushed by the need to decrease the image dynamic range, because the dynamic range of softcopy reading stations was much smaller than that of the films displayed on a light box [9, 10]. Although the dynamic range of modern display stations is increasing rapidly, current image processing algorithms still aim for a maximum (optimal) local contrast while decreasing the total image dynamic range. Such contrast optimisation techniques can have a large impact on the appearance of images. These techniques are aimed at increasing the diagnostic accuracy, although they could also influence the perceived suspiciousness of healthy breast tissue. In addition to this, some studies have shown that differences in image processing can influence both sensitivity and specificity [1114].

The purpose of this study is to determine the influence that local contrast optimisation has on diagnostic accuracy and the perceived suspiciousness of digital screening mammograms.

Materials and methods

Study dataset

The data for this study were collected from a screening region in the eastern part of the Netherlands in the period April 2007 up to November 2007. This screening region had converted to digital mammography several months before this period. After digitisation, a temporal increase in recall rate equivalent to those described in recent studies [4, 5] was observed. Although the recall rate was not as high as during the first month after conversion (6%), during the full study period an increased recall rate was observed (3–4%) compared with the recall rate for the analogue screening (2%). The recall rate dropped back further (2–3%) after the study period.

The dataset for this study contains all recalled cases in the study period for which digitised previous mammograms of the previous screening round were available (153 studies). 43 of these cases were biopsy-confirmed true-positives (TP), 110 cases were negative (FP). For each negative, the last non-recalled case that was acquired before it and for which the previous mammograms were also present, was added to the dataset. This last group of cases are referred to as the normal cases (N). There was a total of 263 cases. The age range of women in the study was 51–86, and the median age was 60. Approval of the institutional review board was not required. Informed consent was obtained from the participants and all cases were anonymised.

All cases were acquired using the General Electric (GE) Senographe Essential (GE Medical Systems, Buc, France) digital mammography system. GE provides two processing algorithms for this system; the standard processing algorithm Tissue Equalisation (TE) and the local contrast optimisation algorithm Premium View (PV) which can be applied as an additional processing step after TE.

All previous mammograms had been routinely digitised using an Array 2,905 Laser Film Digitizer (Array Corporation Europe, Roden, the Netherlands) at a resolution of 100 μm, because an earlier study had shown this to be sufficient for comparison of previous mammograms [15]. For views consisting of multiple images (mosaics) only the image containing the largest part of the breast was digitised.

Postprocessing methods

Tissue Equalisation (TE) is a standard General Electric application that corrects for low frequency variations resulting from under- and over-penetration of X-rays (with the latter occurring for example at the breast edge). As a result the image dynamic range is reduced, enabling improved softcopy image display.

Premium View (PV) has been designed more recently to further improve the quality of the information presented to the radiologist for diagnosis as well as the reading speed by optimising the local contrast in breast structures. In short, PV works as follows [16]: low-frequency structures (i.e. large-scale structures) are obtained from the original image by low-pass filtering. High frequency structures (i.e. small-scale structures) are obtained by subtracting the low-pass filtered image from the original image. These low and high frequency images are both processed and weighted individually and then added together. The resulting image exhibits reduced contrast between different tissue types, but enhanced contrast of small scale anatomical architecture.

Observer study

Six screening radiologists read two versions of the study set processed with the algorithms TE and PV. All radiologists were familiar with the use of both types of post-processing due to their participation in activities at the national training center for breast cancer screening. Two of them used these types of processing in their daily practice. The two versions of the 263 cases were grouped in ten sessions: 5 sessions with TE processing and 5 sessions with PV; each session with 52 to 53 cases; for each TE session, there was a related PV session containing the same cases. The order of the cases within the sessions was randomised. All cases within a session were processed using the same algorithm. The time between reading two sessions with the same cases was at least one month. Digitised previous mammograms were available at each session. The sets were read independently by each radiologist. Radiologist experience varied and is summarised in Table 1.

Table 1 Radiologist experience at study initiation

The studies were displayed on Hologic SecurView DX diagnostic workstations (Hologic Inc., Danbury, CT, USA). All radiologists were familiar with this system before the study. The radiologists were allowed to use all viewing functionality (e.g. zooming, panning, inverting, adjusting brightness and contrast, hanging protocols) that is normally used while screening.

Radiologists were asked to use a low threshold for reporting lesions and could report up to three findings for each case on a printed form. For each finding the radiologist assigned a suspiciousness score by marking a point on a 10-cm strongly non-linear Visual Analog Scale (VAS) (Fig. 1). The scores were measured automatically after digitising the forms. Case suspiciousness was calculated as the maximum suspiciousness of all findings by a radiologist within a case. This study examines the impression radiologists get of the suspiciousness of cases when these are presented in different ways, while the raw data on which these presentations were based were identical. To emphasise this, case suspiciousness is referred to as perceived case suspiciousness in this study.

Fig. 1
figure 1

Visual analogue scale used for scoring suspiciousness of individual findings within each case

Statistical analysis

For each combination of radiologist and processing algorithm the diagnostic accuracy was measured as the area (A z ) under the maximum likelihood estimated binormal ROC curve [17, 18] based on the suspiciousness score using DBM MRMC (University of Chicago and University of Iowa, version 2.2, June 2008). Significance of the average difference in A z between both algorithms was tested with the Dorfman-Berbaum-Metz method [19, 20] treating both readers and cases as random samples. The P value was tested against a significance threshold of 0.05.

The exact interpretation of a VAS by individual radiologists is unknown. Therefore, only the order of the suspiciousness scores for individual radiologists are relevant for analysis, the actual values along the VAS are not. Differences in perceived case suspiciousness were therefore analysed with two-tailed paired sample sign tests using SPSS (version 16.0.1, November 2007; SPSS, Chicago, IL, USA). The P values were tested against a significance threshold of 0.0083 (0.05/6) to compensate for applying the tests for six radiologists separately, according to the Bonferroni method.

Results

The six radiologists reported 1,565 findings in total for the TE cases and 1,683 for the PV cases. This corresponds to an average of 0.99 and 1.07 findings per case per radiologist respectively. An example of a finding in a normal case that was marked by four radiologists when using PV but by none when using TE is shown in Fig. 2. Suspiciousness scores for this particular finding varied from 0.9% to 39%.

Fig. 2
figure 2

Example of a finding in a left-sided mediolateral oblique view, reported by four radiologists when using Premium View (PV) only. a Digitised prior. b Tissue equalisation (TE) processed image. c PV processed image with the annotation. d is the result image of subtracting (TE) from (PV). e is the thresholded version of (d). White areas indicate that pixels in the PV image have relatively higher intensity than the related pixels in the TE image whereas black areas indicate the opposite. It shows that in PV images low frequency trends are suppressed (no noticeable signal decrease in the breast edge in PV compared with TE) whereas higher frequency structures are emphasised (e.g. glandular structures)

Table 2 lists the diagnostic accuracies for the individual radiologists with both processing algorithms. The difference between the mean A z values for the two algorithms was not significant (TE: 0.909, PV: 0.917, P = 0.46).

Table 2 Diagnostic accuracy scores (A z ) for the ROC analysis

Table 3 lists the results for the sign tests. For all radiologists, the perceived case suspiciousness for the full dataset was higher when using the PV algorithm. For four out of six radiologists this difference was significant. The table also indicates the results for the TP, FP and N subgroups and all negative cases (FP + N). The perceived case suspiciousness was higher with PV than with TE for nearly all combinations of radiologists and subgroups. The only exception was radiologist 5, who rated the FP cases slightly higher with TE. Because of the small numbers of cases in the subgroups, most of the corresponding P values are above the significance threshold.

Table 3 Comparison of perceived case suspiciousness

We assume a simple model in which cases are recalled when they contain a finding that exceeds a certain suspiciousness threshold. At a given threshold, the recall rate can be computed for both processing algorithms. In Fig. 3a the recall rates for TE and PV are compared for every possible recall threshold. The dataset for this study was an enriched set, where 58% of the cases (43 TP + 110 FP / 263 cases) was originally recalled. Our dataset contains all recalled cases from the data collection period and the recall rate during this period was up to three times the pre-digitisation recall rate. Before digitisation as few as 19% (58% / 3) of the cases in the dataset might have been recalled. Figure 3b is an excerpt of Fig. 3a showing only this relevant range from 19% (bottom left) to 58% (upper right). For practically every recall threshold in this range the calculated recall rate is higher for PV than for TE.

Fig. 3
figure 3

a Recall rates for equal suspiciousness thresholds with TE and PV. b Excerpt of (a)

Discussion

We evaluated two commercially available image processing algorithms by comparing diagnostic accuracy and perceived case suspiciousness. The diagnostic accuracy was not significantly different. The perceived case suspiciousness averaged over all observers of all case types was higher when using PV.

The major difference between the processing algorithms used in our study is an additional local contrast optimisation step when PV is applied. PV is aimed at increasing the visibility and suspiciousness of malignant lesions, but in our study the perceived suspiciousness of benign lesions and normal cases is increased as well. An effect of local contrast enhancement could be that both normal (dense) structures and abnormal structures appear more suspicious due to their enhanced signal. An additional aspect may be the decreased similarity of the PV images to the digitised previous mammograms. Comparison of current and previous mammograms is very important for breast cancer screening, especially for discerning growing lesions from benign findings already present in the previous mammograms [21]. Preference studies using only malignant lesions may conclude that high contrast images are preferable because of the increased visibility of the lesions, while missing the effect that the algorithm could have on normal cases. In our study the perceived suspiciousness of the normal cases increased even more than that of the malignant cases. Even when diagnostic accuracy is not influenced by the choice of image processing, the image processing may still influence the recall rate. Earlier studies have shown an increase in recall rate during the first months after converting to digital mammography [4, 5]. It was proposed that this temporal increase could have been caused by a learning effect and/or by the previous mammograms being film-screen.

Comparability of currents and previous mammograms is not only an issue when converting from analogue to digital mammography. In a recent study, an increase in recall rate was found in a clinical setting after switching from TE to PV [16]. The increase was explained as a training effect, but the necessity of switching off the contrast optimisation for better similarity to archived comparison mammography was also recognised. Future studies should therefore investigate the influence of both the learning effect and the degree of similarity with previous mammograms on diagnosis with respect to the introduction of new postprocessing methods.

In conclusion, this study examines just two out of many possible combinations of appearances of currents and previous mammograms. For manufacturers of digital mammography systems, image appearance has become an important means of distinguishing themselves from each other. Previous studies have suggested that algorithms using contrast enhancement techniques may improve diagnostic accuracy [16, 22, 23]. This effect is not convincingly present in our study. Our study suggests that the introduction of new image processing algorithms is likely to influence the recall rate because of changes in perceived case suspiciousness while diagnostic accuracy may be similar.