Gene expression microarray technology continues to evolve and its use has expanded into all areas of biology. However, the high dimensionality of the data makes analysis a difficult challenge. Evaluating measurements and estimating the significance of the observed differences among samples remain important issues that must be addressed for each technology platform. In this work we use a consecutive sampling method to characterize the dispersion patterns of data generated from Illumina fiberoptic bead-based oligonucleotide arrays.
To describe general properties of the dispersion we used a linear function SD = a + bYmean, approximating the standard deviation across arrays (Ymean is the mean expression of a given consecutive sample). First we examined three levels of variability: 1) same cell culture, same reverse transcription, duplicate hybridizations; 2) same cell culture, reverse transcription replicates; 3) parallel cultures. Each higher level is expected to introduce a new source of variability. We observed minor differences in the constant term: the mean values are 3.5, 3.1 and 3.5, respectively. However, the mean coefficient b increased from 0.045 to 0.147 and 0.133. We compared the coefficients derived from the consecutive sampling to those obtained from the standard deviation of individual gene expressions and found them in good agreement. In the second experiment samples we detected 11 genes with systematically different expressions between the experiment samples treated with glucose oxidase and controls and corroborated the selection using the Mann-Whitney and other tests. We also compared the consecutive sampling and coincidence method to t-test: the average percentage of consistency was above 80 for the former and below 50 for the latter.
Our results indicate that the consecutive sampling method and standard deviation function provide a convenient description of the overall dispersion of Illumina arrays. We observed that the constant term of the standard deviation function is at average approximately the same for duplicate hybridization as for the assays with additional sources of variability. Furthermore, among the genes affected by glucose oxidase treatment we identified 6 genes in oxidative stress pathways and 5 genes involved in DNA repair. Finally, we noted that the consecutive sampling and coincidence test provide, under given conditions, more consistent results than the t-test.
This article was reviewed by Alexander Karpikov (nominated by MarkGerstein), Jordan King and Eugene V. Koonin.
Open peer review
Reviewed by Alexander Karpikov (nominated by Mark Gerstein), Eugene V. Koonin, and Jordan King. For the full reviews, please go to the Reviewers' comments section.
The usefulness of DNA microarray technology in the exploration of gene expression profiles can hardly be overstated. Along with the dramatic increase in microarray publications (a 2.5-fold increase per year since 1997, to >3,000 in 2004) and a broadening in the scope of applications, the methods of analysis of microarray data have grown in variety and sophistication, from simple fold-difference criteria to complex Bayesian procedures and clustering techniques [1-9]. In spite of these advances, evaluating variation and estimating the significance of the observed differences in recorded signals remain a difficult challenge. Existing methods provide various approximations of reality, balancing Type I against Type II error, but none can be considered ideal under all conditions. This is mainly due to the inherent complexity of the problem, but is sometimes due to the use of oversimplified conditions. Mehta et al.  offered an interesting overview of the subject, including the discussion of misconceptions about generality and applicability of some approaches.
Quantities that are taken as a measure of gene expression are affected by number of processes that contribute to variation, resulting in the random and/or pseudorandom component of the signal. Such variation may be separable into the "technical component," caused by the technical factors, such as variability of experimental protocols, autofluorescence and backscatter, laser-molecule interactions, photomultiplier noise etc., and the "sampling" or "physiological component," which depends mainly on the variability caused by the differences between the samples, e.g. differences in the biological state or purity of the sample composition. Distinguishing if a given gene expression intensity value is greater than the background noise or is different between two samples are fundamental issues in microarray analysis.
For the single-color Affymetrix arrays we have two groups of methods aiming at separation of the true signal from the random components: "low level" and "high level." The former approach deals with the fluorescence signals of each individual probe and includes background correction, adjustment for the nonspecific signal and expression summary that yields an approximation of RNA abundance or "gene expression," the latter takes the gene expression as an elementary variable . Low-level analysis can be used only when a relatively large number (say 8 or more) of probes or probe pairs per probe set is available. Moreover, the standard methods, such as dChip [12,13] or RMA [14-16], are not applicable if only duplicates are available and not quite reliable for triplicates (URL address for the RMAExpress is http://stat.berkeley.edu/~bolstad/RMAExpress/RMAExpress.html webcite).
The high level analysis consists of two basic steps: normalization and statistical evaluation of the observed differences. One approach to normalization relies on the "reference genes" (e.g. [17-20]), but genes providing "ubiquitous reference" are hard to find  and they require an additional experimental effort. The other calculates normalization coefficients from the expression values. In case of linear dependence between the measured signal and RNA abundance and balanced over- and under-expressed values, the global normalization is suitable. In case of nonlinearity, LOWESS  or other appropriate correction has to be employed . Statistical significance of the observations is often estimated using standard parametric tests, such as the t-test or ANOVA. However, a certain percentage of the frequency distributions always deviates from the normal distribution and in multiple comparisons of thousands of gene expressions this can lead to a substantial error. Furthermore, number of replicates is usually small and estimated variances often differ largely from the true value. Novak et al. suggested characterization of dispersion patterns of Affymetrix arrays with the method of consecutive sampling , which uses groups of genes with close mean expressions to estimate the standard deviations; similar approach was independently proposed by Baldi and Long  and Kamb and Ramaswami . Two component model including the constant and proportional terms of the standard deviation was introduced by Rocke and Lorenzato  in the context of analytical chemistry and later applied to cDNA and oligonucleotide microarrays ; see also [28,29]. Choe et al.  compared performance of the t-test, modified t-test developed by Tusher et al.  and method of Baldi and Long  and concluded that the last method showed, under given conditions, superior performance. Some other approaches were also suggested and tested. For example, Troyanskaya et al. examined three nonparametric methods, Durbin et al.  proposed a variance-stabilizing transformation and Bilke et al.  used Bayesian approach. Among other publications, the paper by McClinick et al. , e.g., deals with reproducibility of microarray data, Kooperberg et al.  compared several statistical methods and Jarvinen et al.  different microarray platforms.
Many new microarray-based platforms are available and some, which allow parallel analysis of many samples, may be suitable for high throughput analysis. Here we utilized the Illumina GEX Sentrix™ Array Matrix (SAM) system to evaluate gene expression for 632 genes in 96-well format. Our first aim is to characterize expression data and assess various sources of dispersion. We describe the data obtained from replicate hybridizations, reverse transcription reactions, and biological cultures and evaluate the frequency distributions. Subsequently, we compare the dispersion patterns, and assess the contribution of each additional process to variability of data. The second aim is to study systematic differences in gene expression values in the control cell cultures and cell cultures subjected to a particular treatment. We analyze the data from a cell line subjected to a continuous low-dose oxidative stress exposure (~10 μm H2O2) for 24 hrs. In our analysis we use the consecutive sampling method , which quantifies dispersion between two samples by ranking the probe sets according to the mean signal intensity, grouping them in bins containing k consecutive gene pairs, and calculating standard deviations from the difference of expressions (in this study k = 12). We search for the best candidate genes affected by the treatment among the differentially expressed genes, using the consecutive sampling and coincidence test. The results are compared to the t-test and Wilcoxon (Mann and Whitney) nonparametric test. In addition, we examine consistency of the results obtained by the coincidence test and compare to the t-test on normalized data, log-transformed data and data subjected to the variance stabilization transformation, to the method of analysis by Tusher et al.  and to Baldi and Long  CyberT method.
Illumina GEX Sentrix™ Array Matrix utilizes oligonucleotides attached to 3 micron beads immobilized on fiber optic bundles. Each oligonucleotide probe is represented on ~30 beads per array, allowing averaging of many signals for the same oligonucleotide probe and a consequent reduction in signal variation. Each gene typically has two probes and the intensity signal for a given gene (or probe set), is the average of the available probe signals. The consecutive sampling approach was used to characterize the dispersion patterns of the gene expression data obtained from oligonucleotide probes for 632 genes on three groups of samples: 1) RNA samples from five parallel cultures of lymphoblast cell line GM10469 were reverse transcribed and each has a hybridization duplicate (Samples C1–C5); 2) Pooled reference RNA sample has three reverse transcription/cRNA replicates (PR1, PR2, PR3, each with a duplicate hybridization); 3) Parallel cultures of lymphoblast cell line GM12831 were grown either untreated or in the presence of 1 mUnit/ml glucose oxidase (which generated a continuous dose of ~10 μmole H2O2). The RNA from each of these 4 samples (GN1, GN2, GO1, GO2) was reverse transcribed and hybridized to arrays in duplicate. The data from these hybridizations allows evaluation of variation due to hybridization, reverse transcription, and parallel biological cultures. We used a coincidence test to identify differentially expressed genes in treated samples and compared these results with several other approaches.
Normalization and frequency distributions
Prior to the data analysis we calculated the gene expression per probe set by averaging the signals of the available probes. Subsequently, all the datasets were normalized to 100% of the total mean expression across the array. Dispersion plots and running mean plots of the pair-wise comparisons showed, in most cases, deviations from the 45° line at the low expression end; the additive correction constants range from about -5 to +2 (normalized values). Additional files 1 to 4 illustrate the effect of correction on a particular example of the dispersion plots for paired hybridization replicates C5a and C5b (1: Dispersion patters before normalization, 2: Running mean before normalization, 3: Dispersion pattern after normalization, 4: Running mean after normalization). In some cases we noted a saturation effect, which was corrected by power functions.
Format: DOC Size: 20KB Download file
This file can be viewed with: Microsoft Word Viewer
Format: DOC Size: 18KB Download file
This file can be viewed with: Microsoft Word Viewer
Format: DOC Size: 21KB Download file
This file can be viewed with: Microsoft Word Viewer
Characterization of variability: frequency distributions and standard deviation functions
We examined properties of our data by sampling the combined expression values for the first replicate hybridization comprising five parallel cultures of the cell line GM10469 (C1a, C2a, C3a, C4a, C5a) and comparing them to the normal frequency distribution. Figures 1 and 2 show quantile-quantile (Q-Q) plots of the observed values versus the corresponding inverse normal distribution at the low-end of mean intensities, from -2 to 0 and from 0 to 2, respectively. As expected, the distribution has the same character at the positive and negative side of zero. Only about three outlying points are noted in each figure. Figure 3 then shows Q-Q plot of the relative expressions (measured expressions divided by the mean of five samples) in the range of mean expressions from 117 to the maximum of 5432. Here about 396 out of 415 points lie very close to the normal reference, while the remaining 19, corresponding to about 4.5%, deviate from the diagonal. Similar results are obtained when using the set C1b, C2b, C3b, C4b and C5b. Verification of the normality is a quality check, incorporated into the consecutive sampling program (see Methods). For example, in the consecutive sampling of duplicate hybridizations and biological culture replicates, an average 6.8% and 5.0% of the samples failed the Kolmogorov-Smirnov test at the level of 0.05.
Figure 1. Quantile-quantile plot of the frequency distribution. Comparison of the observed expressions with the corresponding inverse normal distribution, combined samples C1a, C2a, C3a, C4a, C5a: range of average expressions from -2 to 0.
Figure 2. Quantile-quantile plot of the frequency distribution. Comparison of the observed expressions with the corresponding inverse normal distribution, combined samples C1a, C2a, C3a, C4a, C5a: range of average expressions from 0 to 2.
Figure 3. Quantile-quantile plot of the frequency distribution. Comparison of the observed expressions with the corresponding inverse normal distribution, combined samples C1a, C2a, C3a, C4a, C5a: range of average expressions from 117 to 5432; figure shows the relative values (expressions divided by the mean of five arrays).
Conformity to the normal distribution is an important property. For a normally distributed population the quality of a given population sample can be assessed by comparing the sample frequency distribution to the normal distribution function. Analysis of frequency distribution of the observations also provides information about the character of random processes. In our particular case we note that in the low range the distribution function of expression values agrees well with the normal distribution, while in the high range the distribution of relative expressions is close to the normal. On the other hand, distribution of the relative expressions in the low-expression region and of the observed values in the high range deviate substantially from the normal form (see Additional files 5 and 6: Quantile-quantile plot of the frequency distribution). This corroborates the proposition that the standard deviation of the random variability consists of two components: a constant term and a directly proportional term, as suggested (e.g. Novak et al. ). It follows further that the standard deviation can be well represented by a linear characteristic function with a constant term. It is also important to note, that the negative observations are meaningful and the observations are symmetrically distributed around zero; the standard deviation of the statistical samples in near-zero region provides the best approximation of the constant term of the standard deviation function. Finally, the normal distribution is a necessary condition for application of the parametric methods, although the normality assumption is rarely, if ever, verified (Pavelka et al.  is a particular exception).
Additional file 5. Supplemental Figure S5, quantile-quantile plot of the frequency distribution. Comparison of the observed expressions with the corresponding inverse normal distribution, combined samples C1a, C2a, C3a, C4a, C5a: range of average expressions from 0.1 to 2.0; figure shows the relative values (expressions divided by the mean of five arrays).
Format: DOC Size: 19KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional file 6. Supplemental Figure S6, quantile-quantile plot of the frequency distribution. Comparison of the observed expressions with the corresponding inverse normal distribution, combined samples C1a, C2a, C3a, C4a, C5a: range of average expressions from 117 to 5432.
Format: DOC Size: 20KB Download file
This file can be viewed with: Microsoft Word Viewer
Variation between duplicate hybridizations
We examined the dispersion in Illumina array data obtained from duplicate hybridizations of RNA samples extracted from 5 parallel cultures of the cell line GM10469 (pairs: C1a vs. C1b, C2a vs. C2b, C3a vs. C3b, C4a vs. C4b, C5a vs. C5b). For each pair we determined the standard deviation of the consecutive samples and fitted the calculated values to the characteristic function
SD = a + b Ymean, (1)
at the logarithmic scale; here Ym is the sample mean (see the Method section, Method of analysis). The coefficient values calculated for the individual pairs and the mean of five pairs are shown in Table 1 (Hybridization 1 vs. hybridization 2) and a typical low dispersion pair C1a vs. C1b is plotted in Figure 4; Figure 5 shows the experimental standard deviations and the regression curve, representing the characteristic standard deviation function (1). For calculation of the characteristic function we exclude the top 10 samples to keep the variability of mean expression within the samples small. Furthermore, at the low end the expression range is limited by requirement that the mean value must be positive.
Figure 4. Example of dispersion in the duplicate hybridization assay. Dispersion plot of the pair C1a versus C1b. Figure shows the experimental points and boundaries of the 0.9 probability interval.
Figure 5. Example of dispersion in the duplicate hybridization assay. Standard deviation calculated from the consecutive samples and the regression curve representing the standard deviation function.
Table 1. Table of the mean coefficients of standard deviation characteristic function. The table shows the average values for parallel hybridization, parallel biological cultures and pooled reference sample. The coefficients a and b are the coefficients of the standard deviation function; CV stands for the coefficient of variation.
To estimate the effect of the samples size we also evaluated the standard deviation function using k = 24. Since on given Illumina arrays the number of probe sets is relatively small, we evaluated the coefficients of standard deviation function in comparisons of the combined samples C1a to C4a versus C1b to C4b; for k = 24 we obtained a = 3.66 and b = 0.044, about 4.4% and 3.4% above the values a = 3.51 and b = 0.043 obtained for k = 12 and shown in Table 1. Since the spread of mean expression values is larger in larger samples, we expect higher dispersion. Furthermore, the intercept and the coefficient of proportionality obtained from the combined series (k = 12) are just about 1 and 8 percent below the means of the individual pair-wise comparisons 3.54 and 0.047, respectively (Table 1). Again, given that variance of the consecutive samples is enhanced by spread of the mean expression values, we expected smaller coefficients of the standard deviation function in combined comparison, which has higher density of the mean expression values. The coefficients of variation for a and b are 0.14 and 0.35, respectively. Of note, in this group the values of the intercept a are quite similar across the five-pair set; however, the coefficient of proportionality b for the pair C5a versus C5b is 0.074, about double of the mean of remaining pairs (0.040).
We also looked at the duplicate hybridizations from three independent reverse transcription reactions of a Pooled Reference RNA sample created in our laboratory (pairs PR1a vs. PR1b, PR2a vs. PR2b, PR3a vs. PR3b). The constant term a ranged from 2.12 to 4.86 (mean a = 3.50) and the proportionality factor b ranged from undetectable to 0.082; PR1 and PR2 samples showed quite low variation for both a and b (typical of other pairs), while the PR3 sample displayed a relatively high constant term a = 4.86 with a flat slope value b ≈ -0.003 (Table 1, Hybridization variation). Figures 6 to 9 show the dispersion plots and standard deviations for the pairs PR1a, b and PR3a, b; the pair PR2 is similar to PR1 (not shown). The plot of the PR3 pair indicates that there is very little change in the SD when the mean intensity increases. It is unknown what might have caused this effect but it may be due to some technical difference in the way these samples were handled.
Figure 6. Example of dispersion in the pooled RNA reference assay: pair PR1a versus PR1b. Experimental points and boundaries of the 0.9 probability interval.
Figure 7. Example of dispersion in the pooled RNA reference assay: pair PR1a versus PR1b. Standard deviation calculated from the consecutive samples and the regression curve representing the standard deviation function.
Figure 8. Example of dispersion in the pooled RNA reference assay: pair PR3a versus PR3b. Experimental points and boundaries of the 0.9 probability interval.
Figure 9. Example of dispersion in the pooled RNA reference assay: pair PR3a versus PR3b. Standard deviation calculated from the consecutive samples and the regression curve representing the standard deviation function.
Hybridization and transcription variation
Variation in the data sets obtained from the replicate analysis of the three independent aliquots of the Pooled Reference RNA sample should be affected by both hybridization and reverse transcription. Analyzing dispersion for all combinations of PR1a, PR2a, and PR3a (Series PRa), and PR1b, PR2b and PR3b (Series PRb) we observe that mean values for the constant a (Series PRa, a = 3.55, Series PRb, a = 3.73) are similar to the mean values from duplicate comparisons. However, the proportionality coefficients (b) are 3–4 fold greater (0.12 and 0.15 versus 0.04) than for duplicate hybridizations (Table 1, Transcription and hybridization variation). Thus, while these values are still low, the reverse transcription reaction clearly introduces meaningful increases in variation.
Analysis of the parallel biological cultures comprised all pair-wise combinations of the data from the 1st hybridization (Series Ca, 5 parallel cultures; C1a-C5a) and all pair-wise combinations of the data from the 2nd hybridization (Series Cb). Mean values of the intercept a were 3.86 and 3.08, close to the means obtained for the between-duplicate hybridization comparisons. However, the means of the coefficients of proportionality b = 0.130 and 0.136 are about three-fold larger than the mean of the between-hybridization coefficients (Table 1, Biological replicates, Series Ca, all pair-wise hyb. and Series Cb, all pair-wise hyb., respectively). Thus, as previously observed for Affymetrix arrays , the intercept appears to reflect a measure of technical variability, associated with variability of technological processes (e.g. efficiency of hybridization and labeling, fluctuation of luminescence) or with some features of the array or array reader instrument (e.g. backscatter, scanning and light detection), and exhibits a certain degree of independence of the sample origin. On the other hand, the proportionality coefficient reflects mainly differences in sample origin and composition
The consecutive sampling analysis used throughout this study derives the standard deviation function from the difference of expressions of pairs of ranked genes. Having five replicates for each series gives us an opportunity to verify whether the standard deviation function obtained from the consecutive samples agrees with the function derived from individual genes. Values of the coefficients obtained from the five individual arrays Ca are a = 3.39 and b = 0.120 (Table 1, Biological replicates, Series Ca, based on individual genes); this is about -12.0% and -7.4% below the consecutive sampling averages, respectively. For the series Cb we get from the individual genes a = 3.00 and b = 0.116 (Table 1, Series Cb); the differences are -2.6% and -14.8%. The mean differences of the coefficients a and b for both series are then -7.3% and -11.1%, respectively. Since the probe sets in consecutive samples belong to populations with different, albeit very similar, means, the higher values of the coefficients obtained from the consecutive sampling are to be expected.
Analysis of the glucose oxidase treated samples
Biological replicates of cell line GM12831 were either untreated (GN1, GN2) or treated with glucose oxidase (GO1, GO2). This low-dose oxidant exposure represents a physiological level of oxidative stress with no apparent induction of toxicity to the cells. After 24 hrs cells were harvested and RNA was extracted. Each of these four samples was hybridized in duplicate to Illumina fiber optic bead arrays (GN1a, GN1b, etc.).
In order to evaluate the quality and consistency of the biological and hybridization replicates, we first performed pair-wise dispersion analysis of the same-type samples (untreated vs. untreated and treated vs. treated). The mean values aavg = 2.7, 2.3 and bavg = 0.059 and 0.062 are very close to the means obtained from duplicate hybridization assays, although in this case the means also include comparisons across biological replicate cultures. For the glucose oxidase treated versus untreated comparisons, we observe that the average constant component aavg = 2.4, is quite similar to values observed for other lymphoblast culture replicates but the proportionality coefficient bincreases about 2-fold to 0.120 (Table 2; for the complete data see 7: Supplemental Table S1 and 8: Supplemental Table S2). To assess how closely the characteristic standard deviation function correlates with the standard deviation values across the range of expressions we determined the correlation coefficient R-square and the standard errors of the coefficients a and b for seven particular cases: individual genes, series Ca and Cb, pair-wise comparison of the combined data Ca versus Cb and pairs GN1a-GO1a, GN1b-GO1b, GN2a-GO2a and GN2b-GO2b. The mean R-square was 0.78, the mean standard errors of the coefficients a and b were 4.3% and 8.9%, respectively, and the probability that the coefficients are zero was less than 0.001 (SigmaStat software). Table 2 also shows the Kα coefficients that bound the interval containing 90% of values (se the Methods section, Method of analysis). Boundaries of the probability intervals are useful in searching for genes with significantly different expression.
Additional file 7. Supplemental Table S1, same-type sample dispersion parameters for the glucose oxidase treatment assay. Coefficients of the standard deviation function a and b and Kα coefficient corresponding to 0.9 probability interval; nt and tr stand for "untreated" and "treated," respectively.
Format: DOC Size: 32KB Download file
This file can be viewed with: Microsoft Word Viewer
Additional file 8. Supplemental Table S2, dispersion parameters for case/control comparisons in the glucose oxidase treatment assay. Coefficients of the standard deviation function a and b and Kα coefficient corresponding to 0.9 probability interval; nt and tr stand for "un-treated" and "treated," respectively.
Format: DOC Size: 33KB Download file
This file can be viewed with: Microsoft Word Viewer
Table 2. Dispersion parameters for case/control comparisons. First two data columns show the coefficients of standard deviation function a and b. The last column labeled Kα shows the coefficient defining boundaries of the probability interval 0.9; CV is the coefficient of variation.
Differences between the treated and non-treated cells are small. Looking at the plot of the average values, just about three genes are substantially above the random dispersion pattern and none below (Figure 10). However, when we examined all 16 pair-wise comparisons, we found seven genes above the 0.9 probability interval in 14 out of 16 cases and only two below. When we reduced the width of the interval to 0.8, we found 11 genes upregulated and three downregulated; the selected genes are shown in Table 3. While this experiment was not designed to provide a definitive biological demonstration of oxidative stress-induced gene expression, it is encouraging that this analysis has identified six genes known to be altered following oxidative stress (HMOX1, NQO1, TFRC, P21, MGST1, CCL5) and five genes clearly related to repair of DNA damage (P21, GADD45, DDB2, XPC, ATF3).
Figure 10. Dispersion plot of the treated versus non-treated averages in the glucose oxidase assay. Figure shows the experimental points and boundaries of the 0.95 probability interval (solid lines). Fourteen points selected by the consecutive sampling and coincidence test are shown as enlarged squares.
Table 3. Differentially expressed genes selected by the consecutive sampling and coincidence method. List of the probe sets selected by the consecutive sampling and coincidence method using the interval 0.8. The fold change is calculated using the threshold of 10, substituting into the denominator function max(10, Yavg), where Yavg is the mean of the under-expressed values. Bold print indicates the probe sets selected also for the interval 0.9. The probe set GI_4755127 printed in italics showed an inconsistent behavior in comparison of the probe 1 versus probe 2.
We can estimate probability of false positives from probability of coincidence in three random trials, assuming zero hypothesis, i.e. assuming that the differences are purely random. Figure 11 shows probability of observing k or more genes in all three trials for the probability interval 0.8; the circles and solid line correspond to the results obtained using binomial distribution and triangles represent the results of Monte Carlo simulations with four hundred runs. Probability of detecting at least two upregulated or downregulated genes is about 14%; for three or more genes it is 3%.
Figure 11. Probability of observing at least k genes in three trials. Figure shows the number of probe sets observed in all three trials, assuming that only random differences exist among the gene expressions. Each trial selects randomly 63 probe sets out of 632, corresponding to the number of the probe sets above (below) the 0.8 probability interval. The circles and solid line correspond to the calculations based on binomial probability and triangles represent the results of Monte Carlo simulations (400 runs).
Corroboration of the selected differentially expressed genes
We used several independent approaches to assess reproducibility of the results of our data analysis. First we ask the question: Having four replicates, what results we would have obtained, if we only had three pairs of samples? To find the answer we selected the genes above the 0.8 and 0.9 intervals in four sets of three samples and counted the genes common to at least seven out of nine possible combinations. The average percentages of the common genes for any pair of the three-sample tests were 83 and 95, respectively. Surprisingly, the percentage did not decrease, when we reduced the width of the interval (Table 4). In the second verification we submitted the selected genes to the t-test and Wilcoxon test. For the t-test we choose the levels of 0.01 and 0.001, which yielded 65 and 21 over-expressed genes and 4 and 1 under-expressed, respectively. Table 5 shows the comparisons: all 14 genes selected by the coincidence satisfied the Wilcoxon test (P = 0.03) and ten and nine over-expressed genes agree with the t-test at the levels of 0.01 and 0.001, respectively. Third, we checked, how are the selected genes distributed on the plot of average values. We counted 16 and 8 genes above and below the 0.95 interval, respectively: all 14 selected genes are included in these two subsets (Table 5).
Table 4. Summary of the test results for 3-sample sets. Results of the reproducibility test. The first and second column list the number of genes identified by the coincidence method for the interval 0.9 and 0.8, respectively, and the third column the number of genes that satisfied the two-tail t-test. The first row shows the mean number of genes that passed the test in complete set (four treated and four untreated samples, 12 coincidences out of 16). The second and third rows, first two columns, give the mean number of genes that passed in seven out of nine comparisons in three-sample sets and the mean of the genes passing concurrently in two particular tests, respectively; the third column shows the mean number of genes that passed the t-test in three-sample sets and the mean of the genes passing concurrently in any two particular tests, respectively, the fourth column corresponds to results obtained for the variance stabilization, fifth for the starred logarithm transformation, sixth for the CyberT method and seventh for the Tusher's calculations. The fourth row shows the ratio of the third row versus the second row in percent.
Table 5. Results of the Wilcoxon test, t-test and the average plot for the selected differentially expressed genes. Verification of the consecutive sampling and coincidence test by comparison to the Wilcoxon test (P = 0.03), t-test at the level of 0.001 and 0.01 and the and plot of average values. The first five rows of data show the numbers of genes satisfying a given test and the last four the number of genes selected by the coincidence test, which also satisfied a particular second criterion
One of the indicators of reliability of the data is consistency of the individual probes within the probe set. Since we have only two probes per a probe set (with one exception), we can only check for consistency of the probe pair behavior. We calculated ratios of the probe set averages for non-treated and treated samples and compared these to the corresponding ratios obtained for individual probes. The difference ranges from -12% to +8%, except for the gene GI_4755127 probe 1, where we got 33% (Table 6). Also, the differences in coefficients of variation of non-treated and treated samples for probes 1 and 2 of the gene GI_4755127 are 0.43 and 0.40, respectively, while the maximum difference for the remaining genes is 0.23. The probe set GI_4755127 was included only among the genes obtained for the 0.8 interval and has the lowest treated-samples average of 13.2.
Table 6. Comparison of the signal of first and second probe. Comparison of the probe 1 versus probe 2, nt and tr stand for "non-treated" and "treated," respectively and CV is the coefficient of variation. The ratio pr1/pr1&2 is the fold difference obtained using probe 1 only versus the fold difference obtained from the probe set; similarly for pr2/pr1&2. The last two columns give the absolute values of the difference of coefficient of variation. The minimum and maximum values in the last two rows were calculated excluding the gene GI_4755127, which shows abnormal behavior (printed in italics). Fourteen genes listed were obtained using the interval 0.8; 11 genes printed in bold are the genes, obtained for the interval 0.9*) Excluding the probe set GI_4755127.
We also compared our candidate genes with the genes selected by the Illumina custom method. 9 (Supplemental Table A3) shows the list of Illumina genes, including the average gene expression, coefficient of variation and differential score; according to the Illumina scoring, the value of 20 corresponds approximately to P = 0.01. In distinction to all other tests, the Illumina method selects approximately the same number of up- and down-regulated genes: 16 and 15, respectively. All the genes selected by the coincidence test are also found on the Illumina list. There is a good agreement between both methods with the coincidence test, apparently, providing a more rigorous criterion. Of note, the Illumina custom method identified a number of additional genes that are good candidates for regulation by oxidative stress including genes in DNA repair, cell cycle and inflammatory response.
Additional file 9. Supplemental Table S3, differentially expressed genes selected by the Illumina custom algorithm. Probe sets selected by the Illumina method. The table shows the gene name and function, mean intensity, coefficient of variation and Illumina differential score; value ± 20 corresponds to P = 0.01. Note that Illumina uses different normalization and, consequently, the mean intensities of Tables 3 and 4 do not agree. Bold print indicates the probe sets selected by the consecutive sampling method and coincidence test.
Format: DOC Size: 73KB Download file
This file can be viewed with: Microsoft Word Viewer
Comparison of the coincidence test to alternative methods
To assess performance of the coincidence method in the context of other methods currently employed, we compared reproducibility of the coincidence results to the standard t-test, t-test on the variance-stabilized data , on the data subjected to "starred logarithm" transformation , CyberT method  and the method suggested by Tusher and coworkers . We used the same procedure as above, i.e. we identified the probe sets satisfying the probability threshold for four subsets of three microarrays and calculated the average agreement between the all combinations of two trials. For each method we chose significance level that produced a similar number of genes as the coincidence test for the probability interval 0.9.
The mean number of samples satisfying the standard t-test for p < 0.0028 was 11.8, marginally below the average of 12.3, obtained for the coincidence test. At average, only 6.3 were common to any pair of the three-sample tests, representing just about 54% agreement (Table 4). This is to compare with 83% and 95%, attained with the coincidence test. The tests applied to the variance-stabilized data and to the data subjected to the starred logarithm transformation yielded similar performance as the standard t-test, namely 53% at p < 0.0025 and 50% at p < 0.003, respectively. The CyberT method and Tusher's approach showed performance similar to the coincidence test. The average agreement for the former was 8.7 out of 12.0, corresponding to 72% at p < 0.0001; for the latter we obtained 8.8 out of 12.5, corresponding to 71% at p < 0.023 (Table 4).
Properties of the dispersion patterns
Analysis of variability of the Illumina replicates has shown that experimental frequency distributions are very close to the normal distribution. However, about 5–10% of the samples deviate from normality and include genes with significantly outlying expressions. This implies that any parametric method should be used with caution. At the low end of the expression range the standard deviation is approximately constant, while at the high end it is proportional to the mean expression. The distribution functions of the observed values are symmetrical with respect to the zero axis and the distributions at the right-hand and left-hand sides are equivalent. We demonstrated that the linear standard deviation function provides a good approximation of the overall variability across the array. The intercept is dominant at the low expression level and reliable characterization of the near-zero variability is needed to determine its magnitude. We noted that the values of intercept were similar in all three sets of comparisons, while the coefficient of proportionality in transcription variation and biological replicates was at average about 2 to 3 fold larger as compared to the hybridization variation.
Approximation of the standard deviation across array, provided by the characteristic function derived from the consecutive sampling, was compared to the standard deviation function, derived from the individual genes: the difference for the coefficients a and b in two independent tests was in the range -3% to -12% and -8% to -14%, respectively. It is understandable that the standard deviation of individual genes is lower, because in the consecutive method we use in each sample elements coming from different populations with small but finite differences in population means. Since the standard deviation increases with the expression mean, this introduces an additional component into the standard deviation estimate.
Differentially expressed genes in lymphoblasts exposed to glucose oxidase
We created a physiological state of oxidative stress by using a low-dose exposure to glucose oxidase. Previous experiments (data not shown) suggested that this dose could induce oxidative stress genes and produce levels of DNA damage that could be repaired with no apparent cellular toxicity. Indeed, among the differentially expressed genes identified between treated and untreated cells (Table 3), we observe six genes known to be altered following oxidative stress (HMOX1, NQO1, TFRC, P21, MGST1, CCL5) and 5 genes clearly related to repair of DNA damage (P21, GADD45, DDB2, XPC, ATF3). Additional work is needed to characterize the biological importance of these small changes in gene expression following low-dose oxidative stress.
In this analysis we examined the frequency distributions of the data in replicate experiments. We demonstrated plausibility of the two-component representation of the standard deviation and showed equivalence of the consecutive sampling method and gene-by-gene evaluation of the standard deviation function. We used the consecutive sampling and coincident test to identify the best candidates among the differentially expressed genes in the samples under oxidative stress; the results were in agreement with the t-test and Wilcoxon statistic and with the Illumina proprietary method. A practical advantage of the consecutive sampling and coincidence approach is that it provides detail information about characteristics of each individual array. Complete pair-wise comparisons can identify the atypical samples and enable the experimenter to decide about their treatment.
The main conclusions can be summarized in the following points:
• Random variability exhibits the Gaussian characteristics; at the low end the frequency distribution of expression values is close to the normal distribution, while at the high end the distribution of relative values is close to normal. The frequency distribution is symmetrical with respect to zero.
• Standard deviation is well approximated by the linear function of the mean gene expression with a constant term.
• Our observations indicate that the change in biological state of the matter is usually reflected in the proportionality coefficient b, while the change in technical parameters is frequently correlated with the coefficient a.
• Consecutive sampling provides good estimator of the characteristic standard deviation function.
• Consecutive sampling and coincidence test yielded, under given conditions, more consistent results than the t-test applied directly to the normalized data or data submitted to the variance stabilization and starred logarithmic transformation; the performance of the CyberT method and Tusher's method was similar to the coincidence test. The coincidence selection as a nonparametric approach provides more robust selection criterion and can be used for assays with only duplicate arrays.
Cell culture conditions
Epstein Barr Virus (EBV)- transformed lymphoblastoid cell lines (LCLs) were grown in RPMI-1640 medium, supplemented with 10% fetal bovine serum (Invitrogen, Carlsbad, CA) and 2 mM L-glutamine (Life Technologies, Gaithersburg, MD) at 37°C in a humidified 5% CO2 atmosphere. Before treatment with glucose oxidase, cells were diluted to a concentration of 2 × 105 cells/ml in fresh RPMI-1640 media (plus 15% FBS) and allowed to grow out for 18 hours to condition the media. After 18 hours, 10 ml of suspended cells were aliquotted into Petri dishes. Glucose oxidase (Molecular Probes) was added to test samples at a final concentration of 1 mUnit/ml, while dilution buffer (1 mM sodium acetate) was added to controls. Test and control samples were incubated at standard conditions for 8 or 24 hours. Parallel cell culture samples (biological replicates) were extracted and reverse transcribed separately. RNA was extracted from test and control samples using RNeasy Midi extraction columns, according to the manufacturer's instructions (Qiagen). The pooled reference used in these experiments was a combination of equal amounts of RNA from six LCLs (GM10469, GM10967, GM11321, GM12909, GM13838, and GM14682, Coriell Cell Repositories, Camden, NJ) and from 3 lymphoid tumor lines (L428 (DSMZ, Braunschweig, Germany), and Jurkat and Raji (ATCC, Manassas, VA)).
Illumina bead-based arrays
The Illumina Gene Expression system was used for direct hybridization of labeled cRNAs to gene-specific 50-mer oligonucleotide probes attached to microbeads. For each sample, 200 ng of total RNA was aliquoted into 1 well of a 96-well plate. Labeled cRNA was produced by a reverse transcription followed by in-vitro transcription according to the manufacturer's instructions (MessageAmp II, Ambion). Duplicate aliquots of each cRNA sample (1 μg cRNA each) were distributed into parallel microwells in a 384 well hybridization plate with buffer, paired with a Sentrix array matrix (SAM), and incubated at 55°C overnight as per the Illumina standard protocol. The following day the SAM was washed, blocked with casein (Pierce, Rockford, IL), and signal was developed with streptavidin-Cy3 using Fluorolink Cy3 (Amersham, Piscataway, NJ) according to the manufacturer's instructions. The SAM was then imaged with the Illumina BeadArray Reader GX.
Method of analysis
The consecutive sampling method  provides a convenient tool for the global characterization of dispersion patterns in pair-wise comparisons. Briefly, the probe sets of a given pair of arrays are ranked according to the mean expression, statistical samples are defined as k consecutive pairs (typically k = 12) and the standard deviations are calculated from the difference of expressions. The estimator of the characteristic standard deviation function SD is then determined by fitting the linear function
SD = a + bYmean
to the experimental points at the logarithmic scale; (Eq. (1), Ym is the sample mean). In order to obtain representative standard deviation at a given expression level, the differences in mean expressions of the genes in a given sample must be small. This poses no problem at the low expression end, but at the high expressions the density of genes within a narrow expression interval is small and a certain number of genes (probe sets) must be excluded.
The consecutive sampling program is written in Basic and uses the Excel platform. After sorting and definition of the samples it calculates the standard deviations and determines the characteristic function using the logarithmic transform and nonlinear regression subprogram. Once it determines the standard deviation function, it calculates the boundaries of chosen probability intervals. The upper and lower limits in the dispersion plot Y2 versus Y1 are defined as
where Kα is a constant corresponding to the probability interval α.
Several "reliability checks" have been incorporated into the consecutive sampling program. First, assuming independent samples we verify the identity
SD(Ydiff) = SD(Y1) + SD(Y2), (4)
where SD(Ydiff) and SD(Yi) are the standard deviations calculated from the expression difference of the ordered sample and from the expression values of the array i, respectively (see Supplementary Material in Ref. ). It provides good verification of variability of the mean within a given sample: here we obtained agreement within about 2%. The second check calculates the average number of samples failing the Kolmogorov-Smirnov normality test (P = 0.05) and the third compares the number of genes beyond the 0.95 probability interval to the number of genes outside the interval corresponding to 1.96 standard deviations (0.95 probability interval of the normal distribution with the same mean and standard deviation). Additional subprograms calculate skewness and kurtosis and assess the symmetry. The program provides the output tables including the verifications, parameters characterizing the dispersion and list of the genes outside specified probability intervals.
The author(s) declare that they have no competing interests.
Experimental samples were prepared and processed by MCM III and DAB in the Environmental Genomics Section, National Institute of Environmental Health Sciences. Data analysis was carried out by JPN at McGill University and Genome Quebec Innovation Centre.
Reviewer's report 1
Alexander Karpikov, MB&B Department, Yale University, New Haven, CT 06520 USA (nominated by MarkGerstein, MB&B Department, Yale University, New Haven, CT 06520 USA).
I read your article and found it quite interesting. I think it has enough material for the publication. My major comment is on the style. I think the article is very difficult to read and its style should be improved prior to the publication.
Author response: In response to the reviewer's concern we abbreviated the text and introduced number of revisions. First, in the Background discussion we stated the purpose of our study and the main points of the individual sections more clearly. Second, we partially reorganized the text and introduced two new subheadings. We tried to be clearer and more specific and to eliminate deviations from the main subject. Throughout the text we systematically tried to indicate the goal of a given particular section and clearly describe the approach used to achieve it.
Reviewer's report 2
Eugene V. Koonin, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
I think this is a useful, carefully performed study on microarray analysis statistics. Although the technology behind the experiments analyzed in this paper is not the most common one, the tests investigated here may have general applicability.
Reviewer's report 3
King Jordan, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894.
In this work, Novak et al. analyzed the statistical properties of gene expression level data generated from the Illumina GEX Sentrix™ microarray platform, which employs a fiberoptic bead-based approach to measuring expression levels. Microarray technology is evolving rapidly and the trend is towards increasingly high density arrays that are capable, in principle, of generating expression profiles for multiple replicates of entire eukaryotic genomes from single chips. The bead based technology employed by Illumina arrays represents an experimental methodology that is qualitatively distinct from that employed by the industry leader Affymetrix. In addition to allowing for increased density and multiple samples per chip, the Illumina bead-based arrays afford several other potential advantages including: i-longer 50 mer probes that presumably increase both selectivity and specificity, ii-low sample and reagent volumes that, along with a high density of features on the array, lead to a relatively low price per sample and a consequently more ambitious experimental scale, and iii-high redundancy, with ~30 beads for each oligo, that can be exploited to increase confidence in specific gene expression levels.
While the move to increasingly high density arrays represents a potential boon to researchers, it also presents fundamental bioinformatics challenges regarding the analysis, visualization and interpretation of expression data. Unfortunately, the bioinformatics technology needed to meaningfully comprehend the results of increasingly high throughput gene expression profiling tends to lag behind the new experimental approaches. It is precisely this challenge that Novak et al. took up in their analysis of the dispersion patterns of expression data generated by Illumina arrays. To evaluate dispersion patterns, the authors implemented the previously developed consecutive sampling method whereby probe sets are ranked according to mean expression and sets with similar means are binned prior to comparison. They compared dispersion patterns from three groups of samples, each of which allows for different sources of variation – hybridization, reverse transcription and biological – to be considered independently. Evaluation of the assumption of normality revealed distinct deviations at the low and high ranges of expression. Based on this pattern, they found that the standard deviation of the variability can be broken down into two components – a constant term and a proportional term. The constant term dominates at low levels of expression, while the proportional term dominates at high levels. Having established the statistical properties of the expression level dispersion, the authors performed a comparison of gene expression from glucose oxidase treated versus un-treated cell lines in order to identify differentially expressed genes that may play a role in the response to oxidative stress. They were able to identify 11 such genes using the consecutive sampling approach, affirming its potential utility.
By way of critique, one may quibble with the statistical techniques employed by the authors, and there are clearly other tacks that could be taken to analyze this kind of data. For instance, the authors compared their consecutive sampling method to standard parametric (t-test) and non-parametric (Wilcoxon) methods as well as to Illumina's own proprietary method. They show that their consecutive sampling method performs comparably to these methods, in terms of identifying similar sets of differentially expressed genes, and also shows more consistency. It would have been nice to see a more systematic comparison of different methods for selecting differentially expressed genes since this is the essence of what investigators usually want to glean from microarray studies. For example there are a number of methods cited in this paper including several non-parametric approaches, a variance-stabilizing transformation and Bayesian approaches that could have been compared with the authors' method of choice. However, the analysis that is presented in the paper is detailed and thorough.
The impact of the work can also be considered to be somewhat mitigated by the fact that the consecutive sampling method introduced is an extension of the authors own previous work on Affymetrix arrays. The scope of this study was also quite small, expression of only 632 genes was analyzed, and one may wonder whether the dispersion properties for a set of that size would hold up for whole-genome data sets. In addition, while the authors do make some attempt to study an actual biological system – exposure of a cell line to low dose oxidative stress – there is relatively little biological insight that can be gleaned from this work. To be fair however, both substantial utility and novelty can be found in the manuscript, pursuant to the fact that it represents the first study of data variability in the Illumina bead-based microarrays. Thus, the authors have made an important, if tentative, contribution towards bridging the gap between emerging microarray experimental technologies and the bioinformatics tools needed to interpret their output.
Author response: First, we would like to thank Dr. Jordan for very thorough and helpful review. In response to his comment we extended comparisons of reproducibility to include the variance stabilization, "starred logarithm" transformation, CyberT method and Tusher's approach. Regarding the comment that the consecutive sampling method was already introduced in the 2002 publication : We would like to mention, that the original paper described only the principle and justification of the approach. On the basis of this principle we developed the method of analysis, incorporating the subroutines calculating the characteristic standard deviation function, boundaries of the probability intervals for selected set of values, tests of normality, calculations of skewness and kurtosis, etc. (c.f. Methods section). Furthermore, in the present study we introduced the application of the consecutive sampling and probability intervals to the differential expression analysis via coincidence test and presented the estimate of the number of coincidences based on probability of coincidences in random trials. We hope that new information is sufficiently noteworthy to make it interesting to readers. Regarding the experimental part of the study, unfortunately, lack of support makes it impossible now to extend the study to include larger arrays or to make a more thorough investigation of the biological properties of the system under consideration.
This research was supported in part by the Intramural Research Program of the NIH and NIEHS. We thank Drs. T. Dickinson and S. Baker (Illumina Inc, San Diego) for cDNA and aRNA preparation and hybridizations. Thanks are also due to Drs. X. Wang and J. Li, NIEHS for useful suggestions and comments to the manuscript. Furthermore, the authors are indebted to Dr. E. Chudin, Illumina, for information regarding the Illumina technology and Illumina methods of analysis and for performing the Illumina custom analysis of the oxidase-treated samples.
Nature Genetics 2004, 16:943-947. Publisher Full Text
Bolstad BM: RMAExpress. [http://www.statberkeley.edu/~bolstad/RMAExpress/RMAExpress.html] webcite
Berger C, Pierce LN, Kruger M, Marcusson EG, Robbins JM, Welcsh P, Welch PJ, Welte K, King MC, Barber JR, Wong-Staal F: Identification of Id4 as a regulator of BRCA1 expression by using a ribozyme-library-based inverse genomics approach.
Cancer Research 1998, 58:5009-5013. PubMed Abstract
Genome Research 2001, 12:292-297. Publisher Full Text
Journal of the American Statistical Association 1988, 83:596-610. Publisher Full Text
Technometrics 1995, 37:176–184. Publisher Full Text
Mansourian R, Mutch DM, Antille N, Aubert J, Fogel P, Le Goff JM, Moulin J, Petrov A, Rytz A, Voegel JJ, Roberts MA: The global error assessment GEA model for the selection of differentially expressed genes in microarray data.
Bioinformatics 2003, 19:966–972. Publisher Full Text