Experimental Design |
|
This series combines nine independent datasets representing a spectrum of human pathologies expected to result in changes in gene abundance related to changes in expression or cellular composition of whole blood. These nine datasets are composed of 410 individual whole blood profiles generated from patients with HIV, tuberculosis, sepsis, systemic lupus erythematosus, systemic arthritis, B-cell deficiency and liver transplant. For each dataset healthy controls are also included. Each dataset’s expression data was preprocessed independently. First, probes were discarded if they were not present in at least ten percent of the dataset’s samples. Then, the sample data for each dataset was normalized using the BeadStudio average normalization algorithm. Once normalized, the signal was scaled such that all signals less than ten were set to ten. The signal median of all of the dataset’s samples was calculated for each probe. Probes were discarded if no sample had a difference in signal from the median that was greater than or equal to thirty, or if no sample had a fold change relative to the median that was either greater than or equal to 1.5, or less than or equal to 0.67. Finally, data was transformed to the log2 of the signal divided by the mean. Each of the preprocessed datasets was clustered in parallel using Euclidean distance and the Hartigan’s K-Means clustering algorithm, a hybrid of hierarchical and K-Means clustering algorithms. The number of clusters (k) was set to thirty, chosen to provide significant power during later module extraction steps. A higher value could have been chosen for k, but was not in order to minimize possibly arbitrary cluster splitting. Taking the nine sets of thirty clusters as input, we constructed a weighted co-cluster graph, a probe by probe matrix where the value of each cell (the weight) is set to the number of times probe_i and probe_j are found in the same cluster. In this instance, the values range from zero to nine, inclusive. At this point, the goal is to extract sets of probes that are most frequently clustered together, proceeding from the most stringent requirements to the least. To accomplish this, we employ the iterative algorithm. To begin, the maximum clique threshold is initialized to the number of input cluster sets, the paraclique threshold is calculated, and a minimum seed size is chosen (we used ten). The outer loop begins by creating an unweighted graph through application of the maximum clique threshold to the weighted co-cluster graph such that a probe pair, or edge, is represented in the unweighted graph if and only if the corresponding weight in the co-cluster graph equals or exceeds this threshold. We then begin the inner loop. The first step is to isolate the largest set of probes such that all pairs of probes in the set are completely connected in the unweighted graph - that is, there is no pair of probes in the set where the weight from the initial graph is smaller than the maximum clique threshold. In graph theoretic terms, the probes form a maximum clique. If the size of the probe set is smaller than the minimum seed size, we escape from the inner loop, reduce the threshold by one, and return to the beginning of the outer loop. Otherwise, the probe set is at least as large as the minimum seed size and it becomes the seed for a module. To allow for the inevitable clustering inaccuracies, we then employ the paraclique algorithm revisiting the co-cluster graph and adding to the seed any probe that is found to cluster with at least eighty-five percent of the seed’s members a number of times equal or exceeding the paraclique threshold. This final probe set is a module. It is removed from both graphs and named in accordance with the iterations in which it was found (i.e. a module extracted in the first iteration of the outer loop and the second iteration of the inner loop is designated M1.2). The inner loop then begins again with the reduced graphs. Those modules with conserved expression across diseases (formed by transcripts that cluster together for all nine datasets) were selected in early rounds whereas modules with greater disease specificity (formed by transcripts that cluster together only in a subset of the nine datasets) were selected in later rounds. | |
|