PLoS Computational Biology
image
Genome-wide prediction of topoisomerase IIβ binding by architectural factors and chromatin accessibility
DOI 10.1371/journal.pcbi.1007814 , Volume: 17 , Issue: 1
Article Type: research-article, Article History
Abstract

Type II DNA topoisomerases (TOP2) are a double-edged sword. They solve topological problems in the form of supercoiling, knots and tangles that inevitably accompany genome metabolism, but they do so at the cost of transiently cleaving DNA, with the risk that this entails for genome integrity, and the serious consequences for human health, such as neurodegeneration, developmental disorders or predisposition to cancer. A comprehensive analysis of TOP2 distribution throughout the genome is therefore essential for a deep understanding of its function and regulation, and how this can affect genome dynamics and stability. Here, we use machine learning to thoroughly explore genome-wide binding of TOP2B, a vertebrate TOP2 paralog that has been linked to genome organization and cancer-associated translocations. Our analysis shows that TOP2B-DNA binding can be accurately predicted exclusively using information on DNA accessibility and binding of genome-architecture factors. We show that such information is enough to generate virtual maps of TOP2B binding along the genome, which we validate with de novo experimental data. Our results highlight the importance of TOP2B for accessibility and 3D organization of chromatin, and show that computationally predicted TOP2 maps can be accurately obtained using minimal publicly available datasets, opening the door for their use in different organisms, cell types and conditions with experimental and/or clinical relevance.

Martínez-García, García-Torres, Divina, Terrón-Bautista, Delgado-Sainz, Gómez-Vela, Cortés-Ledesma, and Ay: Genome-wide prediction of topoisomerase IIβ binding by architectural factors and chromatin accessibility

Introduction

Type II DNA topoisomerases are unique in their ability to catalyze duplex DNA passage, and therefore the only enzymes capable of dealing with superhelical DNA structures, as well as unknotting and decatenating DNA molecules [1]. This places them in a privileged position to integrate and coordinate many aspects of genome organization and metabolism, and have indeed been related to virtually all aspects of chromatin dynamics [2, 3]. While the genomes of invertebrates and lower eukaryotic systems code for only one type II topoisomerase (TOP2), vertebrates encode two close paralogs (TOP2A and TOP2B) with similar catalytic and structural properties but with different roles [2, 4]. TOP2A is essential for cellular viability [5] and is mainly expressed in dividing cells [6], where it is required for chromosome segregation [2]. TOP2B, in contrast, is not essential at a cellular level [7], but is ubiquitously expressed and required for organismal viability due to roles in the transcriptional regulation of genes required for proper development of the nervous system [2, 6, 8].

Consistent with regulatory functions in transcription, TOP2B significantly associates with promoters and DNase I hypersensitivity sites, as well as actively regulated epigenetic modifications and enhancers [913]. Furthermore, recent studies have linked TOP2B with 3D genome organization due to a strong colocalization with architectural factors such as CTCF and the cohesin complex protein RAD21 at the boundaries of topologically associated domains (TADs), which has been interpreted as TOP2B functions in resolving topological problems derived from the active organization of the genome in the context of the loop extrusion model [11, 12]. Strikingly, TOP2B and RAD21 are spatially organized around CTCF binding sites, forming ordered triple sites that flank the boundaries of TADs in a way that TOP2B is positioned externally and cohesin internally to the domain loop.

Despite these fundamental roles in the organization and dynamics of the genome, TOP2 function can also pose a threat to its integrity [4]. Aberrant activity of TOP2, which can occur spontaneously or as a consequence of cancer chemotherapy, results in the formation of DNA double-strand breaks (DSBs). These are highly cytotoxic lesions that, if inefficiently or aberrantly repaired, can seriously compromise cell survival and genome stability. Indeed, unrepaired or misrepaired DSBs can lead to genome rearrangements associated with neurodegeneration, sterility, developmental disorders and predisposition to cancer [1416]. In particular, transcription-associated TOP2B function at TAD boundaries has been recently linked to oncogenic chromosomal translocations [17, 18]. The specific functions of TOP2B in these processes are, however, still largely unexplored and far from being fully understood. A deep understanding of TOP2B function and regulation in a genome-wide context, and how it can result in chromosome fragility and alterations, will require thorough and unbiased studies, integrating different systems, cellular types and conditions.

In this work, we take advantage of the wealth of high-throughput sequencing data to comprehensively study the informative power of a wide set of chromatin features to predict TOP2B binding. We show that TOP2B localization can be accurately predicted using publicly available sequencing data, and identify architectural and open chromatin factors as the most informative features, in agreement with previously reported data [11, 12]. Moreover, feature selection analysis indicates that TOP2B can be faithfully predicted using only three sequencing datasets. Indeed, models trained with DNase-seq and ChIP-seq of CTCF and RAD21 allow for accurate prediction across different mouse systems, supporting a generalizing association of TOP2B with these features. Finally, to validate our model we generate predictions in mouse thymocytes and human MCF7 cells and compare them with ChIP-seq data obtained in our laboratory. Genome wide predictive tracks accurately mirror experimental TOP2B tracks in both organisms, showing that our model can be used to generate virtual TOP2B signals in potentially any mammalian cell type and condition for which sequencing data of CTCF, RAD21 and DNase I hypersensitivity are available.

Materials and methods

Processing publicly available data

The experimental data used in this study is summarized in S1 Table. When available, we batch-downloaded mm9 and hg19 BAM files from ENCODE [19]. Raw sequencing data was processed by first quality filtering and merging the reads of biological replicates and corresponding input samples (in the case of ChIP-seq experiments). Then we mapped them to the mouse or human genome (mm9 or hg19) using Bowtie 1.2 [20] with option “-m 1” so that reads that mapped only once to the genome were retained. Genome track plots of Fig 1 and S1 Fig were generated using Bioconductor packages Gviz [21] and trackViewer [22].

Chromatin landscape of TOP2B.
Fig 1
A. Genome browser view for TOP2B ChIP-seq signal and other relevant chromatin features in mouse liver. B . Average reads enrichment of selected chromatin features within ±4 kb of TOP2B binding center (solid lines) and random regions (dashed lines) in mouse liver (blue) and MEFs (red). Signals were smoothed using nucleR [50]. Left and right sided ordinate axes correspond to mouse liver and MEFs, respectively. C. Overlap frequencies of chromatin features at TOP2B binding sites as compared to random regions in mouse liver and MEFs. D. Top panel shows relative frequency of several dinucleotides within ±500 bp of TOP2B binding center and random regions. Bottom panel shows average of 3D DNA shape features within ±1 kb of TOP2B binding center and random regions. Profiles corresponding to MEFs and liver are colored as in C.Chromatin landscape of TOP2B.

Peak calling and generation of randomized regions

To generate our predictive models, we identified TOP2B binding sites in the mouse learning systems (liver, MEFs and activated B cells) using MACS2 [23] with option “-q 0.01” and keeping only peaks with a fold change over the control sample greater than 5. Then, peaks overlapping > 20% of their width with non-mappable regions or regions of the ENCODE blacklist [24] were discarded. Since we were interested in highly enriched TOP2B binding sites to ensure robust learning, peaks were called for individual replicates (when available) and only overlapping peaks were kept. For each training system, we generated the same number of random regions in the genome than TOP2B binding sites were identified. Such regions were searched to have 300 bp length and were selected so that they had the same distribution across chromosomes as TOP2B peaks. Again, non-mappable regions and regions of the ENCODE blacklist were discarded. To make sure they did not overlap TOP2B binding sites not detected by MACS, regions with high levels of TOP2B/INPUT ChIP-seq signal were also discarded. In order to account for sequence composition biases, an additional set of random regions was generated so that their distribution of sequence G+C content matched that of TOP2B peaks. We proceeded using an iterative approach. Briefly, the G+C content of TOP2B peaks was first computed and an initial set of random regions was generated for each chromosome. Random regions having similar G+C content than any TOP2B peak were retained while matched TOP2B peaks were not considered in the next iteration. This process was repeated until the number of GC-random regions and TOP2B peaks were equal for each chromosome.

When we processed the experimental ChIP-seq samples generated in our lab, we found that MACS2 greatly underestimated the number of TOP2B binding sites. For this reason, we used HOMER [25] to identify peaks in our samples. Libraries were first subsampled so that both the test and the control samples had the same number of reads. Then, HOMER was called with option “-style factor”. Only peaks with a fold change over the control sample (IgG) greater than 5 and not overlapping non-mappable regions and regions of the ENCODE blacklist were retained.

Distribution of TOP2B binding sites across the genome

TOP2B binding sites and random genomic regions were annotated using two different approaches. On the one hand, peaks were classified relative to the mouse TSSs (transcription start sites) according to the mm9 transcripts annotation of UCSC [26] using ChIPseeker [27]. In addition, we grouped peaks into five classes as described in Matthews and Waxman (2018). Briefly, open chromatin regions (DNase-seq peaks) were identified and classified based on H3K4me1, H3K4me3, and CTCF ChIP-seq signals. Promoters were defined as peaks with > 1.5 H3K4me3/H3K4me1 fold ratio. Enhancers were defined as peaks with < 0.67 H3K4me3/H3K4me1 fold ratio. The remaining DNase peaks were separated into those with a roughly equal H3K4me3/H3K4me1 fold ratio and those with low signal for both histone marks. From the the former set, those overlapping TSSs were classified as weak promoters. Low signal peaks that overlapped CTCF binding sites were classified as insulators. Finally, the remaining low signal peaks not overlapping TSSs were classified as weak enhancers. For a more detailed description, see [28].

Quantification of high-throughput sequencing data

To score sequencing data around TOP2B binding sites, we based on the scoring approach of [29]. For ChIP-seq experiments, reads aligned to a 300 bp window centered on TOP2B binding center were first considered, normalized to the size of the corresponding sequencing library and scaled to the size of the smaller library (among test and input). Then, they were added 1 as a pseudocount (to avoid undefined values in posterior logarithmic transformations) and the ratio of test versus input signal was calculated. Finally, this value was log transformed. RNA-seq, DNase-seq and MNase-seq were scored as the number of reads aligned within the same 300 bp window divided by the library size. CpG methylation was measured as the total percentage of methylated CG for each CpG dinucleotide as described in [30].

DNA shape and sequence features

Three-dimensional DNA shape features at base pair or base pair step resolution were derived from the high-throughput approach implemented in the Bioconductor library DNAshapeR [31]. We used thirteen DNA shape features: helix twist, propeller twist, minor groove width, roll, shift, slide, rise, tilt, shear, stretch, stagger, buckle and opening [32]. For a given DNA sequence of length L, the k-mer feature at each nucleotide position was encoded as a binary vector of length 4k , as described in [33]. A value of 1 represents the occurrence of a particular k-mer starting at that position. The k-mer features for a whole DNA sequence were then encoded by the concatenation of the k-mer feature vectors at each position of the sequence. As a result, we obtained a binary vector of length 4k(L-k+1).

Machine learning

Training data matrix

After scoring the sequencing experiments as well as DNA shape and sequence parameters, we built up a final data matrix with rows representing TOP2B/random sites and columns corresponding to scored features. Since we identified a total of 13, 128 and 8, 413 TOP2B peaks in mouse liver and MEFs respectively, we ended up with model matrices of 32, 766 columns and either 26, 256 (mouse liver) or 16, 826 (MEFs) rows.

Classifiers

The Naive Bayes [34] (NB) classifier applies the Bayes rule to compute the probability of a class label given an instance. For such purpose, the classification model learns, from training data, the conditional probability of each feature given the class label. NB assumes that all features are conditionally independent given the values of the class.

Support Vector Machine [35] (SVM) is a classifier based on the idea of finding a hyperplane that optimally separates the data into two categories by solving an optimization problem. Such hyperplane, is the one that maximizes the margin between classes. In case of non-separable data, it optimizes a weighted combination of the misclassification rate and the distance of the decision boundary to any sample vector. After several tests with different kernels (linear, polynomial, radial basis and sigmoid), we use a linear kernel due to its good performance and simplicity.

Random Forests [36] is a widely applied tree-based supervised machine-learning algorithm designed as a combination of bagging and a random selection of features that has proven to be an effective tool for the statistical analysis of high-dimensional genomic data [37].

Feature selection algorithms

Fast Correlation Based Filter [38] (FCBF) is an efficient heuristic that, based on information theory measures, performs a relevance and redundancy analysis for selecting a good subset of features. Basically, FCBF is a backward search strategy that uses Symmetrical Uncertainty (SU) as goodness function to measure non-linear dependencies between features. In order to identify relevant features, it requires to set a threshold parameter δ so that only values above δ are considered relevant. In this work we set δ = 0 since there is no rule about this parameter tuning and in the datasets under study only a small subset of features have a SU value different to 0.

Scatter Search [39, 40] (SS) is a population-based metaheuristic that uses intensification and diversification mechanisms to generate new solutions. The strategy starts by generating an initial population of diverse solutions from the solution space. Then, it selects a subset of features, called Reference Set (RefSet), of high quality and dispersed solutions from the initial population. In the next step the method selects all subsets consisting of two solutions and combine them in order to generate new solutions that are improved by applying a local search which will yield to local optima. Finally, a static update of the reference set is carried out selecting solutions according to quality and diversity from the union of the original RefSet and the new local optima found. As goodness of feature subsets, the Correlation Feature Selection [40] (CFS) is used.

In this work we use the Correlation Feature Selection [41] (CFS) measure as goodness function. CFS evaluates subsets of features by considering the non-linear correlation Symmetrical Uncertainty function.

Genome wide predictions

Once we identified the most predictive features, a generalized TOP2B binding model was built by training on DNase-seq, RAD21 and CTCF data from mouse liver, MEFs and activated B cells using Random Forests [36]. We first used 5-fold cross-validation to confirm that the model was able to accurately predict TOP2B in the training systems (as shown in Results and Discussion) and then applied the model to the mappable genomes of both the training and the test systems (mouse thymus and human MCF7). First, we split the genomes into bins of 300 bp with sliding windows of 50 bp. Then we scanned each bin with our model and obtained a TOP2B binding probability vector that was used to build up a bedgraph file. Probability track plots were generated using the UCSC genome browser [26].

ChIP-seq

Cells were fixed by adding formaldehyde to a final concentration of 1% and incubated at 37°C for 10 min. Fixation was quenched by adding glycine to a final concentration of 125 mM at cell culture plates. After two washes with cold PBS, in the presence of complete protease inhibitor cocktail (Roche) and PMSF, cell pellet was lysed in two steps using 0.5% NP-40 buffer for nucleus isolation and SDS 1% lysis buffer for nuclear lysis. Sonication was performed using Bioruptor (Diagenode, UCD-200) at high intensity and three cycles of 10 min (30” sonication, 30” pause) and chromatin was clarified by centrifugation (17000xg, 10 min, 4 °C). For IP, 30 μg chromatin and 4 μg antibody (anti-TOP2B, SIGMA-HPA024120) were incubated o/n in IP buffer (0.1% SDS, 1% TX-100, 2 mM EDTA, 20 mM TrisHCl pH8, 150 mM NaCl) at 4°C, and then with 25 μL of pre-blocked (1 mg/ml BSA) Dynabeads protein A and Dynabeads protein G (ThermoFisher) for 4 hours. Beads were then sequentially washed with IP buffer, IP buffer containing 500 mM NaCl and LiCl buffer (0.25 M LiCL, 1% NP40, 1% NaDoc, 20 mM TrisHCl pH8 and 1 mM EDTA). ChIPmentation was carried out as previously described (Schmidl 2015) using Tagment DNA Enzyme provided by the Proteomic Service of CABD (Centro Andaluz de Biología del Desarrollo). DNA was eluted by incubation at 50°C in 100 μL elution buffer (1% SDS, 100 mM NaHCO3) for 30 min followed by an incubation with 200 mM NaCl and 10 μg of Proteinase K (ThermoFisher) to revert cross-linking. DNA was then purified using Qiagen PCR Purification columns. Libraries were amplified for N-1 cycles (being N the optimum Cq determined by qPCR reaction) using NEBNext High-Fidelity Polymerase (M0541, New England Biolabs), purified and size-selected using Sera-Mag Select (GE Healthcare) and sequenced using Illumina NextSeq 500 in a single-end configuration. ChIP-seq datasets corresponding to two replicates and one IgG control sample are available under GEO accession number GSE141528.

Western blot analysis

For protein extraction, cell pellets were resuspended in RIPA buffer supplemented with protease inhibitors and then sonicated, clarified and loaded after quantification in a 4-20% Mini-PROTEAN tris-Glycine Precast Protein Gels (Biorad, 4561096). Gels were electroblotted onto Immobilon-FL Transfer Membrane (Millipore). Western blot was performed following standard protocol for Odyssey CLx analysis (LI-COB BIOSCIENCES, Lincoln, NE) and for both anti-TOP2B antibodies (SIGMA, HPA024120 and NOVUS, NB100-40842) 1:3000 dilution of the antibody was used in Odyssey Blocking Buffer.

Code availability

The code used to train our models and generate genome wide predictions is freely available at GitLab.

Results and discussion

Chromatin landscape of TOP2B binding sites

We started by comprehensively assessing the chromatin landscape of TOP2B binding sites in two mouse systems: embryonic fibroblasts (MEFs) and liver. We collected high-throughput sequencing data from ENCODE [19] and several independent studies [11, 12, 4249] (S1 Table) and observed the expected colocalization with open chromatin sites, RNA polymerase II (Pol2), architectural components and transcription associated histone modifications; genome tracks in both systems are provided as an example (Fig 1A and S1 Fig). Then, we identified enriched TOP2B peaks and studied the position specific distribution of each chromatin feature around the TOP2B binding center (Fig 1B). Several marks were found to be highly enriched in both systems, such as DNase-seq, CTCF and the cohesin complex members STAG1, STAG2 and RAD21. These features also showed a high colocalization frequency with TOP2B (Fig 1C), in line with previous findings [11, 12]. For instance, 89% of TOP2B sites in mouse liver colocalized with DNase-seq peaks, while for randomly selected regions this percentage decreased to 11%. Similarly, 87% of TOP2B peaks overlapped with RAD21 in MEFs as compared to 4% of random regions. These observations agree with a preferential association of TOP2B with accessible chromatin, actively transcribed promoters and regulators of genome architecture [1012], and suggest that chromatin features measured by next generation sequencing could provide valuable information for the prediction of TOP2B binding.

Current experimental data suggest that TOP2B does not bind to a specific DNA motif [11], which agrees with topoisomerases specifically acting where DNA topological problems arise. However, TOP2B associates with DNA regions that are frequently bound by sequence-specific transcription factors [911], so the fact that particular topological structures can be favored by specific properties of DNA sequence is still a possibility. Indeed, a recent study has provided evidence that DNA sequence governs the location of supercoiled DNA [51]. To explore whether sequence composition of TOP2B binding sites can be used in our predictive approach, we analyzed the spatial distribution of DNA sequence dinucleotides within 1kb windows centered on TOP2B peaks. For both murine systems, TOP2B sites showed an increase of GC and GG dinucleotides while AT and AA were depleted (Fig 1D, top). This result agrees with previously reported GC-rich motifs, such as CTCF and ESR1, at TOP2B occupied regions [10, 11].

Another feature that could potentially help in the prediction of TOP2B binding is DNA shape, which has been shown to affect the binding preferences of a number of DNA associated proteins [33, 52, 53]. In this sense, models trained with information on DNA shape have demonstrated significant improvements in transcription factor binding predictions [33]. To further inspect whether the observed sequence preferences of TOP2B sites are accompanied by specific 3D conformational parameters, we derived high-throughput predictions of DNA shape features from TOP2B peaks. Position specific profiles of such features revealed a specific pattern around the TOP2B binding center, characterized by decreased helix twist, increased minor groove width and propeller twist, and a modest enrichment of roll (Fig 1D, bottom). Such conformation suggests helical unwinding as consequence of a decrease in helix twist, which together with the widening of the minor groove could be providing an energetically favorable scenario for TOP2B binding.

Finally, we also interrogated whether CpG methylation is informative for TOP2B binding prediction. CpG methylation is a well-known epigenetic mark that has proven predictive ability for the localization of DNA binding proteins [5456]. We profiled whole genome bisulfite sequencing (WGBS) data around TOP2B binding sites and found decreased CpG methylation as compared to random regions (S2 Fig), showing that this feature is likely to be informative for TOP2B prediction.

A predictive model for TOP2B binding based on chromatin features

In order to assess whether chromatin features could discriminate TOP2B binding sites from the rest of the genome, we applied the computational approach outlined in Fig 2. For both MEFs and liver, we considered peaks identified in the previous section, resized them to 300 bp and generated the same number of random genomic regions (see Materials and methods). Then we scored 15 high-throughput sequencing experiments (S1 Table) together with DNA sequence and shape features within such regions. Regarding DNA sequence, we represented DNA 1-mers, 2-mers and 3-mers for each nucleotide position in the TOP2B binding sites as described in [33] (see Materials and methods). This produced 1, 200 parameters representing 1-mers, 4, 784 parameters representing 2-mers and 19, 072 parameters representing 3-mers. We also included information on 13 DNA shape features using DNAshape method [57], which added other 7, 695 parameters to the model. This parametrization allowed us to measure the predictive ability of DNA sequence and shape to explain TOP2B binding with an unprecedented resolution. As a result, we ended up with model matrices of 32, 766 columns and either 13, 128 (liver) or 8, 413 (MEFs) rows. Finally, binary classifiers were trained and tested using 5-fold cross-validation.

Machine learning schema for the prediction of TOP2B binding.
Fig 2
TOP2B binding sites and random regions were first identified. Then, 15 high-throughput sequencing experiments together with DNA sequence and shape features were scored around such regions, which resulted a data matrix with rows representing TOP2B/random sites and columns representing the scored features. Finally, binary classifiers were trained and tested using 5 fold cross-validation and feature selection was applied to identify the most informative features.Machine learning schema for the prediction of TOP2B binding.

Since TOP2B binding regions display increased G+C content as compared to the average of the mouse genome, we trained one more model in which random regions were selected so that their distribution of sequence G+C content matches that of TOP2B peaks. This allowed us to account for potential biases of the predictions due to differences in G+C content. The final sets of TOP2B, random and GC-corrected random peaks displayed the expected genomic distribution [913], with TOP2B binding sites showing higher co-localization with promoters, enhancers and insulators than randomized regions (S3(A) and S3(B) Fig). Nevertheless, forcing the G+C content of random regions to match that of TOP2B binding sites led to the former being located at GC rich sites, which in turn made them show a moderate co-localization with actively regulated regions. Since TOP2B is expected to localize at such regions, we generated heatmaps to examine the ChIP-seq signal at TOP2B, random, and GC-corrected random peaks and observed enrichment only at the center of TOP2B peaks (S3(C) Fig), indicating that our background regions are adequate for subsequent training.

We employed Support Vector Machine (SVM) [35] and Naive Bayes (NB) [34] to build up the predictive models and used 5-fold cross-validation to estimate the prediction accuracy. We started by training on the whole set of features. Accurate predictions were obtained for the two murine systems using both the GC-corrected and the regular model (Tables 1 and 2; Fig 3 and S4 Fig), indicating that chromatin features can faithfully explain TOP2B binding. As a matter of fact, SVM trained on all features performed considerably better than NB. For instance, while regular models trained with NB achieved accuracies of 71% and 77.4% for MEF and liver, respectively, these values increased to 98% and 97.4% when using SVM. Similarly, accuracies obtained training the GC-corrected models with NB were 74.1% and 76.2% for MEF and liver, respectively, whereas SVM produced accuracies of 96.7% and 93%. Such performance differences are likely to be explained by many features showing dependencies to each other or being poorly informative. This would impair NB performance, which assumes all features are independent. Indeed, removal of not informative features leads to a significant increase of accuracies achieved by the NB algorithm (see next sections).

Chromatin features predict TOP2B binding.
Fig 3
ROC curves and AUC values for Support Vector Machine models trained on the indicated sets of features (for Naive Bayes models, see S4 Fig).Chromatin features predict TOP2B binding.
Table 1
Performance of SVM and NB classifiers in MEF.
SystemCategoryGC correctedRandom
NBSVMNBSVM
MEFAll74.18 ± 1.0896.72 ± 0.1870.94 ± 0.7598.06 ± 0.28
Sequence59.15 ± 0.6155.44 ± 1.1066.87 ± 0.8864.41 ± 0.48
3D shape55.08 ± 0.9254.32 ± 0.8766.23 ± 0.5461.80 ± 0.97
Histone marks84.82 ± 0.7383.90 ± 0.6688.61 ± 0.7087.57 ± 0.74
Architecture90.35 ± 0.6194.98 ± 0.3293.15 ± 0.5896.52 ± 0.39
Open chromatin81.92 ± 0.5883.96 ± 1.4490.48 ± 0.6192.35 ± 0.63
Expression50.36 ± 0.2550.48 ± 0.5450.77 ± 0.2252.52 ± 1.57
Pol260.60 ± 0.6261.48 ± 1.0657.71 ± 0.4656.09 ± 0.92
CpG methylation69.45 ± 0.8873.82 ± 0.8863.98 ± 0.7968.90 ± 1.09
Table 2
Performance of SVM and NB classifiers in mouse liver.
SystemCategoryGC correctedRandom
NBSVMNBSVM
LiverAll76.20 ± 0.9793.00 ± 0.5677.43 ± 0.2097.47 ± 0.13
Sequence62.95 ± 0.7056.64 ± 1.1074.98 ± 0.3172.78 ± 0.94
3D shape57.97 ± 0.6654.00 ± 1.4273.19 ± 0.3869.71 ± 0.46
Histone marks78.03 ± 0.3281.17 ± 0.4778.97 ± 0.4485.34 ± 0.41
Architecture87.45 ± 0.5391.74 ± 0.3789.50 ± 0.5494.26 ± 0.56
Open chromatin71.68 ± 0.4685.32 ± 4.2283.35 ± 0.9292.01 ± 1.57
Expression50.00 ± 0.0150.00 ± 0.0150.04 ± 0.2850.02 ± 0.01
Pol254.92 ± 0.2453.97 ± 0.4252.95 ± 0.7352.63 ± 0.90
CpG methylation71.58 ± 0.3475.31 ± 0.3466.56 ± 0.3669.58 ± 0.44

Chromatin architecture and open chromatin are the most informative features

To explore whether specific molecular events provide different prediction abilities for TOP2B binding, chromatin features were grouped into 8 categories (Fig 2): open chromatin (DNAse-seq, MNase-seq), histone marks (H3K4me1, H3K4me3, H3K9ac, H3K27ac, H3K27me3, H3K36me3), Pol2 binding (POLR2A ChIP-seq), architectural components (ChIP-seq of CTCF and the cohesin complex components RAD21, STAG1 and STAG2), gene expression (RNA-seq), CpG methylation (WGBS), DNA sequence and DNA shape (measured using DNAshapeR [31]). Then, SVM and NB models were constructed for each of such categories.

Different predictive performances were observed for the grouped molecular events, with similar results across systems (Tables 1 and 2; Fig 3 and S4 Fig). Models trained with architectural components achieved the highest prediction accuracies, ranging from 87.4% to 96.5%, with area under ROC curves (AUC) of 0.96—0.99 (Fig 3 and S4 Fig). This is in agreement with previous findings of TOP2B associating with the boundaries of TADs [11, 12]. In the same line, models built using open chromatin factors reflect TOP2B preference to bind accessible DNA, showing accuracies of 71.6%—92.35% and AUC of 0.88—0.99 (Tables 1 and 2; Fig 3 and S4 Fig).

As expected, histone-mark models also obtained reasonably high prediction performances, with accuracies ranging from 78% to 88.6% and AUC of 0.84—0.95 (Tables 1 and 2; Fig 3 and S4 Fig). The set of histone modifications used to feed these models include a combination of promoter, enhancer and gene body associated marks, which represents common genomic regions where TOP2B tends to localize.

In contrast to the above results, gene expression (measured by RNA-seq) appeared to be the least informative feature for TOP2B binding, with accuracies of around 50% and AUC of 0.5—0.54. Interestingly, and despite the observed co-occurrence of TOP2B and Pol2 at gene promoters (Fig 1 and S1 Fig), the model trained with Pol2 binding information also exhibited poor prediction accuracies, ranging from 52.6% to 61.4% and AUC of 0.53—0.67. Although these observations seem to contrast with previous findings of TOP2B being involved in transcriptional regulation [58], we argue that other chromatin features, such as histone marks or DNase-seq may be capturing such association. Given the observed positive correlation of TOP2B-induced DSBs at CTCF and cohesin-bound regions with transcriptional output [17, 18], even architectural factors can be capturing TOP2B association with transcription within our model.

The impact of correcting for GC-bias

According to the performances achieved, the effect of training models using GC-corrected background regions is negligible for the categories analyzed. As expected, this is not the case for DNA sequence and DNA shape models, which obtained decreased prediction accuracies when trained with GC-corrected sites as compared to those achieved using pure random regions. While sequence content exhibited a prediction accuracy of 55.4% and 62.9% and AUC of 0.58—0.66 for GC-corrected models, these values increased to 64.1%—74.9% and AUC of 0.71—0.84 for random models (Tables 1 and 2; Fig 3 and S4 Fig). Similarly, sequence-derived 3D shape models built with GC-corrected regions achieved accuracies from 54% to 57.9% and AUC of 0.56—0.6, whereas pure random models obtained accuracies of 61.8%—73.1% and AUC of 0.67—0.78. Given the remarkable resolution of our model in measuring DNA sequence and sequence-derived 3D DNA shape, these results suggest that sequence information alone is only poorly informative of TOP2B binding, and strengthen the idea of topoisomerases acting where DNA topological problems arise.

The predictive performances of models trained with CpG methylation data were also influenced (although only moderately) by the use of GC-corrected sites as background regions. While pure random models achieved modest accuracies, ranging from 64% to 69.5% and AUC of 0.73—0.76, GC-corrected models obtained moderately higher accuracies of 69.4%—75.3% and increased AUC of 0.78—0.81 (Tables 1 and 2; Fig 3 and S4 Fig). When forcing the G+C content of random regions to match that of TOP2B binding sites, they tend to be located at GC rich sites not occupied by TOP2B. Such regions happen to be hypermethylated as compared with TOP2B sites (S2 Fig), which leads to an increased predictive power of CpG methylation.

Given the modest differences found using random or GC-corrected random regions in model performances, from this point on only GC-corrected random regions were considered.

Three chromatin features explain TOP2B binding genome wide

With the aim of identifying the most relevant features to explain TOP2B binding, we applied feature selection, which reduces the number of attributes in the data matrix while trying to keep (or even improve) the predictive power of the classifier. We used two feature selection algorithms using 5-fold cross-validation: Fast Correlation Based Filter (FCBF) [38] and Scatter Search (SS) [39, 40] (see Materials and methods). In addition to the accuracy and the AUC values, here we also considered the average number of features selected by each strategy and the stability of each algorithm. The stability (or robustness) evaluates the sensitivity to variations of a feature selection algorithm in the dataset. This issue is especially important in high dimensional domains where strategies may yield to different solutions with similar performance and, therefore, this measure enhances the confidence in the analysis of the results.

Both feature selection algorithms succeeded in finding a subset of features that faithfully explained TOP2B binding in both systems, with accuracies ranging from 92.1% to 96.11% (Table 3). As a matter of fact, although in mouse liver the two classification approaches (SVM and NB) achieved the same performance for both feature selection algorithms, this was not the case in MEFs, in which the SVM slightly outperformed NB predictions. Regarding the number of selected features, while FCBF needed averages of 46 and 22 features to optimize the prediction of TOP2B in MEF and liver, respectively, SS only needed 3 features in both systems (S2 Table). It is worth stressing that SS always finds, in every iteration, the same 3 features, achieving a stability score of 1 (Table 3). On the other hand, features selected by FCBF vary from different iterations, which confers a poor robustness score to this strategy.

Table 3
Performance of SVM and NB classifiers using the features selected by FCBF and SS strategies.
SystemAlgorithmNBSVM#featuresStability
MEFFCBF94.60 ± 0.3596.11 ± 0.2846.80 ± 3.900.113
SS93.88 ± 0.3195.51 ± 0.333.00 ± 0.001
LiverFCBF92.62 ± 0.5192.84 ± 1.9122.20 ± 4.380.132
SS92.90 ± 0.4492.10 ± 1.973.00 ± 0.001

The most selected features for each strategy are shown in Fig 4A. For each histogram and feature, the white bar height indicates the frequency of selection and the black bar height is the Symmetrical Uncertainty (SU) value with respect to the class (TOP2B). SU [59] is a widely used technique that allows to measure the relevance between two variables and can be interpreted as a correlation measure to identify non-linear dependencies between features. Due to the mentioned low stability score obtained with the FCBF approach, only 5 out of the average 46 and 3 out of the average 22 features were selected in every iteration for MEFs and liver, respectively (S2 Table). In MEFs, such 5 features were RAD21, DNase-seq, WGBS, MNase and the DNA shape associated translational parameter Shift. In liver, the three selected features corresponded to DNase-seq, STAG2 and WGBS. It is worth mentioning that the only features selected in 4 of the 5 iterations were the DNA shape associated parameters Slide and Shear in MEFs and Helix Twist in liver. The recurrent selection of such DNA shape features in both systems may indicate a modest contribution of these parameters to TOP2B binding. In any case, most of the predictive power was sustained by open chromatin features (DNAse-seq, MNase-seq), cohesin complex associated factors (RAD21, STAG2) and to a lesser extent CpG methylation (WGBS).

Three features accurately predict TOP2B.
Fig 4
A . Top features selected by Fast Correlation Based Filter and Scatter Search algorithms. For each histogram and feature, the white bar height indicates the frequency of selection and the black bar height is the Symmetrical Uncertainty (SU) value with respect to the class (TOP2B). Indexes of DNA shape parameters indicate position within the corresponding parameter vector associated to the 300 bp width of modeled TOP2B binding sites (see Materials and methods). B. Summary of the most selected features by both algorithms. Top and bottom aligned dots indicate selection of a given feature by the corresponding selection algorithm in liver and MEFs, respectively. In the middle, the SU of each feature is displayed. Only the top fifteen features according to their SU are shown, which happen to match in both systems. C . ROC curves and AUC values for Naive Bayes models trained on either MEF, liver or activated B cells and applied to the three systems (for Support Vector Machine and Random Forests models, see S4 Fig). Only DNase-seq, RAD21 and CTCF binding data were used for training.Three features accurately predict TOP2B.

A significantly more robust and consistent selection was achieved when using SS (Fig 4A), where only three features were selected in every iteration for both systems. Furthermore, DNase-seq and RAD21 were both selected for MEFs and liver, which is in line with above results and highlights the influence of accessible chromatin and cohesin factors for TOP2B binding across systems. As a matter of fact, while in MEFs SS selected CTCF at every iteration, this was not the case in liver, where the cohesin complex factor STAG2 was always selected (Fig 4A).

A summary of the features selected by FCBS and SS in both systems is shown in Fig 4B. Top and bottom aligned dots indicate selection of a given feature by the corresponding selection algorithm in liver and MEFs, respectively. In the middle, the relevance (SU) of each feature with respect to TOP2B is displayed. Only the top fifteen features according to their SU are shown, which happen to match in both systems. With the exception of STAG2, the trends of liver and MEFs are very similar, indicating that the specific contributions of genomic features for TOP2B prediction are conserved across systems. Furthermore, the results obtained by the SS algorithm strengthen the importance of accessible chromatin and architectural factors for TOP2B binding, and shows that a small number of these features are enough to achieve accurate genome-wide predictions. Indeed, comparison of models trained using the whole set of chromatin features with models trained only with SS-selected features showed that both sets achieved similar performances (S5 Fig). Therefore, we decided to use only the three SS-selected features for subsequent analyses. It is noteworthy to mention that, since data on STAG2 binding is rather scarce, we focused on DNase-seq, RAD21 and CTCF. Although the latter was only selected in MEFs, it was among the most informative features in mouse liver (Fig 4B; S2 Table), and there is a wealth of ChIP-seq data available for this protein. Furthermore, comparison of DNase-RAD21-CTCF and DNase-RAD21-STAG2 models showed that both approaches achieve similar performances in this tissue (S6 Fig and S3 Table), and consequently, we did not loose predictive power.

Prediction across systems

We next asked to what extent the three selected features could be used to predict TOP2B binding in systems that were not used for training. Besides the two systems considered so far, we included TOP2B binding data from mouse activated B cells (Gene Expression Omnibus GSM2635606). We assessed model predictions by training on one system and testing prediction accuracy on the rest. Besides SVM and NB, this time we also applied Random Forests (RF) [36], which has been widely applied to genomic data with excellent results [37]. RF models performances were comparable to those of SVM and NB (S4 Table), and analysis of feature importances was consistent with the feature selection results (S7 Fig). The achieved predictions showed that cross-system applications of the classifiers did not decrease their performances, which acquired accurate predictions independently of the training system (Fig 4C and S8 Fig; Table 4). For example, when we applied NB models trained on MEF and mouse liver to activated B cells, we obtained accuracies of 98.3% and 97.9%, respectively, compared to an average of 97.6% using the model trained on activated B cells itself. These results support the idea that TOP2B association with DNase-seq, CTCF and RAD21 can be generalized across mouse systems.

Table 4
Errors are only shown for models trained and tested with the same system due to test data partition.
Performance of cross system predictions using models trained with DNase-seq, RAD21 and CTCF.
System (train/test)RFNBSVM
MEF95.03 ± 0.4493.89 ± 0.2395.22 ± 0.28
MEF/Liver91.9789.4392.36
MEF/aB cells98.3198.2698.38
Liver94.90 ± 0.1691.60 ± 0.5192.57 ± 2.20
Liver/MEF89.3389.0493.11
Liver/aB cells96.3897.8798.21
aB cells98.70 ± 0.2297.63 ± 0.4298.28 ± 0.31
aB cells/MEF93.1685.0092.95
aB cells/Liver89.0885.4981.19

A general model of TOP2B binding based on DNase-seq, RAD21 and CTCF

Given the above observations, and to better capture the system specificities of TOP2B binding in a single classifier, we built a model trained with a combination of data from mouse liver, MEFs and activated B cells. We used this model to generate genome wide predictions of TOP2B binding in several systems, including those used for training and two additional systems that served as the validation sets (see next sections). To do so, we split the genome in 300 bp windows, scored DNase-seq, CTCF and RAD21 signals and applied the TOP2B generalizing classifier within such windows, obtaining probability values through the whole genome for each system (S9 Fig).

The fact that only three features are sufficient to predict TOP2B binding genome wide opens the question of whether our model is truly informative, or similar performances could be achieved by simply considering the merged set of DNAse-seq, CTCF and RAD21 peaks. In order to test this, we focused on datasets in mouse liver, due to the excellent quality and reproducibility of the experimental TOP2B ChIPseq data [11]. We generated genome wide predictions and considered regions with a probability > 0.95 as predicted binding sites. Then, we compared our predictions and the set of merged DNase-CTCF-RAD21 peaks regarding their overlap with experimental TOP2B peaks (S10(A) Fig). We identified 68, 163 predicted sites, among which 56% coincided with ChIP-seq peaks. Regarding the 117, 714 DNase-CTCF-RAD21 detected peaks, this percentage decreased to 37%, suggesting that the merged-peak approach could lead to the selection of a large number of false positives. Indeed, while the specific potential false positives of our model (not identified by ChIP or by the merged set) clearly displayed TOP2B signal in heatmaps (S10(B) Fig, bottom), this was greatly reduced in the case of the specific DNase-CTCF-RAD21 set (S10(B) Fig, top). These results highlight the sensitivity of our model and shows that it clearly outperforms a simple DNase-CTCF-RAD21 peak merging by also capturing quantitative TOP2B binding information.

TOP2B probability mirrors experimental binding intensity genome wide

Given the remarkable predictive power of our model, we used the genome wide TOP2B binding probabilities generated in the previous section (S9 Fig) as virtual tracks and represented them in a genome browser [26] together with experimental TOP2B ChIP-seq signals. Virtual tracks faithfully simulated ChIP-seq data at the gene level and accurately captured system binding differences in the training datasets (Fig 5 and S11 Fig). It is worth noting that, even though we modeled TOP2B binding using binary classifiers, the similarities between the experimental signals and our predicted probabilities are quite significant, showing Pearson’s correlation coefficients comparable to those obtained across replicates (S5 Table). Furthermore, signal comparison with the three chromatin features used as predictors showed that the model was also able to apply learned associations of DNAse-seq, RAD21 and CTCF with TOP2B to precisely generate bona fide predictions (Fig 5 and S11 Fig). These results highlight the power of our model and demonstrate that the described approach can be used as a framework for the prediction of genome wide TOP2B signal potentially in any mammalian system for which sequencing data on DNAse I hypersensitivity, RAD21 and CTCF are available. Furthermore, this demonstrates that, besides global TOP2B binding, our model can be used to produce virtual probability tracks with which quantitative TOP2B accumulation can be analyzed at specific genomic loci.

TOP2B predictive and experimental tracks in mouse liver, MEFs and activated B cells (training systems).
Fig 5
From top to bottom, genome browser view of TOP2B ChIP-seq, TOP2B virtual track, CTCF ChIP-seq, RAD21 ChIP-seq and DNase-seq are displayed for each system.TOP2B predictive and experimental tracks in mouse liver, MEFs and activated B cells (training systems).

Validation in mouse thymocytes

To further test the predictive power of our approach, we generated genome wide predictions in mouse thymocytes and used TOP2B ChIP-seq data performed in our laboratory [60] to validate them. In order to test the specificity of our model, we evaluated the ChIP-seq signal generated by two different antibodies (Novus and Santa Cruz). Visual inspection of predictive and experimental tracks in a genome browser revealed that our predictions accurately mirror ChIP-seq signals (Fig 6A and S12 Fig). Furthermore, the predictions seemed to capture the learned associations of DNAse-seq, RAD21 and CTCF with TOP2B (S12 Fig), as observed for the training systems, and showed Pearson’s correlation coefficients comparable to those obtained across replicates (S6 Table). Next, we called peaks using HOMER [25] and identified 2, 570 and 13, 297 TOP2B binding sites for Novus and Santa Cruz, respectively. Then, we applied the model and computed AUC values (Fig 6B). Although moderate predictive power was achieved for individual antibodies (AUC of 0.83 and 0.73 for Novus and Santa Cruz, respectively), when applying the model to common peaks of both experiments we obtained quite accurate predictions, with AUC of 0.96.

Validation of TOP2B model in mouse thymocytes and human MCF7.
Fig 6
A. TOP2B predictive and ChIP-seq tracks of mouse thymocytes in a selected region of the mouse genome. Identified peaks are displayed above the signals. B. ROC curves and AUC values for the prediction of TOP2B peaks detected using Novus and Santa Cruz antibodies. C. Venn diagram showing the overlaps between predicted TOP2B peaks and the two sets of experimental peaks. D. Heatmap representations of TOP2B ChIP-seq reads enrichment within ± 8 kb of several set of peaks: predicted and confirmed by HOMER calls of both antibodies (A1-A2-P), predicted and confirmed only by Novus antibody (A1-P) or Santa Cruz (A2-P), only predicted (P-only), only detected by Novus (A1-only) and only detected by Santa Cruz (A2-only). For illustration purposes, the same number of randomly selected peaks is represented in all the heatmaps. E. TOP2B predictive and ChIP-seq tracks of MCF7 in a selected region of the human genome. Identified peaks are displayed above the signals. F. ROC curves and AUC values for the prediction of TOP2B peaks in two replicates performed in MCF7. G. Venn diagram showing the overlaps between predicted TOP2B peaks and the two sets of experimental peaks. F. Heatmap representations of TOP2B ChIP-seq reads enrichment within ± 8 kb of several set of peaks: predicted and confirmed by HOMER calls of both replicates (R1-R2-P), predicted and confirmed only by the first (R1-P) or the second replicate (R2-P), only predicted (P-only), only detected by the first replicate (R1-only) and only detected by the second replicate (R2-only). As in D, the same number of randomly selected peaks is represented in all the heatmaps.Validation of TOP2B model in mouse thymocytes and human MCF7.

To further compare predictions with our experimental ChIP-seq data in mouse thymus, we performed peak overlap analysis. Again, we filtered our predictive track and kept only those genomic windows with a probability of TOP2B binding above 0.95, which resulted in 46, 836 predicted peaks. Such peaks included 53.4% of experimental ChIP-seq peaks detected using Novus antibody, 30.7% of those detected using Santa Cruz and 72.6% of common peaks (Fig 6C). In addition, our model detected a set of 41, 766 TOP2B peaks that were not identified by HOMER in any of the two ChIP-seq experiments. In order to assess whether we were potentially detecting false positives, we examined ChIP-seq intensity centered at TOP2B peaks using heatmaps (Fig 6D). We displayed TOP2B at common and individually identified peaks, including (Fig 6D): predicted peaks confirmed by both experimental datasets (A1-A2-P), predicted and confirmed only by Novus antibody (A1-P) or Santa Cruz (A2-P), only predicted (P-only), only detected by Novus (A1-only) and only detected by Santa Cruz (A2-only). Interestingly, an enrichment of TOP2B signal was observed at all types of predicted peaks. As expected, A1-A2-P, A1-P and A2-P peaks showed a higher TOP2B enrichment, although peaks only predicted by our model still displayed an evident signal for both antibodies (Fig 6D). Visual inspection of such regions in a genome browser confirmed that, although less enriched than those confirmed by the antibodies, P-only peaks constitute TOP2B binding sites that were not identified by peak calling (S13 Fig). Furthermore, examination of A1-only and A2-only peaks showed that our predictive track displayed TOP2B probability at these regions, but they were not classified as predicted due to the threshold of 0.95 (S13 Fig). In summary, these observations indicate that our predictions are indeed highly sensitive to TOP2B binding even in a mouse system not used for generating the model.

Validation in human cells

In order to test whether our TOP2B binding model can be generalized to other organisms, we generated predictions in the human MCF7 cell line and performed ChIP-seq of TOP2B to validate them. This time we performed two replicates using the same antibody (Sigma), whose specificity was validated for both human and mouse cells (S14 Fig), and allowed us to evaluate to what extent our predictions resemble an experimental replicate. Indeed, visual inspection in a genome browser revealed that TOP2B predictive track accurately mirrors both experimental signals (Fig 6E and S12 and S15 Figs), and Pearson’s correlation coefficients were again comparable to those obtained across replicates (S7 Table). Then, we called HOMER and identified 35, 778 and 27, 966 TOP2B peaks for each replicate, with 10, 310 common peaks (Fig 6G). Our model obtained quite accurate predictions on both sets of peaks, with AUC of 0.87 and 0.91 (Fig 6F). When predicting on the set of common peaks, the model performed even better, with AUC of 0.97.

Again, we performed peak overlap analysis to further compare predictions with our experimental ChIP-seq data. We identified 79, 596 regions with a probability of TOP2B binding above 0.95, including 45.98% of experimental peaks from replicate 1, 56.75% of experimental peaks from replicate 2 and 74.45% of common peaks, showing that our predictions behave as a virtual experimental replicate (Fig 6G). Since our predictions revealed an additional set of 55, 311 peaks that were not detected by HOMER, we assessed whether these corresponded to potential false positives, as described above: predicted and confirmed by HOMER in both replicates (R1-R2-P), predicted and confirmed only by the first (R1-P) or the second replicate (R2-P), only predicted (P-only), only detected by the first replicate (R1-only) and only detected by the second replicate (R2-only). We observed a similar pattern as in our previous analysis in mouse thymocytes, with predicted peaks confirmed by both antibodies showing the highest enrichment of TOP2B reads (Fig 6H). Again, we examined peaks only predicted by our model in a genome browser and confirmed that they corresponded to true binding sites (S15 Fig). On the other hand, R1-only and R2-only peaks were found to be enriched in our probability track, which indicates that they were not detected as predicted peaks because of the defined threshold of 0.95 (S15 Fig). These results confirm that the model is also accurate in predicting TOP2B binding in human cells.

Prediction of TOP2-induced DSBs

Upon etoposide treatment, abortive TOP2 activity results in DSBs at CTCF-RAD21 sites, as measured by the END-seq method [12, 17]. Since TOP2-induced DSBs have important implications for cancer-linked translocations [12, 17, 18, 60, 61], we decided to test the capacity of our model to predict these breaks. Given that many of the datasets provided by Canela et al (2017) correspond to MEFs, we focused our analysis on this cell line and processed an END-seq sample treated with etoposide in addition to the already analyzed TOP2B ChIP-seq dataset (see S1 Table). A genome browser view of a chromatin loop near the Lrp8 gene revealed an enrichment of END-seq and TOP2B ChIP-seq signals at the anchors, which was accurately mirrored by our predictive track (S16(A) Fig). We performed peak calling on the END-seq sample and identified 4, 363 DSBs. Then, we applied our model and computed AUC values, obtaining accurate predictions of END-seq peaks (AUC of 0.91), although still lower than those achieved on the ChIP-seq sample (AUC of 0.99) (S16(B) Fig). We next performed peak overlap analysis to further compare our genome wide predictions with the experimental DSBs. We used our previously generated TOP2B virtual track and filtered regions with a probability greater than 0.95. This resulted in 87, 205 predicted peaks, including 61.2% of experimental DSBs and 83.3% of ChIP-seq peaks (S16(C) Fig). Thus, our model achieves a reasonably high prediction accuracy when applied to TOP2-induced DSBs, but, as expected, is clearly better in predicting TOP2B binding. We argue that other factors beyond binding itself may be implicated in TOP2B activity and subsequent DSB formation.

Conclusion

TOP2B is a crucial enzyme for DNA metabolism whose activity is required for a wide range of cellular processes including replication, transcription and genome organization. The few genome-wide maps of TOP2B that are currently available show high colocalization with several chromatin features, but no work has been devoted to comprehensively study predictive features that define TOP2B binding. By applying feature selection techniques, we show that TOP2B binding can be accurately predicted using only three high-throughput sequencing datasets, DNase-seq and ChIP-seq of RAD21 and CTCF, highlighting the importance of TOP2B in the regulation of genome architecture. Since the above datasets are largely available in public databases, our framework allows the scientific community to predict TOP2B binding in a large set of organisms, cell types, tissues and biological conditions. Finally, we believe that this approach and the generation of virtual probability tracks could be applied to other chromatin binding factors with a similar sequence-independent binding behavior to TOP2B.

Acknowledgements

We thank A. Herrero-Ruiz and C. Gómez-Marín for valuable comments. All the analyses were performed using custom scripts that were run on the High Perfomance Computing cluster provided by the Centro Informático Científico de Andalucía (CICA).

References

VosSM, TretterEM, SchmidtBH, BergerJM. All tangled up: How cells direct, manage and exploit topoisomerase function. Nature Reviews Molecular Cell Biology. 2011; 12(12):827841. doi: 10.1038/nrm3228

NitissJL. DNA topoisomerase II and its growing repertoire of biological functions. Nature Reviews Cancer. 2009;9(5):327337. doi: 10.1038/nrc2608

PommierY, SunY, HuangSyN, NitissJL. Roles of eukaryotic topoisomerases in transcription, replication and genomic stability. Nature Reviews Molecular Cell Biology. 2016;17(11):703721. doi: 10.1038/nrm.2016.111

DeweeseJE, OsheroffN. The DNA cleavage reaction of topoisomerase II: Wolf in sheep’s clothing. Nucleic Acids Research. 2009;37(3):738748. doi: 10.1093/nar/gkn937

AkimitsuN, KamuraK, TonéS, SakaguchiA, KikuchiA, HamamotoH, et al Induction of apoptosis by depletion of DNA topoisomerase IIα in mammalian cells. Biochemical and Biophysical Research Communications. 2003;307(2):301307. doi: 10.1016/S0006-291X(03)01169-0

ThakurelaS, GardingA, JungJ, SchubelerD, BurgerL, TiwariVK. Gene regulation and priming by topoisomerase IIalpha in embryonic stem cells. Nat Commun. 2013;4:2478 doi: 10.1038/ncomms3478

DereuddreS, DelaporteC, Jacquemin-SablonA. Role of topoisomerase IIβ in the resistance of 9-OH-ellipticine- resistant Chinese hamster fibroblasts to topoisomerase II inhibitors. Cancer Research. 1997;57(19):43014308.

CapranicoG, TinelliS, AustinCA, FisherML, ZuninoF. Different patterns of gene expression of topoisomerase II isoforms in differentiated tissues during murine development. BBA—Gene Structure and Expression. 1992;1132(1):4348. doi: 10.1016/0167-4781(92)90050-A

MadabhushiR, GaoF, PfenningAR, PanL, YamakawaS, SeoJ, et al Activity-Induced DNA Breaks Govern the Expression of Neuronal Early-Response Genes. Cell. 2015;161(7):15921605. doi: 10.1016/j.cell.2015.05.032

10 

ManvilleCM, SmithK, SondkaZ, RanceH, CockellS, CowellIG, et al Genome-wide ChIP-seq analysis of human TOP2B occupancy in MCF7 breast cancer epithelial cells. Biology Open. 2015;4(11):14361447. doi: 10.1242/bio.014308

11 

Uusküla-ReimandL, HouH, Samavarchi-TehraniP, RudanMV, LiangM, Medina-RiveraA, et al Topoisomerase II beta interacts with cohesin and CTCF at topological domain borders. Genome Biol. 2016;17(1):122. doi: 10.1186/s13059-016-1043-8

12 

CanelaA, MamanY, JungS, WongN, CallenE, DayA, et al Genome Organization Drives Chromosome Fragility. Cell. 2017;170(3):507521.e18. doi: 10.1016/j.cell.2017.06.034

13 

DellinoGI, PalluzziF, ChiarielloAM, PiccioniR, BiancoS, FuriaL, et al Release of paused RNA polymerase II at specific loci favors DNA double-strand-break formation and promotes cancer translocations. Nature Genetics. 2019; doi: 10.1038/s41588-019-0421-z

14 

CicciaA, ElledgeSJ. The DNA Damage Response: Making It Safe to Play with Knives. Molecular Cell. 2010; 40(2):179204. doi: 10.1016/j.molcel.2010.09.019

15 

PoloSE, JacksonSP. Dynamics of DNA damage response proteins at DNA breaks: A focus on protein modifications. Genes & Development. 2011; 25(5):409433. http://www.genesdev.org/cgi/doi/10.1101/gad.2021311

16 

ChapmanJR, TaylorMRG, BoultonSJ. Playing the End Game: DNA Double-Strand Break Repair Pathway Choice. Molecular Cell. 2012; 47(4):497510. doi: 10.1016/j.molcel.2012.07.029

17 

CanelaA, MamanY, HuangSyN, WutzG, TangW, Zagnoli-VieiraG, et al Topoisomerase II-Induced Chromosome Breakage and Translocation Is Determined by Chromosome Architecture and Transcriptional Activity. Molecular Cell. 2019; doi: 10.1016/j.molcel.2019.04.030

18 

GotheHJ, BouwmanBAM, GusmaoEG, PiccinnoR, PetrosinoG, SayolsS, et al Spatial Chromosome Folding and Active Transcription Drive DNA Fragility and Formation of Oncogenic MLL Translocations. Molecular Cell. 2019; doi: 10.1016/j.molcel.2019.05.015

19 

ConsortiumEP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):5774. doi: 10.1038/nature11247

20 

LangmeadB, TrapnellC, PopM, SalzbergS. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 2009;10(3):R25 doi: 10.1186/gb-2009-10-3-r25

21 

HahneF, IvanekR. Visualizing genomic data using Gviz and bioconductor. In: Methods in Molecular Biology; 2016 doi: 10.1007/978-1-4939-3578-9_16

22 

Ou J, Zhu LJ. trackViewer: a Bioconductor package for interactive and integrative visualization of multi-omics data; 2019.

23 

ZhangY, LiuT, MeyerCA, EeckhouteJ, JohnsonDS, BernsteinBE, et al Model-based Analysis of ChIP-Seq (MACS). Genome Biology. 2008;9(9):R137 doi: 10.1186/gb-2008-9-9-r137

25 

HeinzS, BennerC, SpannN, BertolinoE, LinYC, LasloP, et al Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Molecular Cell. 2010; doi: 10.1016/j.molcel.2010.05.004

26 

KentWJ, SugnetCW, FureyTS, RoskinKM, PringleTH, ZahlerAM, et al The Human Genome Browser at UCSC. Genome Research. 2002; doi: 10.1101/gr.229102

27 

YuG, WangLG, HeQY. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics. 2015;31(14):23822383. doi: 10.1093/bioinformatics/btv145

28 

MatthewsBJ, WaxmanDJ. Computational prediction of CTCF/cohesin-based intra-TAD loops that insulate chromatin contacts and gene expression in mouse liver. eLife. 2018;7:e34077 doi: 10.7554/eLife.34077

29 

ComoglioF, ParoR. Combinatorial Modeling of Chromatin Features Quantitatively Predicts DNA Replication Timing in Drosophila. PLoS Computational Biology. 2014;10(1). doi: 10.1371/journal.pcbi.1003419

30 

ListerR, MukamelEA, NeryJR, UrichM, PuddifootCA, JohnsonND, et al Global epigenomic reconfiguration during mammalian brain development. Science. 2013; doi: 10.1126/science.1237905

31 

ChiuTP, ComoglioF, ZhouT, YangL, ParoR, RohsR. DNAshapeR: An R/Bioconductor package for DNA shape prediction and feature encoding. Bioinformatics. 2016; doi: 10.1093/bioinformatics/btv735

32 

LiJ, SagendorfJM, ChiuTP, PasiM, PerezA, RohsR. Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding. Nucleic Acids Research. 2017; doi: 10.1093/nar/gkx1145

33 

ZhouT, ShenN, YangL, AbeN, HortonJ, MannRS, et al Quantitative modeling of transcription factor binding specificities using DNA shape. Proceedings of the National Academy of Sciences. 2015;112(15):46544659. doi: 10.1073/pnas.1422023112

34 

DudaRO, HartPE. Pattern classification and scene analysis. New York, USA: John Wiley & Sons; 1973.

35 

VapnikVN. Statistical Learning Theory. Wiley-Interscience; 1998.

36 

BreimanL. Random Forests. Machine Learning. 2001;45:532. doi: 10.1023/A:1010933404324

37 

ChenX, IshwaranH. Random forests for genomic data analysis. Genomics. 2012; 99(6):3239. doi: 10.1016/j.ygeno.2012.04.003

38 

YuL, LiuH. Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning Research. 2004;5:12051224.

39 

García-LópezFC, García-TorresM, Moreno-PérezJA, Moreno-VegaJM. Scatter Search for the Feature Selection Problem In: ConejoR, UrretavizcayaM, Pérez-de-laCruz JL, editors. Current Topics in Artificial Intelligence. Berlin, Heidelberg: Springer Berlin Heidelberg; 2004 p. 517525. doi: 10.1007/978-3-540-25945-9_51

40 

García-LópezFC, García-TorresM, Melián-BatistaB, PérezJAM, Moreno-VegaJM. Solving the Feature Selection Problem by a Parallel Scatter Search. European Journal of Operations Research. 2006;169(2):477489. doi: 10.1016/j.ejor.2004.08.010

41 

HallMA. Correlation-based Feature Subset Selection for Machine Learning. University of Waikato; 1998.

42 

LiZ, SchugJ, TutejaG, WhiteP, KaestnerKH. The nucleosome map of the mammalian liver. Nature Structural and Molecular Biology. 2011; doi: 10.1038/nsmb.2060

43 

FaureAJ, SchmidtD, WattS, SchwaliePC, WilsonMD, XuH, et al Cohesin regulates tissue-specific expression by stabilizing highly occupied cis-regulatory modules. Genome Research. 2012; doi: 10.1101/gr.136507.111

44 

RemeseiroS, CuadradoA, Gómez-LãpezG, PisanoDG, LosadaA. A unique role of cohesin-SA1 in gene regulation and development. EMBO Journal. 2012; doi: 10.1038/emboj.2012.60

45 

TeifVB, VainshteinY, Caudron-HergerM, MallmJP, MarthC, HöferT, et al Genome-wide nucleosome positioning during embryonic stem cell development. Nature Structural and Molecular Biology. 2012; doi: 10.1038/nsmb.2419

46 

HonGC, RajagopalN, ShenY, McClearyDF, YueF, DangMD, et al Epigenetic memory at embryonic enhancers identified in DNA methylation maps from adult mouse tissues. Nature Genetics. 2013; doi: 10.1038/ng.2746

47 

KraushaarDC, JinW, MaunakeaA, AbrahamB, HaM, ZhaoK. Genome-wide incorporation dynamics reveal distinct categories of turnover for the histone variant H3.3. Genome Biology. 2013; doi: 10.1186/gb-2013-14-10-r121

48 

YuW, BrionesV, ListerR, McIntoshC, HanY, LeeEY, et al CG hypomethylation in Lsh-/- mouse embryonic fibroblasts is associated with de novo H3K4me1 formation and altered cellular plasticity. Proceedings of the National Academy of Sciences of the United States of America. 2014;111(16):58905895. doi: 10.1073/pnas.1320945111

49 

XuY, GuoW, LiP, ZhangY, ZhaoM, FanZ, et al Long-Range Chromosome Interactions Mediated by Cohesin Shape Circadian Gene Expression. PLoS Genetics. 2016; doi: 10.1371/journal.pgen.1005992

50 

FloresO, OrozcoM. nucleR: A package for non-parametric nucleosome positioning. Bioinformatics. 2011; doi: 10.1093/bioinformatics/btr345

51 

KimSH, GanjiM, KimE, van der TorreJ, AbbondanzieriE, DekkerC. DNA sequence encodes the position of DNA supercoils. eLife. 2018; doi: 10.7554/eLife.36557

52 

MathelierA, XinB, ChiuTP, YangL, RohsR, WassermanWW. DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo. Cell Systems. 2016;3(3):278286.e4. doi: 10.1016/j.cels.2016.07.001

53 

RaoS, ChiuTP, KribelbauerJF, MannRS, BussemakerHJ, RohsR. Systematic prediction of DNA shape changes due to CpG methylation explains epigenetic effects on protein-DNA binding. Epigenetics and Chromatin. 2018; doi: 10.1186/s13072-018-0174-4

54 

JonesPA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nature Reviews Genetics. 2012;13(7):484492. doi: 10.1038/nrg3230

55 

VinsonC, ChatterjeeR. CG methylation. Epigenomics. 2012;4(6):655663. doi: 10.2217/epi.12.55

56 

RubeHT, LeeW, HejnaM, ChenH, YasuiDH, HessJF, et al Sequence features accurately predict genome-wide MeCP2 binding in vivo. Nature Communications. 2016; doi: 10.1038/ncomms11025

57 

ZhouT, YangL, LuY, DrorI, Dantas MachadoAC, GhaneT, et al DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale. Nucleic acids research. 2013;41(Web Server issue). doi: 10.1093/nar/gkt437

58 

TiwariVK, BurgerL, NikoletopoulouV, DeograciasR, ThakurelaS, WirbelauerC, et al Target genes of Topoisomerase II regulate neuronal survival and are defined by their chromatin state. Proceedings of the National Academy of Sciences. 2012;109(16):E934E943. doi: 10.1073/pnas.1119798109

59 

Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practical Machine Learning Tools and Techniques; 2016. doi: 10.1016/C2009-0-19715-5

60 

Álvarez-QuilónA, Terrón-BautistaJ, Delgado-SainzI, Serrano-BenítezA, Romero-GranadosR, Martínez-GarcíaPM, et al Endogenous topoisomerase II-mediated DNA breaks drive thymic cancer predisposition linked to ATM deficiency. Nature Communications. 2020; 11(1):910 doi: 10.1038/s41467-020-14638-w

61 

Olmedo-PelayoJ, Rubio-ContrerasD, Gómez-HerrerosF. Canonical non-homologous end-joining promotes genome mutagenesis and translocations induced by transcription-associated DNA topoisomerase 2 activity. Nucleic Acids Research. 2020; 48(16):91479160. doi: 10.1093/nar/gkaa640

3 Jun 2020

Dear Dr. Cortés-Ledesma,

Thank you very much for submitting your manuscript "Genome-wide prediction of topoisomerase IIβ binding by architectural factors and chromatin accessibility" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. Even though we expect you to address each comment, it seems to us to be essential that you particularly address concerns  about the distinction of the contribution of this work from ref 11 and other related literature, comparison to a simple baseline prediction (regions with CTCF, DNaseI and cohesin peaks), antibody validation, availability of the code (GitHub or similar) and data (please include the token within the main manuscript in addition to the cover letter), and clarification about feature selection strategy.  

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Ferhat Ay, Ph.D

Associate Editor

PLOS Computational Biology

Alice McHardy

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors utlise a number of publically available chip seq data sets for TOP2B and also for CTCF, RAD21, STAG1, STAG2, and a number of histone modifications. They use a range of computing methods to compare the genomic locations of TOP2B peaks/enriched regions with the peaks for the other types of proteins (CTCF, RAD21 etc). They also look at chromatin accessibility (DNAase hypersentivity) and various DNA features including CpG islands.

They do find concordance between TOP2B peaks DNAase hypersenitive regions and CTCF and RAD21. However this is not surprising, as in ref 11 Uuskula-Reimand et al, they previously showed and reported in detail that half of CTCF and cohesin (contains RAD21) - part of abstract from Ref 11 copied below

"TOP2B associates with DNase I hypersensitivity sites, allele-specific transcription factor (TF) binding, and evolutionarily conserved TF binding sites on the mouse genome. Approximately half of all CTCF/cohesion-bound regions coincided with TOP2B binding. Base pair resolution ChIP-exo mapping of TOP2B, CTCF, and cohesin sites revealed a striking structural ordering of these proteins along the genome relative to the CTCF motif. These ordered TOP2B-CTCF-cohesin sites flank the boundaries of topologically associating domains (TADs) with TOP2B positioned externally and cohesin internally to the domain loop."

https://www.ncbi.nlm.nih.gov/pubmed/27582050

Although this reference is cited in the introduction of this study, this study fails to fully address how the findings in this manuscript relate to this previous closely related research. The authors should consider revising their Introduction and Discussion to reference more fully the closely related previous literature.

This study also reports the generation two new datasets one from mouse cells and one from human cells to utlise their machine learning programmes to determine if they can predict where the TOP2B peaks/enriched regions will be found. In the venn diagrams in Figure 5B and 5E approx half the predicted TOP2B sites (based on the location of CTCF, RAD21 and DNAse hypersentive sites) overlap with the experimentally determined TOP2B sites. This agrees with the 50% concordance published in ref 11. What might be the reasons for the concordance not being 100%?

This manuscript confirms what was was previously reported for murine cells, using different computer programmes.

Reviewer #2: In this manuscript Martinez-Garcia et al. develop a computational tool to predict the localization of TOPO2 beta genome wide based on input data sets from DNaseI, and Chip-Seq of CTCF and RAD21. The model can be used to predict both mouse and human TOPO2 beta localization. These results are based on the previously observed strong co-localization of TOPO2 beta with CTCF, cohesion and DNaseI accessible sites from several research groups in both mouse and human cells. The models show very good accuracy, with reported ~95% of TOPO2 beta peaks called using the trained sets with the models, and ~90% accuracy using the experimental data sets (MCF7, mouse thymocytes). Overall this is a well-done manuscript. However, given the strong co-localization of TOPO2 beta with the features used to develop the models (DNaseI, CTCF and cohesion), its utility is somewhat diminished. I recommended publication of this manuscript provided the following concerns are addressed.

Major Points

1.) It has been well known that TOPO2 beta co-localizes with DNaseI, CTCF and cohesion binding sites. It seems that much of the TOPO2 binding can be predicted summing the unique peak calls from the three chromatin elements (Fig 1C). What is the overlap frequency of TOPO2 beta peaks to the combined CTCF, DNaseI and Rad21 peaks (removing redundant peak calls – Bedtools Merge function)? Can this simple method achieve worse/similar/better accuracy that the computational methods? Can this analysis be added to Figure 5 (or as a supplement). A similar presentation can be used as in Figure 5 with the merged DNAseI, CTCF and Rad21 peak data set replacing the “Predicted” data set from the authors models. This could serve as good control as to the superiority of the authors models.

2.) It has been known for some time that regions of open chromatin are “sticky”. This has been seen when using IgG controls in ChIP-Seq experiments and when ChIPing proteins which do not associate with chromatin like GFP. This is a aprticulare problem if the washing is not stringent enough. I would like the authors to process an IgG control ChIP-Seq data set for one other their cell lines (at similar read density to the TOPO2 Beta data set), call peaks using similar parameters as used to call TOPO2 Beta peaks and determine the overlap with their TOPO2 Beta peaks and predicted peaks as seen in Figure 5 B, E. Can the analysis be included in the supplement?

3.) It is important to validate antibodies prior to using them for ChIP-Seq experiments. I don’t see that the authors have validated the TOPO2 Beta antibodies used for the ChjIP-Seq experiments. Can a extract of total cellular proteins be resolved on a 4-20% gradient gel and the antibodies used for ChIP-Seq be used to do a Western blot? Is one band observed at the expected molecular weight? Can this be added to the supplement?

4.) I don’t see anywhere in the manuscript that the code to run the model is available. I do see a GEO deposit number. Does this have the code to run the model? I am assuming it is not available to the reviewers as a reviewer link and password is not provided. The code needs to be made freely available. Detailed instructions should be provided on how to run the code, using CTCF, RAD21 and DNAseI inputs, to get the predicted TOPO2 Beta output.

Minor Points

1.) Can a key be added to 1B and 1D? as in figure 1C? It is difficult for the reader to go to the figure legend to determine what the groups are.

2.) Can the ChIP-Seq peak calls and the “predicted” peak cells be added for each track in Figure 6 A and B? Also, supplemental Fig S6, S7 and Fig 1A?

Reviewer #3: Summary:

Martinez-Garcia et al., built several machine learning models to predict the genome-wide binding of topoisomerase II beta (TOP2B) using sequence features, DNA shape, and chromatin properties obtained from publically available data. The authors showed that open chromatin and occupancy of chromatin architecture proteins are consistently the best predictors.Using only DNase I signal, CTCF, and RAD21 binding, the authors were able to accurately predict TOP2B binding. With experimental validations, the authors showed that the predictive power of these three features is conserved across cell types and between human and mouse, achieving performance similar to biological replicates. Furthermore the predicted probability of TOP2B binding highly resembles experimental data showing the potential of using such a model to achieve quantitative predictions.

We find the work by Martinez-Garcia et al. to be interesting and the manuscript well-written and should be of general interest to genome biologists and those with specific interests in DNA topoisomerases and DNA repair. This is a computational biology paper takes published observations that TOP2B correlates well with cohesin and DNA and formalizes them into a predictive model. Understanding TOP2B binding has important implications yet experiments are often limited by the availability of antibodies. Thus, the predictive model proposed in the manuscript is a valuable approach that could be applied by others. The authors evaluated different models and feature selection methods. Reasonable approaches appear to be taken both computationally and experimentally. For example, the authors carefully tested and interpreted the usage of GC-corrected background sets and used two different TOP2B antibodies for experimental validation.

Major comments:

The major concerns I have come from:

1) Apparent lack of availability of data (no reviewer token was given and so I could not check the if appropriate data and metadata was submitted. Small but crucial details such as the specific antibodies (e.g. no part number was provided) used were not included in the methods and there was no way to assess whether this, along with other crucial metadata was included in the GEO submission referred to.

2) Equally important would be to address the lack of code and intermediate files that would allow others, including reviewers, to replicate the findings. Given this paper uses existing and appropriate models to make useful predictions, it would be most important that other computational biologist can easily replicate them and make new predictions.

3) There seems to be no direct acknowledgement and exploration (aside from a citation) of recent papers that generated original data, and made computational models, to reveal and predict the location of TOP2 covalent complexes. Specifically the original END-seq method has been demonstrated to capture TOP2 covalent complexes (Canela et al. 2017 PMID: 28735753 and PMID: 31202577). In Canela et al. 2017 PMID: 28735753 they presented a linear regression model that could predict DNA breaks (TOP2cc’s) using only RAD21. It would be important for the authors to more directly relate their work to this study. Given this current study and the previous study used MEFs there would be specific opportunities to see how well the author’s current model predicts the END-seq (TOP2cc) data from Canela et al. 2017. At the very least some discussion is warranted giving both studies highlight the importance of cohesin in their predictions.

Detailed comments:

1) Method: For processing: “raw reads were merged for biological replicates”. For identification of binding sites: “when replicates were possible, peaks were called for individual replicates and overlapping peaks were kept”. It is thus not clear if biological replicates were merged prior to alignment or not.

2) Method: it is not exactly clear how the additional set of GC-matched controls was generated.

3) Method: the features used clearly violates the assumptions of NB. The authors did not clarify why they were still interested to try out the method.

4) Are the random controls showing similar genomic distribution (distance to TSS for example) as TOP2B peaks?

5) Line 233-234, 15 experiments used for model training is not clearly indicated in table S1. It is thus not clear which datasets were used for the classification shown in Table 1. This is then explained in lines 272-277. However, it will be more clear if it’s explained before getting into prediction results.

6) Interpretation: line 297, “These observations contrast with previous findings of TOP2B being involved in transcriptional regulation [47], and may indicate that other chromatin features, such as histone marks and DNase-seq, are capturing such association..” Would the authors expect difference performance if they used RNAP2 instead of RNA-seq? Could predictions for TOP2B at specific genomic regions (TSS/first exon) be a more appropriate question when RNA-seq is used? More details in the discussion how these results contrast previous results would be welcomed. For instance, the association of TOP2B occupancy with cohesin and DNAse I has been revealed in several studies as has its association with transcribed genes. Canela et al.’s END-seq work PMID: 28735753 clearly showed that without transcription a linear model with RAD21 binding could predict TOP2cc genome wide. To me it seems the results here from Martinez-Garcia et al are congruent with previous work. Discussing this would be relevant. Furthermore, the fact that transcription was relevant for TOP2 mediated DSBs is worth mentioning (using modified END-seq from Canela et al. PMID: 31202577 along with Gothe et al (PMID: 31202576)).

7) Analysis: Fig1D, GG signal peak seems to be biased on one side? Is this driven by a few regions?

8) Analysis: Fig6, it would be good to add a genome-wide correlation measure between prediction probability and real ChIP-seq data.

9) Analysis: The proposed model is trained using TOP2B peaks but the prediction is made on sliding windows across the genome. Could the authors train the model using whole-genome sliding windows or comment on the potential/necessity of doing so. Is it possible to adopt other models to be able to predict the quantitative signal (instead of classification)?

Reviewer #4: In the manuscript, the authors proposed a computational approach to predict TOP2B binding sites using chromatin accessibility and architectural proteins. The authors compared the performance of three classifiers: Naive Bayes, Support Vector Machine and Random Forests on a bunch of different features including: histone marks, Pol2 binding, architectural components, chromatin accessibility, gene expression, DNA shape, DNA sequence and CpG methylation. Next, they conducted feature selection and found that DNase I hypersensitivity, CTCF and cohesin binding are the most important features to predict TOP2B binding sites. Then, they trained a generalized model using these three features in one cell type and applied the model to a different cell type and validated the predictions with ChIP-seq of TOP2B, showing that the generalized model can predict TOP2B binding sites in new cell line and species.

TOP2B plays important roles in DNA metabolism and 3D organization of chromatin. But currently there are few TOP2B ChIP-seq datasets available. Based on the high accuracy of the predictions presented in the manuscript, this approach offers an attractive way to predict TOP2B binding in different cell types and tissues using the public available datasets. The manuscript is well written and is sound and accurate in general.

Comments

1. Why is Random Forests added after feature selection, not at the beginning of the comparing different features and feature selection? RF can also identify the most important features. Would the important features selected by RF be consistent with the ones selected by FCBF and SS?

2. In the “Feature selection algorithms” method section, the authors mention that FCBF uses SU as goodness function in line 124, and then mention CFS measure is used as goodness function in line 138. Is CFS used instead of SU in FCBF? Could the authors provide more details?

3. Could the authors add the comparison of models trained using all features vs three important features to validate that the prediction accuracy is not reduced much.

4. For feature selection, SS selects DNase-seq, RAD21 and STAG2 as important features in liver, while DNase-seq, RAD21 and CTCF in MEF. Could the authors compare the performance of the model trained using DNase-seq, RAD21 and CTCF with the model trained using DNase-seq, RAD21 and STAG2 in liver to support that the performance is similar?

5. The authors compared the predicted peaks with HOMER detected peaks. Did the authors compare this predictive model with other models that can predict TF binding sites (e.g. DRAF)?

6. Did the authors apply this approach to predict other TF binding sites, e.g. TOP2A? How is the performance of that?

7. The authors should provide more description about the machine learning framework. It is not clear what package if any was used. Code and scripts should be provided along with the training data. It would be nice to provide the folds of validation too.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: No: I could not view the new ChiP-seq data sets on GEO, it says they are private and not to be released until Nov 2022.

"Accession "GSE141528" is currently private and is scheduled to be released on Nov 16, 2022."

Are the machine learning scripts/programmes publically available?

If so where can they be accessed?

Reviewer #2: No: I don't see that the code needed for predicting the TOPO2 Beta occupancy is provided. I might be missing it.

Reviewer #3: No: see above comments

Reviewer #4: No: The authors don't provide any description of the code or software used to perform these experiments.

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods


29 Aug 2020

Submitted filename: revision_letter.pdf

10 Oct 2020

Dear Dr. Cortés-Ledesma,

Thank you very much for submitting your manuscript "Genome-wide prediction of topoisomerase IIβ binding by architectural factors and chromatin accessibility" for consideration at PLOS Computational Biology.

In light of the reviews (below this email), two reviewers agreed to the publication of this work. However, reviewer 2 has major concerns remaining and reviewer 4 has some minor issues. Therefore, we cannot accept the work in its current form but if all concerns are satisfactorily addressed in a revised version, we would be happy to consider this work again. Note that we will not be able to offer another round of revisions after this and will have to make an accept/reject decision. Your revised manuscript will be sent to at least one of the reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Ferhat Ay, Ph.D

Associate Editor

PLOS Computational Biology

Alice McHardy

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: OK now apart from a few small issues

1. how many million reads were each MCF7 replicate?

2. some typos - "split" not "splitted" in methods and on line 509

"extent" not "extend" on line 479

Reviewer #2: Reviewer 2

Many of the requested changes were not made to the satisfaction of the reviewer.

Major point 1 - The response to this point is very confusing. I think the authors did a good job of giving the reviewer what they wanted, but the result is very confusing. In supp Fig 10 the authors show that 100% of the R2 peaks overlap with the merged peaks, and ~70% with the predicted peaks. If that is the case why would not 70% of the rows in the Merged-only heat map analysis have the high enrichment scores as seen in the Predicted-only heat map analysis? Similarly for the R1? This heat map analysis does not make sense. The Merged-only heat map shows all very low enrichments, but 70% should be as high as those shown in the Predicted-only heat map.

Major point 2 - The requested analysis was not done. Maybe I was not completely clear. I wanted the authors to down load a IGG pulldown ChIP-Seq data set and use it in their peak calling algorithm and determine the overlap with their Top2 ChIP-Seq data sets and the predicted data sets. As is shown now it is very difficult to know what the authors did. I don't see any discussion of IGG controls in any of the text, methods. What was actually done?

Major point 3 - The validation is wholly inadequate. I requested a 4-20 gradient gel. What is shown is only a band which is about the correct MW for the target (Top2). I would like to see if the antibodies used in the study cross react with other cellular proteins.

Major point 4 - this has been addressed adequately.

Minor point 1 - this has been addressed adequately.

Minor point 2 - 6A- Novus is missing the peak calls, Fig 1A is missing all peak calls, S11, S12, S13, S15, S16 are missing peak calls. It would be best for the reader to see these peaks calls, to understand what is being called in each track and see visually the agreement/disagreement between the tracks.

Reviewer #3: I think the authors addressed my concerns as well as those of the other reviewers.

Reviewer #4: The authors have adequately addressed my comments and concerns. Their model has now been more rigorously tested and the manuscript is greatly improved.

There are some minor comments that need to be addressed:

1) Mislabeled Fig S4 as Fig S3 in multiple places: legend of Figure 3, line 363, line 375

2) Table 4 and S8 Fig: The cross-validation performance in Liver based on SVM (76.15 ± 9.76%) is lower than the cross-system performance (Liver/MEF 93.11% and Liver/aB cells 98.21%). Is the accuracy of the cross-validation performance in Liver based on SVM correct? In Figure S6, it looks that the cross-validation performance of SVM using RAD21, DNase and CTCF features in Liver is good (AUC 0.95).

3) In line 502-504, besides the Fig 5 examples, provide the Pearson's correlation of predicted TOP2B binding vs experimental signals like Table S5, S6 to validate that similarities are globally significant.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods


28 Oct 2020

Submitted filename: renamed_7ea32.pdf

13 Nov 2020

Dear Dr. Cortés-Ledesma,

We are pleased to inform you that your manuscript 'Genome-wide prediction of topoisomerase IIβ binding by architectural factors and chromatin accessibility' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Ferhat Ay, Ph.D

Associate Editor

PLOS Computational Biology

Alice McHardy

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #4: The authors have addressed the remaining issues.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #4: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #4: No


12 Jan 2021

PCOMPBIOL-D-20-00455R2

Genome-wide prediction of topoisomerase IIβ binding by architectural factors and chromatin accessibility

Dear Dr Cortés-Ledesma,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Jutka Oroszlan

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
https://www.researchpad.co/tools/openurl?pubtype=article&doi=10.1371/journal.pcbi.1007814&title=Genome-wide prediction of topoisomerase II<i>β</i> binding by architectural factors and chromatin accessibility&author=&keyword=&subject=Research Article,Biology and Life Sciences,Genetics,Genomics,Animal Genomics,Mammalian Genomics,Biology and life sciences,Cell biology,Chromosome biology,Chromatin,Chromatin modification,DNA methylation,Biology and life sciences,Genetics,Epigenetics,Chromatin,Chromatin modification,DNA methylation,Biology and life sciences,Genetics,Gene expression,Chromatin,Chromatin modification,DNA methylation,Biology and life sciences,Genetics,DNA,DNA modification,DNA methylation,Biology and life sciences,Biochemistry,Nucleic acids,DNA,DNA modification,DNA methylation,Biology and life sciences,Genetics,Epigenetics,DNA modification,DNA methylation,Biology and life sciences,Genetics,Gene expression,DNA modification,DNA methylation,Biology and Life Sciences,Cell Biology,Chromosome Biology,Chromatin,Biology and Life Sciences,Genetics,Epigenetics,Chromatin,Biology and Life Sciences,Genetics,Gene Expression,Chromatin,Biology and life sciences,Molecular biology,Molecular biology techniques,Sequencing techniques,DNA sequencing,DNase Seq,Research and analysis methods,Molecular biology techniques,Sequencing techniques,DNA sequencing,DNase Seq,Computer and Information Sciences,Artificial Intelligence,Machine Learning,Support Vector Machines,Biology and Life Sciences,Cell Biology,Signal Transduction,Cell Signaling,Genomic Signal Processing,Biology and Life Sciences,Genetics,Genomics,Human Genomics,Biology and Life Sciences,Cell Biology,Cellular Types,Animal Cells,Immune Cells,Antibody-Producing Cells,B Cells,Biology and Life Sciences,Immunology,Immune Cells,Antibody-Producing Cells,B Cells,Medicine and Health Sciences,Immunology,Immune Cells,Antibody-Producing Cells,B Cells,Biology and Life Sciences,Cell Biology,Cellular Types,Animal Cells,Blood Cells,White Blood Cells,B Cells,Biology and Life Sciences,Cell Biology,Cellular Types,Animal Cells,Immune Cells,White Blood Cells,B Cells,Biology and Life Sciences,Immunology,Immune Cells,White Blood Cells,B Cells,Medicine and Health Sciences,Immunology,Immune Cells,White Blood Cells,B Cells,