
## README
Author: Kris Alavattam
Dates: 2025-10-29, 2025-11-06

<br />

### Topologically associating domains (TADs)
#### Overview
This bundle includes genome-wide insulation scores, boundaries, and TAD annotationss from pooled endothelial and cardiomyocyte differentiation Hi-C datasets analyzed in the [*Stem Cell Rep* 2023 study](https://pubmed.ncbi.nlm.nih.gov/36493778/). (Cardiomyocyte data are from [Bertero & Fields et al., *Nat Commun* 2019](https://pubmed.ncbi.nlm.nih.gov/30948719/).)

Analyses were run at 40-kb resolution on autosomes and chromosome X (as the RUES2 stem cell line is female) using the insulation-score method of [Crane et al., *Nature* 2015](https://pubmed.ncbi.nlm.nih.gov/26030525/) as implemented in the [`cworld-dekker` toolkit](https://github.com/dekkerlab/cworld-dekker/tree/master). We also generated 10-kb TAD calls; these weren’t used in the paper but are provided for reference.

##### Files used for analyses in the paper
```text
{endoPooled_D{0,2,6,14},cardioPooled_D{0,2,5,14}}.genome.40000--is520001--nt0.01--ids320001--ss160001--immean.insulation--mbs0--mts0.tads.bed.gz
{endoPooled_D{0,2,6,14},cardioPooled_D{0,2,5,14}}.genome.40000--is520001--nt0.01--ids320001--ss160001--immean.insulation.boundaries.bed.gz
```

<br />

#### Data contents
Each timepoint (`endoPooled_D{0,2,6,14}`, `cardioPooled_D{0,2,5,14}`) includes four gzipped, genome-wide files:
1. TAD domains: `*.tads.bed.gz`
    BED intervals marking each called TAD (`chrom`, `start`, `end`, `name`, `score`). <mark>**This is the primary file for TAD overlap/annotation work.**</mark>

2. TAD boundaries: `*.insulation.boundaries.bed.gz`
    BED intervals at local minima of the normalized insulation profile (i.e., boundary calls). <mark>**This was also used for TAD overlap/annotation work.**</mark>

3. Insulation score track: `*.insulation.bedGraph.gz`
    Genome-wide normalized insulation profile used to call boundaries. This enables visualization of bin-wise insulation values and recomputation of TAD boundaries. Not used in the paper.

4. TAD domain scores: `*.tads.bedGraph.gz`
    bedGraph intervals identical to (**1**), with column 4 giving a “TAD strength score” that summarizes boundary-adjacent insulation behavior. Not used in the paper.

*Note: in these files, chromosome names contain a “chr” prefix, whereas loop files do not contain the prefix. If you plan to integrate TAD and loop coordinates, you’ll need to normalize chromosome naming conventions.*

<br />

#### What insulation and TAD strength scores represent
##### Insulation scores
Insulation scores quantify local contact levels (i.e., enrichment or depletion) around each bin’s neighborhood window. Lower (more negative) values indicate stronger insulation (reflecting stronger separation between domains). These are log₂-transformed and mean-centered (`--im mean`).

##### TAD strength scores
TAD strength scores summarize the relative drop in insulation from the interior of a TAD to its flanking boundaries. It is derived with [`insulation2tads.pl`](https://github.com/dekkerlab/cworld-dekker/blob/master/scripts/perl/insulation2tads.pl) from the normalized insulation values across the bins inside each TAD:
- Let `I[0...n-1]` be the normalized insulation values ($\log_2$-mean-centered) across the bins from the left boundary to the right boundary of a TAD.
- Let `I_max = max(I)` (maximum interior normalized insulation).
- Let `I_left = I[0]` (value at the left edge bin) and `I_right = I[n-1]` (value at the right edge bin).
- The TAD score is `score = min( I_max − I_left , I_max − I_right )`

Essentially, this captures the weaker of the two “drops” from the interior peak to each flanking edge, i.e., a boundary-strength-type summary. TADs with too many `NA` bins (>25% of bins) are suppressed by the script and not output. `NA` values are encoded as `NaN` in the bedGraph.

<br />

#### Parameters, provenance
All tracks and calls were produced from per-chromosome Hi-C matrices (pooled replicates) at 40 kb, which were later concatenated to genome-wide files. Insulation was computed with the following parameters:
- Insulation square (`--is`): 520,001 bp
- Delta span (`--ids`): 320,001 bp
- Smoothing (`--ss`): 160,001 bp
- Insulation mode (`--im`): mean
- Noise threshold (`--nt`): 0.01
- Boundary margin of error (default 0; not expanded in the final outputs)

Representative command (per chromosome and sample):
<details>
<summary><i>Click to view</i></summary>
<br />

```bash
matrix2insulation.pl \
    -i <SAMPLE>_40000_iced_<CHR>_dense_craned.matrix \
    --is 520001 --ids 320001 --ss 160001 --nt 0.01 --im mean \
    -o <OUTDIR>/chr<CHR>/<SAMPLE>.chr<CHR>.40000/<SAMPLE>.chr<CHR>.40000
```
</details>
<br />

TAD assembly from insulation and boundaries:
<details>
<summary><i>Click to view</i></summary>
<br />

```bash
insulation2tads.pl \
    -i <OUT>.insulation.txt \
    -b <OUT>.insulation.boundaries.txt \
    -o <OUT>.insulation \
    --mbs 0 --mts 0
```
</details>
<br />

#### File naming conventions
```txt
<TIMEPOINT>.genome.40000--is520001--nt0.01--ids320001--ss160001--immean.insulation.bedGraph.gz
<TIMEPOINT>.genome.40000--is520001--nt0.01--ids320001--ss160001--immean.insulation.boundaries.bed.gz
<TIMEPOINT>.genome.40000--is520001--nt0.01--ids320001--ss160001--immean.insulation--mbs0--mts0.tads.bed.gz
<TIMEPOINT>.genome.40000--is520001--nt0.01--ids320001--ss160001--immean.insulation--mbs0--mts0.tads.bedGraph.gz
```
where `<TIMEPOINT>` $\in$ `endoPooled_D{0,2,6,14}` or `cardioPooled_D{0,2,5,14}`.

<br />

#### Miscellaneous
- Visualization: the bedGraph/BED files load in IGV/UCSC. The TAD bedGraph uses `NaN` for missing values, and IGV/UCSC handles these (if I’m not mistaken&mdash;you’ll want to double check and, if not, strip such rows from the files).
- Coordinate system: 0-based, half-open intervals (in keeping with BED and bedGraph conventions).
- Reproducibility: exact insulation values depend on window geometry, smoothing, and `--im` setting; you should use the same parameters above for replication.

<br />

#### Representative workflow code snippets
##### Conversion of sparse matrices to dense matrices (HiC-Pro; per chromosome)
<details>
<summary><i>Click to view</i></summary>
<br />

```bash
${HICPRO_PATH}/utils/sparseToDense.py \
    -b <SAMPLE>/raw/40000/<SAMPLE>_40000_abs.bed \
    --perchr <SAMPLE>/iced/40000/<SAMPLE>_40000_iced.matrix
```
</details>
<br />

##### Conversion of dense matrices to “Crane-formatted matrices”
<details>
<summary><i>Click to view</i></summary>
(AWK call by Giancarlo Bonora.)

```bash
awk -v sample=<SAMPLE> -v rez=40000 -v chr=<CHR> '
    NR == 1 {
        for(i = 1; i <= NF; i++) {
            printf("\tbin%s|%s|chr%s:%d-%d", i, sample, chr, ((i - 1) * rez) + 1, (i * rez) + 1)
        }
        printf("\n")
    }
    {
        printf("bin%s|%s|chr%s:%d-%d\t", NR, sample, chr, ((NR - 1) * rez) + 1, (NR * rez) + 1)
        print $0
    }
    ' \
    <SAMPLE>_40000_iced_<CHR>_dense.matrix \
        > <SAMPLE>_40000_iced_<CHR>_dense_craned.matrix
```
</details>
<br />

##### Boundary computation with [`matrix2insulation.pl`](https://github.com/dekkerlab/cworld-dekker/blob/master/scripts/perl/matrix2insulation.pl) (as above)
<details>
<summary><i>Click to view</i></summary>
<br />

```bash
matrix2insulation.pl \
    -i <SAMPLE>_40000_iced_<CHR>_dense_craned.matrix \
    --is 520001 --ids 320001 --ss 160001 --nt 0.01 --im mean \
    -o <OUTDIR>/chr<CHR>/<SAMPLE>.chr<CHR>.40000/<SAMPLE>.chr<CHR>.40000
```
</details>
<br />

##### TAD assembly with [`insulation2tads.pl`](https://github.com/dekkerlab/cworld-dekker/blob/master/scripts/perl/insulation2tads.pl) (as above)
<details>
<summary><i>Click to view</i></summary>
<br />

```bash
insulation2tads.pl \
    -i <OUT>.insulation.txt \
    -b <OUT>.insulation.boundaries.txt \
    -o <OUT>.insulation \
    --mbs 0 --mts 0
```
</details>
<br />

##### Additional work
Per-chromosome outputs were concatenated to genome-wide tracks and `gzip`-compressed.

<br />

### Loops
#### Overview
This bundle includes genome-wide chromatin loop calls (“pairwise point interactions” or “PPIs”) from the pooled endothelial and cardiomyocyte differentiation Hi-C datasets analyzed in [*Stem Cell Rep* 2023](https://pubmed.ncbi.nlm.nih.gov/36493778/). (Cardiomyocyte data are from [Bertero & Fields et al., *Nat Commun* 2019](https://pubmed.ncbi.nlm.nih.gov/30948719/).)

Loops were called with HiCCUPS ([`juicer_tools`](https://github.com/aidenlab/JuicerTools); Juicer v1.9.9, Jcuda 0.8, `--cpu`). Primary analyses were run at 10-kb on autosomes and chromosome X (as the RUES2 cell line is female). We also provide 20-kb, 40-kb, and cross-resolution merged sets for completeness; these supplementary sets&mdash;and all `cardioPooled` datasets&mdash;were **not** used in the paper.

##### Files used for analyses in the paper
```text
endoPooled_D{0,2,6,14}.postprocessed_pixels_10000.bedpe.gz
```

<br />

#### Data contents
Each timepoint (`endoPooled_D{0,2,6,14}`, `cardioPooled_D{0,2,5,14}`) includes:
1. Postprocessed loops: `*.postprocessed_{1,2,4}0000.bedpe.gz`
    Primary, analysis-ready loop set (the 10-kb `endoPooled` files were used in the *Stem Cell Rep* analyses). Each line lists a significant interaction pixel (one loop) passing all HiCCUPS filters after FDR control and local postprocessing. (For more details, see the [HiCCUPS wiki](https://github.com/aidenlab/juicer/wiki/HiCCUPS) maintained by the [Aiden Lab](https://aidenlab.org/) as well as the Supplementary Methods in their [Rao and Huntley et al., *Cell* 2014](https://pubmed.ncbi.nlm.nih.gov/25497547/) study.) <mark>**This is the primary file for loop overlap/annotation work.**</mark>

    Each `.bedpe` file follows the standard HiCCUPS schema:
    ```txt
    chrom1	x1	x2	chrom2	y1	y2	name	score	strand1	strand2	color	observed	expectedBL	expectedDonut	expectedH	expectedV	fdrBL	fdrDonut	fdrH	fdrV	numCollapsed	centroid1	centroid2	radius
    ```

    Variable definitions are available in the [HiCCUPS wiki](https://github.com/aidenlab/juicer/wiki/HiCCUPS).

2. `*.enriched_pixels_{1,2,4}0000.bedpe.gz`
    All candidate pixels that pass the enrichment tests (donut/horizontal/vertical/background) before centroid collapsing and final deduplication.

    Useful for inspecting the raw peaks underlying the final loop calls and/or redoing or customizing postprocessing. Not used in the paper.

3. `*.fdr_thresholds_{1,2,4}0000.gz`
    Per-resolution FDR cutoff tables produced by HiCCUPS. These record the enrichment thresholds used for each local background model (donut, horizontal, vertical, lower-left) at the chosen FDR.

    Useful for auditing or reproducing the calling stringency. Not used in the paper.

4. `*.merged_loops.bedpe.gz`
    Non-redundant union of the 10-, 20-, and 40-kb loop calls, with nearby pixels across resolutions merged to a single representative record.

    Provided for completeness and cross-resolution validation. Not used in the paper.

*Note: in these files, chromosome names do not have the “chr” prefix used in the various TAD files. If you plan to integrate TAD and loop coordinates, you’ll need to normalize chromosome naming conventions first.*

<br />

#### Parameters, provenance
Loop detection was executed using the Juicer `hiccups` command as in the following representative command:
<details>
<summary><i>Click to view</i></summary>
<br />

```bash
juicer_tools hiccups \
    --cpu \
    -r 10000,20000,40000 \
    -f 0.1,0.1,0.1 \
    -p 4,2,1 \
    -i 7,5,3 \
    -t 0.02,1.5,1.75,2.0,2.5 \
    <INPUT.hic> \
    <OUTPUT_DIR>
```
</details>
<br />

Postprocessed loops (`postprocessed_pixels_10000.bedpe`) were exported from each `<SAMPLE>.hiccupsOutput` directory and `gzip`-compressed. Although 20-kb, 40-kb, and multi-resolution (loop-merged) sets were generated, only 10-kb loops were used in the *Stem Cell Rep* analyses. (Again, happy to share the other data upon request.)

Parameter descriptions can be found in the [HiCCUPS wiki](https://github.com/aidenlab/juicer/wiki/HiCCUPS).

HiCCUPS operates on `.hic` files, which are binary matrices produced by Juicer’s pre-processing pipeline (e.g., `juicer_tools pre`). Paul Fields’ and Giancarlo Bonora’s earlier HiC-Pro runs produced `validPairs` text files, which can be converted to `.hic` format via the following:
<details>
<summary><i>Click to view</i></summary>
<br />

```bash
juicer_tools pre -n <HiC-Pro_validPairs> <OUTPUT.hic> <chrom.sizes>
```
</details>
<br />

#### File naming conventions
```txt
<TIMEPOINT>.enriched_pixels_10000.bedpe.gz
<TIMEPOINT>.fdr_thresholds_10000.gz
<TIMEPOINT>.postprocessed_pixels_10000.bedpe.gz
<TIMEPOINT>.merged_loops.bedpe.gz
```
where `<TIMEPOINT>` $\in$ `endoPooled_D{0,2,6,14}` or `cardioPooled_D{0,2,5,14}`.

<br />

#### Miscellaneous
- Visualization: BEDPE loops can be viewed in browsers designed for Hi-C (and related) data&mdash;e.g., Juicebox or HiGlass&mdash;or converted to long-range interaction arcs for use with the UCSC or IGV browsers.
- Coordinate system: 0-based, half-open intervals (per BED/BEDPE conventions).
- Interpretation: Each line (representing an individual loop) corresponds to a statistically enriched contact between two 10-kb bins. FDR values for multiple neighborhood models (donut, horizontal, vertical, lower-left) are included in columns 16&ndash;19.
- Reproducibility: Exact loop sets depend on bin size, normalization (e.g., `KR` for Knight-Ruiz matrix “balancing”), and HiCCUPS parameters; replication requires the same Juicer build and parameters as above.

<br />

#### Representative workflow code snippets
Summary of core steps to derive these outputs:
<details>
<summary><i>Click to view</i></summary>
<br />

```bash
#  Convert HiC-Pro validPairs files to Juicer-compatible .hic files
juicer_tools pre -n <HiC-Pro_validPairs> <OUTPUT.hic> <chrom.sizes>

#  Run HiCCUPS
juicer_tools hiccups --cpu \
    -r 10000,20000,40000 \
    -f 0.1,0.1,0.1 \
    -p 4,2,1 \
    -i 7,5,3 \
    -t 0.02,1.5,1.75,2.0,2.5 \
    -k KR \
    <SAMPLE>.hic <SAMPLE>.hiccupsOutput/

#  Compress and rename 10-kb postprocessed loops
gzip -c <SAMPLE>.hiccupsOutput/postprocessed_pixels_10000.bedpe \
    > <SAMPLE>.hiccups.postprocessed_10kb.bedpe.gz
```
</details>
<br />
