groupBy

Argument	Description
object	A profileplyr object
group	How the ranges will be grouped. If this is a character string, then it must match a column name of the range metadata, and this column will be used for grouping of any exported deepTools matrix. If this is a GRanges, or GRangesList, then the ranges will be subset based on overlap with these GRanges. If this is a list, each element should contain ether 1) a character vector of genes, and ranges will be subset based on overlap with these genes, as determined by the annotations made by annotateRanges() or annotateRanges_great() functions, or 2) a data frame with the gene symbols as the rownames. Any additional columns of this dataframe will be added to the range metadata.
GRanges_names	Relevant for ‘GRanges’ mode. These are the names that will be assigned to the ranges that overlap each GRanges object
levels	This will set the levels of the grouping column set by ‘rowGroupsInUse’ (if the grouping column is not a factor, it will be converted to one). If levels are not provided, they will remain unchanged if the grouping column was already a factor, or will use default leveling (e.g. alphabetical) if grouping column is not already a factor variable.
include_nonoverlapping	Relevant for ‘GRanges’ mode. This should be indicated (default is TRUE). A logical argument, if FALSE the regions from the original deepTools matrix that do not overlap with the user defined regions will be left out of the returned profileplyr object.
separateDuplicated	Relevant for ‘GRanges’ mode. This should be indicated (default is FALSE). A logical argument, if TRUE then regions that overlap multiple inputs toe ‘regions’ argument will be separated and made into their own group. All possible combinations of region overlaps will be tested, so it is not recommended to have more than 3 groups if this option is TRUE. If FALSE, then regions that overlap each individual ‘region’ input will be in the output, and if one region overlaps multiple ‘region’ inputs, then it will be duplicated in the output and will show up in the section for each group.
inherit_groups	A logical whether that groups the exist in the profileplyr object in the ‘object’ argument should be included in the default grouping scheme for the output object of this function. The default is TRUE. If false, only the GRanges or gene list overlap annotation will be used for heatmap grouping.

Example

Options for grouping

The groupBy() function within profileplyr allows for a range of grouping mechanisms
The groupBy() function always requires a profileplyr object and the ‘group’ argument, which will determine how the ranges are to be grouped
There are three options for grouping:
- Switch output grouping columns within existing range metadata
- Group by user-supplied GRanges
- Group by user-supplied gene list

Switch output grouping columns within existing range metadata

If the ‘group’ argument is a character string then it must match a name of one of the columns in mcols(proplyrObject)
In this case the groupBy function will change the column that will be used for grouping a profileplyr object during visualization.
Here we add a column of random numbers to demonstrate both adding a column and then switching to use this column in grouping.

mcols(proplyr_object_subset)$newColumn <- rnorm(n = nrow(proplyr_object_subset))

params(proplyr_object_subset)$rowGroupsInUse

## [1] "sgGroup"

switchGroup <- groupBy(proplyr_object_subset, 
                            group = "newColumn")
params(switchGroup)$rowGroupsInUse

## [1] "newColumn"

Group by user-supplied GRanges

If the ‘group’ argument is a GRanges or a GRangesList then the overlap between each GRanges in this list and the ranges of the profileplyr object will be determined
A column will be added to the range metadata (column name is ‘group_and_overlap’) that shows whether the range overlaps with any of the GRanges
The ‘rowGroupsInUse’ will be changed to this column by default.

Determine GRanges object to be used for overlap analysis

GATA4 is a transcription factor that is known to be important for directing the mesoderm lineage. A previous study had shown that GATA4 was enriched at enhancers in mesoderm differentiated cells compared to the parental HUES64 stem cells.
- bigwig files (two replicates) for the GATA4 ChIPseq experiments in the mesoderm cells were downloaded from the GEO database for this publication

Import bed files as GRanges

Here we will import the bed files as GRanges objects, which is necessary for grouping with the groupBy() function.
Then the high confidence peaks are found between the two replicates by taking the peaks that are present in both replicates

library(rtracklayer)
GATA4_1 <- import.bed("GSM1505644_Gata4_022113_meso.bed.peak.txt.gz")

GATA4_2 <- import.bed("GSM1505645_Gata4_041914_meso.bed.peak.txt.gz")

GATA4_HC <- GATA4_1[GATA4_1 %over% GATA4_2]
seqlevelsStyle(GATA4_HC) <- "UCSC"

use ‘groupBy’ function to indicate overlapping ranges

When a profileplyr object is grouped by GRanges a column is added to the range metadata titled “GR_overlap_names” that indicates which GRanges that range overlaps with (or an indication that there was no overlap). The ‘rowGroupsInUse’ slot of the params is automatically changed to this column, and this is reflected in the heatmap on the next slide.

proplyr_object_subset <- groupBy(proplyr_object_subset,
                                 group = GATA4_HC,
                                 GRanges_names = "GATA4_peaks",
                                 include_nonoverlapping = TRUE)

table(mcols(proplyr_object_subset)$GR_overlap_names)

## 
## GATA4_peaks  no_overlap 
##         124         876

Visualize overlaps as a range heatmap

generateEnrichedHeatmap(proplyr_object_subset,
                        include_group_annotation = TRUE,
                        color_by_sample_group = "chip",
                        ylim = "chip",
                        show_heatmap_legend = rep(FALSE, 6)
                        )

Group by user-supplied gene list

A profileplyr object can also be subset or annotated based on overlap of its gene annotations (from annotateRanges() or annotateRanges_great()) with a list of genes from the user. This argument must be a list(), and the elements of list can either be a character vector of gene symbols or a list of data frames with the gene symbols as the rownames. Additional columns in these data frames will be inculded as columns in the range metadata and can also be used for additional range annotation.

Read in gene lists

First we will read in two genes lists, one for genes going up in the mesoderm compared to the ESCs, and one for the genes that go down (or up in ESC). This analysis was done with DESeq2 using the following counts tables downloaded from ENCODE: * HUES64 ESC: Replicate 1 and Replicate 2 * Mesoderm: Replicate 1 and Replicate 2

library(dplyr)

UP_in_meso_genes <- read.table("UP_in_meso_genes.txt",
                               header = FALSE,
                               stringsAsFactors = FALSE) %>%
  .[,1]

head(UP_in_meso_genes)

## [1] "HAPLN1" "COL3A1" "ANXA1"  "ITGA1"  "LOX"    "COL1A1"

DOWN_in_meso_genes <- read.table("DOWN_in_meso_genes.txt",
                               header = FALSE,
                               stringsAsFactors = FALSE) %>%
  .[,1]

head(UP_in_meso_genes)

## [1] "HAPLN1" "COL3A1" "ANXA1"  "ITGA1"  "LOX"    "COL1A1"

groupBy the gene lists

We combine these gene sets into a single list, and again use the groupBy() function to group the ranges. This will create a column in the range metadata shoing these overlaps, and the’rowgroupsInUse’ will by default be switched to this column. Remember that you can always switch this grouping to any column you’d like with the same groupBy() function, as shown above.

# IMPORTANT: The profileplyr object must have been annotated with either annotateRanges() or annotateRanges_great() before subsetting by a gene list

proplyr_object_subset <- annotateRanges_great(proplyr_object_subset,
                                              species = "hg19")

geneList_for_grouping <- list(meso_UP= UP_in_meso_genes,
                              meso_DOWN = DOWN_in_meso_genes)

proplyr_object_subset <- groupBy(proplyr_object_subset,
                                 group = geneList_for_grouping)

## A column has been added to the range metadata with the column name 'GL_overlap_names' that specifies the GRanges each range overlaps with, but the inherited groups are not included.

mcols(proplyr_object_subset)

## DataFrame with 843 rows and 13 columns
##                 name     score                sgGroup     giID    names
##          <character> <numeric>               <factor> <factor> <factor>
## giID808           NA         0 K27ac_top10_HUES64.bed  giID808       NA
## giID785           NA         0 K27ac_top10_HUES64.bed  giID785       NA
## giID736           NA         0 K27ac_top10_HUES64.bed  giID736       NA
## giID714           NA         0 K27ac_top10_HUES64.bed  giID714       NA
## giID714           NA         0 K27ac_top10_HUES64.bed  giID714       NA
## ...              ...       ...                    ...      ...      ...
## giID983           NA         0 K27ac_top10_HUES64.bed  giID983       NA
## giID983           NA         0 K27ac_top10_HUES64.bed  giID983       NA
## giID999           NA         0 K27ac_top10_HUES64.bed  giID999       NA
## giID999           NA         0 K27ac_top10_HUES64.bed  giID999       NA
## giID1000          NA         0 K27ac_top10_HUES64.bed giID1000       NA
##                    newColumn overlap_matrix GR_overlap_names      name.1
##                    <numeric>       <matrix>         <factor> <character>
## giID808   -0.453881995410842            0:1       no_overlap          NA
## giID785    0.776292813554967            0:1       no_overlap          NA
## giID736    0.325036775903307            0:1      GATA4_peaks         158
## giID714  -0.0157360582609343            0:1       no_overlap          NA
## giID714  -0.0157360582609343            1:0       no_overlap          NA
## ...                      ...            ...              ...         ...
## giID983    0.331649349787874            0:1       no_overlap          NA
## giID983    0.331649349787874            1:0       no_overlap          NA
## giID999   -0.252095261459177            0:1       no_overlap          NA
## giID999   -0.252095261459177            0:1       no_overlap          NA
## giID1000     1.2475850575946            0:1       no_overlap          NA
##            score.1      SYMBOL distanceToTSS GL_overlap_names
##          <numeric> <character>     <numeric>        <ordered>
## giID808         NA       PUSL1          6780        meso_DOWN
## giID785         NA        RER1       -112760        meso_DOWN
## giID736      59.31        RERE        146478        meso_DOWN
## giID714         NA    SLC25A33       -167481        meso_DOWN
## giID714         NA       SPSB1         79121          meso_UP
## ...            ...         ...           ...              ...
## giID983         NA       HCFC1          -690        meso_DOWN
## giID983         NA     TMEM187           170          meso_UP
## giID999         NA      NLGN4Y       -721757        meso_DOWN
## giID999         NA      TMSB4Y         98422        meso_DOWN
## giID1000        NA       KDM5D           105        meso_DOWN

Visualize the gene list groupings

generateEnrichedHeatmap(proplyr_object_subset,
                        include_group_annotation = TRUE,
                        color_by_sample_group = "chip",
                        ylim = "chip",
                        show_heatmap_legend = rep(FALSE, 6)
                        )

Remember: can always export to deepTools

The profileplyr object can always be exported to deepTools using the ‘export_deepToolsMat’ function, however, remember that while the grouping will be preserved in deepTools, the extra annotation columns will not.

export_deepToolsMat(proplyr_object_subset,
                    con = "groupBy_geneList_deeptools",
                    overwrite = TRUE)

## [1] "groupBy_geneList_deeptools"

##system("plotHeatmap -m groupBy_geneList_deeptools.gz -o groupBy_geneList_deeptools.pdf --yMax 50 5 100 50 5 150 --zMax 100 20 180 100 20 180")

igv

Group by user-supplied dataframe of gene lists to incorporate additional information about these genes. Here that column indicates gene expression changes

The following table is simply a DESeq2 results table subset for gene symbol and the stat column indicating direction and extent of gene expression changes. By using a data frame with this column, we can then use this information as an extra annotation column in our range heatmap.

meso_vs_esc_stat <- readRDS("meso_vs_esc_stat.RData")

head(meso_vs_esc_stat)

##        meso_vs_esc_stat
## HAPLN1         44.41412
## COL3A1         43.80409
## ANXA1          40.33583
## ITGA1          35.59385
## LOX            35.55025
## COL1A1         35.05454

use ‘groupBy’ function to add column to the range metadata indicating gene expression changes

After this grouping, a column will be added to the range metadata called “GL_overlap_names”, in addition to the other columns in the user-supplied data frame.

note that in addition to the ‘GL_overlap_names’ column showing overlap (which isn’t important since we used all genes here), there is a column showing the expression changes for all these genes in meso vs esc

proplyr_object_subset <- groupBy(proplyr_object_subset,
                                 group = list(meso_vs_esc_stat), # note that this must be a list, and you could have more than one data frame here
                                 include_nonoverlapping = FALSE)


#it will be interesting to see the gene expression pattern within the clusters we have used before, so we will cluster the ranges after grouping by gene list
set.seed(0)
proplyr_object_subset <- clusterRanges(proplyr_object_subset,
                                       kmeans_k = 3)

tail(colnames(mcols(proplyr_object_subset)))

## [1] "SYMBOL"             "distanceToTSS"      "GL_overlap_names"  
## [4] "GL_overlap_names.1" "meso_vs_esc_stat"   "cluster"

Visualize gene expression changes as an extra annotation column

Here we use the gene expression information to generate a heatmap with the expression of genes in mesoderm vs HUES64 cells for each cluster

generateEnrichedHeatmap(proplyr_object_subset,
                        include_group_annotation = TRUE,
                        extra_annotation_columns = c("meso_vs_esc_stat"),
                        color_by_sample_group = "chip",
                        ylim = "chip",
                        show_heatmap_legend = rep(FALSE, 6)
                        )

Improve color scales by adding a custom scale

You’ll notice in the previous figure that the color scale for the expression data is unbalanced so that 0 is not right in the middle, resulting in most ranges aving a red color. And we want a different color for genes that go up and another color for genes that go down, with genes that don’t change being white. The colors of the extra annotation columns can be changed with the ‘extra_anno_color’ argument, which takes a list containing elements taht correspond to each extra annotation heatmap on the right side of the figure.

The list must be the same length as the number of extra columns, and each element can either be NULL (keep default coloring), a character or numeric vector with two or three colors depending on whether you want a two or three color progression, or a function that calls colorRamp(). Below we will generate a function to make a three color scheme, with 0 in the being white. We also make the max and min values the 98th and 2nd percentiles to avoid the extreme values driving the colors.

library(circlize)

meso_vs_esc_statonly <- mcols(proplyr_object_subset)$meso_vs_esc_stat
q_98 <- quantile(meso_vs_esc_statonly, 0.98)
q_01 <- quantile(meso_vs_esc_statonly, 0.02)

meso_vs_esc_stat_color <- (colorRamp2(c(q_01, 0, q_98),
                             c("blue", "white", "red")))

Use the custom color scale in ‘extra_anno_color’ argument of the ‘generateEnrichedHeatmap’ function

This function will be used for the gene expression column, and we will change the colors of the GATA4 overlap column as well so that no overlap is white.

generateEnrichedHeatmap(proplyr_object_subset,
                        include_group_annotation = TRUE,
                        extra_annotation_columns = c("meso_vs_esc_stat"),
                        color_by_sample_group = "chip",
                        ylim = "chip",
                        show_heatmap_legend = rep(FALSE, 6),
                        extra_anno_color = list(meso_vs_esc_stat_color)
                        )