Introduction


ChIPCor is a testing procedure for evaluating the significance of overlapping from a pair of proteins by leveraging information from publicly available data. By borrowing information from protein binding profiles from public data repositories, the proposed method corrects for background artifacts involved in ChIP-seq data and the Simpson's paradox phenomenon due to genome heterogeneity.


Use ChIPCor


Download and installation

  • Click following links to download the package:
  • Follow the standard procedure to install the package in your R.
  • ChIPCor depends on BioConductor package qvalue to turn p-values to q-values.


Prepare configuration file


The user needs to prepare an input list containing overlapping information of both the target protein pair for comparison as well as the database of protein pairs from public data repositories under the same biological condition. Each element contains the names of a given pair of proteins and a list of two by two tables representing their overlapping at each segment type.

For preparing the two by two tables list for protein pairs under a given genome segmentation, the users can take the following steps:

  • Segment the whole genome into bins of equal sizes (e.g. 1000 base pairs). For instance, given the bed file of ChromHMM segmentation for cell line K562, the genome can be segmented into 1000bp long bins:
     
    #read in a vector recording the length of each chromosome for hg18   
    data(hg18.chrlen)
    #segment the genome into 1000bp long bins
    #annotate each bin according to ChromHMM states
    genomeseg<-genomeSeg("wgEncodeBroadHmmK562HMM.bed",winsize=1000, chrlen)
    
  • Similarly, for each protein, each bin can be annotated as 0 or 1 according to the absence or presence of its binding peaks using the function genomeSeg with the same segmentation length (e.g. 1000 base pairs).
  • Once the peak lists for all the proteins have been converted to 0 or 1 vectors, we obtain a table where each row corresponds to a genomic bin and each column represents the genome-wide binding profile of a protein. The sample database table for K562 can be downloaded here.
  • Then, the function prepareTbls constructs a two by two table to summarize the co-occurrence pattern of binding sites for each pair of proteins under each genome segmentation type. Each element of the function prepareTbls's output corresponds to a protein pair, and it records the co-occurrence patterns of the pair of protein over each segment type.
    data(K562peaklocmat)
    tbls=prepareTbls(peakloc.mat,genomeseg)
    #data structure
    str(tbls[[1]])
    
  • Finally, the function ChIPCor can be applied to measure the spatial correlations of protein binding sites.


Examples


Using ChromHMM to segment the genome, we applied our testing procedure to transcription factors assayed in the ENCODE project cell line K562:
    
data(K562tbls)
#inspection of the data structure
str(tbls[[1]])
#Conduct testing
qval.K562<-ChIPCor(tbls)
head(qval.K562)
and GM12878:
data(GM12878tbls)
qval.GM12878<-ChIPCor(tbls)
head(qval.GM12878)