Allele-specific expression (ASE) quantifies the relative expression of two alleles in a diploid individual, and such expression imbalance potentially contributes to phenotypic variation and disease pathophysiology among individuals. Since the two alleles used to measure ASE come from the same cellular environment and genetic background, they can serve as internal control and eliminate the influence of trans-acting genetic and environmental factors. Existing ASE detection methods analyze one individual at a time, therefore not only wasting shared information across individuals, but also posing a challenge for result interpretation across individuals. To overcome this limitation, we developed ASEP, a method that is able to detect gene-level ASE under one condition, as well as, ASE difference between two conditions (e.g., pre- vs post-treatment) in a population. We have demonstrated ASEP’s convincing performance through extensive simulations. Application of ASEP to human kidney and macrophage RNA-seq datasets have further illustrated its ability to uncover ASE genes related to kidney functions and cardiometabolic traits. With the wide application of large-scale transcriptome sequencing in biomedical studies, there is an urgent need to learn a comprehensive picture of ASE in diverse populations. We believe ASEP will be well-suited for this purpose and can guide future ASE studies.
Genome-wide association studies (GWAS) are successful in identifying candidate loci for complex human diseases and traits [1, 2]. Despite the impressive success for disease susceptibility loci discovery, few, if any, results from GWAS have led to the delivery of new therapies . The association peaks from GWAS typically identify a handful of gene candidates, but it is often unclear whether these candidates are expressed in relevant tissues and cell types. Further complicating the picture, we now know that most GWAS signals are probably the result of regulatory variants that impact gene expression, rather than amino acid changes. Data on gene expression from tissues and cell types directly involved in disease are critically important to find causative genes.
A commonly used approach to understand the functional roles of GWAS identified genetic variants is expression quantitative trait loci (eQTL) analysis [4, 5]. The rationale is that, a genetic variant, known as an eQTL, influences the expression level of a gene, and differences in gene expression levels among individuals may lead to different phenotypes. Studies have found that many GWAS identified single nucleotide polymorphisms (SNPs) are significantly enriched for eQTLs, compared to control SNPs matched by allele frequencies . eQTL analysis identifies both cis- and trans-regulatory SNPs, in which cis-eQTLs affect gene expression in an allele-specific manner, with implications on underlying mechanism, whereas trans -eQTLs affect gene expression in an allele independent manner . Although eQTL analysis has successfully uncovered functional variant loci that regulate gene expression, typical eQTL analysis only tells local versus distal association [7, 8]. The lack of explicit information on cis- versus trans - makes it difficult to directly link to the underlying mechanism, and the requirement of a relatively large sample size for eQTL analysis further makes it impractical for studies that involve difficult-to-collect tissues .
To identify cis-regulated gene expression variation, analysis of allele-specific expression (ASE) is required. ASE refers to unequal expression between paternal and maternal alleles of a gene in a diploid individual, driven by cis -regulatory variants located near the gene . The allelic imbalance of gene expression may explain phenotypic variation and disease pathophysiology. Since the two alleles used to measure ASE are expressed in the same cellular environment and genetic background, they can serve as internal control and eliminate the influence of trans-acting genetic and environmental factors. It has been shown that ASE analysis requires 8-fold less samples than eQTL analysis to reach the same power in detecting cis -regulatory SNPs, and is less sensitive to SNPs with low minor allele frequencies (MAFs) compared to eQTL analysis .
To measure ASE, we exploit allelic imbalance by RNA sequencing (RNA-seq), which provides allele-specific read counts at exonic SNPs distinguished by heterozygous sites . Existing methods for ASE detection report evidence of ASE in single individuals, in which the ASE is quantified for each SNP (e.g., QuASAR ), and a gene-level ASE is obtained by integrating effects across SNPs in the same gene for an individual (e.g., MBASED  and GeneiASE ). However, evidence of ASE is often shared across individuals. Failure to accommodate such shared information not only loses power, but also makes it difficult to interpret results across individuals. It is desirable to have a method that simultaneously models both multi-SNP and multi-individual information.
ASE detection across individuals, however, is challenging when only RNA-seq data are available, because the data often include individuals that are either heterozygous or homozygous for the unobserved cis-regulatory SNPs, leading to heterogeneity in ASE. Such heterogeneity complicates the analysis because only heterozygous individuals are informative for ASE, whereas those homozygous individuals have balanced expression. Further, when analyzing multiple SNPs in the same gene, haplotype phase information is needed to separate the paternal and maternal alleles. Although it is possible to infer haplotype phase from DNA genotype data, most studies do not have such data available. Even when phase information is available, cross-individual read count alignment is still needed when performing cross-individual analysis, which is complicated as the cis-regulatory SNP is not observed. Fig 1 illustrates these analytical challenges in cross-individual gene-based ASE analysis.
To properly perform cross-individual gene-based ASE analysis using only the RNA-seq data, we propose ASEP (Allele-Specific Expression analysis in a Population), a generalized linear mixed-effects model based method with subject-specific random effect to account for correlation of multiple SNPs within the same gene. ASEP is able to detect gene-level ASE under one condition and differential ASE between two conditions (e.g., pre- versus post-treatment). Through extensive simulations and analysis of real RNA-seq datasets from human transcriptomic studies, we demonstrate that combining shared ASE information across SNPs and individuals leads to easier interpretation and improved power in identifying genes with ASE. Results from our analysis shed light on the functional roles of GWAS identified genetic variants.
The primary goal of ASEP is to perform gene-based ASE analysis across individuals using only the RNA-seq data. However, the population includes individuals that are either heterozygous or homozygous for the unobserved cis-regulatory SNP, and ASE is present only in those heterozygous individuals. To account for such heterogeneity and simultaneously aggregate multi-individual and multi-SNP information, we develop ASEP, a generalized linear mixed-effects model based method, in which the subject-specific random effect is used to account for correlation of multiple SNPs within the same gene, and sample heterogeneity is modeled by a two-component mixture distribution. Our method can be applied to detect gene-level ASE under one condition and differential ASE between two conditions across individuals in a population.
ASEP utilizes allele-specific read counts across transcribed SNPs of a given gene obtained from RNA-seq. For a given gene g, let rSNP be its unobserved cis-regulatory SNP with alleles R and r, where we assume the R allele leads to increased expression level of the residing haplotype as compared to the r allele. The haplotype with higher expression is denoted as the major haplotype, and the alleles on this haplotype are referred as the major alleles. For individuals homozygous or heterozygous at the rSNP, we denote them as ‘Hom’ or ‘Het’, respectively. Let tSNP be an observed transcribed SNP within the gene of interest detected from the RNA-seq data. Individuals that are homozygous for the tSNPs are excluded from analysis since they do not provide information on allelic expression.
When haplotypes are inferred and properly aligned with the unobserved rSNP alleles across individuals, for one condition ASE analysis, we detect evidence of ASE by testing the existence of a mixture distribution within the samples, i.e., group-level ASE difference between ‘Hom’ and ‘Het’ samples. For paired two-condition analysis, we test for the difference of ASE between two conditions for the ‘Het’ individuals. However, RNA-seq data alone do not provide information on haplotype phase or rSNP . To address these issues, we adopt a pseudo-phasing procedure originally proposed by MBASED , which is a ‘majority voting’ procedure based on allele-specific read counts, to infer the major haplotype for each individual. Details of ASEP can be found in Materials and Methods.
We evaluated the performance of ASEP to detect gene-level ASE as a function of the number of individuals (N), the number of tSNPs (nSNP), sequencing depth (Y), and pre-specified minor allele frequency (MAF) of the unobserved rSNP. When haplotype phase among tSNPs was known, our simulations showed that type I error rate of ASEP was controlled at the 1% level under all scenarios we investigated. As expected, the power increased as the number of individuals, sequencing depth or the number of heterozygous tSNPs increased. Among these three factors, the sequencing depth and the number of tSNPs were more influential as compared to the sample size. With high sequencing depth and more tSNPs, our method had sufficient power to detect an ASE effect of 0.6. Further, increasing the proportion of ‘Het’ individuals in the sample, determined by the MAF of the unobserved rSNP, improved the power under all scenarios considered. The model performed similarly when MAF = 0.3 or 0.5, and outperformed the model when MAF = 0.1 with other factors held constant. This is expected since only 18% of the individuals were heterozygous at the rSNP when MAF = 0.1 under Hardy-Weinberg equilibrium (HWE), whereas more than 40% of the individuals were heterozygous when MAF = 0.3 or 0.5, leading to a much larger effective sample size in ASE detection. However, when the sequencing depth, the number of tSNPs and the MAF of the rSNP were all at low level, increasing sample size resulted in decreased power. This is likely due to higher uncertainty when aligning haplotypes across multiple individuals with increased sample size but with less information on allelic read counts (Fig 2A and S1A Fig).
Next, we examined the performance of ASEP when haplotype phase was unknown under similar scenarios. The type I error rate of ASEP was still well controlled at the 1% level across all scenarios. The power increased as the sequencing depth or the number of tSNPs increased, with the read depth having higher impact on power as compared to the number of tSNPs. Notably, we observed a dramatic power increase as the read depth increased when there were only a few tSNPs in the gene. Increasing sample size only improved the power when at least two of the three above-mentioned factors, sequencing depth, number of tSNPs and MAF, were at moderate to high level. With low level of sequencing depth and MAF of the rSNP, we observed that the power decreased slightly when the sample size increased. This is because we assigned alleles with larger read counts to the major haplotype, thus the estimated ASE level for the ‘Hom’ group deviated more from 0.5 when the number of ‘Hom’ individuals increased with smaller MAF. This led to smaller ASE difference between the ‘Hom’ and ‘Het’ groups and hence decreased the detection power. Similarly, with small number of tSNPs and low level of sequencing depth and MAF, less information on the SNP level read counts was available, which led to increased phasing errors and resulted in decreased detection power (Fig 2B and S1B Fig).
Next, we evaluated the performance of ASEP to detect ASE difference between two conditions. When haplotype phase information was known, the type I error rate of ASEP was well controlled at the 1% level across a variety of settings. Similar to the one condition analysis, when MAF of the rSNP was fixed, the power increased as the number of individuals, the sequencing depth or the number of tSNPs increased. Among these three factors, the sequencing depth, followed by the number of tSNPs and the sample size, had the largest impact on power. Moreover, a specific factor increased the power more when accompanied by the increase of either of the other two factors. Further, increasing MAF of the rSNP also improved the power to detect differential ASE between two conditions. With MAF = 0.3, when any two of the three factors were at high level, ASEP had an adequate power to detect a, as small as, 0.05 ASE difference between two conditions (Fig 3A and S2A Fig).
When the haplotype phase was unknown, the type I error rate was still under control at the 1% level across all scenarios. For power evaluation, we set the ASE difference between two conditions to 0.075 to achieve an adequate power. Overall, the detection power followed similar pattern as what we observed under phase known scenarios. The power increased as the sequencing depth, the sample size or the number of tSNPs increased, with the sequencing depth being the most influential factor among the three as it led to much higher power improvement when the other two factors were fixed. Increasing the number of ‘Het’ individuals also dramatically improved the power. With MAF of 0.3, ASEP had sufficient power to detect a 0.075 ASE difference between two conditions when any two of the three above-mentioned factors were at high level (Fig 3B and S2B Fig).
We applied ASEP to a human kidney RNA-seq dataset generated from an eQTL study by Qiu et al . , which includes 121 tubule compartment samples. Details of sample characteristics, RNA-seq data processing and read mapping were described in the original paper . Allele-specific read counts for SNPs in exonic regions were obtained using WASP , which is robust to mapping bias in the presence of SNPs. For each tSNP, an individual was filtered out if the minor allele count was less than 5, or the total read count was less than 20, or the minor allele count was less than 5% of the total read count. In addition, we only analyzed genes that were expressed in three or more individuals in order to have enough information for parameter estimation.
In total, we analyzed 6,540 genes and detected 304 genes with significant ASE effect after FDR multiple testing adjustment (S1 Table). To validate our findings, we first compared our results with eGenes identified using the same kidney RNA-seq dataset by Qiu et al . . Here an eGene refers to a gene with cis-eQTLs at 5% FDR, where a cis -eQTL is defined as an eQTL located within 1 megabase (Mb) from the transcription start site of the gene . 179 (59%) of our ASE genes were also detected as eGenes by Qiu et al . . We further compared the 304 genes with another eQTL study performed on a different human kidney cortex RNA-seq dataset of 96 samples , and found that 97 (32%) of our ASE genes were detected as eGenes in their analyses, among which 85 were also detected as eGenes by Qiu et al . . Among genes detected as eGenes by both eQTL studies, GSTM3 showed strong evidence of ASE (FDR adjusted P < 0.0003). It has been reported that GSTM3 may function as a tumor suppressor in renal cell carcinoma . Fig 4 shows the estimated SNP-level ASE, i.e., the proportion of major allele read count relative to the total count of both alleles of each SNP, for each individual, sorted by their median of estimated ASE levels among heterozygous individuals for the analyzed tSNPs in the gene. We observed that about one-third of the individuals have estimated ASE level below 0.6, which presumably are individuals that are homozygous for the unobserved cis-regulatory SNP, whereas the rest of the individuals showed strong ASE effect. The reason that the estimated ASE level was greater than 0.5 is due to the ‘majority voting’ phasing procedure used to assign major alleles across SNPs. By aggregating information across individuals, ASEP was able to detect a significant ASE signal for this gene (Fig 4A).
We also detected significant ASE in APOC3 (FDR adjusted P = 0.0003). APOC3 is known to encode protein apolipoprotein C-III, which is highly associated with hypertriglyceridemia and its altered metabolism may lead to dyslipidemia in chronic kidney disease (CKD) . The RNA-seq data suggest that even though each individual only has a few transcribed SNPs in this gene, with consistent signals across individuals, ASEP was able to aggregate ASE information across individuals to facilitate the detection of a population-level ASE effect (Fig 4B).
An additional example is PGAP3, which showed strong evidence of ASE by ASEP (FDR adjusted P < 0.0003). PGAP3 encodes the glycosylphosphatidylinositol (GPI)-specific phospholipase that is crucial for protein sorting and trafficking . A previous study has shown that aged PGAP3 knockout mice developed the phenotype such as enlarged renal glomeruli with deposition of immune complexes and matrix expansion . In this dataset, we observed that many individuals showed small ASE effect at a few transcribed SNPs. However, by leveraging information across multiple SNPs and individuals, ASEP was able to uncover the ASE signal shared across individuals (Fig 4C).
Although 113 ASEP detected ASE genes (37%) were not identified as eGenes in either of the two eQTL studies, many of these genes are related to kidney functions, especially with chronic kidney disease (CKD). For example, SOD3 (FDR adjusted P < 0.0003) is an antioxidant highly expressed in normal kidneys and is protective in CKD progression . SPSB1 (FDR adjusted P < 0.0003) has been found as a novel regulator of the transforming growth factor-β (TGF-β ) signaling pathway , which mediates fibrosis and plays an important role in CKD . Changes in CYP24A1 (FDR adjusted P < 0.0003) expression have been shown to be related with dysfunctional vitamin D metabolism. Vitamin D deficiency may trigger renal osteodystrophy and lead to other complications of renal disease . PIGR (FDR adjusted P < 0.0003) is expressed in renal tubule epithelial cells and is related to innate immune system and IL4-mediated signaling events pathways . LBH (FDR adjusted P < 0.0003) may act in mitogen-activated protein kinase (MAPK) signaling pathway , which is relevant to renal cell function and pathophysiology . APOE (FDR adjusted P = 0.0005) modulates lipoprotein metabolism and is significantly related with CKD progression  (S3 Fig).
Next, we applied ASEP to a paired macrophage RNA-seq dataset generated from 48 healthy individuals (S2 Table). Human peripheral blood mononuclear cell (PBMC) can be cultured and differentiated to macrophages, and polarized in vitro to functionally and molecularly distinct M1-like inflammatory macrophages by IFN-γ and Lipopolysaccharide (LPS), an important and widely-used experimental model to study macrophage biology in homeostasis and diseases [30, 31]. M0 and M1 macrophages from each individual were subject to 2x101 bp paired-end RNA-seq. Reads were aligned to human genome hg19 using STAR 2.6.0a . Reads from each pair were required to map to the same chromosome with distance <500,000 bp. Only uniquely mapped reads were retained for downstream analysis. The RNA-seq data were processed using WASP  to remove possible mapping bias and extract allele-specific read counts.
We first applied ASEP for one condition analysis to M0 and M1 macrophage samples separately to detect ASE genes under each condition. Similar filtering criteria to the human kidney RNA-seq analysis were applied. In total, we analyzed 5,961 genes for the M0 and 5,465 genes for the M1 macrophage samples, with 4,783 genes in both. We identified 2,503 genes with significant ASE (P < 0.05) in M0 and 2,580 genes in M1. Among these genes, 1,223 were detected only in M0, with 408 of them not expressed in M1, and 1,300 genes detected only in M1, with 334 of them not expressed in M0. Additionally, 1,280 genes showed evidence of ASE under both conditions. After multiple testing adjustment with FDR, 2,402 genes remained significant (FDR adjusted P < 0.05) in M0 and 2,489 genes in M1 (S3 and S4 Tables), and 1,223 genes were found to have ASE under both conditions (Fig 5A).
To validate our findings, we first compared our results to an eQTL study based on monocytes from 134 healthy males . The monocytes were stimulated with three prototypical microbial ligands, LPS was used to activate Toll-like receptor 4 (TLR4), muramyl-dipeptide (MDP) to stimulate Nucleotide-binding oligomerization domain-containing protein 2 (NOD2), and 5′-triphosphate RNA to activate retinoic acid-inducible gene I (RIG-I). RNAs from these samples were sequenced at baseline, 90 minutes and 6 hours after stimulation. We found 460 (19%) of the FDR significant ASE genes for M0 macrophages overlapped with the eGenes identified at the baseline, and 493 (20%) of the ASE genes for M1 macrophages overlapped with the eGenes identified at either 90 minutes or 6 hours using one of the three microbial ligands as the stimuli. To examine if the percent overlapping eGenes (pobserved) is more than expected by chance, we performed resampling based enrichment analysis. For the 2,402 ASE genes detected for M0, we randomly sampled 2,402 genes from the remaining 3,559 genes that did not show evidence of ASE, and recorded the percentage of genes (presampled ) overlapping with eGenes in the monocyte eQTL study . We repeated this resampling procedure 10,000 times and the eGene enrichment p-value was calculated as . Similar analysis was performed for ASE genes detected for M1. Both M0 and M1 ASE genes have enrichment p-values less than 0.0001, suggesting the observed overlap with eGenes is more than expected by chance.
Encouraged by these results, we next performed differential ASE analysis between M0 and M1 by selecting the 2,714 candidate genes that were found to show evidence of ASE (P < 0.05) in M0 or M1 from the one condition analysis. Since haplotype phase is unknown, to reduce phasing error, for each gene, we chose the condition with higher estimated ASE effect as the ‘reference’ to phase the data from the other condition. In total, we detected 826 genes showing evidence of differential ASE (P < 0.05), with 582 genes still being significant after multiple testing adjustment (FDR adjusted P < 0.05) (Fig 5B and S5 Table ). We compared the differential ASE genes with response eQTLs identified in the monocyte eQTL study, where a response eQTL was defined as an eQTL with different effect between baseline and stimulated cells . We found 15 (3%) of our differential ASE genes had response eQTLs identified between monocytes at baseline and monocytes stimulated using at least one of the three microbial ligands sequenced at either 90 minutes or 6 hours: TRABD, AGTRAP, TMEM9, IRF5, AAK1, EIF2AK1, GBP3, GLRX, JUP, MBNL1, MCM7, MS4A7, PTGER4, SLFN5, TMEM110. For example, IRF5 has been demonstrated to promote inflammatory macrophage polarization ; GBP3 encodes a protein from the guanylate-binding protein family that is expressed in response to interferons and other pro-inflammatory cytokines and mediates innate immune responses against intracellular pathogens ; SLFN5 belongs to the schlafen family and plays an important role in the regulation of human T cell quiescence .
Since macrophages are important regulators and promoters of many cardiovascular disease programs, we further examined whether the 582 genes showing significant differential ASE overlap with findings from GWAS for cardiovascular disease (CVD), coronary artery disease (CAD) and acute coronary syndrome (ACS) . Among these 582 genes, 323 (56%) overlapped with loci that reached GWAS significance (P<5×10−8) (S6 Table). The differential ASE genes were marginally enriched for GWAS findings of selected traits as compared to those non-differential ASE genes (P = 0.078). For example, CCL3 (FDR adjusted P < 0.00002) encodes the macrophage inflammatory protein-1α that is known as a macrophage-derived inflammatory mediator and plays a well-known role in inflammatory responses . Fig 6 shows the estimated SNP-level ASE difference for each individual, i.e., the difference in major allele proportion of each SNP after haplotype phasing between M1 and M0 for each individual. After sorting individuals by their median of estimated ASE difference across all heterozygous transcribed SNPs, we observed that, the majority individuals have negative ASE difference with a few having positive ASE difference, which might be due to potential phasing error. About one-third of the individuals have median ASE difference around zero, and these individuals are presumably homozygous for the unobserved rSNP. However, since more individuals have negative ASE difference, ASEP was able to detect evidence of differential ASE at the population level by aggregating information across individuals and multiple transcribed SNPs within the gene (Fig 6A).
We also detected differential ASE in CASP8 (FDR adjusted P = 0.0003) in the macrophage data. A previous study has shown that loss of CASP8 expression in macrophages led to onset of a mild systemic inflammatory disease . CASP8 can control the response to TLR activation and macrophage polarization in a RIPK-dependent manner. Our RNA-seq data suggest that even though each SNP may have small effect, by aggregating information across SNPs and individuals, ASEP was able to uncover the consistent differential ASE signal between conditions in the population (Fig 6B).
Additionally, we detected differential ASE for IL1RN (FDR adjusted P < 0.00002), which encodes protein interleukin-1 receptor antagonist (IL-1RA) that modulates a variety of interleukin 1 related immune and inflammatory responses [27, 40]. Although each individual only has a few transcribed SNPs in this gene, by accumulating evidence across multiple individuals, we were able to detect a consistent signal of differential ASE (Fig 6C). Further, we detected differential ASE in ABHD16A (FDR adjusted P < 0.00002). A study has shown that ABHD16A dynamically regulates the metabolism of lysophosphatidylserines (lyso-PS), a class of signaling lipids that regulate (neuro)immunological processes . Although heterozygous sites in ABHD16A varied across individuals and most of them are heterozygous for only one transcribed SNP in this gene, for each SNP the ASE effect was consistently larger in M1 than in M0. By aggregating information across multiple SNPs and individuals, ASEP was able to detect a population-level differential ASE effect (Fig 6D).
Other genes that showed significant differential ASE include those from the cluster of differentiation, e.g., CD226 (FDR adjusted P < 0.00002), CD68 (FDR adjusted P < 0.00002) and CD44 (FDR adjusted P = 0.004)), RCAN1 (FDR adjusted P < 0.00002), TSPO (FDR adjusted P < 0.00002), AKT1 (FDR adjusted P = 0.00008), and PIEZO1 (FDR adjusted P = 0.001) (S4 Fig).
Although 259 of the differential ASE genes did not overlap with GWAS findings, some of them may play a relevant role in inflammation. For example, MAPK14 (FDR adjusted P = 0.0009) (Fig 6E ), is involved in the production of inflammatory mediators, and play an essential role in mediating cellular responses to injurious stress and immune signaling [42, 43]. Other genes of interest include DDX24 (FDR adjusted P < 0.00002), GRK3 (FDR adjusted P < 0.00002), GBP2 (FDR adjusted P = 0.00002), EEF2 (FDR adjusted P = 0.00009) and SLFN5 (FDR adjusted P = 0.005), where SLFN5 was also identified as having response eQTL in the monocyte study  (S4 Fig). For example, DDX24 negatively regulates the RIG-I-Like Receptors (RLR) pathway and type I IFN production, which may in turn negatively regulate the innate immune signaling . A study has shown that GRK3−/− mice exhibit numerous features of human WHIM syndrome, a rare congenital immune deficiency, indicating its potential effects on attenuating inflammatory responses. GBP2, similar to GBP3, belongs to the GBP family, which is mainly induced by IFN-γ and may play an important role in defense against pathogens . EEF2 has been found to be overexpressed in a wide variety of cancers as an antigen that can elicit both humoral and cellular immune responses .
ASE detection is an important step towards the understanding of genetic polymorphisms on gene expression variation. However, existing ASE detection methods mainly focus on individual-based ASE effect. To better utilize shared ASE information across individuals, we proposed ASEP, a novel method that can detect allelic imbalance in gene expression across individuals under one condition, and differential ASE between two conditions using only RNA-seq data. The main advantage of ASEP lies in its ability to leverage information across multiple individuals and SNPs within the same gene. Existing methods, such as MBASED , detect ASE effect through individual-based analysis, which makes it difficult to aggregate information across subjects. GeneiASE  uses Fisher’s meta-analysis method to combine p-values across subjects, however, the resulted p-value is driven by extremely small p-values, which may lead to a significant combined p-value even when ASE is absent in the majority of subjects.
A major challenge for cross-individual ASE analysis based on RNA-seq data alone is due to the difficulty in differentiating ‘Hom’ and ‘Het’ individuals as the underlying cis-rSNP is unobserved in the absence of DNA genotype data. By employing a mixture model, ASEP is able to aggregate ASE effects contributed by those ‘Het’ individuals while accounting for heterogeneity introduced by those ‘Hom’ individuals. As a result, ASEP is not only more powerful, but its results are also easier to interpret compared to traditional ASE tests that consider one individual at a time. Through extensive simulations, we showed that ASEP is sensitive in detecting small ASE effect under a wide range of scenarios. We further demonstrated that the ASE effects uncovered by ASEP are convincing through the analysis of RNA-seq datasets on human kidney and macrophages.
ASEP can be applied when haplotype phase information in the transcribed SNPs is known or unknown. When sequencing depth is high, the haplotype phase reconstruction approach employed by ASEP is able to correctly recover the true major haplotype. For genes with relatively low sequencing depth, correct assignment of haplotype phase will increase the power to detect ASE. Since rSNP is unobserved, paired RNA-seq data are needed for two-condition analysis in order to correctly phase the haplotypes and align them consistently across samples from both conditions. If DNA genotype data and phase information are available, then based on alleles of a candidate regulatory SNP, we not only can differentiate the ‘Het’ individuals but also can easily align ‘major’ haplotypes that reside on the same haplotype with the expression-increasing allele across individuals. This way, our method can be easily modified to detect ASE difference using all available data or even for unpaired samples, such as case-control study, to detect differential ASE between two independent groups. ASEP is a regression-based framework for ASE analysis, which is flexible and can be easily extended to adjust for additional covariates or confounders in the model if necessary.
As a method designed for analysis of bulk RNA-seq data, we cannot tell if the detected ASE is driven by cell-type composition change or cell-type-specific ASE. Therefore, for future study, investigating cell-type-specific ASE will help provide extra information and will be more powerful especially for genes expressed in rare cell types.
In summary, we have developed ASEP, a gene-based ASE detection method by aggregating information across individuals and SNPs within the same gene. ASEP can detect genes with shared ASE effect or differential ASE across individuals in a population, which leads to easier interpretation and improved power as compared to traditional individual-based ASE detection methods. With the wide application of RNA-seq in biomedical studies, more and more samples of the same tissue from different individuals become available to study gene-phenotype correlation. There is an urgent need to learn a comprehensive picture of ASE in the broad population instead of focusing on individual-level effect. We believe ASEP, which, to the best of our knowledge, is the first method for population-based ASE detection, will be well-suited for various ASE studies for human diseases.
We assume that only RNA-seq data are available for ASE analysis. For individual i at a transcribed tSNP j of gene g, let Xij be the read count for the reference allele in the genome, and Yij be the total read count at the SNP. Assume haplotype phase information is known, i.e., the paternal and maternal alleles can be differentiated when there are more than one tSNP of the gene. We further assume the major haplotype, defined as the haplotype resides with the R allele of the rSNP that has higher expression than the other haplotype, is known, and Mij is the read count for the corresponding allele that resides on the major haplotype. Mij is assumed to follow a binominal distribution, Binomial(Yij,Pi), where Pi is the ASE level, representing the underlying transcript frequency of the major haplotype for individual i. When there is no gene-level ASE, Pi = 0.5, and Pi>0.5 otherwise. The allele-specific read counts of the major haplotypes are then aligned across all individuals. To account for correlations across multiple tSNPs within the gene, we employ a generalized linear mixed-effects model:where the random effect, γi, represents the individual-specific true underlying transcript frequency of the major haplotype on a logit scale, and is assumed to follow some unknown distribution denoted as g(γi).
The likelihood of the above model can be written as:where f(∙)represents the probability density function of a binomial distribution. However, the integral does not have a closed form because g(γi) is unknown. We approximate it by a finite mixture over two mass points μ1, μ2 with probabilities π, 1−π, respectively, since ‘Hom’ and ‘Het’ individuals are expected to have different ASE levels under the alternative hypothesis (i.e., existence of ASE). The likelihood can then be written as:
Here, μ1 and μ2 indicate the population-level major allele transcript frequency of individuals that are heterozygous and homozygous for the unobserved rSNP, respectively. Based on our assumption, μ1, which represents the gene-level ASE effect, will deviate from 0, i.e., logit(0.5), whereas μ2 , which represents the situation of no ASE, will be around 0. To avoid imposing any prior distributional assumptions on the random effect, parameters are estimated using the non-parametric maximum likelihood estimation (NPML) approach, an Expectation-Maximization based method developed by Murray Aitkin [46, 47].
To detect gene-level ASE in the population, we test the following hypothesis:
We do not test H0: μ1 = 0 because we prefer to use individuals who are homozygous at the rSNP as an internal control to reduce excessive false positive results due to errors from haplotype phasing and across-individual alignment. We employed the likelihood ratio test statistic , where the likelihood under H1, , is calculated using Eq (1), and the likelihood under H0, , is obtained by fitting a standard generalized linear mixed-effect model assuming a common mean μ for the random effect γi . Since the null distribution of LRT does not follow standard χ2 distribution, we assess the statistical significance of the LRT through a resampling-based procedure. Specifically, for each individual i at tSNP j, we randomly sample Mij from Binomial(Yij,0.5), and calculate the corresponding LRTn using the sampled data. We repeat this procedure Nsim times, and calculate the gene-specific p-value as .
In the above framework, we have assumed the haplotype phase is known and the major allele can be inferred. However, in real studies, the haplotype phase is often unknown and the observed data offer little or no information of which allele is the major allele. In the absence of DNA genotype data, with only the Xij and Yij of the tSNP, it is challenging to infer which alleles at different SNPs reside on the same haplotype. Even when haplotype phase is known, lacking information of the rSNP makes it difficult to align read counts across individuals as we do not know which allele resides on the same haplotype with the R allele. To overcome these challenges, we adapted a pseudo phasing procedure originally employed by MBASED . This procedure uses a ‘majority voting’ approach based on observed read counts. For each individual, when the haplotype phase information is known, we assign the haplotype with larger total reads, obtained by summing up read counts across all SNPs on the same haplotype, as the major haplotype. When haplotype phase is unknown, we assign the allele with larger read counts of each SNP to the major haplotype, and alleles on the inferred major haplotype are treated as major alleles.
The previously described ASE detection procedure for one condition can be naturally extended to detect gene-level ASE difference between two conditions (e.g., conditions A and B) using paired RNA-seq data, where the same individual is sequenced under both conditions. Similar to the one condition analysis, for individual i at tSNP j, let and be the condition-specific reference allele read count, total read depth and major allele read count accordingly for individual i under condition c. is assumed to follow , where is the condition-specific true underlying transcript frequency of the major haplotype. After aligning major alleles across individuals, by introducing a covariate of condition indicator , defined asthe model can be modified as the following: where the random intercept, γi, represents , the individual-specific true underlying transcript frequency of the major haplotype for condition A on a logit scale; and the random slope, Zi, represents , the difference in the transcript frequency between the two conditions on a logit scale. γi and Zi are assumed to jointly follow some unknown distribution denoted as g(γi,Zi).
The likelihood of the above model can be written as:where f(∙)represents the probability density function of the binomial distribution. Similar to one condition analysis, we approximate the unknown distribution g(γi,Zi) by a finite mixture with modified likelihood as:
Here μ1 and μ2 represent the population-level transcript frequency of the major haplotype under condition A, β1 and β2 represent the difference in the transcript frequency between two conditions, for ‘Het’ and ‘Hom’ individuals, respectively. Similarly, the parameters are estimated through the NPML approach [46, 47].
To test gene-level ASE difference between two conditions with the ‘Hom’ individuals as an internal control, we consider the following hypothesis
Same as one condition analysis, the haplotype phase and major haplotype information are often unknown in real studies when only RNA-seq data are available. Therefore, we employ the pseudo phasing procedure to determine the major haplotype and align them across individuals . To ensure that the major haplotypes are identical for the same individual under different conditions, we choose one condition as the ‘reference’, obtain its phasing information, and phase the data from the other condition accordingly. To improve phasing accuracy, following MBASED , the condition with larger ASE effect is used as the ‘reference’.
Again, we consider the likelihood ratio test with as the test statistic. Under H1, the observed data likelihood, , can be approximated using Eq (2). Under H0, there is no ASE difference between the two conditions and the random slope Zi = 0. Therefore, the model reduces to the one condition model and the likelihood, , can be approximated using Eq (1). We assess the significance of the LRT by resampling. To obtain the null distribution of the LRT, for individual i at tSNP j, we resample from , where is the individual-specific estimate of the ASE level assuming no ASE difference between the two conditions. A two-step procedure is employed to obtain . First, for individual i, we combine data from both conditions, and calculate as the transcript frequency of the major haplotype in the pooled sample, where
Because we perform pseudo alignment on the RNA-seq data based on a ‘reference’ condition, after the ‘majority voting’, for ‘Hom’ individuals, , as the pooled major allele frequency, will always be larger than 0.5, which violates the assumption of no ASE effect under both conditions. To make the resampled data represent the null, as a second step, is obtained through a weighted sum as the following:where and are the estimated posterior probabilities that individual i belongs to the ‘Hom’ and ‘Het’ group, respectively. If the individual is estimated as ‘Hom’ individual with high probability, i.e., is large, this mechanism will down weight and make to be close to 0.5. If the individual is estimated as ‘Het’ individual with high probability, will borrow most of the information from and take a value similar as in the pooled sample. Based on the resampled data, LRTn can be obtained accordingly. This procedure is repeated Nsim times, and the p-value is calculated as .
ASEP is implemented as an R package and is freely available on Github (https://github.com/Jiaxin-Fan/ASEP), with detailed tutorial and examples provided.
Without loss of generality, we consider one gene only. To evaluate the performance of ASEP across a wide range of scenarios, we simulated RNA-seq data for N individuals (20 or 50), each with nSNP number of tSNPs (2, 4 or 6). For each individual, we generated the data with a pre-specified minor allele frequency (MAF) of the rSNP (0.1, 0.3 or 0.5), and assigned ‘Hom’ or ‘Het’ based on the genotype of the rSNP. The haplo-genotype data were simulated assuming HWE with assigned haplotype frequencies such that, for each tSNP, MAF = 0.3 with the linkage disequilibrium (LD) coefficient between pairs of tSNPs set at 0.8.
The read count for the major allele of each tSNP was sampled from Binomial(Yij, 0.5) for ‘Hom’ individuals and from Binomial(Yij, Pi) for ‘Het’ individuals across all simulations. For simplicity, we assume Yij = Y for all individuals across all tSNPs, where Y takes two possible values, 50 or 100. For ‘Het’ individuals, when evaluating the type I error rate, we set Pi = 0.5 under both phase known and unknown scenarios. When evaluating power, to account for subject-specific random variation in ASE levels, Pi on the logit scale, was simulated from Normal(logit(P),0.032), where P is the pre-specified ASE effect in the population. We set P = 0.6 under both phase known and unknown situations.
Similar to one condition analysis, for ‘Hom’ individuals, the major allele read count for each tSNP was simulated from for both conditions across all evaluations. For ‘Het’ individuals, the major allele read count was simulated from , where c represents condition (A or B). For simplicity, we assume , where Y is either 50 or 100. When evaluating the type I error rate, we set under both phase known and unknown scenarios. When evaluating the power, and , on the logit scale, were sampled from Normal(logit(PA),0.032) and Normal(logit(PB),0.032), respectively, where PA and PB are the pre-specified condition-specific ASE effect in the population for condition A and condition B. When haplotype phase is known, we set and , and and , otherwise. Condition B was used as the ‘reference’ for pseudo-phasing given its stronger ASE effect.
All of the protocols for this study were approved by the Human Subjects Research Institutional Review Board at the University of Pennsylvania and Columbia University Irving Medical Center. Peripheral blood mononuclear cell (PBMC) collected using BD VACUTAINER Mononuclear Cell Preparation Tube were cultured in macrophage culture media, 20% FBS in RPMI 1640 media with 100 ng/ml human macrophage colony-stimulating factor (M-CSF), for 7 days on BD Primaria tissue culture plate to induce macrophage differentiation [49, 50]. Polarization was obtained in the presence of M-CSF by 18–20 hour incubation with 20 ng/ml interferon-gamma (IFN-γ ) and 100 ng/ml lipopolysaccharide (LPS) for M1-like polarization [49, 50].
RNA samples of M0 and M1 macrophages were extracted using All Prep DNA/RNA/miRNA Universal Kit (Qiagen, Valencia, CA) by batches and the samples were randomly assigned to each batch [49, 50]. The RNA quality and quantity were determined by Agilent 2100 Bioanalyzer (Median RIN = 7.9, n = 96 samples from 48 subjects). With a minimum of 300 ng input RNA, libraries were prepared using the TruSeq RNA Sample Preparation Kit (Illumina, San Diego, CA), followed by 101 bp 60M paired-end sequencing on an Illumina’s HiSeq 2500 at Columbia Genome Center.