genesdev.cshlp.orggenesdev.cshlp.org/.../Supplemental_Methods.docx · Web viewThe 7 BglII digested...
Transcript of genesdev.cshlp.orggenesdev.cshlp.org/.../Supplemental_Methods.docx · Web viewThe 7 BglII digested...
Supplemental Methods
Library preparation:
a. Linker preparation:
Linkers were prepared in a 250 µl reaction mixture by adding 25 µl of each linker
oligo (100 µM) (Suppl. Table S5) in a 25 µl of 10X Titanium PCR buffer from Clontech
(Cat#639209). The reaction mixture was mixed and distributed (as 50 µl aliquots) into
five different PCR tubes. These tubes were kept at 94°C for 1 minute in a MJ research
Thermal Cycler. Then the tubes were transferred to a 100 ml beaker of 80 ml water at
80°C, and the water was allowed to cool. Once the water reached room temperature, the
linker was mixed and stored at -80°C.
b. Digestion of infected HEK293T DNA with the restriction enzyme Mse I:
Two independent digests of 3µg were prepared at different times. The following
procedure was used each time. A reaction cocktail consisting of 35 µl of 10X NEB buffer
4, 35 µl of 10X BSA, 21 µl of Mse I (NEB) in 224 µl water was mixed and distributed as
90 µl aliquots into three 1.5 ml vials. 1 µg of DNA (10 ul) was added in each vial and the
vials were incubated at 37° C for 16 hr.
c. Purification of digested DNA fragments:
The three parallel MseI digests were mixed and purified by Qiagen Kit PCR purification
catalog #28106. In summary, 500 µl of buffer PB with pH indicator was added to 100 µl
of the digested sample and mixed. If the color was yellow, we proceeded. If the color was
not yellow, 10 µl of 3 M sodium acetate, pH 5.0 was added. The sample was then divided
onto three QIAquick columns and centrifuged at 13,000 rpm for 1 min. The flow-through
1
was discarded and the procedure was repeated to process the entire volume. The columns
were washed with 750 µl PE buffer and the flow-through was discarded. An additional
centrifugation step was done to remove the residual alcohol and the column was dried for
1 minute. DNA was eluted in 30 µl warm water in a 1.5 micro centrifuge tube and warm
water was added to make the total volume 90 ul.
d. Linker ligation:
The linker ligation reaction mixture was prepared by mixing 16 µl of NEB ligation buffer
(10X), 8 µl NEB T4 ligase (cat#M0202S)(400 U/µl) and 36 µl water with 28 µl of the
double stranded linker at a concentration of 10 µM. 11 µl of this mixture was distributed
into 7 PCR tubes, 9 µl purified Mse I digested DNA was added to each tube, and the
mixtures were incubated at 16°C for 16 hr in the MJ Research Thermal Cycler and then at
65° C for 10 minute to inactivate the ligase.
The 3 µg of genomic DNA digested with MseI at a separate time was ligated to linker
using the following conditions. We mixed 20 µl Invitrogen ligation buffer (5X), 2.5 µl
Invitrogen T4 ligase (P/N 100004917) and 17.5 µl double-strand linker at a
concentration of 10 µM. Four tubes containing 8 µl of this cocktail were mixed with 12.5
ul of the purified Mse I digested DNA and the mixtures were incubated at 16°C for 16 hr
in the MJ Research Thermal Cycler and then at 65 for 10 minute to inactivate the ligase.
e. BglII digestion of linker ligated product:
For each of the 7 ligation tubes 20.5 µl of the reaction mixture was mixed with
29.5 µl of a cocktail that was made by mixing 24 µl NEB 3 buffer (10X), 12 µl NEB
BglII (10 U/ul) and 200 µl water.
2
For each of the 4 ligation tubes from the separate MseI digestions, 20.5 µl of the
ligated reaction mixture was mixed with 29. 5 µl of a reaction mixture that was prepared
by mixing 15 µl NEB 3 buffer (10X), 7.5 µl NEB BglII (10 U/ul) and 125 µl water. Each
BglII digestion was incubated at 37°C for 2 hr. After a 2 hr. incubation, 10 U of BglII
was added and the reactions were incubated for an additional 2-3 hr.
f. PCR reactions for generating the integration site libraries:
The 7 BglII digested reactions were used as template for PCR. For each barcode,
(4, 5, and 6), three master tubes, each having a reaction volume of 200 µl, were prepared
by mixing 20 µl Clontech PCR buffer 10X, 4 µl dNTP, 6 µl of each barcode DNA (10
um), 4 µl of Titanium Taq (Clontech and cat#639209), 35 µl template and 124 µl of
water. In total, 9 independent PCR master tubes were generated (three different
barcodes, three reaction mixtures). Because each master tube was divided into 4 PCR
reaction tubes, for each barcode, there were 12 PCR reactions of 50 µl. Similarly, from
the 4 tubes from the independent BglII digestions, a total of 4 PCR reaction tubes of 50 µl
were incubated for each of the barcodes 1, 2, 3, 4, and 5 (total of 20 independent PCR
reactions). The PCR cycles were: 94o for 4 min, 94o for 15sec, 65o for 30sec, 68o for 45
sec, go to step 2 for a total of six cycles, 94o for 15 sec, 60o for 30 sec, 68o for 45 sec, go
to step 6 for a total 24 cycles, 68o for 10 minute, then cool the reaction to 4o.
g. Extraction and purification of PCR product:
For each barcoded set or PCR reactions the products were combined and loaded
onto half of a 10 cm 2% TBE agarose gel run at 90V. DNA fragments between 200-500
bp in length were excised and purified using the Qiagen gel extraction kit. The weight of
the gel band was determined, and 300 µl of buffer QG was added to per 100 mg weight
3
of the gel. The mixture was incubated at 50°C until the gel dissolved completely. If the
color was yellow, we proceeded. If the color was not yellow, 10 µl of 3 M sodium
acetate, pH 5.0 was added. 100 µl of isopropanol was added per 100 mg gel before it was
loaded onto the QIAquick column (1 column per 400 mg of the gel), centrifuged, and the
flow-through was discarded. The same procedure was repeated for all of the samples.
Each column was washed with 500 µl QG buffer, and 500 µl buffer PE was added and
the columns rested on the bench for 1 minute. After that, the columns were centrifuged
and the flow-through was discarded. An additional centrifugation step was performed to
remove any residual alcohol and the columns dried for 1 minute. The PCR product was
eluted by adding 30 µl warm water, kept for 1 minute and the eluate collected by
centrifugation. The eluates were stored at -20 C.
h. Illumina Pair-end Sequencing:
All of the independent PCR reactions were mixed as shown in Table S9 and the
sequences were determined using three different lanes on an Illumina HiSeq2000.
Bioinformatics analysis:
All of the sequence analysis was done using scripts written in Perl or Python. All Refseq
databases were downloaded from the UCSC website. We used Blat from the UCSC sites
for aligning the Illumina reads to the human genome (h19).
a. Alignment of the Illumina reads:
Read 1 contained sequence information for the barcode, the LTR, and the adjacent human
genome sequence, whereas read 2 contained the sequence of the linker and the random
nucleotides that were used to measure independent integration events at individual sites
4
(to be reported in a separate publication). We selected and retained only those reads
which contained correct sequences for the full barcode and the full LTR in read 1, and the
full linker on read 2. After trimming accessary sequences (barcode and LTR sequences)
from the read 1, we obtained 69,286,538 reads that were longer than 14 nucleotides. After
removing duplicates from the trimmed sequences, we had 20,285,651 unique reads.
These reads were aligned to the human genome by BLAT, giving a total of 43,685,085
matches. Only those reads that aligned from the first nucleotides of genomic sequence
and had e-values less than 10-5 were selected (total 6,170,903). After removing the multi-
matched (reads that matched more than one location in the genome), a total 961,274
unique integration sites were obtained. Multi-matched reads were removed on the basis
of bit score. The top strand coordinate of the match was used as the coordinate for each
insertion. When the matched sequences were from the negative strand the coordinate of
the insertion was defined to be the value that is obtained by subtracting 4 from the top
strand coordinate of the match seq.
Customized set of transcription units:
Since HIV-1 integrates into the transcribed sequences of genes and not the promoters we
analyzed integration sites relative to a customized set of transcription units. The RefSeq
data set of the human genome (hg19) containing coordinates that define 44,525
transcripts was downloaded from the UCSC site. Some of these transcripts either overlap
or are totally internal to others. Therefore, in the process we describe below, we resolved
overlapping transcripts and made a customized set of transcription units for analysis of
5
gene ontology, and cancer genes. Similarly, we used RefSeq transcripts of mm9 to map
HIV-1 integration sites in mouse embryonic fibroblast cells.
To define a custom set of transcription units we used coordinates that provided a
unique set of transcripts. RefSeq transcripts were sorted on the basis of chromosome,
start site and end site, and if transcripts had the same start and stop coordinates one was
retained for the transcription unit set. Adjacent transcripts were compared two at a time.
Internal transcripts were not considered part of the set. If transcripts have the same start
site but one had an internal termination site or if transcripts have the same termination
site but one had an internal promoter, we excluded the internal transcripts from the set. If
two consecutive transcripts overlapped, we excluded the shorter transcript from the set.
This produced a set of 21,188 unique transcripts and their coordinates were used for the
transcription units to analyze the distribution of HIV-1 integration sites within
transcription units, calculate the fraction of total integration in transcription units ranked
by integration density, perform gene ontology, measure the relative frequency of
integration in cancer genes, and calculate multivariate models for integration density.
To determine the integration levels in transcription units with specific numbers of
introns, we included the transcription units with multiple spliced isoforms in multiple
intron groups based on the numbers of introns of each of its spliced isoforms. Thus, a
transcription unit expressing two spliced isoforms with different intron numbers would be
included in two intron number groups using the coordinates of the transcripts to define
the transcription units. To accomplish this we sorted all RefSeq transcripts into individual
groups based on the number of introns, so that, within a group, all transcripts had the
same number of introns. The coordinates of the transcripts within a group were used as
6
transcription units and were customized to resolve overlap using the same rules described
above. Adjacent transcripts with the same number of introns were compared. Internal
transcripts were excluded from the dataset and, when transcripts overlapped, the shorter
transcript was excluded. This procedure produced groups of transcription units based on
the number of introns present in the transcripts (numbers in each group is shown in
Suppl. Table S6). The same procedure was used to determine integration levels in mouse
transcription units with specific numbers of introns.
b. Mapping the integration sites with respect to the customized set of transcription units:
We used the customized set of transcription units described above to map the integration
sites. Each transcription unit was divided into 15 equal segments (red bars) and the
integration sites were counted for each segment of the transcription units. For the region
that is either upstream or downstream of transcription units, integration sites were
distributed into bins of 500 bp and counted. Integration sites were mapped upstream of
transcription units if they were nearer the 5’ end of a transcription unit than the 3’ end of
another transcription unit. Conversely, integration sites were mapped downstream of
transcription units if they were closer to the 3’ end than the 5’ end of another
transcription unit.
For determining the distribution of HIV-1 integration sites within intronless
transcription units, we used all intronless transcription units within our custom set and
performed a similar analysis, with the integration sites distributed into 15 equal bins.
c. Correlation of the intron number with integration site density (inserts per kb):
7
The integration site density for each transcription unit was calculated using the non-
overlapping groups of transcription units that were assembled based on intron number
(described above). For each transcription unit in a group, the total number of integration
sites was determined and the number was divided by the length of the transcription unit
to determine the integration density (integration sites per kb) for that transcription unit.
For each group, the average integration density was determined by adding all the
integration densities and dividing by the number of transcripts in the group.
d. Correlation of alternatively spliced transcription units with the average integration
density:
RNA-seq analyses were conducted for three independent experiments and the cross-
correlations were very high between the datasets, demonstrating that the RNAseq data
were highly reproducible (Suppl. Fig. S11). The reads were interpreted using Cufflinks,
which defines the spliced species of each transcription unit and the expression of each
transcript as fragments per kb of exons per million (FPKM). The transcription units were
sorted into groups based on their number of spliced products as detected by Cufflinks.
Then within each group, overlapping transcription units were resolved by comparing
adjacent transcription units two at a time. Transcription units that are contained entirely
within other transcription units were excluded. If transcription units had the same start
site but one had internal termination or if transcription units had the same termination site
but one had an internal promoter, we excluded the internal transcription unit. If the two
adjacent transcription units had some overlap we excluded the shorter one. This produced
groups of non-overlapping transcription units that were based on the number of
8
alternative transcripts. The number of transcription units in each group is listed in Suppl.
Table S5. The integration density (inserts per kb) for each transcription unit was
calculated by dividing the total number of integration sites per transcription unit by the
length of the transcription unit. The average integration density of the transcription unit in
a group was calculated by dividing the total integration density of each group by the
number of transcription units in that group.
d. Analysis of HIV-1 integration in genes with 1, 2, and 3 introns compared to genes with
matched sizes that have 10 introns:
Using the groups of transcription units with specific numbers of introns we made
two sets: one group of transcription units contained all the transcription units that had 10
introns and the other group contained all of the transcription units with 1, 2, or 3 introns.
We sorted the transcription units on the basis of length. For each of the transcription units
in the 10-intron group, we found a transcription unit in the 1, 2, or 3 intron group whose
length differed by less than 500 bp. We selected 673 transcription units that contained 10
introns (group A) and size matched it with 673 transcription units having 1, 2 or 3 introns
(group B). The total size of the transcription units in both groups was similar (43 kb for
10 introns and 42.6 kb for genes with 1-3 introns). For both these groups, we determined
the distribution of HIV integration sites in the transcription units by dividing the
transcription units into 15 equal segments and counting integrations in each bin. For the
region that is either upstream or downstream of transcription units, integration sites were
sorted into bins of 500 bp. We also calculated the average integration site density for each
group by using the actual HIV-1 integration sites and the control MRC sites.
9
e. HIV analysis from published data in mouse fibroblast cells:
Published data (1, 2) was used for these analyses. We used RefSeq of the mm9
compilation of the sequence of the mouse genome from UCSC. We mapped the
integration sites in the mouse genome using the same scripts that were used to analyze
the integration sites in HEK293T cells.
f. Gene ontology and cancer gene analysis:
We used our customized set of transcription units to determine which transcription units
contained the highest amount of HIV-1 integration sites. We sorted the transcription units
on the basis of total sites and separately on the basis of inserts per kb. The top 1,000
transcription units were analyzed for gene ontology using DAVID (3, 4). The gene
ontology terms and their associated P values are given in Tables 1 and Suppl. Table S3.
Three different cancer genes datasets (5-7) were used to determine the prevalence of
cancer genes in the top 1,000-targeted transcription units by Perl scripts.
h. Matched Random Control:
We made a list of all of the Mse I recognition sites on each chromosome in the human
genome. Next, we made a similar list that contained the positions of all of the Mse I and
Bgl II positions on each chromosome in the human genome. For each HIV integration
site, we determined the distance from the nearest Mse I. We selected a random Mse I site
from the whole genome. We then determined the coordinate of the MRC site using the
distance of a true HIV-1 integration site from the nearest Mse I site. We randomly chose
10
whether the MRC site would be upstream or downstream of the MseI site. We generated
a total of 961,176 MRC sites.
i. Mass spectrometry-based proteomics:
To identify the cellular binding partners of LEDGF/p75 and LEDGF/p52; GST, GST-
LEDGF/p75, and GST-LEDGF/p52 were pre-bound to gluthathione resin in 200 mM
NaCl, 25 mM Tris (pH 8.0), 0.1% Nonidet P-40, 2 mM β-mercaptoethanol, and 1x
complete protease mixture. Nuclear extracts were prepared from HEK 293T cells using
the NE-PER Nuclear and Cytoplasmic Kit. The nuclear extracts were then allowed to
bind to the resin with the associated LEDGF and control proteins and were washed twice.
The proteins that were bound to the resin were separated by SDS/PAGE and each lane
was subjected to in-gel trypsin digestion, followed by analysis of the peptide fragments
with capillary-liquid chromatography–tandem mass spectrometry (MS/MS) using a
Thermo Finnigan LTQ Orbitrap mass spectrometer equipped with a microspray source.
Splicing factors were identified by cross-referencing the data against the spliceosome
database at http://spliceosomedb.ucsc.edu/ and are reported in Table 2 with total spectral
counts. The complete list of GST-LEDGF/p52 and GST-LEDGF/p75 binding partners is
shown in Suppl. Table S10.
The interactions between GST, GST-LEDGF/p75, and GST-LEDGF/p52 and
select cellular proteins were also analyzed by affinity pull-down assays, and the
interacting proteins were detected by western blotting. GST, GST-LEDGF/p75, and
GST-LEDGF/p52 were prebound to gluthathione resin in 200mM NaCl, 25 mM Tris (pH
8.0), 0.1% Nonidet P-40, 2 mM β-mercaptoethanol, and1x complete protease mixture.
11
Nuclear extract was prepared from HEK 293T cells using the NE-PER Nuclear and
Cytoplasmic Kit. The nuclear extracts were then allowed to bind to the resin in 200mM
NaCl, 25 mM Tris (pH 8.0), 0.1% Nonidet P-40, 2 mM β-mercaptoethanol, and 1x
complete protease mixture and were washed twice with the same buffer. The proteins
bound to the resin were separated by SDS/PAGE and subject to western blotting with the
following antibodies anti-ASF/SF2, anti-hnRNP M, and anti-SF3B2 (Abcam ab38017,
Novus Biologicals NB200-315, and Abcam ab56800).
i. Pairwise and multivariate regression models of integration in transcription units:
To evaluate which features of transcription units best predict integration levels we
developed regression models using factors previously shown to predict integration
positions genome-wide (8-11). For each transcription unit in our custom set of 21,188, we
tabulated values for intron density, histone H3K4 trimethylation (ENCODE data for
HEK293T from Washington University GEO sample accession: GSM945288 using raw
signal), transcription level (FPKM from our RNAseq data for HEK293T), DNase I
cleavage sites (ENCODE data for HEK293T peak values from Duke GEO sample
accession: GSM1008573), and percent GC base pairs (Suppl. Table S. The transcription
units were grouped into sets of 100 based on integration density. The values of each
factor were averaged for each group. Natural log values for FPKM, DNase I sites, and
integration density were used because this resulted in higher correlations.
12
An examination of each factor against the log(integration density) showed virtually no
correlation of the integration density with the GC-content. A weak, negative correlation
was observed with log(DNAse1) and strong correlations are obtained for log(FPKM) and
histone H3 trimethylation. The strongest correlation is with intron density. This means
that there is a strong linear relationship between intron density and log(integrations
density) and it is therefore be the best single predictor of integration density (smallest
root-mean-squared error in the prediction).
Pairwise linear regression with log(integration density)
Factor Pierson’s coefficient r%GC 0.096
histone H3K4me3 0.886log(FPKM) 0.808
log(DNase I) -0.387intron density 0.903
When all five factors are included in multivariate analysis, the GC-content is ignored.
This is expected due to its small correlation with log(integration density). The
importance of the remaining four factors (as measured by the probability that the effect of
this factor is due to chance) reflects the same ordering as found in the individual
correlations with log(integration density). The intron density is the most important
factor, followed by H3K4me3, log(PFKM), and then log(DNAse1). The importance of
this last factor is not significant (p = 0.137), which is not unexpected due to its weak
correlation with log(integration density).
Multivariate regression with all five factorsFactor P value
13
Intercept 6.52e-7H3K4me3 7.89e-6log(FPKM) 0.00313
log(DNase I) 0.137intron density <2e-16
The adjusted r2 value and the residual standard error for the multivariate fit when all five
factors are used and when each of the factors is individually removed is given in the table
below. The multivariate fit with all five factors is the same as the fit when the %GC-
content is removed, since it did not contribute to the fit when the other four factors were
present. Similarly, there was very little change in the r2 when log(DNAse1) was
removed, as expected from its low correlation with log(integration density). In this fit,
the GC-content was also found to be uninformative, so the actual fit only used H3K4me3,
log(PFKM) and in intron density. When log(PFKM) was removed from the multivariate
fit, log(DNAse1) and GC-content were both found to be uninformative. Therefore a
linear model using just intron density and H3K4me3 produced an r2 of 0.875. This is
somewhat expected since these two factors had the highest individual correlations with
log(integrations density). Finally, removing intron density had, by far, the largest effect
on the multivariate fit; decreasing the r2 to 0.812. All of these result are consistent with
the conclusion that intron density is the strongest predictor of log(integration density) of
the five factors considered, followed by H3K4me3.
Multivariate fit with each factor individually removedFactors Residual standard error Adjusted r2
All five 0.450 0.883No intron density 0.570 0.812No H3K4me3 0.480 0.866No log(PFKM) 0.463 0.875No log(DNAse1) 0.452 0.881
14
No GC-content 0.450 0.883
References
1. Koh Y, Wu X, Ferris AL, Matreyek KA, Smith SJ, Lee K, KewalRamani VN,
Hughes SH, Engelman A. 2013. Differential effects of human immunodeficiency
virus type 1 capsid and cellular factors nucleoporin 153 and LEDGF/p75 on the
efficiency and specificity of viral DNA integration. J Virol 87:648-658.
2. Wang H, Jurado KA, Wu X, Shun MC, Li X, Ferris AL, Smith SJ, Patel PA,
Fuchs JR, Cherepanov P, Kvaratskhelia M, Hughes SH, Engelman A. 2012.
HRP2 determines the efficiency and specificity of HIV-1 integration in
LEDGF/p75 knockout cells but does not contribute to the antiviral activity of a
potent LEDGF/p75-binding site integrase inhibitor. Nucleic Acids Res 40:11518-
11530.
3. Huang da W, Sherman BT, Lempicki RA. 2009. Bioinformatics enrichment
tools: paths toward the comprehensive functional analysis of large gene lists.
Nucleic Acids Res 37:1-13.
4. Huang da W, Sherman BT, Lempicki RA. 2009. Systematic and integrative
analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc
4:44-57.
5. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Jr., Kinzler
KW. 2013. Cancer genome landscapes. Science 339:1546-1558.
15
6. Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, Xie M, Zhang Q,
McMichael JF, Wyczalkowski MA, Leiserson MD, Miller CA, Welch JS,
Walter MJ, Wendl MC, Ley TJ, Wilson RK, Raphael BJ, Ding L. 2013.
Mutational landscape and significance across 12 major cancer types. Nature
502:333-339.
7. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman
N, Stratton MR. 2004. A census of human cancer genes. Nat Rev Cancer 4:177-
183.
8. Craigie R, Bushman FD. 2012. HIV DNA integration. Cold Spring Harb
Perspect Med 2:a006890.
9. Berry C, Hannenhalli S, Leipzig J, Bushman FD. 2006. Selection of target sites
for mobile DNA integration in the human genome. PLoS Comput Biol 2:e157.
10. Wang GP, Ciuffi A, Leipzig J, Berry CC, Bushman FD. 2007. HIV integration
site selection: analysis by massively parallel pyrosequencing reveals association
with epigenetic modifications. Genome Res 17:1186-1194.
11. Schroder AR, Shinn P, Chen H, Berry C, Ecker JR, Bushman F. 2002. HIV-1
integration in the human genome favors active genes and local hotspots. Cell
110:521-529.
16