genesdev.cshlp.orggenesdev.cshlp.org/.../Supplemental_Methods.docx · Web viewThe 7 BglII digested...

Supplemental Methods

Library preparation:

a. Linker preparation:

Linkers were prepared in a 250 µl reaction mixture by adding 25 µl of each linker

oligo (100 µM) (Suppl. Table S5) in a 25 µl of 10X Titanium PCR buffer from Clontech

(Cat#639209). The reaction mixture was mixed and distributed (as 50 µl aliquots) into

five different PCR tubes. These tubes were kept at 94°C for 1 minute in a MJ research

Thermal Cycler. Then the tubes were transferred to a 100 ml beaker of 80 ml water at

80°C, and the water was allowed to cool. Once the water reached room temperature, the

linker was mixed and stored at -80°C.

b. Digestion of infected HEK293T DNA with the restriction enzyme Mse I:

Two independent digests of 3µg were prepared at different times. The following

procedure was used each time. A reaction cocktail consisting of 35 µl of 10X NEB buffer

4, 35 µl of 10X BSA, 21 µl of Mse I (NEB) in 224 µl water was mixed and distributed as

90 µl aliquots into three 1.5 ml vials. 1 µg of DNA (10 ul) was added in each vial and the

vials were incubated at 37° C for 16 hr.

c. Purification of digested DNA fragments:

The three parallel MseI digests were mixed and purified by Qiagen Kit PCR purification

catalog #28106. In summary, 500 µl of buffer PB with pH indicator was added to 100 µl

of the digested sample and mixed. If the color was yellow, we proceeded. If the color was

not yellow, 10 µl of 3 M sodium acetate, pH 5.0 was added. The sample was then divided

onto three QIAquick columns and centrifuged at 13,000 rpm for 1 min. The flow-through

1

was discarded and the procedure was repeated to process the entire volume. The columns

were washed with 750 µl PE buffer and the flow-through was discarded. An additional

centrifugation step was done to remove the residual alcohol and the column was dried for

1 minute. DNA was eluted in 30 µl warm water in a 1.5 micro centrifuge tube and warm

water was added to make the total volume 90 ul.

d. Linker ligation:

The linker ligation reaction mixture was prepared by mixing 16 µl of NEB ligation buffer

(10X), 8 µl NEB T4 ligase (cat#M0202S)(400 U/µl) and 36 µl water with 28 µl of the

double stranded linker at a concentration of 10 µM. 11 µl of this mixture was distributed

into 7 PCR tubes, 9 µl purified Mse I digested DNA was added to each tube, and the

mixtures were incubated at 16°C for 16 hr in the MJ Research Thermal Cycler and then at

65° C for 10 minute to inactivate the ligase.

The 3 µg of genomic DNA digested with MseI at a separate time was ligated to linker

using the following conditions. We mixed 20 µl Invitrogen ligation buffer (5X), 2.5 µl

Invitrogen T4 ligase (P/N 100004917) and 17.5 µl double-strand linker at a

concentration of 10 µM. Four tubes containing 8 µl of this cocktail were mixed with 12.5

ul of the purified Mse I digested DNA and the mixtures were incubated at 16°C for 16 hr

in the MJ Research Thermal Cycler and then at 65 for 10 minute to inactivate the ligase.

e. BglII digestion of linker ligated product:

For each of the 7 ligation tubes 20.5 µl of the reaction mixture was mixed with

29.5 µl of a cocktail that was made by mixing 24 µl NEB 3 buffer (10X), 12 µl NEB

BglII (10 U/ul) and 200 µl water.

2

For each of the 4 ligation tubes from the separate MseI digestions, 20.5 µl of the

ligated reaction mixture was mixed with 29. 5 µl of a reaction mixture that was prepared

by mixing 15 µl NEB 3 buffer (10X), 7.5 µl NEB BglII (10 U/ul) and 125 µl water. Each

BglII digestion was incubated at 37°C for 2 hr. After a 2 hr. incubation, 10 U of BglII

was added and the reactions were incubated for an additional 2-3 hr.

f. PCR reactions for generating the integration site libraries:

The 7 BglII digested reactions were used as template for PCR. For each barcode,

(4, 5, and 6), three master tubes, each having a reaction volume of 200 µl, were prepared

by mixing 20 µl Clontech PCR buffer 10X, 4 µl dNTP, 6 µl of each barcode DNA (10

um), 4 µl of Titanium Taq (Clontech and cat#639209), 35 µl template and 124 µl of

water. In total, 9 independent PCR master tubes were generated (three different

barcodes, three reaction mixtures). Because each master tube was divided into 4 PCR

reaction tubes, for each barcode, there were 12 PCR reactions of 50 µl. Similarly, from

the 4 tubes from the independent BglII digestions, a total of 4 PCR reaction tubes of 50 µl

were incubated for each of the barcodes 1, 2, 3, 4, and 5 (total of 20 independent PCR

reactions). The PCR cycles were: 94o for 4 min, 94o for 15sec, 65o for 30sec, 68o for 45

sec, go to step 2 for a total of six cycles, 94o for 15 sec, 60o for 30 sec, 68o for 45 sec, go

to step 6 for a total 24 cycles, 68o for 10 minute, then cool the reaction to 4o.

g. Extraction and purification of PCR product:

For each barcoded set or PCR reactions the products were combined and loaded

onto half of a 10 cm 2% TBE agarose gel run at 90V. DNA fragments between 200-500

bp in length were excised and purified using the Qiagen gel extraction kit. The weight of

the gel band was determined, and 300 µl of buffer QG was added to per 100 mg weight

3

of the gel. The mixture was incubated at 50°C until the gel dissolved completely. If the

color was yellow, we proceeded. If the color was not yellow, 10 µl of 3 M sodium

acetate, pH 5.0 was added. 100 µl of isopropanol was added per 100 mg gel before it was

loaded onto the QIAquick column (1 column per 400 mg of the gel), centrifuged, and the

flow-through was discarded. The same procedure was repeated for all of the samples.

Each column was washed with 500 µl QG buffer, and 500 µl buffer PE was added and

the columns rested on the bench for 1 minute. After that, the columns were centrifuged

and the flow-through was discarded. An additional centrifugation step was performed to

remove any residual alcohol and the columns dried for 1 minute. The PCR product was

eluted by adding 30 µl warm water, kept for 1 minute and the eluate collected by

centrifugation. The eluates were stored at -20 C.

h. Illumina Pair-end Sequencing:

All of the independent PCR reactions were mixed as shown in Table S9 and the

sequences were determined using three different lanes on an Illumina HiSeq2000.

Bioinformatics analysis:

All of the sequence analysis was done using scripts written in Perl or Python. All Refseq

databases were downloaded from the UCSC website. We used Blat from the UCSC sites

for aligning the Illumina reads to the human genome (h19).

a. Alignment of the Illumina reads:

Read 1 contained sequence information for the barcode, the LTR, and the adjacent human

genome sequence, whereas read 2 contained the sequence of the linker and the random

nucleotides that were used to measure independent integration events at individual sites

4

(to be reported in a separate publication). We selected and retained only those reads

which contained correct sequences for the full barcode and the full LTR in read 1, and the

full linker on read 2. After trimming accessary sequences (barcode and LTR sequences)

from the read 1, we obtained 69,286,538 reads that were longer than 14 nucleotides. After

removing duplicates from the trimmed sequences, we had 20,285,651 unique reads.

These reads were aligned to the human genome by BLAT, giving a total of 43,685,085

matches. Only those reads that aligned from the first nucleotides of genomic sequence

and had e-values less than 10-5 were selected (total 6,170,903). After removing the multi-

matched (reads that matched more than one location in the genome), a total 961,274

unique integration sites were obtained. Multi-matched reads were removed on the basis

of bit score. The top strand coordinate of the match was used as the coordinate for each

insertion. When the matched sequences were from the negative strand the coordinate of

the insertion was defined to be the value that is obtained by subtracting 4 from the top

strand coordinate of the match seq.

Customized set of transcription units:

Since HIV-1 integrates into the transcribed sequences of genes and not the promoters we

analyzed integration sites relative to a customized set of transcription units. The RefSeq

data set of the human genome (hg19) containing coordinates that define 44,525

transcripts was downloaded from the UCSC site. Some of these transcripts either overlap

or are totally internal to others. Therefore, in the process we describe below, we resolved

overlapping transcripts and made a customized set of transcription units for analysis of

5

gene ontology, and cancer genes. Similarly, we used RefSeq transcripts of mm9 to map

HIV-1 integration sites in mouse embryonic fibroblast cells.

To define a custom set of transcription units we used coordinates that provided a

unique set of transcripts. RefSeq transcripts were sorted on the basis of chromosome,

start site and end site, and if transcripts had the same start and stop coordinates one was

retained for the transcription unit set. Adjacent transcripts were compared two at a time.

Internal transcripts were not considered part of the set. If transcripts have the same start

site but one had an internal termination site or if transcripts have the same termination

site but one had an internal promoter, we excluded the internal transcripts from the set. If

two consecutive transcripts overlapped, we excluded the shorter transcript from the set.

This produced a set of 21,188 unique transcripts and their coordinates were used for the

transcription units to analyze the distribution of HIV-1 integration sites within

transcription units, calculate the fraction of total integration in transcription units ranked

by integration density, perform gene ontology, measure the relative frequency of

integration in cancer genes, and calculate multivariate models for integration density.

To determine the integration levels in transcription units with specific numbers of

introns, we included the transcription units with multiple spliced isoforms in multiple

intron groups based on the numbers of introns of each of its spliced isoforms. Thus, a

transcription unit expressing two spliced isoforms with different intron numbers would be

included in two intron number groups using the coordinates of the transcripts to define

the transcription units. To accomplish this we sorted all RefSeq transcripts into individual

groups based on the number of introns, so that, within a group, all transcripts had the

same number of introns. The coordinates of the transcripts within a group were used as

6

transcription units and were customized to resolve overlap using the same rules described

above. Adjacent transcripts with the same number of introns were compared. Internal

transcripts were excluded from the dataset and, when transcripts overlapped, the shorter

transcript was excluded. This procedure produced groups of transcription units based on

the number of introns present in the transcripts (numbers in each group is shown in

Suppl. Table S6). The same procedure was used to determine integration levels in mouse

transcription units with specific numbers of introns.

b. Mapping the integration sites with respect to the customized set of transcription units:

We used the customized set of transcription units described above to map the integration

sites. Each transcription unit was divided into 15 equal segments (red bars) and the

integration sites were counted for each segment of the transcription units. For the region

that is either upstream or downstream of transcription units, integration sites were

distributed into bins of 500 bp and counted. Integration sites were mapped upstream of

transcription units if they were nearer the 5’ end of a transcription unit than the 3’ end of

another transcription unit. Conversely, integration sites were mapped downstream of

transcription units if they were closer to the 3’ end than the 5’ end of another

transcription unit.

For determining the distribution of HIV-1 integration sites within intronless

transcription units, we used all intronless transcription units within our custom set and

performed a similar analysis, with the integration sites distributed into 15 equal bins.

c. Correlation of the intron number with integration site density (inserts per kb):

7

The integration site density for each transcription unit was calculated using the non-

overlapping groups of transcription units that were assembled based on intron number

(described above). For each transcription unit in a group, the total number of integration

sites was determined and the number was divided by the length of the transcription unit

to determine the integration density (integration sites per kb) for that transcription unit.

For each group, the average integration density was determined by adding all the

integration densities and dividing by the number of transcripts in the group.

d. Correlation of alternatively spliced transcription units with the average integration

density:

RNA-seq analyses were conducted for three independent experiments and the cross-

correlations were very high between the datasets, demonstrating that the RNAseq data

were highly reproducible (Suppl. Fig. S11). The reads were interpreted using Cufflinks,

which defines the spliced species of each transcription unit and the expression of each

transcript as fragments per kb of exons per million (FPKM). The transcription units were

sorted into groups based on their number of spliced products as detected by Cufflinks.

Then within each group, overlapping transcription units were resolved by comparing

adjacent transcription units two at a time. Transcription units that are contained entirely

within other transcription units were excluded. If transcription units had the same start

site but one had internal termination or if transcription units had the same termination site

but one had an internal promoter, we excluded the internal transcription unit. If the two

adjacent transcription units had some overlap we excluded the shorter one. This produced

groups of non-overlapping transcription units that were based on the number of

8

alternative transcripts. The number of transcription units in each group is listed in Suppl.

Table S5. The integration density (inserts per kb) for each transcription unit was

calculated by dividing the total number of integration sites per transcription unit by the

length of the transcription unit. The average integration density of the transcription unit in

a group was calculated by dividing the total integration density of each group by the

number of transcription units in that group.

d. Analysis of HIV-1 integration in genes with 1, 2, and 3 introns compared to genes with

matched sizes that have 10 introns:

Using the groups of transcription units with specific numbers of introns we made

two sets: one group of transcription units contained all the transcription units that had 10

introns and the other group contained all of the transcription units with 1, 2, or 3 introns.

We sorted the transcription units on the basis of length. For each of the transcription units

in the 10-intron group, we found a transcription unit in the 1, 2, or 3 intron group whose

length differed by less than 500 bp. We selected 673 transcription units that contained 10

introns (group A) and size matched it with 673 transcription units having 1, 2 or 3 introns

(group B). The total size of the transcription units in both groups was similar (43 kb for

10 introns and 42.6 kb for genes with 1-3 introns). For both these groups, we determined

the distribution of HIV integration sites in the transcription units by dividing the

transcription units into 15 equal segments and counting integrations in each bin. For the

region that is either upstream or downstream of transcription units, integration sites were

sorted into bins of 500 bp. We also calculated the average integration site density for each

group by using the actual HIV-1 integration sites and the control MRC sites.

9

e. HIV analysis from published data in mouse fibroblast cells:

Published data (1, 2) was used for these analyses. We used RefSeq of the mm9

compilation of the sequence of the mouse genome from UCSC. We mapped the

integration sites in the mouse genome using the same scripts that were used to analyze

the integration sites in HEK293T cells.

f. Gene ontology and cancer gene analysis:

We used our customized set of transcription units to determine which transcription units

contained the highest amount of HIV-1 integration sites. We sorted the transcription units

on the basis of total sites and separately on the basis of inserts per kb. The top 1,000

transcription units were analyzed for gene ontology using DAVID (3, 4). The gene

ontology terms and their associated P values are given in Tables 1 and Suppl. Table S3.

Three different cancer genes datasets (5-7) were used to determine the prevalence of

cancer genes in the top 1,000-targeted transcription units by Perl scripts.

h. Matched Random Control:

We made a list of all of the Mse I recognition sites on each chromosome in the human

genome. Next, we made a similar list that contained the positions of all of the Mse I and

Bgl II positions on each chromosome in the human genome. For each HIV integration

site, we determined the distance from the nearest Mse I. We selected a random Mse I site

from the whole genome. We then determined the coordinate of the MRC site using the

distance of a true HIV-1 integration site from the nearest Mse I site. We randomly chose

10

whether the MRC site would be upstream or downstream of the MseI site. We generated

a total of 961,176 MRC sites.

i. Mass spectrometry-based proteomics:

To identify the cellular binding partners of LEDGF/p75 and LEDGF/p52; GST, GST-

LEDGF/p75, and GST-LEDGF/p52 were pre-bound to gluthathione resin in 200 mM

NaCl, 25 mM Tris (pH 8.0), 0.1% Nonidet P-40, 2 mM β-mercaptoethanol, and 1x

complete protease mixture. Nuclear extracts were prepared from HEK 293T cells using

the NE-PER Nuclear and Cytoplasmic Kit. The nuclear extracts were then allowed to

bind to the resin with the associated LEDGF and control proteins and were washed twice.

The proteins that were bound to the resin were separated by SDS/PAGE and each lane

was subjected to in-gel trypsin digestion, followed by analysis of the peptide fragments

with capillary-liquid chromatography–tandem mass spectrometry (MS/MS) using a

Thermo Finnigan LTQ Orbitrap mass spectrometer equipped with a microspray source.

Splicing factors were identified by cross-referencing the data against the spliceosome

database at http://spliceosomedb.ucsc.edu/ and are reported in Table 2 with total spectral

counts. The complete list of GST-LEDGF/p52 and GST-LEDGF/p75 binding partners is

shown in Suppl. Table S10.

The interactions between GST, GST-LEDGF/p75, and GST-LEDGF/p52 and

select cellular proteins were also analyzed by affinity pull-down assays, and the

interacting proteins were detected by western blotting. GST, GST-LEDGF/p75, and

GST-LEDGF/p52 were prebound to gluthathione resin in 200mM NaCl, 25 mM Tris (pH

8.0), 0.1% Nonidet P-40, 2 mM β-mercaptoethanol, and1x complete protease mixture.

11

Nuclear extract was prepared from HEK 293T cells using the NE-PER Nuclear and

Cytoplasmic Kit. The nuclear extracts were then allowed to bind to the resin in 200mM

NaCl, 25 mM Tris (pH 8.0), 0.1% Nonidet P-40, 2 mM β-mercaptoethanol, and 1x

complete protease mixture and were washed twice with the same buffer. The proteins

bound to the resin were separated by SDS/PAGE and subject to western blotting with the

following antibodies anti-ASF/SF2, anti-hnRNP M, and anti-SF3B2 (Abcam ab38017,

Novus Biologicals NB200-315, and Abcam ab56800).

i. Pairwise and multivariate regression models of integration in transcription units:

To evaluate which features of transcription units best predict integration levels we

developed regression models using factors previously shown to predict integration

positions genome-wide (8-11). For each transcription unit in our custom set of 21,188, we

tabulated values for intron density, histone H3K4 trimethylation (ENCODE data for

HEK293T from Washington University GEO sample accession: GSM945288 using raw

signal), transcription level (FPKM from our RNAseq data for HEK293T), DNase I

cleavage sites (ENCODE data for HEK293T peak values from Duke GEO sample

accession: GSM1008573), and percent GC base pairs (Suppl. Table S. The transcription

units were grouped into sets of 100 based on integration density. The values of each

factor were averaged for each group. Natural log values for FPKM, DNase I sites, and

integration density were used because this resulted in higher correlations.

12

An examination of each factor against the log(integration density) showed virtually no

correlation of the integration density with the GC-content. A weak, negative correlation

was observed with log(DNAse1) and strong correlations are obtained for log(FPKM) and

histone H3 trimethylation. The strongest correlation is with intron density. This means

that there is a strong linear relationship between intron density and log(integrations

density) and it is therefore be the best single predictor of integration density (smallest

root-mean-squared error in the prediction).

Pairwise linear regression with log(integration density)

Factor Pierson’s coefficient r%GC 0.096

histone H3K4me3 0.886log(FPKM) 0.808

log(DNase I) -0.387intron density 0.903

When all five factors are included in multivariate analysis, the GC-content is ignored.

This is expected due to its small correlation with log(integration density). The

importance of the remaining four factors (as measured by the probability that the effect of

this factor is due to chance) reflects the same ordering as found in the individual

correlations with log(integration density). The intron density is the most important

factor, followed by H3K4me3, log(PFKM), and then log(DNAse1). The importance of

this last factor is not significant (p = 0.137), which is not unexpected due to its weak

correlation with log(integration density).

Multivariate regression with all five factorsFactor P value

13

Intercept 6.52e-7H3K4me3 7.89e-6log(FPKM) 0.00313

log(DNase I) 0.137intron density <2e-16

The adjusted r2 value and the residual standard error for the multivariate fit when all five

factors are used and when each of the factors is individually removed is given in the table

below. The multivariate fit with all five factors is the same as the fit when the %GC-

content is removed, since it did not contribute to the fit when the other four factors were

present. Similarly, there was very little change in the r2 when log(DNAse1) was

removed, as expected from its low correlation with log(integration density). In this fit,

the GC-content was also found to be uninformative, so the actual fit only used H3K4me3,

log(PFKM) and in intron density. When log(PFKM) was removed from the multivariate

fit, log(DNAse1) and GC-content were both found to be uninformative. Therefore a

linear model using just intron density and H3K4me3 produced an r2 of 0.875. This is

somewhat expected since these two factors had the highest individual correlations with

log(integrations density). Finally, removing intron density had, by far, the largest effect

on the multivariate fit; decreasing the r2 to 0.812. All of these result are consistent with

the conclusion that intron density is the strongest predictor of log(integration density) of

the five factors considered, followed by H3K4me3.

Multivariate fit with each factor individually removedFactors Residual standard error Adjusted r2

All five 0.450 0.883No intron density 0.570 0.812No H3K4me3 0.480 0.866No log(PFKM) 0.463 0.875No log(DNAse1) 0.452 0.881

14

No GC-content 0.450 0.883

References

1. Koh Y, Wu X, Ferris AL, Matreyek KA, Smith SJ, Lee K, KewalRamani VN,

Hughes SH, Engelman A. 2013. Differential effects of human immunodeficiency

virus type 1 capsid and cellular factors nucleoporin 153 and LEDGF/p75 on the

efficiency and specificity of viral DNA integration. J Virol 87:648-658.

2. Wang H, Jurado KA, Wu X, Shun MC, Li X, Ferris AL, Smith SJ, Patel PA,

Fuchs JR, Cherepanov P, Kvaratskhelia M, Hughes SH, Engelman A. 2012.

HRP2 determines the efficiency and specificity of HIV-1 integration in

LEDGF/p75 knockout cells but does not contribute to the antiviral activity of a

potent LEDGF/p75-binding site integrase inhibitor. Nucleic Acids Res 40:11518-

11530.

3. Huang da W, Sherman BT, Lempicki RA. 2009. Bioinformatics enrichment

tools: paths toward the comprehensive functional analysis of large gene lists.

Nucleic Acids Res 37:1-13.

4. Huang da W, Sherman BT, Lempicki RA. 2009. Systematic and integrative

analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc

4:44-57.

5. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Jr., Kinzler

KW. 2013. Cancer genome landscapes. Science 339:1546-1558.

15

6. Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, Xie M, Zhang Q,

McMichael JF, Wyczalkowski MA, Leiserson MD, Miller CA, Welch JS,

Walter MJ, Wendl MC, Ley TJ, Wilson RK, Raphael BJ, Ding L. 2013.

Mutational landscape and significance across 12 major cancer types. Nature

502:333-339.

7. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman

N, Stratton MR. 2004. A census of human cancer genes. Nat Rev Cancer 4:177-

183.

8. Craigie R, Bushman FD. 2012. HIV DNA integration. Cold Spring Harb

Perspect Med 2:a006890.

9. Berry C, Hannenhalli S, Leipzig J, Bushman FD. 2006. Selection of target sites

for mobile DNA integration in the human genome. PLoS Comput Biol 2:e157.

10. Wang GP, Ciuffi A, Leipzig J, Berry CC, Bushman FD. 2007. HIV integration

site selection: analysis by massively parallel pyrosequencing reveals association

with epigenetic modifications. Genome Res 17:1186-1194.

11. Schroder AR, Shinn P, Chen H, Berry C, Ecker JR, Bushman F. 2002. HIV-1

integration in the human genome favors active genes and local hotspots. Cell

110:521-529.

16

genesdev.cshlp.orggenesdev.cshlp.org/.../Supplemental_Methods.docx · Web viewThe 7 BglII digested...

Documents

Transcript of genesdev.cshlp.orggenesdev.cshlp.org/.../Supplemental_Methods.docx · Web viewThe 7 BglII digested...