A-Walk-on-the-W-Side

Taking a walk on the W-side:Comparing Epitopes on HIV-1

with the W-curve & TSP.

Douglas J. Cork1,2,4, Steven Lembark3, Bruce K. Brown1,4, Victoria R. Polonis1,4, Jerome Kim1,4, Nelson L. Michael5

US Military HIV Research Program (MHRP)/Henry Jackson Foundation(HJF)1, Rockville, MD., Illinois Institute of Technology2, Chicago, IL., Workhorse Computing3, Woodhaven, NY., Walter Reed

Army Institute For Research4, Rockville, MD., Walter Reed Army Institute for Research, Washington, DC5

Statistically, HIV1 is a problem.

● One of the major problems in studying HIV1 is the apparent randomness of clinical response.● Tests using clades based on genome sequences

show no correlation with immune response.

● Part of the answer may be clades based on smaller, clinicallyspecific sequences.● HIV1 mutates 10,000 times faster than people.● Existing clades end up including too much white

noise to correlate well with anything.

The Structure of HIV1

● gp120 is the primary focus for immune studies.

● gp120 and gp41 make up the envelope protein, gp160.

Standard Clades vs. Neutralization Data

● Standard clades of HIV1 are based on phylogenetic trees of the genome.● They do not correlate well with neutralization data.● Between and withinclade have similar variability.● Antibody and Cell studies have low correlation for

withinclade results.

● Lack of a correlation prevents developing any broadly neutralizing treatments.● Today we have to sequence the virus to treat it.

Example: Crossclade neutralization shows no useful pattern in Peripheral Blood Mononuclear Cell or Pseudovirus Assay studies.● Bub

ble plot.

● No real relationship.

Neutralization Heat Map● Distribution of

response to antibody pools lacks any correlation with the standard clades.

HIV1 Genetics Complicate Analysis

● Genes and proteins are normally reported with respect to a single strain, HXB2.● Hard to compare local features between strains.● Need to rediscover them for each study.

● Neutralization data are specific to gp120.● Variable regions in gp120 leave corresponding

locations in different samples off by 10's of bases.● Antibody binding sites (epitopes) are only a few

bases long, with a majority in the variable regions.

Another approach: Wcurves

● The Wcurve is based on chaos and game theory.

● It abstracts a sequence of DNA into a threedimensional structure.● Originally designed for visualization, we have now

adapted it for machine comparison.

● Geometric analysis of the curves allows for piecewise comparison of the sequences.

The Wcurve

● Start with a square at the origin and a discrete Zaxis matching the sequence base numbers.

● Each point moves halfway towards the corner for the next base.

● All curves start at (0,0,0).

● The curve (blue) moves half way towards “C” then “G” (red lines).

Autoregression

● Converge by base 7 after a SNP at base3.

● Convergence is quick even after large indels.

Handling Gaps

● Curves converge as SNP's do but with a phase shift.

Scoring Curves

● Approximating the distance smooths over SNP's.

● Smaller angles reduce difference, large angles add them.

Needle in a Haystack: CD4 Epitope

● The CD4 epitopes occupy only a few, widely dispersed locations on gp120.

● Locating portions of the discontinuous epitope is difficult.● Variable regions between them change the

locations between samples.● Portions of the epitope within the variable region

can be hidden by nearby changes.

Analyzing the 3D Structure

● The advantage to Wcurves is that even small features of the gene generate unique geometry.● Features are easier to identify in 3D than the 1D

CATGstrings.

● By first locating largescale features, we can search for smaller ones more easily.● First align extreme points on the curves.● Then compare regions between them.● With a library of fragments, we pick the best match.

Wcurve Algorithm & Serial Comparison

● Largescale features guide the search for smaller pieces.● Conserved regions anchor search.● After aligning 'peaks' in the curves, we align smaller

and less discriminating features.● A library of Wcurve fragments finds best fit with

multiple samples.

● Repeatable process allows examining and scoring large numbers of finer features.

Wcurves of HXB2 genome and gp120

● The curve for HXB2 illustrates the most important features of Wcurves.● Looking at each section of the Wcurve you'll notice

that each area is different from the others.● This is what allows us to locate small features: it is

easier to discern them in 3D than a character string.

● This figure also highlights the location of gp120.

A detailed view of gp120

● The next slide shows the first portion of HXB2's env gene: gp120.

● Again, notice that each portion of the curve is distinct from the others.

● The different conserved (C) and variable (V) regions are marked across the bottom of the image.

The CD4 epitope in gp120

● This is where the Wcurve really becomes useful: isolating the epitope locations within gp120.

● The highlighted areas show the epitope locations with an additional 3bases of conformational region before and after (which combines a few of the regions).

● Note that the epitope is dispersed and lives largely in the variable regions.

Clustering With the TSP

● Solutions to the Traveling Salesman Problem can be used to cluster genes.● The shortest path clusters moresimilar sequences.

● The difficulty is in getting clades out of the TSP.● One approach uses dummy cities with small

distances to all other cities.● Dummys end up in the intercluster regions.

● This approach has proven fast & repeatable.

Tour0 defines the colors for others.

Clades start to break down in gp41

C5 needs more groups.

Clades break down completely in V4

Further Work on Clusters

● Detection.● Find algorithm for repeatably assigning the number

of dummy cities.

● Comparison.● Automate detecting “similar” clusters.

● Timeseries analysis.● Watch sample groups for new members.● Track evolution of drug resistance in clinical trial

groups, individual patients.

Ongoing Research

● Our goal is to correlate neutralization outcomes.● Compare small regions near the epitopes.● Find DNA that clusters similarly to neutralization

data.

● DNA clusters that match the Neutralization data are “clinical” clades.● Biggest issue will be deciding what “similar” is.● Probably a good application for Fuzzy Logic.

Acknowledgments

● Thanks to the authors of Brown, et al, study.

All of the work we've shown you was done on a computer. Without fieldwork and wet labs, it would be empty. Next time you sit down to crunch some numbers, stop and picture for a moment the process of acquiring it. You'll get a whole new appreciation for your work.

A-Walk-on-the-W-Side

Documents

Transcript of A-Walk-on-the-W-Side