11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross &...

54
11. Lecture WS 2006/07 Bioinformatics III 1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization of Maximum Flow] Let f be a flow in a network N. Then f is a maximum flow in network N if and only if there does not exist an f-augmenting path in N. Proof : Necessity () Suppose that f is a maximum flow in network N. Then by Proposition 12.2.1, there is no f-augmenting path. Proposition 12.2.1 (Flow Augmentation) Let f be a flow in a network N, and let Q be an f-augmenting path with minimum slack Q on its arcs. Then the augmented flow f‘ given by is a feasible flow in network N and val(f‘) = val(f) + Q. assuming an f-augmenting path existed, we could construct a flow f‘ with val(f‘) > val(f) contradicting the otherwise Q of arc backward a is if , Q of arc forward a is if , ' e f e e f e e f e f Q Q

Transcript of 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross &...

Page 1: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 1

V11: Max-Flow Min-CutV11 continues chapter 12 in Gross & Yellen „Graph Theory“

Theorem 12.2.3 [Characterization of Maximum Flow]

Let f be a flow in a network N.

Then f is a maximum flow in network N if and only if there does not exist an

f-augmenting path in N.

Proof: Necessity () Suppose that f is a maximum flow in network N.

Then by Proposition 12.2.1, there is no f-augmenting path.

Proposition 12.2.1 (Flow Augmentation) Let f be a flow in a network N, and let Q be an f-augmenting path with minimum slack Q on its arcs. Then the augmented flow f‘ given by

is a feasible flow in network N and val(f‘) = val(f) + Q.

assuming an f-augmenting path existed, we could construct a flow f‘ with

val(f‘) > val(f) contradicting the maximality of f.

otherwise

Q of arc backward a is if ,

Q of arc forward a is if ,

'

ef

eef

eef

ef Q

Q

Page 2: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 2

Max-Flow Min-CutSufficiency () Suppose that there does not exist an f-augmenting path in

network N.

Consider the collection of all quasi-paths in network N that begin with source s,

and have the following property: each forward arc on the quasi-path has positive

slack, and each backward arc on the quasi-path has positive flow.

Let Vs be the union of the vertex-sets of these quasi-paths.

Since there is no f-augmenting path, it follows that sink t Vs.

Let Vt = VN – Vs.

Then Vs,Vt is an s-t cut of network N. Moreover, by definition of the sets

Vs and Vt ,

Hence, f is a maximum flow, by Corollary 12.1.8. □

st

ts

VVe

VVeecapef

, if 0

, if

Page 3: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 3

Max-Flow Min-Cut

Theorem 12.2.4 [Max-Flow Min-Cut] For a given network, the value of a

maximum flow is equal to the capacity of a minimum cut.

Proof: The s-t cut constructed in the proof of Theorem 12.2.3 has capacity equal

to the maximum flow. □

The outline of an algorithm

for maximizing the flow in

a network emerges from

Proposition 12.2.1 and

Theorem 12.2.3.

Page 4: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 4

Finding an f-Augmenting Path

The idea is to grow a tree of quasi-paths, each starting at source s.

If the flow on each arc of these quasi-paths can be increased or decreased,

according to whether that arc is forward or backward, then an f-augmenting

path is obtained as soon as the sink t is labelled.

A frontier arc is an arc e directed from a labeled endpoint v to an unlabeled

endpoint w.

For constructing an f-augmenting path, the frontier path e is allowed to be

backward (directed from vertex w to vertex v), and it can be added to the tree as

long as it has slack e > 0.

The discussion of f-augmenting paths culminating in the flow-augmenting

Proposition 12.2.1 provides the basis of a vertex-labeling strategy due to Ford

and Fulkerson that finds an f-augmenting path, when one exists.

Their labelling scheme is essentially basic tree-growing.

Page 5: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 5

Terminology: At any stage during tree-growing for constructing an f-augmenting

path, let e be a frontier arc of tree T, with endpoints v and w.

The arc e is said to be usable if, for the current flow f, either

e is directed from vertex v to vertex w and f(e) < cap(e), or

e is directed from vertex w to vertex v and f(e) > 0.

Frontier arcs e1 and e2 are usable if

f(e1) < cap(e1) and f(e2) > 0

Remark From this vertex-labeling scheme, any of the existing f-augmenting paths

could result. But the efficiency of Algorithm 12.2.1 is based on being able to find

„good“ augmenting paths.

If the arc capacities are irrational numbers, then an algorithm using the

Ford&Fulkerson labeling scheme might not terminate (strictly speaking, it would

not be an algorithm).

Finding an f-Augmenting Path

Page 6: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 6

Finding an f-Augmenting Path

Even when flows and capacities are restricted to be integers, problems

concerning efficiency still exist.

E.g., if each flow augmentation were to increase the flow by only one unit, then

the number of augmentations required for maximization would equal the capacity

of a minimum cut.

Such an algorithm would depend on the size of the arc capacities instead of on

the size of the network.

Page 7: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 7

Finding an f-Augmenting Path

Example: For the network shown below, the arc from vertex v to vertex w has

flow capacity 1, while the other arcs have capacity M, which could be made

arbitrarily large.

If the choice of the augmenting flow path at each iteration were to alternate

between the directed path s,v,w,t and the quasi path s,w,v,t , then the flow

would increase by only one unit at each iteration.

Thus, it could take as many as 2M iterations to obtain the maximum flow.

Page 8: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 8

Finding an f-Augmenting Path

Edmonds and Karp avoid these

problems with this algorithm.

It uses breadth-first search

to find an f-augmenting path

with the least number of arcs.

Page 9: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 9

FFEK algorithm: Ford, Fulkerson, Edmonds, and Karp

Algorithm 12.2.3 combines Algorithms 12.2.1 and 12.2.2

Page 10: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 10

Example: the figures illustrate algorithm 12.2.3.

shown is the s-t cut with capacity equal to

the current flow, establising optimality.

FFEK algorithm: Ford, Fulkerson, Edmonds, and Karp

Page 11: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 11

FFEK algorithm: Ford, Fulkerson, Edmonds, and Karp

At the end of the final iteration, the arc directed from source s to vertex w and the

arc directed from vertex v to sink t are the only frontier arcs of the tree T,

but neither is usable.

These two arcs form the minimum cut {s,x,y,z,v }, {w,a,b,c,t} .

This illustrates the s-t cut that was constructed in the proof of theorem 12.2.3.

Page 12: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 12

Determining the connectivity of a graph

In this section, we use the theory of network flows to give constructive proofs of

Menger‘s theorem.

These proofs lead directly to algorithms for determining the edge-connectivity and

vertex-connectivity of a graph.

The strategy to prove Menger‘s theorems is based on properties of certain

networks whose arcs all have unit capacity.

These 0-1 networks are constructed from the original graph.

Page 13: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 13

Determining the connectivity of a graph

Lemma 12.3.1. Let N be an s-t network such that

outdegree(s) > indegree(s),

indegree(t) > outdegree (t), and

outdegree(v) = indegree(v) for all other vertices v.

Then, there exists a directed s-t path in network N.

Proof. Let W be a longest directed trail (trail = walk without repeated edges; path = trail

without repeated vertices) in network N that starts at source s, and let z be its terminal

vertex.

If vertex z were not the sink t, then there would be an arc not in trail W that is directed from

z (since indegree(z) = outdegree(z) ).

But this would contradict the maximality of trail W.

Thus, W is a directed trail from source s to sink t.

If W has a repeated vertex, then part of W determines a directed cycle, which can be

deleted from W to obtain a shorter directed s-t trail.

This deletion step can be repeated until no repeated vertices remain, at which point, the

resulting directed trail is an s-t path. □

Page 14: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 14

Determining the connectivity of a graph

Proposition 12.3.2. Let N be an s-t network such that

outdegree(s) – indegree(s) = m = indegree(t) – outdegree (t),

and outdegree(v) = indegree(v) for all vertices v s,t.

Then, there exist m disjoint directed s-t path in network N.

Proof. If m = 1, then there exists an open eulerian directed trail T from source s to

sink t by Theorem 6.1.3.

Review: An eulerian trail in a graph is a trail that contains every edge of that graph.

Theorem 6.1.3. A connected digraph D has an open eulerian trail from vertex x to vertex y if and only if

indegree(x) + 1 = outdegree(x), indegree(y) = outdegree(y) + 1, and all vertices except x and y have equal

indegree and outdegree.

Theorem 1.5.2. Every open x-y walk W is either an x-y path or can be reduced to an x-y path.

Therefore, trail T is either an s-t directed path or can be reduced to an s-t path.

Page 15: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 15

Determining the connectivity of a graph

By way of induction, assume that the assertion is true for m = k, for some k 1,

and consider a network N for which the condition holds for m = k +1.

There exists a directed s-t path P by Lemma 12.3.1.

If the arcs of path P are deleted from network N, then the resulting network N - P

satisfies the condition of the proposition for m = k.

By the induction hypothesis, there exist k arc-disjoint directed s-t paths in network

N - P. These k paths together with path P form a collection of k + 1 arc-disjoint

directed s-t paths in network N. □

Page 16: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 16

Basic properties of 0-1 networksDefinition A 0-1 network is a capacitated network whose arc capacities are either 0

or 1.

Proposition 12.3.3. Let N be an s-t network such that cap(e) = 1 for every arc e.

Then the value of a maximum flow in network N equals the maximum number of

arc-disjoint directed s-t paths in N.

Proof: Let f* be a maximum flow in network N, and let r be the maximum number of

arc-disjoint directed s-t paths in N.

Consider the network N* obtained by deleting from N all arcs e for which f*(e) = 0.

Then f*(e) = 1 for all arcs e in network N*.

It follows from the definition that for every vertex v in network N*,

voutdegreevOutefvOute

*

and

vindegreevInefvIne

*

Page 17: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 17

Basic properties of 0-1 networksThus by the definition of val(f*) and by the conservation-of-flow property,

outdegree(s) – indegree (s) = val(f*) = indegree(t) – outdegree(t)

and outdegree(v) = indegree(v), for all vertices v s,t.

By Proposition 12.3.2., there are val(f*) arc-disjoint s-t paths in network N*, and

hence, also in N, which implies that val(f*) r.

To obtain the reverse inequality, let {P1,P2, ..., Pr} be the largest collection of arc-

disjoint directed s-t paths in N, and consider the function f: EN R+ defined by

. otherwise ,0

arc uses path some if ,1 ePef i

Then f is a feasible flow in network N, with val(f) = r.

It follows that val(f*) r. □

Page 18: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 18

Separating Sets and CutsReview from §5.3

Let s and t be distinct vertices in a graph G. An s-t separating edge set in G is a

set of edges whose removal destroys all s-t paths in G.

Thus, an s-t separating edge set in G is an edge subset of EG that contains at least

one edge of every s-t path in G.

Definition: Let s and t be distinct vertices in a digraph D.

An s-t separating arc set in D is a set of arcs whose removal destroys all directed

s-t paths in D.

Thus, an s-t separating arc set in D is an arc subset of ED that contains at least one

arc of every directed s-t path in digraph D.

Remark: For the degenerate case in which the original graph or digraph has no

s-t paths, the empty set is regarded as an s-t separating set.

Page 19: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 19

Separating Sets and CutsProposition 12.3.4 Let N be an s-t network such that cap(e) = 1 for every arc e.

Then the capacity of a minimum s-t cut in network N equals the minimum number of

arcs in an s-t separating arc set in N.

Proof: Let K* = Vs ,Vt be a minimum s-t cut in network N, and let q be the

minimum number of arcs in an s-t separating arc set in N.

Since K* is an s-t cut, it is also an s-t separating arc set. Thus cap(K*) q.

To obtain the reverse inequality, let S be an s-t separating arc set in network N

containing q arcs, and let R be the set of all vertices in N that are reachable from

source s by a directed path that contains no arc from set S.

Then, by the definitions of arc set S and vertex set R, t R, which means that

R, VN - R is an s-t cut.

Moreover, R, VN - R S. Therefore

Page 20: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 20

Separating Sets and Cuts

which completes the proof. □

q

SRVRS

RVR

tsKRVRcapKcap

N

N

N

, since

1 are capacities all since ,

cut minimum a is * since ,*

Page 21: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 21

Arc and Edge Versions of Menger’s Theorem Revisited

Theorem 12.3.5 [Arc form of Menger‘s theorem]

Let s and t be distinct vertices in a digraph D. Then the maximum number of arc-

disjoint directed s-t paths in D is equal to the minimum number of arcs in an s-t

separating set of D.

Proof: Let N be the s-t network obtained by assigning a unit capacity to each arc of

digraph D. Then the result follows from Propositions 12.3.3. and 12.3.4., together

with the max-flow min-cut theorem. □

Theorem 12.2.4 [Max-Flow Min-Cut] For a given network, the value of a maximum flow is equal to the

capacity of a minimum cut.

Proposition 12.3.3. Let N be an s-t network such that cap(e) = 1 for every arc e. Then the value of a

maximum flow in network N equals the maximum number of arc-disjoint directed s-t paths in N.

Proposition 12.3.4 Let N be an s-t network such that cap(e) = 1 for every arc e. Then the capacity of a

minimum s-t cut in network N equals the minimum number of arcs in an s-t separating arc set in N.

Page 22: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 22

Transcriptional – Gene regulation networks

The machine that transcribes a

gene is composed of perhaps 50

proteins, including RNA

polymerase, the enzyme that

converts DNA code into RNA

code.

A crew of transcription factors

grabs hold of the DNA just above

the gene at a site called the core

promoter, while associated

activators bind to enhancer regions

farther upstream of the gene to rev

up transcription.

ahttp://www.berkeley.edu/news/features/1999/12/09_nogales.html

Working as a tightly knit machine, these

proteins transcribe a single gene into

messenger RNA. The messenger RNA

winds its way out of the nucleus to the

factories that produce proteins, where it

serves as a blueprint for production of a

specific protein.

Page 23: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 23

Transcription in E.coli and in Eucaryotes

Procaryotes Eucaryotes

Genes are grouped into operons Genes are not grouped in operons

mRNA may contain transcript of each mRNA contains only

several genes (poly-cistronic) transcript of a single gene

(mono-cistronic)

Transcription and translation are coupled. Transcription and translation are

Transcript is translated already during NOT coupled.

transcription. Transcription takes place

in nucleus, translation in cytosol.

Gene regulation takes place by Gene regulation via transcription

modification of transcription rate rate AND by RNA-processing,

RNA stability etc.

Page 24: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 24

Promoter prediction in E.coli

To analyze E.coli promoters, one may align a set of promoter sequences by the

position that marks the known transcription start site (TSS) and search for

conserved regions in the sequences.

E.coli promoters are found to contain 3 conserved sequence features

- a region approximately 6 bp long with consensus TATAAT at position -10

- a region approximately 6 bp long with consensus TTGACA at position -35

- a distance between these 2 regions of ca. 17 bp that is relatively constant

a

Page 25: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 25

Gene regulatory promoter network

In E.coli, 240 transcription factors have been verified that regulate 3000 genes.

Binding site matrics are available for more than 55 E.coli TFs

(Robison et al. 1998)

In S. cerevisae, genome-wide binding analysis of 106 transcription factors

indicates that more than one-third of the promoter regions that were bound by

regulators were bound by 2 or more regulators.

Highly connected network of transcriptional regulators.

Page 26: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 26

Feasibility of computational motif search?

Computational identification of transcription factor binding sites is difficult

because they consist of short, degenerate sequences that occur frequently by

chance.

The problem is not easy to define (therefore: it is „complex“) because

- the motif is of unknown size

- the motif might not be well conserved between promoters

- the sequences used to search for the motif do not necessarily represent the

complete promoter

- genes with promoters to be analyzed are in many cases grouped together by a

clustering algorithm which has its own limitations.

Page 27: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 27

Strategy 1

Arrival of microarray gene-expression data.

Group of genes with similar expression profile (e.g. those that are activated at

the same time in the cell cycle) one may assume that this profile ist, at least

partly, caused by and reflected in a similar structure of the regions involved in

transcription regulation.

Search for common motifs in < 1000 base upstream regions.

Sofar used: detection of single motifs (representing transcription-factor binding

sites) common to the promoter sequences of putatively co-regulated genes.

Better: search for simultaneous occurrence of 2 or more sites at a given distance

interval! Search becomes more sensitive.

Page 28: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 28

Motif identifaction

A flowchart to illustrate the two

different approaches for motif

identification. We analyzed 800

bp upstream from the translation

start sites of the five genes from

the yeast gene family PHO by

the publicly available systems

MEME (alignment) and RSA

(exhaustive search). MEME was

run on both strands, one

occurrence per sequence mode,

and found the known motif

ranked as second best. RSA

Tools was run with oligo size 6

and noncoding regions as

background, as set by the demo

mode of the system. The well-

conserved heptamer of the

motifs used by MEME to build

the weight matrix is printed in

bold. Ohler, Niemann Trends Gen 17, 2 (2001)

Page 29: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 29

Strategy 2: Exhaustive motiv search in upstream regions

Exploit the finding that relevant motifs are often repeated many times,

possibly with small variations, in the upstream region for the regulatory action to

be effective.

Search upstream region for overrepresented motifs

(1) Group genes based on the overrepresented motifs

(2) Analyze sets of genes that share motifs for coregulation in microarray exp.

(3) Consider overrepresented motifs labelling sets of co-regulated genes as

candidate binding sites.

Cora et al. BMC Bioinformatics 5, 57 (2004)

Page 30: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 30

Exhaustive motiv search in upstream regions

Exploit

Cora et al. BMC Bioinformatics 5, 57 (2004)

Page 31: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 31

Position-specific weight matrix

Popular approach when list of genes available that share TF binding motif;

Good multiple sequence alignment available.

Alignment matrix: lists # of occurrences of

each letter at each position of an alignment

Hertz, Stormo (1999) Bioinformatics 15, 563

Page 32: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 32

Position-specific weight matrix

Examples of matrices used by YRSA

http://forkhead.cgb.ki.se/YRSA/matrixlist.html

Page 33: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 33

A protein bound to a specific DNA sequence

will interfere with the digestion of that region by

DNase I.

An end-labelled DNA probe is incubated with a

protein extract or a purified DNA-binding factor.

The unprotected DNA is then partially digested

with DNase I such that on average every DNA

molecule is cut once.

Digestion products are then resolved by

electrophoresis.

Comparison of the DNase I digestion pattern in

the presence and absence of protein will allow

the identification of a footprint (protected

region)

*

*

*

*

Denaturing PAGE

Footprint

Exp. Identification of TF binding site: DNase 1 Footprinting

Page 34: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 34

Gel ShiftsElectro Mobility Shift Assay (EMSA)Band Shift

Incubating a purified protein, or a complex

mixture of proteins e.g. nuclear or cell extract,

with a 32P end-labelled DNA fragment

containing the putative protein binding site

(from promoter region).

Reaction products are then analysed on a non-

denaturing polyacrylamide gel.

The specificity of the DNA-binding protein for

the putative binding site is established by

competition experiments using DNA fragments

or oligonucleotides containing a binding site for

the protein of interest, or other unrelated DNA

sequences.

* *

Non-denaturing PAGE

Retarded mobility due to protein binding

Free DNA probe

No protein add protein

Gel retardation assays

Page 35: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 35

http://www.rcsb.org

3D structures of transcription factors

1A02.pdb 1AM9.pdb 1AU7.pdb

1CIT.pdb 1GD2.pdb 1H88.pdb

TFs bind with very

different binding modes.

Some are sensitive

for DNA conformation.

2 TFs bound!

Page 36: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 36

database for eukaryotic transcription factors: TRANSFAC

BIOBase / TU Braunschweig / GBF

Relational database

6 flat files:

FACTOR interaction of TFs

SITE their DNA binding site

GENE through which they regulate

these target genes

CELL factor source

MATRIX TF nucleotide weight matrices

CLASS classification scheme of TFs

Wingender et al. (1998) J Mol Biol 284,241

Page 37: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 37

database for eukaryotic transcription factors: TRANSFAC

BIOBase / TU Braunschweig / GBF

Matys et al. (2003) Nucl Acid Res 31,374

Page 38: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 38

TRANSFAC classification

1 Superclass basic domains 3 Superclass: Helix-turn-helix

1.1 Leuzine zipper factors (bZIP)

1.2 Helix-loop-helix factors (bHLH) 4 Superclass: beta-Scaffold

1.3 bHLH-bZIP Factors with Minor Groove

1.4 NF-1 Contacts

1.5 RF-X

1.6 bHSH 5 Superclass: others

2 Superclass: Zinc-coordinating DNA-binding domains

2.1 Cys4 zinc finger of nuclear receptor type

2.2 diverse Cys4 zinc fingers

2.3 Cys2His2 zinc finger domains

2.4 Cys6 cysteine-zinc cluster

2.5 Zinc fingers of alternating composition

http://www.gene-regulation.com/pub/databases/transfac/cl.html

Page 39: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 39

BIOBase / TU Braunschweig / GBF

Matys et al. (2003) Nucl Acid Res 31,374

database for eukaryotic transcription factors: TRANSFAC

Page 40: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 40

Summary

http://www.gene-regulation.com

Large databases available (e.g. TRANSFAC) with information about promoter sites.

Information verified experimentally.

Microarray data allows searching for common motifs of coregulated genes.

Also possible: common GO annotation etc.

TF binding motifs are frequently overrepresented in 1000 bp upstream region.

Clear function of this is unknown.

(Same as in proline-rich recognition sequences.)

Relatively few TFs regulate large number of genes.

Complex regulatory network, Next lecture(s).

Page 41: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 41

additional slides

Page 42: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 42

Arc and Edge Versions of Menger’s Theorem Revisited

Assertion 12.3.6. Let s and t be distinct vertices of a graph G, and let be the

digraph obtained by replacing each edge e of G with a pair of oppositely directed

arcs having the same endpoints as edge e.

Then there is a one-to-one correspondence between the s-t paths in graph G and

the directed s-t paths in digraph .

Moreover, two s-t paths in graph G are edge-disjoint if and only if their

corresponding directed s-t paths in digraph are arc-disjoint.

Assertion 12.3.7. Let s and t be distinct vertices of a graph G, and let be defined

as above. Then the minimum number of edges in an s-t separating set of graph G

is equal to the minimum number of arcs in an s-t separating arc set of digraph .

G

G

G

G

G

Remark The edge form of Menger‘s theorem for undirected graphs follows

directly from the next two assertions concerning the relationship between a graph

G and the digraph obtained by replacing each edge e of graph G with a pair

of oppositely directed arcs having the same endpoints as edge e.

Each of these assertions follows directly from the definitions.

G

Page 43: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 43

Arc and Edge Versions of Menger’s Theorem RevisitedTheorem 12.3.8 [Edge form of Menger‘s theorem]. Let s and t be distinct vertices in

a graph G. Then the maximum number of edge-disjoint s-t paths in G equals the

minimum number of edges in an s-t separating edge set of graph G.

Proof: This is an immediate consequence of Assertions 12.3.6 and 12.3.7, together

with the arc form of Menger‘s theorem (theorem 12.3.5).

Review from §5.1 The edge-connectivity e(G) is the size of a smallest edge-cut in

graph G.

Definition Let s and t be distinct vertices in a graph G.

The local edge-connectivity between vertices s and t , denoted e(s,t) is the

minimum number of edges in an s-t separating edge set in G.

Page 44: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 44

Determining Edge-Connectivity Using Network Flows

Proposition 12.3.9 The edge-connectivity of a graph G is equal to the minimum of

the local edge-connectivites, taken over all pairs of vertices s and t. That is:

Proposition 12.3.9 and theorem 12.3.8 suggest an algorithm for determining the

edge-connectivity e(G) of an arbitrary graph G.

The algorithm calculates the local edge-connectivity between each pair of vertices

in G, by solving an appropriate maximum flow problem in the network .

In fact, as the next two results show, it is not necessary to calculate the local edge-

connectivity between every pair of vertices.

tseGGVts

e ,min,

G

Page 45: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 45

Determining Edge-Connectivity Using Network FlowsProposition 12.3.10. Let V1,V2 be a partition-cut of minimum cardinality in a graph

G, and let v1 and v2 be any vertices in V1 and V2, respectively.

Then the edge-connectivity e(G) equals the local edge-connectivity e(v1,v2).

Proof: Suppose that the minimum local edge-connectivity is achieved between

vertices x and y. Then e(G) e(x,y) by Proposition 12.3.9.

It suffices to show that e(v1,v2) e(x,y).

Let be the digraph obtained by replacing each edge of graph G with two

oppositely directed arcs.

Then can be regarded as a v1-v2 capacitated network and as an x-y

capacitated network where each arc is assigned unit capacity.

Let K* be a minimum v1-v2 cut in network

G

G 21vvGxyG

21vvG

Page 46: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 46

Determining Edge-Connectivity Using Network Flows

It follows that cap(K*) cap V1,V2 , since the partition-cut V1,V2 corresponds to

a v1-v2 cut in network .

Next, let f* be a maximum flow and V1 ,V2 a minimum x-y cut in x-y network

so that cap(Vx,Vy) = val(f*). Then

21vvG

xyG

12.3.5 Theorem and 12.3.3n Propositio ,

cut-min flow-max *

capacityunit have arcs all ,

Gin cut -partition a toscorrespond , ,

capacityunit have arcs all ,

Gin cut a toscorrespond , ,

12.3.7Assertion and 12.3.4n Propositio *,

21

,v212121

21

21

yx

fval

VVcap

VVVV

VV

vvVVVVcap

Kcapvv

e

yx

yxyx

v

e

Page 47: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 47

Arc and Edge Versions of Menger’s Theorem Revisited

Corollary 12.3.11 Let s be any vertex in a graph G. Then

tsG esVt

eG

,min

Proof: Let V1 ,V2 be a partition-cut of minimum cardinality, and suppose, without

loss of generality, that vertex s V1. There must be some vertex t V2 (otherwise,

EG = , and the assertion would be trivially true).

By Proposition 12.3.10 it follows that e(G) = e(s,t). □

The variable e used in the next algorithm, represents the edge-connectivity of

graph G and is initialized with the sufficiently large positive integer |EG|.

Page 48: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 48

Arc and Edge Versions of Menger’s Theorem Revisited

Algorithm 12.3.1 requires O(n) iterations, and since Algorithm 12.2.3 requires

O(n|E|2) computations, the overall complexity of algorithm 12.3.1 is O(n2|E|2).

More efficient algorithms exist.

Page 49: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 49

Using Network Flows to Prove the Vertex Forms of Menger’s Theorem

Construction of Digraph ND from Digraph D.

Let s and t be any pair of non-adjacent vertices in a digraph D.

The digraph ND is obtained from digraph D as follows:

- each vertex x VD – {s,t} corresponds to two vertices x‘ and x‘‘ in digraph ND and

an arc directed from x‘ to x‘‘.

- each arc in digraph D that is directed from vertex s to vertex x VD – {s,t}

corresponds to an arc in ND directed from s to x‘.

- each arc in D that is directed from a vertex x VD – { s,t } to vertex t corresponds

to an arc in ND directed from x‘‘ to t.

- each arc in D that is directed from a vertex x VD – {s,t} to a vertex y VD – {s,t}

corresponds to an arc in ND directed from x‘‘ to y‘.

Page 50: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 50

Using Network Flows to Prove the Vertex Forms of Menger’s Theorem

Review from §5.3: Let s and t be a pair of non-adjacent vertices in a graph G (or

digraph D). An s-t separating vertex set in G (or in D) is a set of vertices whose

removal destroys all s-t paths in G (or all directed s-t paths in D).

Thus, an s-t separating vertex set is a set of vertices that contains at least one

internal vertex of every (directed) s-t path.

Definition Two (directed) s-t paths in a digraph D are internally disjoint if they

have no internal vertices in common.

Page 51: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 51

Relationship between digraphs D and ND

Assertion 12.3.12 There is a one-to-one correspondence between directed s-t paths

in digraph D and directed s-t paths in digraph ND.

Assertion 12.3.13 Two directed s-t paths in D are internally disjoint if and only if

their corresponding s-t directed paths in ND are arc-disjoint.

Assertion 12.3.14 The maximum number of internally disjoint directed s-t paths in D

is equal to the maximum number of arc-disjoint directed s-t paths in ND.

Assertion 12.3.15 The minimum number of vertices in an s-t separating vertex set

in digraph D is equal to the minimum number of arcs in an s-t separating arc set in

digraph ND.

Page 52: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 52

Relationship between digraphs D and ND

Theorem 12.3.16 [Vertex Form of Menger for Digraphs]

Let s and t be a pair of non-adjacent vertices in a digraph D.

Then the maximum number of internally disjoint directed s-t paths in D is equal to

the minimum number of vertices in an s-t separating vertex set in D.

Proof: This follows from Assertions 12.3.12 through 12.3.15 together with the arc

form of Menger‘s theorem (theorem 12.3.5).

Theorem 12.3.17 [Vertex Form of Menger for Undirected Graphs].

Let s and t be a pair of non-adjacent vertices in a graph G.

Then the maximum number of internally disjoint s-t paths in G is equal to the

minimum number of vertices in an s-t separating vertex set in G.

Proof: This follows from Theorem 12.3.16 and Assertions 12.3.6 and 12.3.7.

Page 53: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 53

Determining Vertex-Connectivity using Network Flow

Review from §5.3: Let s and t be non-adjacent vertices of a connected graph G.

Then the local vertex-connectivity between s and t , denoted v(s,t) is the

minimum number of vertices in an s-t separating vertex set.

Lemma 5.3.5. Let G be a connected graph containing at least one pair of non-

adjacent vertices. Then the vertex connectivity v(G) is the minimum of the local

vertex-connectivity v(s,t), taken over all pairs of non-adjacent vertices s and t.

The following algorithm, with O(n|EG|3) time-complexity, computes the vertex-

connectivity of an n-vertex graph by calculating the local vertex-connectivity

between various pairs of non-adjacent vertices.

As in algorithm 12.3.1, it is not necessary to calculate the local vertex-connectivity

between each pair.

Page 54: 11. Lecture WS 2006/07Bioinformatics III1 V11: Max-Flow Min-Cut V11 continues chapter 12 in Gross & Yellen „Graph Theory“ Theorem 12.2.3 [Characterization.

11. Lecture WS 2006/07

Bioinformatics III 54

Determining Vertex-Connectivity using Network Flow