Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This...

10
Diffusion pseudotime robustly reconstructs lineage branching Laleh Haghverdi 1 , Maren Büttner 1 , F. Alexander Wolf 1 , Florian Buettner 1,2 , Fabian J. Theis 1,3 1 Helmholtz Zentrum München–German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany. 2 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. 3 Department of Mathematics, Technische Universität München, Munich, Germany. Single-cell gene expression profiles of differentiating cells encode their intrinsic latent temporal order. We describe an efficient way to robustly estimate this order according to a diffusion pseudotime, which measures transitions on all length scales between cells using diffusion-like random walks. This allows us to identify cells that undergo branching decisions or are in metastable states, and thereby genes differentially regulated at these states. Cellular programs are driven by gene-regulatory interactions, which due to inherent stochasticity as well as external influences often exhibit strong heterogeneity and asynchrony in the timing of program execution. Time-resolved bulk transcriptomics averages over these effects and obscures the underlying gene dynamics. Instead, single-cell profiling techniques allow a systematic observation of a single cell's regulatory state 1 , capturing cells at various stages in their respective process 2,3 . To infer gene dynamics and hence the sequence of cellular programs, the collective (‘universal’) process dynamics (Box 1) can be reconstructed by reordering cells according to some measure of expression similarity. This so-called pseudotemporal ordering was initially proposed for bulk expression 4 , and was later extended to single-cell RNA-seq 5 and protein profiles from mass cytometry 6 . Box 1: Universal time In contrast to continuous time observations of a single cell e.g. from time-lapse microscopy, high-throughput snapshot experiments such as single cell RNA-seq or FACS only encode the collective (‘universal’) time dependence of cells, not the stochastic trajectories of single cells. We define universal time as the geodesic distance on the manifold that is associated with the deterministic program underlying the stochastic cellular process. For time-lapse data, universal time can be constructed by estimating the velocity () tangential to this manifold C from local averages of single-cell trajectories. The geodesic distance = !: ! ! ,! ! = ! ! ! ! ! ! ! 1 ! . then quantifies the arc length i.e. the universal time along the manifold, where denotes the local density of samples on a single cell trajectory (see Supplementary Sec. 1). Pseudotimes are proxies for universal time (Supplementary Figs. 1-3). Our proposed DPT approximates universal time better than other pseudotime schemes as it does not involve dimension reduction, and better than diffusion distance 11 as it accounts for random walks on all length scales. Ultimately these approaches aim to fully understand differentiation dynamics as paths on Waddington’s ‘epigenetic landscape’ 7,8 . However so far it is unclear how to identify cells at critical branching decisions as well as quiescent or metastable cells, for which there is no notion of temporal ordering. Moreover, as novel experimental techniques such as droplet . CC-BY-NC 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/041384 doi: bioRxiv preprint

Transcript of Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This...

Page 1: Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This allows us to identify cells that undergo branching ... Pseudotimes are proxies for

DiffusionpseudotimerobustlyreconstructslineagebranchingLalehHaghverdi1,MarenBüttner1,F.AlexanderWolf1,FlorianBuettner1,2,FabianJ.Theis1,3

1HelmholtzZentrumMünchen–GermanResearchCenterforEnvironmentalHealth,InstituteofComputationalBiology,Neuherberg,Germany.2EuropeanMolecularBiologyLaboratory,EuropeanBioinformaticsInstitute,WellcomeTrustGenomeCampus,Hinxton,Cambridge,UK.3DepartmentofMathematics,TechnischeUniversitätMünchen,Munich,Germany.Single-cell gene expression profiles of differentiating cells encode their intrinsic latenttemporalorder.Wedescribeanefficientwaytorobustlyestimatethisorderaccordingtoadiffusionpseudotime,whichmeasurestransitionsonalllengthscalesbetweencellsusingdiffusion-like random walks. This allows us to identify cells that undergo branchingdecisionsorare inmetastablestates,andtherebygenesdifferentiallyregulatedatthesestates. Cellular programs are driven by gene-regulatory interactions, which due to inherentstochasticity as well as external influences often exhibit strong heterogeneity andasynchronyinthetimingofprogramexecution.Time-resolvedbulktranscriptomicsaveragesovertheseeffectsandobscurestheunderlyinggenedynamics. Instead,single-cellprofilingtechniquesallowasystematicobservationofasinglecell'sregulatorystate1,capturingcellsat various stages in their respective process2,3. To infer gene dynamics and hence thesequenceofcellularprograms, thecollective (‘universal’)processdynamics (Box1)canbereconstructedby reorderingcellsaccording tosomemeasureofexpressionsimilarity.Thisso-calledpseudotemporalorderingwasinitiallyproposedforbulkexpression4,andwaslaterextendedtosingle-cellRNA-seq5andproteinprofilesfrommasscytometry6.Box1:UniversaltimeIncontrasttocontinuoustimeobservationsofasinglecelle.g.fromtime-lapsemicroscopy,high-throughputsnapshotexperimentssuchassinglecellRNA-seqorFACSonlyencodethecollective(‘universal’)timedependenceofcells,notthestochastictrajectoriesofsinglecells.Wedefineuniversal timeas thegeodesicdistanceon themanifold that isassociatedwiththe deterministic program underlying the stochastic cellular process. For time-lapse data,universaltimecanbeconstructedbyestimatingthevelocity𝒗(𝑡)tangentialtothismanifoldCfromlocalaveragesofsingle-celltrajectories.Thegeodesicdistance

𝑠 𝑡 = 𝑑𝑠!: ! ! ,! !

= 𝑑𝑡!!

! 𝒗 𝑡! ≈ 𝑑𝑡!

!

!

1𝜌 𝑡! .

thenquantifiesthearclengthi.e.theuniversaltimealongthemanifold,where𝜌 𝑡 denotesthelocaldensityofsamplesonasinglecelltrajectory(seeSupplementarySec.1).Pseudotimes are proxies for universal time (Supplementary Figs. 1-3). Our proposed DPTapproximatesuniversal timebetter thanotherpseudotimeschemesas itdoesnot involvedimensionreduction,andbetterthandiffusiondistance11asitaccountsforrandomwalksonalllengthscales.Ultimately these approaches aim to fully understanddifferentiationdynamics as paths onWaddington’s‘epigenetic landscape’7,8.Howeversofar it isunclearhowtoidentifycellsatcritical branching decisions aswell as quiescent ormetastable cells, forwhich there is nonotionof temporal ordering.Moreover, asnovel experimental techniques suchasdroplet

.CC-BY-NC 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/041384doi: bioRxiv preprint

Page 2: Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This allows us to identify cells that undergo branching ... Pseudotimes are proxies for

sequencing9,10 allow to profile tens of thousands of cells, there is an urgent need forcomputationallyefficient,scalableandrobustalgorithms.To overcome these problems, we introduce ‘diffusion pseudotime’ (DPT). It measuresprogression through branching lineages using a random-walk-based distance in diffusionmapspace11andallowsforbranchingandpseudotimeanalysisonlarge-scaleRNA-seqdatasets.Evenintheabsenceofbranching,DPTissignificantlymorerobustwithrespecttonoiseinlow-densityregionsandcelloutliersthanexistingmethods,whichrelyontheestimationofminimumspanningtrees5orsampling-baseddistances6,12.Diffusion pseudotime is computed in three steps (Fig. 1a and Online Methods). First, atransitionmatrixTthatapproximatesthedynamictransitionsofcellsthroughstagesofthedifferentiation process is determined. The right eigenvectors of T are known as diffusioncomponents and have been used in diffusion maps for visualizing single cell RNA-seqdata13,14. While using only few diffusion components yields interpretable visualizations,important informationmaybelostbyremovingtheothers.Consequently,DPTisbasedonthefullrankTratherthanalowrankapproximation.Inthesecondstep,thedistancedpt(x,y)betweentwocellswithindexxandyiscomputedas

dpt(𝑥,𝑦) = 𝑴 𝑥, . −𝑴 𝑦, . !/!! , 𝑴 = 𝑻! . (1)!

!!!

Here, … !/!! denotes the𝐿! norm weighted by the first left eigenvector𝜑! of T, the

steadystate(OnlineMethods,SupplementarySec.1.3).Insteadoftheprobability(𝑻!)!"fora random walk of fixed length11 t from x to y, in Eq. (1), we compute the accumulatedtransition probability(𝑴)!" of visiting y when starting from x over random walks of alllengths.Thisisdoneusingthemodifiedtransitionmatrix𝑻,whichisdefinedasTwithouttheeigenspace associated with the steady state𝜑!,and therefore describes how the steadystateisapproached.Fixingaknownrootcellrasstartofthedynamicalprocessofinterest,wedefinethepseudotimeofcellxasdpt(x,r).Inthethirdstep,branchingpointsareidentifiedbycomparingtworandomwalksovercells,one starting at the root cell r and the other at its maximally distant cell y, measuringpseudotimeswithrespecttorandy, respectively. Thetwosequencesofpseudotimesareanticorrelateduntil the twowalksmerge inanewbranch,where theybecomecorrelated(Online methods). This criterion robustly identifies branching points as we illustrate forsimulationdataforwhichthegroundtruthisknown(SupplementaryFig.4).ToillustratetheabilityofDPTtoidentifybranchesonrealdata,wereanalyzedasingle-cellqPCRdatasetfocusingonearlyblooddevelopment15,forwhichwehaveshownpreviouslythat diffusion maps allow to visually identify a precursor branch splitting up into twoseparate lineages (cf. Fig.1b).DPTorderedcellsalongtheirdevelopmental trajectoryandidentified cells at thebranchingpoint. The three identifiedbranchesqualitatively coincidewithaprecursorbranchandthereportedblood(branch1)andendothelialbranches(branch2)15.Inparticularweidentifiedcharacteristicpatternsinthedevelopmentalstagesinbloodprogenitors (Fig.1d),namelythehemangioblast-likesequence16 (subsequentup-regulationof Cdh1 to Tal1 and Cdh5) at the precursor branch15 and the endothelial differentiationroute15onbranch2(elevatedlevelsofPecam1,ErgandSox17amongstothers).Further,we

.CC-BY-NC 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/041384doi: bioRxiv preprint

Page 3: Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This allows us to identify cells that undergo branching ... Pseudotimes are proxies for

findtheerythroid-likesequenceofEtv2,Tal1,Runx1andGata117atbranch1.Thetemporalresolution obtained by DPT indicates immediate (directly after branching, cf. Ikarosexpression in Fig. 1c) and late transitions (cf. Erg in Fig. 1c) as well as a number ofintermediate regulatory events15 until the onset ofHbb-bH1 expression (cf. Fig. 1d, blacktriangles),hintingatpotentialnovelregulatoryinteractions.

Figure1:Diffusion pseudotime reveals temporal ordering and cellular decisions on the single cell level. (a)Outline of the computational workflow. The diffusion transition matrix𝑻!" is constructed bysuperimposing local kernels at the expression levels of cells x and y (1). The diffusion pseudotimedpt(x,y)approximatesthegeodesicdistanceofxandyonthemappedmanifold(2).Branchingpointsareidentifiedaspointswhereanti-correlateddistancesfrombranchendsbecomecorrelated(3).(b)ApplicationofDPTtosingle-cellqPCRof42genesin3934singlecellsduringearlyhematopoiesis15,sorted from 5 different populations: primitive streak (PS), neural plate (NP), head fold (HF), foursomiteGFPnegative(4SG-),foursomiteGFPpositive(4SG+).DPTidentifiesonebranchingpoint.(c)Exemplary dynamics of genes Erg and Ikaros show qualitatively different behavior in the twobranches,blacklinesdescribeamovingaverageover50adjacentcellsalongtherespectivebranch.(d) Heatmap of gene expression, with cells ordered by diffusion pseudotime and genes orderedaccording to theonsetof firstmajor change inexpression (seeSupplementary Sec. 7.2, smoothedalong50adjacentcells,seeSupplementaryFig.5fornon-smoothedversion).Barsontopindicatethecells’ population (b) and cell density, respectively,with high density regions indicatingmetastablestates. The time series were clustered temporally by the time point of the first transition event(precursorbranch,branch1andnone,respectively,SupplementaryFig.6).Thepiecharts(bottom)showthefractionofcellsmakingupthemetastablestates(blackhorizontalline).

DC1

DC2

0

5

10

15

20

25

30

35

40

45

DC1

DC2

0

10

20

30

40

50

60

70

80

90

100

B1 B2

B1 B2

M

branch 1 branch 2

higher probability

lower probability

x

yx

y

a1 2

x

ydiffusion pseudotime:

scale-free average over random walks

3x

y

d(x,_)

d(y,_)

b d

decis

ion st

age

termina

l bran

ch 1

termina

l bran

ch 2

diffusion pseudotime

population indexcell density

gene expression

gene dynamics

not observed

activationdeactivation

c

root

DPTDPT

precu

rsor s

tage

construction of transition matrix

branching point identification

branch 2branch 1

cell state composition

correlated anti-correlatedorder vs :d(x,_) d(y,_)

z

0 1000 2000 3000order in DPT

-10

-8

-6

-4

-2

0

Ikar

os

0 1000 2000 3000order in DPT

-10

-8

-6

-4

-2

0

Erg

Ikar

os

Erg

DPT order DPT order

sorted populations

.CC-BY-NC 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/041384doi: bioRxiv preprint

Page 4: Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This allows us to identify cells that undergo branching ... Pseudotimes are proxies for

DPT identified regions of small time-steps i.e. of high cell density (Fig. 1d, top andSupplementaryFig.7b)alongthedifferentiationprocess.Thesehigh-densityregionsindicatemetastable states, which correspond to biologically meaningful intermediates: We foundfourmetastablestateswithexpressionpatternsofprecursorcells,hemangioblast-likecellsat the decision state, erythroid-like and endothelial-like cells. Notably, both decision andprecursor states consist of cellmixtures from two or three different stages, stressing theasynchrony of developmental stages that could not be resolved without pseudotemporalordering.Toidentifykeydecisiongenes,wequantifiedexpressiondifferencesintheidentifiedstatesof decision versus precursor usingMAST18 (Supplementary Fig. 8a). This resulted inmorethan50%ofchangedgenes(27outof42),includingPecam1andCbfa2t3h,whichareknowntoindicatehematopoieticandendothelialdevelopment16,respectively.Incontrast,only24genesaredifferentiallyexpressedbetweensortedcellsfromheadfoldandprimitivestreak,all changing gradually but preserving bimodal distribution (Supplementary Fig. 8d andSupplementaryTable1).Also,differentialgeneexpressionbetweenHFand4SG-cellsfailstoidentifyendothelialdifferentiationbutbringsuperythroidfactors(Runx1, IkarosandGfi1bamongstothers,seeSupplementaryFig.8eandSupplementaryTable2).Insummarywhencomparing differentially expressed genes between metastable states, we identify moregenes thancomparingdevelopmental stages,andthegeneshave lessbimodalexpression.This shows that the anatomical stages overstate developmental heterogeneity thusdisguisingtheroleofkeyfactors.DPT copes well with large-scale experiments such as scRNA-seq combined with dropletbarcoding10: In the experiment, Klein et al. monitored the transcriptomic profiles andheterogeneity indifferentiationofmouseES cells after LIFwithdrawal (Fig.2a).After cell-cyclenormalization (SupplementaryFigs.9-10),DPTdescribesa singledifferentiationpathfrom which two populations branch off. With increasing pseudotime, we observeupregulatedepiblastmarkers(Krt8/18/19)anddownregulatedpluripotencyfactors(Nanog,Fig. 2b). Clusteringof the geneexpressiondynamics identified four clusterswithdifferenttemporalbehaviors(Fig.2candSupplementaryFig.11)butcoherentbiologicalfunctions(Fig.2d).Earlypseudotimecoincideswithday0cellsexhibitingstrongexpressionofpluripotencyfactors (purple cluster). Then, a small subpopulation (57 cells)mainly consisting of day 2cellsbranchesoff,enrichedinneuron-associatedgenes(Bc1,Lin7b,Snord64,Tagln3,Dtnbp1,Nenf; 6 out of 22, see Supplementary Fig. 12). Subsequent stages are characterized by agradualdecreaseofpluripotencyfactorsandslowriseofbothprimitiveendodermmarkers(yellow cluster) and epiblast markers (orange cluster)19. In late pseudotime, another twobranchingeventgivesrise toapopulationwith increasedprimitiveendodermmarkers (21cells),whereasepiblastmarkergenesrisetwo-tothree-foldintheotherbranch(120cells).Altogether, DPT is able to remove asynchronity of scRNA-seq snapshot data from severaldays,aligningcellsintermsoftheirdegreeofdifferentiation.ToevaluateDPT’sperformancewithoutbranching compared toWanderlustandMonocle,we counted how often a pseudotime puts a cell from a later temporal sorting before anearlierone(measuredbyKendallrankcorrelation𝜏).DPTreconstructsthetemporalordersof ESC differentiation with significantly higher accuracy than Wanderlust (𝜏= 0.78±10-3versus0.71±10-3,respectively,Fig.2eandSupplementaryFig.13).Thisholdstruealsowhen

.CC-BY-NC 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/041384doi: bioRxiv preprint

Page 5: Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This allows us to identify cells that undergo branching ... Pseudotimes are proxies for

compared to Monocle in repeated bootstrap runs and on other data sets (Fig. 2f andSupplementaryTable3).

Figure2:Diffusionpseudotime identifiesdifferentiationdynamics indroplet-basedscRNA-seqexperiments10.(a)MouseESCsafter LIFwithdrawalwereharvestedat T=0,2, 4 and7daysandprofiledwith thedropSeqprotocol,giving2717cellswith24175observeduniquetranscripts10.Aftercorrectionforcellcyclevariation,alowdimensionalvisualizationusingdiffusionmapsshowsoverlappingbutdirectedtemporal dynamics. (b) Temporal dynamics of selected genes as in the original publicationreconstructedbyDPT,relativetoexpressionatinitialtimepoint.(c)DPTidentifiesamaintimeaxiswith twominor branching events: an early side branch and late separation of cells enrichedwithmarkersforepiblastsandprimitiveendoderm(top).PopulationandcelldensityareshownasinFig.1d. The heatmapdepicts genedynamics after hierarchical clustering and removal of a fluctuating,mainlyconstantsubgroup (cf.SupplementaryFig.11b):Thedynamicsubgroups (indicatedbycolorbar,right)consistofepiblastmarkerssuchasKrt8/18/19,Sfn,Tagln(orange),gradualdownregulatedpluripotency factors such as Pou5f1 (Oct4), Sox2, Trim28, Nanog (purple) and slow consistentupregulatedprimitiveendodermmarkerssuchasCol4a1/2,Lama1/b1,Serpinh1,Sparc(yellow).(d)Geneontologyenrichmentshowsacellularreorganizationsignature(orange),ametabolicsignatureconsistent for differentiation (purple) and a cell motility signature (yellow). (e) Pseudotimedistribution of cells in the experiments from the four different days, for DPT and Wanderlust.Diffusionpseudotimeorderscellswellalongthe four temporalcategories (Kendall rankcorrelation0.78±10-3),significantlybetterthanpseudotemporalorderingbyWanderlust(Kendalrankcorrelation0.71±10-3,seealsoSupplementaryFig.13).(f)ComparisonsofKendallrankcorrelationonbootstrapsamples (n=100bootstrapruns,downsamplingtomaximal1800cells forWanderlustandDPT,700cellsforMonocleduetoperformanceissues)forthepresentedESCdataset,theqPCRdatasetfromfigure 1 and an scRNA-seq of differentiating myoblasts from the Monocle paper5 show that DPTconsistentlyoutperformstheothertwomethods(2-sidedt-testwithsignificancelevels *p<0.05;**p<0.01;***p<0.001,n.s.notsignificant).

DC1DC2

DC3

T=0d T=2d T=4d T=7d

amESCs

-LIF

e

0

0.2

0.4

0.6

0.8

ESC dropseq

early blood qPCR

myoblast RNA-seq

DPT

Wande

rlust

Monoc

le

Monoc

le

Monoc

le

Wande

rlust

Wande

rlust

DPTDPT

conc

orda

nce

with

tim

e la

bels

f

cdiffusion pseudotime

population indexcell density

0.2 0.3 0.4 0.5 0.6 0.7 0.80.9 1Wanderlust pseudotime

0d

2d

4d

7d

time

lab

els

0.05 0.1 0.5Diffusion pseudotime

0d

2d

4d

7d

time

lab

els

d

b

0 0.1 0.2 0.3 0.4DPT

1

1.5

2

2.5

3

3.5

expressio

n

Pou5f1Dppa5aZfp42Ccnd3NanogEsrrbSox2Otx2ActbKrt8

Krt8

Actb

Otx2

Nanog

diffusion pseudotime (main branch)

expr

essi

on re

lativ

e to

t=0

*** ***

******

** n.s.

.CC-BY-NC 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/041384doi: bioRxiv preprint

Page 6: Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This allows us to identify cells that undergo branching ... Pseudotimes are proxies for

Inconclusion,weintroduceDPTasapseudotimemeasurethatovercomesthedeficienciesofexistingapproaches:itisabletodealwithbranchinglineagesandidentifiesmetastableorsteadystates,itisstatisticallyrobust,anditscomputationcanbescaledtolargedatasetswithoutdimensionreduction.ComparedtoWanderlust6,whichhasbeenproposedforthelower-dimensionalmasscytometrydata,wereplacedapproximateandcomputationallycostlysamplingofshortestpathsbytheexactandcomputationallycheapaverageoverrandomwalksineq.(1).ComparedtoMonocle5,whichworksonRNA-seqdatabutonlyafterdimensionreductionandonmediumsamplenumbers,DPT’saverageoverallrandomwalksissignificantlymorerobustthanMonocle’sminimumspanningtreeapproach(Fig.2e).In the future, robust computation of pseudotimes will allow inferring regulatoryrelationships with much higher confidence than based on perturbations alone15, and weexpectDPTtoallowscalingthistogenome-widemodels.Recentlypseudotemporalorderinghas been applied to cell morphology to identify cell cycle states20 – here diffusionpseudotimewouldallowinclusionofbranchingforexampletoidentifycellsswitchingintoaquiescent state as well as comparison to time-lapse microscopy via universal time. Tosummarize, diffusion pseudotime provides a powerful and robust tool to order cellsaccordingtotheirstateondifferentiationtrajectoriesinsingle-celltranscriptomicsstudies.References1. Shalek,A.K.etal.Single-celltranscriptomicsrevealsbimodalityinexpressionand

splicinginimmunecells.Nature498,236–240(2013).2. Moignard,V.etal.Characterizationoftranscriptionalnetworksinbloodstemand

progenitorcellsusinghigh-throughputsingle-cellgeneexpressionanalysis.NatCellBiol15,363–372(2013).

3. Treutlein,B.etal.Reconstructinglineagehierarchiesofthedistallungepitheliumusingsingle-cellRNA-seq.Nature509,371–375(2014).

4. Magwene,P.M.,Lizardi,P.&Kim,J.Reconstructingthetemporalorderingofbiologicalsamplesusingmicroarraydata.Bioinformatics19,842–850(2003).

5. Trapnell,C.etal.Thedynamicsandregulatorsofcellfatedecisionsarerevealedbypseudotemporalorderingofsinglecells.NatBiotechnol32,381–386(2014).

6. Bendall,S.C.etal.Single-celltrajectorydetectionuncoversprogressionandregulatorycoordinationinhumanBcelldevelopment.Cell157,714–725(2014).

7. Islam,S.etal.Characterizationofthesingle-celltranscriptionallandscapebyhighlymultiplexRNA-seq.GenomeResearch21,1160–1167(2011).

8. Trapnell,C.Definingcelltypesandstateswithsingle-cellgenomics.GenomeResearch25,1491–1498(2015).

9. Macosko,E.Z.etal.HighlyParallelGenome-wideExpressionProfilingofIndividualCellsUsingNanoliterDroplets.Cell161,1202–1214(2015).

10. Klein,A.M.etal.Dropletbarcodingforsingle-celltranscriptomicsappliedtoembryonicstemcells.Cell161,1187–1201(2015).

11. Coifman,R.R.etal.Geometricdiffusionsasatoolforharmonicanalysisandstructuredefinitionofdata:diffusionmaps.Proc.Natl.Acad.Sci.U.S.A.102,7426–7431(2005).

12. Qiu,P.etal.Extractingacellularhierarchyfromhigh-dimensionalcytometrydatawithSPADE.NatBiotechnol29,886–891(2011).

13. Haghverdi,L.,Buettner,F.&Theis,F.J.Diffusionmapsforhigh-dimensionalsingle-cellanalysisofdifferentiationdata.Bioinformatics31,2989–2998(2015).

14. Angerer,P.etal.destiny-diffusionmapsforlarge-scalesingle-celldatainR.Bioinformaticsbtv715(2015).doi:10.1093/bioinformatics/btv715

.CC-BY-NC 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/041384doi: bioRxiv preprint

Page 7: Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This allows us to identify cells that undergo branching ... Pseudotimes are proxies for

15. Moignard,V.etal.Decodingtheregulatorynetworkofearlyblooddevelopmentfromsingle-cellgeneexpressionmeasurements.NatBiotechnol33,269–276(2015).

16. Huber,T.L.,Kouskoff,V.,Fehling,H.J.,Palis,J.&Keller,G.Haemangioblastcommitmentisinitiatedintheprimitivestreakofthemouseembryo.Nature432,625–630(2004).

17. Costa,G.,Kouskoff,V.&Lacaud,G.OriginofbloodcellsandHSCproductionintheembryo.TrendsinImmunology33,215–223(2012).

18. Finak,G.etal.MAST:aflexiblestatisticalframeworkforassessingtranscriptionalchangesandcharacterizingheterogeneityinsingle-cellRNAsequencingdata.GenomeBiology16,278(2015).

19. Nishikawa,S.I.,Jakt,L.M.&Era,T.Embryonicstem-cellcultureasatoolfordevelopmentalcellbiology.NaturereviewsMolecularcellbiology8,502–507(2007).

20. Gut,G.,Tadmor,M.D.,Pe’er,D.,Pelkmans,L.&Liberali,P.Trajectoriesofcell-cycleprogressionfromfixedcellpopulations.NatMeth12,951–954(2015).

MethodsMethodsandanyassociatedreferencesareavailableintheonlineversionofthepaper.AccessioncodesAMATLABimplementationofDPTisavailableonhttp://www.helmholtz-muenchen.de/icb/dpt.AuthorcontributionsL.H.developedthemethodandthecomputationaltools,performedtheanalysisandwrotethepaper.M.B.contributedtotheanalysisofresults.F.B.andF.A.W.helpedinterprettheresultsandwritethesupplement.F.J.T.conceivedandsupervisedthestudy,contributedtothemethoddevelopmentandwrotethepaperwithhelpfromallco-authors.AcknowledgementsWewouldliketoacknowledgeCarstenMarr,JanHasenauer,MatthiasHeinig,JanKrumsiekandThomasBlasifortheirhelpfuladviceandcommentsonthemanuscript.M.B. is supported by a DFG Fellowship through the Graduate School of QuantitativeBiosciences Munich (QBM). F.A.W. acknowledges support by the “Helmholtz PostdocProgramme”,InitiativeandNetworkingFundoftheHelmholtzAssociation.F.B.issupportedbytheUKMedicalResearchCouncil(MRC)viaaCareerDevelopmentAward(Biostatistics).F.J.T. acknowledges financial support by the German Science Foundation (SFB 1243 andGraduateSchoolQBM)aswellasbytheBavariangovernment(BioSysNet).CompetingfinancialinterestsTheauthorsdeclarenocompetingfinancialinterests.

.CC-BY-NC 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/041384doi: bioRxiv preprint

Page 8: Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This allows us to identify cells that undergo branching ... Pseudotimes are proxies for

OnlineMethods

OverviewofDPTalgorithm0) (Initialization)inputsthefollowing:

a. ThenbyGdatamatrixb. One(orseveral)rootcell(s).c. Diffusionmapsoptions“classic”or“locallyscaled”andrespectivelythe

parameters“θ”(kernelwidth)or“κ"(numberofnearestneighboursforadjustingthekernelwidth).

1) ComputesthetransitionmatrixT.2) BuildstheaccumulatedtransitionmatrixMandcomputesdiffusionpseudotimewith

respecttothespecifiedroot.Ifseveralrootsaredefined,DPTaveragesthepseudotimeforeachcellyovertheseroots.

3) DPTiterativelyassignscellstobranchesandsubbranches.DPTgroupsthecellsforeachbranchandprovidesdiffusionpseudotimeforeachgroup.

DiffusionpseudotimeWecalculate thediffusionmaps transitionmatrixT and its rightand lefteigen-vectorsψ!and𝜑! .Itthencomputestheaccumulatedtransitionprobabilitiesoverallnumbersoftimesteps.

M = 𝑻! = (𝐼 − 𝑻)!! − 𝐼 where 𝑻 = T - ψ!𝜑!! . !

!!!

This isdone relative to thesteadystate𝜑! ,whichstoresno informationaboutdynamics.Fixingaknownrootcellxasstartofthedynamicalprocessofinterest,DiffusionpseudotimeofcellyisdefinedasadensityweightedL!norm

dpt(𝑥,𝑦) = 𝑴 𝑥, . −𝑴 𝑦, . ! !!.

FurtherdetailsaregiveninSupplementarySec.3.BranchassignmentWefindthecellywiththemaximaldptdistancefromtheroot(s)xandalsoanothercellzwhichhasmaximaldistancetoxandy:

z = argmax!! dpt 𝑧′, 𝑥 + dpt 𝑧′,𝑦 .

Ifthemanifoldisbranching,thenasdefinedyandzwillprovidecellsattwodifferenttipsoftwobranches.DPT thenobtains twoorderingsOy=dpt(.,y) andOz=dpt(.,z) anddetermines the cutoff celluntilwhichthesequenceoforderedcellsinOx(callthemXi),OyandOzbecomemaximallycorrelatedusingKendall’srankcorrelation.DPTthusassignscellsXitothebranchofx.DPT treats y and z as root of the subbranchesYi and Zi respectively and in a similarwaysearches for new subbranches within each branch. Further details are provided inSupplementarySec.4.

.CC-BY-NC 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/041384doi: bioRxiv preprint

Page 9: Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This allows us to identify cells that undergo branching ... Pseudotimes are proxies for

MetastablestatesThe pseudotemporal ordering of cell populations reflects gradual and switch-like changesalongacertainbranch.HighlysimilarcellshavesmalldistanceinthegenespaceandahighprobabilitytobereachedbyarandomwalkasdefinedbythetransitionmatrixT.Then,thedifferenceinpseudotimebetweensuchcellsissmall,i.e.,thedensityofthedistancetotheroot cell dpt(r,.) increases at sites, where highly similar cells are found. In particular,developmental steady states have high densities in the pseudotime measure, but theseaccumulationsitesarenotsufficienttodepictasteadystate.However,theseaccumulationsitesarenotsufficienttodepictasteadystate. DetectingtranscriptionalchangesTo identify the succession of switch-like transcriptional changes revealed by thepseudotemporal order in qPCR data, we computed an approximate derivative of thesmoothed gene expression level along branch 1. A switch-like change is defined as themaximuminthederivative(detailsinSupplementarySec.7.2). DifferentialexpressionanalysisWeemployedatwo-part,generalizedlinearmodelthatallowstoquantifytheproportionofcells expressing a certain gene as well as the mean expression level, a modified Hurdlemodel1. Briefly, the model has two parts: A discrete part to decide whether a gene isexpressedandacontinuouspartusinganormaldistributionifthegeneisexpressed.Then,a likelihood ratio test is used to identify differentially expressed genes (details inSupplementarySecs.5and7.3.andFinaketal1). ESCqPCRDataWe reanalyzed a single-cell qPCR dataset (normalized version with 3934 cells, 42 genes)focusingonearlyblooddevelopment2.Foreachgene,thelimitofdetection(LOD)wastheaverageCtvalueforthelastdilutionatwhichallsixreplicateshadpositiveamplification.TheoverallLODof25forthegenesetwasthemedianCtvalueacrossallgenes.RawCtvaluesand normalized data can be found in Supplementary Table 7 of Moignard et al2. Geneexpressionwassubtractedfromthelimitofdetectionandnormalizedonacell-wisebasistothemean expression of the four housekeeping genes (Eif2b1,Mrpl19,Polr2a andUbc) ineach cell. Cells that did not express all four housekeeping genes were excluded fromsubsequentanalysis,aswerecellsforwhichthemeanofthefourhousekeeperswas±3s.d.from themean of all cells. A dCt value of −14was then assignedwhere a genewas notdetected.85–90%ofsortedcellswereretained for furtheranalysis.Gata2didnotamplifycorrectly andHoxB3was not expressed in any cells, so these factors have been excludedfromtheanalysis.TheanalysesweredoneonthedCtvaluesforalltranscriptionfactorsandmarkergenes,butnothousekeepinggenes.DropSeqdataWereanalyzedasingle-cellRNA-seqdatasetusingthedropSeqprotocol fromKleinetal3.Here,singlecellsalongwithasetofuniquelybarcodedprimerswerecaptureintinydropletsandsequenced.Thecapabilitiesof this techniqueweredemonstratedusinganundirecteddifferentiationprocessofmouseembryonicstemcellsuponleukemiainhibitoryfactor(LIF)withdrawal.Thedata set ispubliclyavailableunder theGEOaccessionnumberGSE65525.Countdatawerenormalizedby librarysizeand log10transformed(seeSupplementarySec.8.1).We corrected for cell-cycle and batch effects using scLVM4 on 2044 highly variable

.CC-BY-NC 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/041384doi: bioRxiv preprint

Page 10: Diffusion pseudotime robustly reconstructs lineage branching · diffusion-like random walks. This allows us to identify cells that undergo branching ... Pseudotimes are proxies for

genes (see Supplementary Table 3 in Klein et al3). Then, diffusionmapwith local densityrescaling (Supplementary Sec. 2) visualizes the temporal order for all cells. Hierarchicalclustering was performed in R (http://www.r-project.org/) using the hclust package onquantile-normalized data (Supplementary Sec. 8.2) and displayed with ComplexHeatmappackage, where the distance was defined as 1 – correlation between all samples(SupplementarySec.8.3).Inaddition,weperformedaranksumstestonthefirstsidebranchtoidentifygenesbeinguniquelydifferentfrominitialpluripotentandlateepiblast-likecells(SupplementarySec.8.4).ConcordanceofpseudotimewithtimelabelsWesubsampledsetsof ~70%ofdataand foreachsetperformedWanderlust5,Monocle6andDPTpseudotimeorderings. SinceMonocledoesnotperformonvery largenumberofcells(>103),wereducedthesubsamplingto700cellswhennecessary.Theconcordanceforeach subsetwas thenmeasured as Kendall tau correlation of each pseudotimewith timelabels of that subset. We then performed a t-test and calculated p-values between thehistogramoftheconcordancemeasureforWanderlustandMonoclecomparedtotheDPTpseudotimes.TheresultisshowninFig.2fofthemaintextandSupplementaryTable3. References

1. Finak,G.etal.MAST:aflexiblestatisticalframeworkforassessingtranscriptionalchangesandcharacterizingheterogeneityinsingle-cellRNAsequencingdata.GenomeBiology16,278(2015).

2. Moignard,V.etal.Decodingtheregulatorynetworkofearlyblooddevelopmentfromsingle-cellgeneexpressionmeasurements.NatBiotechnol33,269–276(2015).

3. Klein,A.M.etal.Dropletbarcodingforsingle-celltranscriptomicsappliedtoembryonicstemcells.Cell161,1187–1201(2015).

4. Buettner,F.etal.Computationalanalysisofcell-to-cellheterogeneityinsingle-cellRNA-sequencingdatarevealshiddensubpopulationsofcells.NatBiotechnol33,155–160(2015).

5. Bendall,S.C.etal.Single-celltrajectorydetectionuncoversprogressionandregulatorycoordinationinhumanBcelldevelopment.Cell157,714–725(2014).

6. Trapnell,C.etal.Thedynamicsandregulatorsofcellfatedecisionsarerevealedbypseudotemporalorderingofsinglecells.NatBiotechnol32,381–386(2014).

.CC-BY-NC 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/041384doi: bioRxiv preprint