prosstt.simulation module

This module contains all the functions that produce simulations. This includes the simulation of expression programs, coefficients that map expr. programs to genes, and different sampling strategies for (pseudotime, branch) pairs.

prosstt.simulation.cover_whole_tree(tree)

Get all the pseudotime/branch pairs that are possible in the lineage tree.

Parameters:tree (Tree) – A lineage tree
Returns:
  • pseudotime (ndarray) – Pseudotime values of all positions in the lineage tree
  • branches (ndarray) – Branch assignments of all positions in the lineage tree
prosstt.simulation.diffusion(steps)

Diffusion process with momentum term. Returns a random walk with values usually between 0 and 1.

Parameters:steps (int) – The length of the diffusion process.
Returns:walk – A diffusion process with a specified number of steps.
Return type:float array
prosstt.simulation.draw_counts(tree, pseudotime, branches, scalings, alpha, beta)

For all the cells in the lineage tree described by a given pseudotime and branch assignment, sample UMI count values for all genes. Each cell is an expression vector; the combination of all cell vectors builds the expression matrix.

Parameters:
  • tree (Tree) – A lineage tree
  • pseudotime (ndarray) – Pseudotime values for all cells to be sampled
  • branches (ndarray) – Branch assignments for all cells to be sampled
  • scalings (ndarray) – Library size scaling factor for all cells to be sampled
  • alpha (float or ndarray) – Parameter for the count-drawing distribution. Float if it is the same for all genes, else an ndarray
  • beta (float or ndarray) – Parameter for the count-drawing distribution. Float if it is the same for all genes, else an ndarray
Returns:

expr_matrix – Expression matrix of the differentiation

Return type:

ndarray

prosstt.simulation.draw_times(timepoint, no_cells, max_time, var=4)

Draw cell pseudotimes around a certain sample time point under the assumption that in an asynchronously differentiating population cells are normally distributed around tree. The variance of the normal distribution controls the speed of differentiation (high spread: transient state/fast differentiation, low spread: bottleneck/slow differentiation).

Parameters:
  • timepoint (int) – The pseudotime point that represents the current mean differentiation stage of the population.
  • no_cells (int) – How many cells to sample.
  • max_time (int) – All time points that exceed the differentiation duration will be mapped to the end of the differentiation.
  • var (float, optional) – Variance of the normal distribution we use to draw pseudotime points. In the experiment metaphor this parameter controls synchronicity.
Returns:

sample_pt – Pseudotime points around <timepoint>.

Return type:

int array

prosstt.simulation.sample_density(tree, no_cells, alpha=0.3, beta=2, scale=True, scale_v=0.7)

Use cell density along the lineage tree to sample pseudotime/branch pairs for the expression matrix.

Parameters:
  • tree (Tree) – A lineage tree
  • no_cells (int) – no_cellsumber of cells to sample
  • alpha (float or ndarray, optional) – Parameter for the count-drawing distribution. Float if it is the same for all genes, else an ndarray
  • beta (float or ndarray, optional) – Parameter for the count-drawing distribution. Float if it is the same for all genes, else an ndarray
  • scale (True, optional) – Apply cell-specific library size factor to average gene expression
  • scale_v (float, optional) – Variance for the drawing of scaling factors (library size) for each cell
Returns:

  • expr_matrix (ndarray) – Expression matrix of the differentiation
  • sample_pt (ndarray) – Pseudotime values of the sampled cells
  • branches (ndarray) – The branch to which each simulated cell belongs
  • scalings (ndarray) – Library size scaling factor for each cell

prosstt.simulation.sample_pseudotime_series(tree, cells, series_points, point_std, alpha=0.3, beta=2, scale=True, scale_v=0.7)

Simulate the expression matrix of a differentiation if the data came from a time series experiment.

Taking a sample from a culture of differentiating cells returns a mixture of cells at different stages of progress through differentiation (pseudotime). A time series experiment consists of sampling at multiple time points. This is simulated by drawing normally distributed pseudotime values around pseudotime sample points.

Parameters:
  • tree (Tree) – A lineage tree
  • cells (list or int) – If a list, then the number of cells to be sampled from each sample pointree. If an integer, then the total number of cells to be sampled (will be divided equally among all sample points)
  • series_points (list) – A list of the pseudotime sample points
  • point_std (list or float) – The standard deviation with which to sample around every sample pointree. Use a list for differing std at each time pointree.
  • alpha (float or ndarray, optional) – Parameter for the count-drawing distribution. Float if it is the same for all genes, else an ndarray
  • beta (float or ndarray, optional) – Parameter for the count-drawing distribution. Float if it is the same for all genes, else an ndarray
  • scale (True, optional) – Apply cell-specific library size factor to average gene expression
  • scale_v (float, optional) – Variance for the drawing of scaling factors (library size) for each cell
Returns:

  • expr_matrix (ndarray) – Expression matrix of the differentiation
  • sample_pt (ndarray) – Pseudotime values of the sampled cells
  • branches (ndarray) – The branch to which each simulated cell belongs
  • scalings (ndarray) – Library size scaling factor for each cell

prosstt.simulation.sample_whole_tree(tree, n_factor, alpha=0.3, beta=2, scale=True, scale_v=0.7)

Every possible pseudotime/branch pair on the lineage tree is sampled a number of times.

Parameters:
  • tree (Tree) – A lineage tree
  • n_factor (int) – How many times each pseudotime/branch combination can be present
  • alpha (float or ndarray, optional) – Parameter for the count-drawing distribution. Float if it is the same for all genes, else an ndarray
  • beta (float or ndarray, optional) – Parameter for the count-drawing distribution. Float if it is the same for all genes, else an ndarray
  • scale (True, optional) – Apply cell-specific library size factor to average gene expression
  • scale_v (float, optional) – Variance for the drawing of scaling factors (library size) for each cell
Returns:

  • expr_matrix (ndarray) – Expression matrix of the differentiation
  • sample_pt (ndarray) – Pseudotime values of the sampled cells
  • branches (ndarray) – The branch to which each simulated cell belongs
  • scalings (ndarray) – Library size scaling factor for each cell

prosstt.simulation.sample_whole_tree_restricted(tree, alpha=0.2, beta=3, gene_loc=0.8, gene_s=1)

Bare-bones simulation where the lineage tree is simulated using default parameters. Branches are assigned randomly if multiple are possible.

Parameters:
  • tree (Tree) – A lineage tree
  • alpha (float, optional) – Average alpha value
  • beta (float, optional) – Average beta value
  • gene_loc (float, optional) – Mean of the log-normal distribution of base gene expression values
  • gene_s (float, optional) – Standard deviation of base gene expression value distribution (log-normal)
Returns:

  • expr_matrix (ndarray) – Expression matrix of the differentiation
  • sample_pt (ndarray) – Pseudotime values of the sampled cells
  • scalings (ndarray) – Library size scaling factor for each cell

prosstt.simulation.sim_expr_branch(branch_length, expr_progr, cutoff=0.2, max_loops=100)

Return expr_progr diffusion processes of length T as a matrix W. The output of sim_expr_branch is complementary to _sim_coeff_beta.

W encodes how a group of expr_progr coexpressed genes will behave through differentiation time. This matrix describes one branch of a differentiation tree (between two branch points or between a branch point and an endpoint). W describes the module in terms of relative expression (from 0 to a small positive float, so from a gene not being expressed to a gene being expressed at 2x, 3x of its “normal” level).

After each new diffusion process is added the function checks whether the new diffusion correlates with any of the older ones. If the correlation is too high (above 0.5 per default), the last diffusion process will be replaced with a new one until one is found that does not correlate with any other columns of W or a suitable replacement hasn’t been found after 100 tries.

Obviously this gets more probable the higher the number of components is - it might be advisable to change the number of maximum loops allowed or the cutoff in order to reduce runtime for a high number of components.

Parameters:
  • branch_length (int) – The length of the branch of the differentiation tree in pseudotime units
  • expr_progr (int) – The number of components/modules of coexpression that describe the differentiation in this branch of the tree
  • cutoff (float, optional) – Correlation above the cut-off will be considered too much. Should be between 0 and 1 but is not explicitly tested
  • max_loops (int, optional) – The maximum number of times the method will try simulating a new diffusion process that doesn’t correlate with all previous ones in W before resetting the matrix and starting over
Returns:

W – Output array

Return type:

ndarray

prosstt.simulation.simulate_coefficients(tree, fallback_a=0.04, **kwargs)

H encodes how G genes are expressed by defining their membership to K expression modules (coded in a matrix W). H could be told to encode metagenes, as it contains the information about which genes are coexpressed (genes that belong to/are influenced by the same modules). The influence of a module on a gene is measured by a number between 0 and 1, drawn from a (symmetric, if used with default values) beta distribution.

The result of simulate_H is complementary to sim_expr_branch.

Parameters:
  • tree (Tree) –
  • a (float, optional) – Shape parameter of Gamma distribution or first shape parameter of Beta distribution
  • **kwargs (float) – Additional parameter (float b) if Beta distribution is to be used
Returns:

Return type:

A sparse matrix of the contribution of K expression programs to G genes.

prosstt.simulation.simulate_lineage(tree, rel_exp_cutoff=8, intra_branch_tol=0.5, inter_branch_tol=0, **kwargs)

Simulate gene expression for each point of the lineage tree (each possible pseudotime/branch combination). The simulation will try to make sure that a) gene expression programs within the same branch don’t correlate too heavily and b) gene expression programs in parallel branches diverge enough.

Parameters:
  • tree (Tree) – A lineage tree
  • rel_exp_cutoff (float, optional) – The log threshold for the maximum average expression before scaling. Recommended values are below 9 (exp(9) is about 8100).
  • intra_branch_tol (float, optional) – The threshold for correlation between expression programs in the same branch
  • inter_branch_tol (float, optional) – The threshold for anticorrelation between relative gene expression in parallel branches
  • **kwargs (various, optional) – Accepts parameters for coefficient simulation; float a if coefficients are generated by a Gamma distribution or floats a, b if the coefficients are generated by a Beta distribution
Returns:

  • rel_means (Series) – Relative mean expression for all genes on every lineage tree branch
  • programs (Series) – Relative expression for all expression programs on every branch of the lineage tree
  • coefficients (ndarray) – Array that contains the contribution weight of each expr. program for each gene