prosstt.sim_utils module¶

This module contains utility functions for the simulations, such as a printed progress bar, functions to perform quality checks, functions to create and assign groups, and functions to pick between alternatives when multiple options are possible.

prosstt.sim_utils.adjust_to_parent(relative_means, current, topology)¶

Adds a vector to the relative means of the current branch such that its first row is equal to the last row of the relative means of the branch that precedes it.

Parameters:	relative_means (Series) – Relative mean expression for all genes on every (currently available) lineage tree branch. current (int) – The current branch to be adjusted. topology (numpy.ndarray) – The tree topology in array format.
Returns:	res – Adjusted relative expression matrix for the current branch.
Return type:	numpy.ndarray

prosstt.sim_utils.assign_branches(branch_times, timezone)¶

Assigns a branch to every timezone:

        -- T[1]------
-T[0]--|          -- T[3]------
        -- T[2]--|
                  -- T[4]-
timezones:
---0---|----1----|-2-|-3--|-4--

timezone	branch
0	0
1	1,2
2	1,3,4
3	3,4
4	3

A time point in timezone i can belong to one of k possible branches

Parameters:	branch_times (list of int lists) – The pseudotime at which branches start and end. timezone (int array) – Array that contains the timezone information for each pseudotime point.
Returns:	res – A list of the possible branches for each timezone.
Return type:	list of int lists

prosstt.sim_utils.belongs_to(timezone, branch)¶

Checks whether a timezone start and end are contained within the pseudotime of a branch.

Timezones are constructed such that they don’t go over branch boundaries. This method is used to determine which branches are possible for a timezone.

Parameters:	timezone (int array) – The pseudotime at which the timezone starts and ends. branch (int array) – The pseudotime at which the branch starts and ends.
Returns:	Whether the timezone is contained within the branch.
Return type:	bool

prosstt.sim_utils.bfs_finder(graph, start)¶

Perform a breadth-first search where the graph is a list of connections.

Parameters:

graph (numpy.ndarray) – A numpy array of shape (N, 2). Every row [a, b] describes a connection from branch a to branch b.
start (int) – The root node from which to start the traversal.

Returns:

output (numpy.ndarray) – The input graph sorted by breadth-first traversal order.
Originally answered by StackOverflow user https (//stackoverflow.com/users/2988730)
for question https (//stackoverflow.com/questions/50589804.)

prosstt.sim_utils.bifurc_adjust(child, parent)¶

Adjust two matrices so that the last line of one equals the first of the other.

Parameters:	child (matrix to be adjusted) – parent (matrix to adjust to) –

prosstt.sim_utils.breadth_first_branches(tree)¶

Performs a breadth-first traversal of the tree topology.

Parameters:	tree (Tree) – A lineage tree object.
Returns:	bfs – The tree branches in the order of traversal (breadth-first).
Return type:	list

prosstt.sim_utils.calc_relat_means(tree, programs, coefficients)¶

Calculate relative mean expression for a lineage tree given the expression programs and the coefficient matrix that contains the contribution of each expression program to each gene.

Parameters:	tree (Tree) – A lineage tree object. programs (Series) – Relative expression for all expression programs on every branch of the lineage tree. coefficients (numpy.ndarray) – Array that contains the contribution weight of each expr. program for each gene

prosstt.sim_utils.calc_scalings(cells, scale=True, scale_v=0.7)¶

Obtain library size factors for each cell.

Parameters:	cells (int) – The number of cells scale (bool, optional) – Whether to simulate different scaling factors for each cell scale_v (float, optional) – The standard deviation of the library size distribution (log-normal distribution around 0)
Returns:	scalings – A library size factor for each cell
Return type:	numpy.ndarray

prosstt.sim_utils.create_groups(no_programs, no_genes)¶

Returns a list of the groups to which each gene belongs.

Each gene g is assigned one of no_programs possible groups twice (random draw with replacement).

Parameters:	K (int) – Number of modules. G (int) – Number of genes.
Returns:	groups – A list of the two modules to which each gene belongs.
Return type:	list of ints

prosstt.sim_utils.diverging_parallel(branches, programs, genes, tol=0.5)¶

Calculate if the expression programs in all pairs of parallel branches are diverging enough to make the branches distinguishable from each other.

Parameters:	branches (list) – A list of pairs of parallel branches. programs (Series) – Relative expression for all expression programs on every branch of the lineage tree. genes (int) – The number of genes included in the lineage tree. tol (float, optional) – The percentage of genes that must have anticorrelated expression patterns over pseudotime in order for the branches to be considered diverging.
Returns:	diverging – A list of the boolean values: whether each pair of parallel branches diverges or not.
Return type:	numpy.ndarray

prosstt.sim_utils.find_parallel(tree, programs, branch)¶

Find all branches that are parallel to the input branch (have same parent branch).

Parameters:	tree (Tree) – A lineage tree object. programs (Series) – Relative expression for all expression programs on every branch of the lineage tree. branch (int) – The branch to examine.
Returns:	A list of branches that are parallel to the input branch, including the branch itself.
Return type:	list

prosstt.sim_utils.flat_order(n)¶

Map from indices of flat array of size n(n-1)/2 to an upper triangular matrix of size nxn

Parameters:	n (int) – number of options to combine

prosstt.sim_utils.max_relat_exp(tree, relative_means)¶

Finds maximum relative gene expression for each gene along the lineage tree.

Parameters:	tree (Tree) – A lineage tree object. relative_means (Series) – Relative mean expression for all genes on every lineage tree branch.
Returns:	maxes – An array with the maximum relative expression of each gene along the lineage tree.
Return type:	numpy.ndarray

prosstt.sim_utils.pearson_between_programs(genes, prog1, prog2)¶

Calculate the pearson correlation coefficient between two expression programs for all genes.

Parameters:	genes (int) – The number of genes in the lineage tree prog1 (numpy.ndarray) – The first expression program prog2 (numpy.ndarray) – The second expression program
Returns:	pearson – The pearson correlation coefficient for all genes in the two programs
Return type:	numpy.ndarray

prosstt.sim_utils.pick_branch(tree, pseudotime, timezones, assignments)¶

Picks one of the possible branches for a cell at a given time point.

Parameters:	tree (Tree) – A lineage tree object. pseudotime (int) – A pseudotime point. timezones (int array) – The pseudotimes at which the timezones start and end. assignments (int array) – A list of the possible branches for each timezone.
Returns:	branch – The branch to which the cell belongs.
Return type:	int

prosstt.sim_utils.pick_branches(tree, pseudotime)¶

Randomly pick a corresponding branch for a list of pseudotime values.

Parameters:	tree (Tree) – A lineage tree object. pseudotime (list) – A list of pseudotime values.
Returns:	branches – Branch assignments for each pseudotime value.
Return type:	list

prosstt.sim_utils.print_progress(iteration, total, prefix='', suffix='', decimals=1)¶

Call in a loop to create a terminal-friendly text progress bar. Contributed by Greenstick on stackoverflow.com/questions/3173320.

Parameters:	iteration (int) – Current iteration. total (int) – Total number of iterations. prefix (str, optional) – Prefix string before the progress bar. suffix (str, optional) – Suffix string after the progress bar. decimals (int, optional) – Positive number of decimals in percent complete.

prosstt.sim_utils.process_timeseries_input(series_points, cells, point_std)¶

Process the input of sample_pseudotime_series to make everything the same shape.

Parameters:

series_points (list) – The pseudotime sample points for the time series experiment
cells (int or list) – Either the total number of cells to be sampled (in which case it is split equally among all sample points) or the number of cells to be sampled at each sample point
point_std (float, list) – Standard deviation of cell density around each sample point. If it is a float, then it is the same for every sample point

Returns:

series_points (numpy.ndarray) – The pseudotime sample points for the time series experiment
cells (numpy.ndarray) – The cells to be sampled at each sample point of the time series experiment
point_std (numpy.ndarray) – The cell density at each sample point of the time series experiment

prosstt.sim_utils.random_partition(k, iterable)¶

Random partition in almost equisized groups.

Parameters:

k (int) – How many partitions to create.
iterable (array) – The iterable to be partitioned.

Returns:

results (list of int lists.)
contributed by kennytm on stackoverflow.com/questions/3760752

prosstt.sim_utils.simulate_base_gene_exp(tree, relative_means, abs_max=5000, gene_mean=0.8, gene_std=1)¶

Samples appropriate base expression values for each gene. The criterion applied is that the absolute average gene expression does not surpass a certain threshold.

Parameters:	tree (Tree) – A lineage tree object. relative_means (Series) – Relative mean expression for all genes on every lineage tree branch abs_max (int, optional) – Highest allowed value for the absolute average expression of a gene along the lineage tree gene_mean (float, optional) – Average of the log-normal distribution from which the base gene expression values are sampled gene_std (float, optional) – Standard deviation of the log-normal distribution from which the base gene expression values are sampled
Returns:	base_gene_exp – An array that contains base expression values for each gene
Return type:	numpy.ndarray

prosstt.sim_utils.test_correlation(W, k, cutoff)¶

For a column of a matrix, test if previous columns correlate with it.

Parameters:	W (numpy array) – The matrix to test. k (int) – Compare columns from 0 to k-1 with column k. cutoff (float) – Correlation above the cut-off will be considered too much. Should be between 0 and 1 but is not explicitly tested.