Analytics¶

analytics.py¶

unit_vector(vector)[source]¶: Returns the unit vector of the vector. :param tuple vector: vector :return tuple unit_vector: unit vector

flatten(t, my_list=[])[source]¶

Code from: https://gist.github.com/shaxbee/0ada767debf9eefbdb6e Acknowledgements: Zbigniew Mandziejewicz (shaxbee) Generator flattening the structure

>>> list(flatten([2, [2, (4, 5, [7], [2, [6, 2, 6, [6], 4]], 6)]]))
[2, 2, 4, 5, 7, 2, 6, 2, 6, 6, 4, 6]

angle_between(v1, v2)[source]¶

Returns the angle in radians between vectors ‘v1’ and ‘v2’

Parameters

v1 (tuple) – vector 1
v2 (tuple) – vector 2

Return float angle

angle between two vectors in radians

Example::: angle = angle_between((1, 0, 0), (0, 1, 0))

transform_into_wide_format(data, index, columns, values, extra=[])[source]¶

This function converts a Pandas DataFrame from long to wide format using pandas pivot_table() function.

Parameters

data – long-format Pandas DataFrame
index (list) – columns that will be converted into the index
columns (str) – column name whose unique values will become the new column names
values (str) – column to aggregate
extra (list) – additional columns to be kept as columns

Returns

Wide-format pandas DataFrame

Example:

result = transform_into_wide_format(df, index='index', columns='x', values='y', extra='group')

transform_into_long_format(data, drop_columns, group, columns=['name', 'y'])[source]¶

Converts a Pandas DataDrame from wide to long format using pd.melt() function.

Parameters

data – wide-format Pandas DataFrame
drop_columns (list) – columns to be deleted
group (str or list) – column(s) to use as identifier variables
columns (list) – names to use for the 1)variable column, and for the 2)value column

Returns

Long-format Pandas DataFrame.

Example:

result = transform_into_long_format(df, drop_columns=['sample', 'subject'], group='group', columns=['name','y'])

get_ranking_with_markers(data, drop_columns, group, columns, list_markers, annotation={})[source]¶

This function creates a long-format dataframe with features and values to be plotted together with disease biomarker annotations.

Parameters

data – wide-format Pandas DataFrame with samples as rows and features as columns
drop_columns (list) – columns to be deleted
group (str) – column to use as identifier variables
columns (list) – names to use for the 1)variable column, and for the 2)value column
list_markers (list) – list of features from data, known to be markers associated to disease.
annotation (dict) – markers, from list_markers, and associated diseases.

Returns

Long-format pandas DataFrame with group identifiers as rows and columns: ‘name’ (identifier), ‘y’ (LFQ intensity), ‘symbol’ and ‘size’.

Example:

result = get_ranking_with_markers(data, drop_columns=['sample', 'subject'], group='group', columns=['name', 'y'], list_markers, annotation={})

extract_number_missing(data, min_valid, drop_cols=['sample'], group='group')[source]¶

Counts how many valid values exist in each column and filters column labels with more valid values than the minimum threshold defined.

Parameters

data – pandas DataFrame with group as rows and protein identifier as column.
group (str) – column label containing group identifiers. If None, number of valid values is counted across all samples, otherwise is counted per unique group identifier.
min_valid (int) – minimum number of valid values to be filtered.
drop_columns (list) – column labels to be dropped.

Returns

List of column labels above the threshold.

Example:

result = extract_number_missing(data, min_valid=3, drop_cols=['sample'], group='group')

extract_percentage_missing(data, missing_max, drop_cols=['sample'], group='group', how='all')[source]¶

Extracts ratio of missing/valid values in each column and filters column labels with lower ratio than the minimum threshold defined.

Parameters

data – pandas dataframe with group as rows and protein identifier as column.
group (str) – column label containing group identifiers. If None, ratio is calculated across all samples, otherwise is calculated per unique group identifier.
missing_max (float) – maximum ratio of missing/valid values to be filtered.
how (str) – define if labels with a higher percentage of missing values than the threshold in any group (‘any’) or in all groups (‘all’) should be filtered

Returns

List of column labels below the threshold.

Example::: result = extract_percentage_missing(data, missing_max=0.3, drop_cols=[‘sample’], group=’group’)

imputation_KNN(data, drop_cols=['group', 'sample', 'subject'], group='group', cutoff=0.6, alone=True)[source]¶

k-Nearest Neighbors imputation for pandas dataframes with missing data. For more information visit https://github.com/iskandr/fancyimpute/blob/master/fancyimpute/knn.py.

Parameters

data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
group (str) – column label containing group identifiers.
drop_cols (list) – column labels to be dropped. Final dataframe should only have gene/protein/etc identifiers as columns.
cutoff (float) – minimum ratio of missing/valid values required to impute in each column.
alone (boolean) – if True removes all columns with any missing values.

Returns

Pandas dataframe with samples as rows and protein identifiers as columns.

Example:

result = imputation_KNN(data, drop_cols=['group', 'sample', 'subject'], group='group', cutoff=0.6, alone=True)

imputation_mixed_norm_KNN(data, index_cols=['group', 'sample', 'subject'], shift=1.8, nstd=0.3, group='group', cutoff=0.6)[source]¶

Missing values are replaced in two steps: 1) using k-Nearest Neighbors we impute protein columns with a higher ratio of missing/valid values than the defined cutoff, 2) the remaining missing values are replaced by random numbers that are drawn from a normal distribution.

Parameters

data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
group (str) – column label containing group identifiers.
index_cols (list) – list of column labels to be set as dataframe index.
shift (float) – specifies the amount by which the distribution used for the random numbers is shifted downwards. This is in units of the standard deviation of the valid data.
nstd (float) – defines the width of the Gaussian distribution relative to the standard deviation of measured values. A value of 0.5 would mean that the width of the distribution used for drawing random numbers is half of the standard deviation of the data.
cutoff (float) – minimum ratio of missing/valid values required to impute in each column.

Returns

Pandas dataframe with samples as rows and protein identifiers as columns.

Example:

result = imputation_mixed_norm_KNN(data, index_cols=['group', 'sample', 'subject'], shift = 1.8, nstd = 0.3, group='group', cutoff=0.6)

imputation_normal_distribution(data, index_cols=['group', 'sample', 'subject'], shift=1.8, nstd=0.3)[source]¶

Missing values will be replaced by random numbers that are drawn from a normal distribution. The imputation is done for each sample (across all proteins) separately. For more information visit http://www.coxdocs.org/doku.php?id=perseus:user:activities:matrixprocessing:imputation:replacemissingfromgaussian.

Parameters

data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
index_cols (list) – list of column labels to be set as dataframe index.
shift (float) – specifies the amount by which the distribution used for the random numbers is shifted downwards. This is in units of the standard deviation of the valid data.
nstd (float) – defines the width of the Gaussian distribution relative to the standard deviation of measured values. A value of 0.5 would mean that the width of the distribution used for drawing random numbers is half of the standard deviation of the data.

Returns

Pandas dataframe with samples as rows and protein identifiers as columns.

Example:

result = imputation_normal_distribution(data, index_cols=['group', 'sample', 'subject'], shift = 1.8, nstd = 0.3)

normalize_data_per_group(data, group, method='median')[source]¶

This function normalizes the data by group using the selected method

Parameters

data – DataFrame with the data to be normalized (samples x features)
group_col – Column containing the groups
method (string) – normalization method to choose among: median_polish, median, quantile, linear

Returns

Pandas dataframe.

Example:

result = normalize_data_per_group(data, group='group' method='median')

normalize_data(data, method='median_polish')[source]¶

This function normalizes the data using the selected method

Parameters

data – DataFrame with the data to be normalized (samples x features)
method (string) – normalization method to choose among: median_polish, median, quantile, linear

Returns

Pandas dataframe.

Example:

result = normalize_data(data, method='median_polish')

median_normalization(data)[source]¶

This function normalizes each sample by using its median.

Parameters: data –
Returns: Pandas dataframe.

Example:

result = median_normalization(data)

zscore_normalization(data)[source]¶

This function normalizes each sample by using its mean and standard deviation (mean=0, std=1).

Parameters: data –
Returns: Pandas dataframe.

Example:

data = pd.DataFrame({'a': [2,5,4,3,3], 'b':[4,4,6,5,3], 'c':[4,14,8,8,9]})
result = zscore_normalization(data)
result

          a         b         c
        0 -1.154701  0.577350  0.577350
        1 -0.484182 -0.665750  1.149932
        2 -1.000000  0.000000  1.000000
        3 -0.927173 -0.132453  1.059626
        4 -0.577350 -0.577350  1.154701

median_polish_normalization(data, max_iter=250)[source]¶

This function iteratively normalizes each sample and each feature to its median until medians converge.

Parameters

data –
max_iter (int) – number of maximum iterations to prevent infinite loop.

Returns

Pandas dataframe.

Example:

result = median_polish_normalization(data, max_iter = 10)

quantile_normalization(data)[source]¶

Applies quantile normalization to each column in pandas dataframe.

Parameters: data – pandas dataframe with features as columns and samples as rows.
Returns: Pandas dataframe

Example:

result = quantile_normalization(data)

linear_normalization(data, method='l1', axis=0)[source]¶

This function scales input data to a unit norm. For more information visit https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html.

Parameters

data – pandas dataframe with samples as rows and features as columns.
method (str) – norm to use to normalize each non-zero sample or non-zero feature (depends on axis).
axis (int) – axis used to normalize the data along. If 1, independently normalize each sample, otherwise (if 0) normalize each feature.

Returns

Pandas dataframe

Example:

result = linear_normalization(data, method = "l1", axis = 0)

remove_group(data)[source]¶

Removes column with label ‘group’.

Parameters: data – pandas dataframe with one column labelled ‘group’
Returns: Pandas dataframe

Example:

result = remove_group(data)

calculate_coefficient_variation(values)[source]¶

Compute the coefficient of variation, the ratio of the biased standard deviation to the mean, in percentage. For more information visit https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.variation.html.

Parameters: values (ndarray) – numpy array
Returns: The calculated variation along rows.
Return type: ndarray

Example:

result = calculate_coefficient_variation()

get_coefficient_variation(data, drop_columns, group, columns=['name', 'y'])[source]¶

Extracts the coefficients of variation in each group.

Parameters

data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
drop_columns (list) – column labels to be dropped from the dataframe
group (str) – column label containing group identifiers.
columns (list) – names to use for the variable column(s), and for the value column(s)

Returns

Pandas dataframe with columns ‘name’ (protein identifier), ‘x’ (coefficient of variation), ‘y’ (mean) and ‘group’.

Exmaple:

result = get_coefficient_variation(data, drop_columns=['sample', 'subject'], group='group')

transform_proteomics_edgelist(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', value_col='LFQ_intensity')[source]¶

Transforms a long format proteomics matrix into a wide format

Parameters

df – long-format pandas dataframe with columns ‘group’, ‘sample’, ‘subject’, ‘identifier’ (protein), ‘name’ (gene) and ‘LFQ_intensity’.
index_cols (list) – column labels to be be kept as index identifiers.
drop_cols (list) – column labels to be dropped from the dataframe.
group (str) – column label containing group identifiers.
identifier (str) – column label containing feature identifiers.
extra_identifier (str) – column label containing additional protein identifiers (e.g. gene names).
value_col (str) – column label containing expression values.

Returns

Pandas dataframe with samples as rows and protein identifiers (UniprotID~GeneName) as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

Example:: df = transform_proteomics_edgelist(original, index_cols=[‘group’, ‘sample’, ‘subject’], drop_cols=[‘sample’], group=’group’, identifier=’identifier’, value_col=’LFQ_intensity’)

get_proteomics_measurements_ready(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', imputation=True, method='distribution', missing_method='percentage', missing_per_group=True, missing_max=0.3, min_valid=1, value_col='LFQ_intensity', shift=1.8, nstd=0.3, knn_cutoff=0.6, normalize=False, normalization_method='median', normalize_group=False)[source]¶

Processes proteomics data extracted from the database: 1) filter proteins with high number of missing values (> missing_max or min_valid), 2) impute missing values. For more information on imputation method visit http://www.coxdocs.org/doku.php?id=perseus:user:activities:matrixprocessing:filterrows:filtervalidvaluesrows.

Parameters

df – long-format pandas dataframe with columns ‘group’, ‘sample’, ‘subject’, ‘identifier’ (protein), ‘name’ (gene) and ‘LFQ_intensity’.
index_cols (list) – column labels to be be kept as index identifiers.
drop_cols (list) – column labels to be dropped from the dataframe.
group (str) – column label containing group identifiers.
identifier (str) – column label containing feature identifiers.
extra_identifier (str) – column label containing additional protein identifiers (e.g. gene names).
imputation (bool) – if True performs imputation of missing values.
method (str) – method for missing values imputation (‘KNN’, ‘distribuition’, or ‘mixed’)
missing_method (str) – defines which expression rows are counted to determine if a column has enough valid values to survive the filtering process.
missing_per_group (bool) – if True filter proteins based on valid values per group; if False filter across all samples.
missing_max (float) – maximum ratio of missing/valid values to be filtered.
min_valid (int) – minimum number of valid values to be filtered.
value_col (str) – column label containing expression values.
shift (float) – when using distribution imputation, the down-shift
nstd (float) – when using distribution imputation, the width of the distribution
knn_cutoff (float) – when using KNN imputation, the minimum percentage of valid values for which to use KNN imputation (i.e. 0.6 -> if 60% valid values use KNN, otherwise MinProb)

Returns

Pandas dataframe with samples as rows and protein identifiers (UniprotID~GeneName) as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

Example 1:

result = get_proteomics_measurements_ready(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', imputation=True, method = 'distribution', missing_method = 'percentage', missing_per_group=True, missing_max = 0.3, value_col='LFQ_intensity')

Example 2:

result = get_proteomics_measurements_ready(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', imputation = True, method = 'mixed', missing_method = 'at_least_x', missing_per_group=False, min_valid=5, value_col='LFQ_intensity')

get_clinical_measurements_ready(df, subject_id='subject', sample_id='biological_sample', group_id='group', columns=['clinical_variable'], values='values', extra=['group'], imputation=True, imputation_method='KNN')[source]¶

Processes clinical data extracted from the database by converting dataframe to wide-format and imputing missing values.

Parameters

df – long-format pandas dataframe with columns ‘group’, ‘biological_sample’, ‘subject’, ‘clinical_variable’, ‘value’.
subject_id (str) – column label containing subject identifiers.
sample_id (str) – column label containing biological sample identifiers.
group_id (str) – column label containing group identifiers.
columns (list) – column name whose unique values will become the new column names
values (str) – column label containing clinical variable values.
extra (list) – additional column labels to be kept as columns
imputation (bool) – if True performs imputation of missing values.
imputation_method (str) – method for missing values imputation (‘KNN’, ‘distribuition’, or ‘mixed’).

Returns

Pandas dataframe with samples as rows and clinical variables as columns (with additional columns ‘group’, ‘subject’ and ‘biological_sample’).

Example:

result = get_clinical_measurements_ready(df, subject_id='subject', sample_id='biological_sample', group_id='group', columns=['clinical_variable'], values='values', extra=['group'], imputation=True, imputation_method='KNN')

get_summary_data_matrix(data)[source]¶

Returns some statistics on the data matrix provided.

Parameters: data – pandas dataframe.
Returns: dictionary with the type of statistics as key and the statistic as value in the shape of a pandas data frame

Example:

result = get_summary_data_matrix(data)

check_equal_variances(data, drop_cols=['group', 'sample', 'subject'], group_col='group', alpha=0.05)[source]¶

check_normality(data, drop_cols=['group', 'sample', 'subject'], group_col='group', alpha=0.05)[source]¶

run_pca(data, drop_cols=['sample', 'subject'], group='group', components=2, dropna=True)[source]¶

Performs principal component analysis and returns the values of each component for each sample and each protein, and the loadings for each protein. For information visit https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.

Parameters

data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
drop_cols (list) – column labels to be dropped from the dataframe.
group (str) – column label containing group identifiers.
components (int) – number of components to keep.
dropna (bool) – if True removes all columns with any missing values.

Returns

Two dictionaries: 1) two pandas dataframes (first one with components values, the second with the components vectors for each protein), 2) xaxis and yaxis titles with components loadings for plotly.

Example:

result = run_pca(data, drop_cols=['sample', 'subject'], group='group', components=2, dropna=True)

run_tsne(data, drop_cols=['sample', 'subject'], group='group', components=2, perplexity=40, n_iter=1000, init='pca', dropna=True)[source]¶

Performs t-distributed Stochastic Neighbor Embedding analysis. For more information visit https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.

Parameters

data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
drop_cols (list) – column labels to be dropped from the dataframe.
group (str) – column label containing group identifiers.
components (int) – dimension of the embedded space.
perplexity (int) – related to the number of nearest neighbors that is used in other manifold learning algorithms. Consider selecting a value between 5 and 50.
n_iter (int) – maximum number of iterations for the optimization (at least 250).
init (str) – initialization of embedding (‘random’, ‘pca’ or numpy array of shape n_samples x n_components).
dropna (bool) – if True removes all columns with any missing values.

Returns

Two dictionaries: 1) pandas dataframe with embedding vectors, 2) xaxis and yaxis titles for plotly.

Example:

result = run_tsne(data, drop_cols=['sample', 'subject'], group='group', components=2, perplexity=40, n_iter=1000, init='pca', dropna=True)

run_umap(data, drop_cols=['sample', 'subject'], group='group', n_neighbors=10, min_dist=0.3, metric='cosine', dropna=True)[source]¶

Performs Uniform Manifold Approximation and Projection. For more information vist https://umap-learn.readthedocs.io.

Parameters

data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
drop_cols (list) – column labels to be dropped from the dataframe.
group (str) – column label containing group identifiers.
n_neighbors (int) – number of neighboring points used in local approximations of manifold structure.
min_dist (float) – controls how tightly the embedding is allowed compress points together.
metric (str) – metric used to measure distance in the input space.
dropna (bool) – if True removes all columns with any missing values.

Returns

Two dictionaries: 1) pandas dataframe with embedding of the training data in low-dimensional space, 2) xaxis and yaxis titles for plotly.

Example:

result = run_umap(data, drop_cols=['sample', 'subject'], group='group', n_neighbors=10, min_dist=0.3, metric='cosine', dropna=True)

calculate_correlations(x, y, method='pearson')[source]¶

Calculates a Spearman (nonparametric) or a Pearson (parametric) correlation coefficient and p-value to test for non-correlation.

Parameters

x (ndarray) – array 1
y (ndarray) – array 2
method (str) – chooses which kind of correlation method to run

Returns

Tuple with two floats, correlation coefficient and two-tailed p-value.

Example:

result = calculate_correlations(x, y, method='pearson')

apply_pvalue_correction(pvalues, alpha=0.05, method='bonferroni')[source]¶

Performs p-value correction using the specified method. For more information visit https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html.

Parameters

pvalues (ndarray) – et of p-values of the individual tests.
alpha (float) – error rate.
method (str) – method of p-value correction: - bonferroni : one-step correction - sidak : one-step correction - holm-sidak : step down method using Sidak adjustments - holm : step-down method using Bonferroni adjustments - simes-hochberg : step-up method (independent) - hommel : closed method based on Simes tests (non-negative) - fdr_bh : Benjamini/Hochberg (non-negative) - fdr_by : Benjamini/Yekutieli (negative) - fdr_tsbh : two stage fdr correction (non-negative) - fdr_tsbky : two stage fdr correction (non-negative)

Returns

Tuple with two arrays, boolen for rejecting H0 hypothesis and float for adjusted p-value.

Exmaple:

result = apply_pvalue_correction(pvalues, alpha=0.05, method='bonferroni')

apply_pvalue_fdrcorrection(pvalues, alpha=0.05, method='indep')[source]¶

Performs p-value correction for false discovery rate. For more information visit https://www.statsmodels.org/devel/generated/statsmodels.stats.multitest.fdrcorrection.html.

Parameters

pvalues (ndarray) – et of p-values of the individual tests.
alpha (float) – error rate.
method (str) – method of p-value correction (‘indep’, ‘negcorr’).

Returns

Tuple with two arrays, boolen for rejecting H0 hypothesis and float for adjusted p-value.

Exmaple:

result = apply_pvalue_fdrcorrection(pvalues, alpha=0.05, method='indep')

apply_pvalue_twostage_fdrcorrection(pvalues, alpha=0.05, method='bh')[source]¶

Iterated two stage linear step-up procedure with estimation of number of true hypotheses. For more information visit https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.fdrcorrection_twostage.html.

Parameters

pvalues (ndarray) – et of p-values of the individual tests.
alpha (float) – error rate.
method (str) – method of p-value correction (‘bky’, ‘bh’).

Returns

Tuple with two arrays, boolen for rejecting H0 hypothesis and float for adjusted p-value.

Exmaple:

result = apply_pvalue_twostage_fdrcorrection(pvalues, alpha=0.05, method='bh')

apply_pvalue_permutation_fdrcorrection(df, observed_pvalues, group, alpha=0.05, permutations=50)[source]¶

This function applies multiple hypothesis testing correction using a permutation-based false discovery rate approach.

Parameters

df – pandas dataframe with samples as rows and features as columns.
oberved_pvalues – pandas Series with p-values calculated on the originally measured data.
group (str) – name of the column containing group identifiers.
alpha (float) – error rate. Values velow alpha are considered significant.
permutations (int) – number of permutations to be applied.

Returns

Pandas dataframe with adjusted p-values and rejected columns.

Example:

result = apply_pvalue_permutation_fdrcorrection(df, observed_pvalues, group='group', alpha=0.05, permutations=50)

get_counts_permutation_fdr(value, random, observed, n, alpha)[source]¶

Calculates local FDR values (q-values) by computing the fraction of accepted hits from the permuted data over accepted hits from the measured data normalized by the total number of permutations.

Parameters

value (float) – computed p-value on measured data for a feature.
random (ndarray) – p-values computed on the permuted data.
observed – pandas Series with p-values calculated on the originally measured data.
n (int) – number of permutations to be applied.
alpha (float) – error rate. Values velow alpha are considered significant.

Returns

Tuple with q-value and boolean for H0 rejected.

Example:

result = get_counts_permutation_fdr(value, random, observed, n=250, alpha=0.05)

convertToEdgeList(data, cols)[source]¶

This function converts a pandas dataframe to an edge list where index becomes the source nodes and columns the target nodes.

Parameters

data – pandas dataframe.
cols (list) – names for dataframe columns.

Returns

Pandas dataframe with columns cols.

run_correlation(df, alpha=0.05, subject='subject', group='group', method='pearson', correction='fdr_bh')[source]¶

This function calculates pairwise correlations for columns in dataframe, and returns it in the shape of a edge list with ‘weight’ as correlation score, and the ajusted p-values.

Parameters

df – pandas dataframe with samples as rows and features as columns.
subject (str) – name of column containing subject identifiers.
group (str) – name of column containing group identifiers.
method (str) – method to use for correlation calculation (‘pearson’, ‘spearman’).
alpha (floar) – error rate. Values velow alpha are considered significant.
correction (string) – type of correction see apply_pvalue_correction for methods

Returns

Pandas dataframe with columns: ‘node1’, ‘node2’, ‘weight’, ‘padj’ and ‘rejected’.

Example:

result = run_correlation(df, alpha=0.05, subject='subject', group='group', method='pearson', correction='fdr_bh')

run_multi_correlation(df_dict, alpha=0.05, subject='subject', on=['subject', 'biological_sample'], group='group', method='pearson', correction='fdr_bh')[source]¶

This function merges all input dataframes and calculates pairwise correlations for all columns.

Parameters

df_dict (dict) – dictionary of pandas dataframes with samples as rows and features as columns.
subject (str) – name of the column containing subject identifiers.
group (str) – name of the column containing group identifiers.
on (list) – column names to join dataframes on (must be found in all dataframes).
method (str) – method to use for correlation calculation (‘pearson’, ‘spearman’).
alpha (float) – error rate. Values velow alpha are considered significant.
correction (string) – type of correction see apply_pvalue_correction for methods

Returns

Pandas dataframe with columns: ‘node1’, ‘node2’, ‘weight’, ‘padj’ and ‘rejected’.

Example:

result = run_multi_correlation(df_dict, alpha=0.05, subject='subject', on=['subject', 'biological_sample'] , group='group', method='pearson', correction='fdr_bh')

calculate_rm_correlation(df, x, y, subject)[source]¶

Computes correlation and p-values between two columns a and b in df.

Parameters

df – pandas dataframe with subjects as rows and two features and columns.
x (str) – feature a name.
y (str) – feature b name.
subject – column name containing the covariate variable.

Returns

Tuple with values for: feature a, feature b, correlation, p-value and degrees of freedom.

Example:

result = calculate_rm_correlation(df, x='feature a', y='feature b', subject='subject')

run_rm_correlation(df, alpha=0.05, subject='subject', correction='fdr_bh')[source]¶

Computes pairwise repeated measurements correlations for all columns in dataframe, and returns results as an edge list with ‘weight’ as correlation score, p-values, degrees of freedom and ajusted p-values.

Parameters

df – pandas dataframe with samples as rows and features as columns.
subject (str) – name of column containing subject identifiers.
alpha (float) – error rate. Values velow alpha are considered significant.
correction (string) – type of correction type see apply_pvalue_correction for methods

Returns

Pandas dataframe with columns: ‘node1’, ‘node2’, ‘weight’, ‘pvalue’, ‘dof’, ‘padj’ and ‘rejected’.

Example:

result = run_rm_correlation(df, alpha=0.05, subject='subject', correction='fdr_bh')

run_efficient_correlation(data, method='pearson')[source]¶

Calculates pairwise correlations and returns lower triangle of the matrix with correlation values and p-values.

Parameters

data – pandas dataframe with samples as index and features as columns (numeric data only).
method (str) – method to use for correlation calculation (‘pearson’, ‘spearman’).

Returns

Two numpy arrays: correlation and p-values.

Example:

result = run_efficient_correlation(data, method='pearson')

calculate_ttest_samr(df, labels, n=2, s0=0, paired=False)[source]¶

Calculates modified T-test using ‘samr’ R package.

Parameters

df – pandas dataframe with group as columns and protein identifier as rows
abels (list) – integers reflecting the group each sample belongs to (e.g. group1 = 1, group2 = 2)
n (int) – number of samples
s0 (float) – exchangeability factor for denominator of test statistic
paired (bool) – True if samples are paired

Returns

Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group1)’, ‘log2FC’, ‘FC’, ‘t-statistics’, ‘p-value’.

Example:

result = calculate_ttest_samr(df, labels, n=2, s0=0.1, paired=False)

calculate_ttest(df, condition1, condition2, paired=False, is_logged=True, non_par=False, tail='two-sided', correction='auto', r=0.707)[source]¶

Calculates the t-test for the means of independent samples belonging to two different groups. For more information visit https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html.

Parameters

df – pandas dataframe with groups and subjects as rows and protein identifier as column.
condition1 (str) – identifier of first group.
condition2 (str) – ientifier of second group.
is_logged (bool) – data is logged transformed
non_par (bool) – if True, normality and variance equality assumptions are checked and non-parametric test Mann Whitney U test if not passed

Returns

Tuple with t-statistics, two-tailed p-value, mean of first group, mean of second group and logfc.

Example:

result = calculate_ttest(df, 'group1', 'group2')

calculate_THSD(df, column, group='group', alpha=0.05, is_logged=True)[source]¶

Pairwise Tukey-HSD posthoc test using pingouin stats. For more information visit https://pingouin-stats.org/generated/pingouin.pairwise_tukey.html

Parameters

df – pandas dataframe with group and protein identifier as columns
column (str) – column containing the protein identifier
group (str) – column label containing the between factor
alpha (float) – significance level

Returns

Pandas dataframe.

Example:

result = calculate_THSD(df, column='HBG2~P69892', group='group', alpha=0.05)

calculate_pairwise_ttest(df, column, subject='subject', group='group', correction='none', is_logged=True)[source]¶

Performs pairwise t-test using pingouin, as a posthoc test, and calculates fold-changes. For more information visit https://pingouin-stats.org/generated/pingouin.pairwise_ttests.html.

Parameters

df – pandas dataframe with subject and group as rows and protein identifier as column.
column (str) – column label containing the dependant variable
subject (str) – column label containing subject identifiers
group (str) – column label containing the between factor
correction (str) – method used for testing and adjustment of p-values.

Returns

Pandas dataframe with means, standard deviations, test-statistics, degrees of freedom and effect size columns.

Example:

result = calculate_pairwise_ttest(df, 'protein a', subject='subject', group='group', correction='none')

complement_posthoc(posthoc, identifier, is_logged)[source]¶

Calculates fold-changes after posthoc test.

Parameters

posthoc – pandas dataframe from posthoc test. Should have at least columns ‘mean(group1)’ and ‘mean(group2)’.
identifier (str) – feature identifier.

Returns

Pandas dataframe with additional columns ‘identifier’, ‘log2FC’ and ‘FC’.

calculate_dabest(df, idx, x, y, paired=False, id_col=None, test='mean_diff')[source]¶

Parameters

df –
idx –
x –
y –
paired –
id_col –
test –

Returns

calculate_anova_samr(df, labels, s0=0)[source]¶

Calculates modified one-way ANOVA using ‘samr’ R package.

Parameters

df – pandas dataframe with group as columns and protein identifier as rows
labels (list) – integers reflecting the group each sample belongs to (e.g. group1 = 1, group2 = 2, group3 = 3)
s0 (float) – exchangeability factor for denominator of test statistic

Returns

Pandas dataframe with protein identifiers and F-statistics.

Example:

result = calculate_anova_samr(df, labels, s0=0.1)

calculate_anova(df, column, group='group')[source]¶

Calculates one-way ANOVA using pingouin.

Parameters

df – pandas dataframe with group as rows and protein identifier as column
column (str) – name of the column in df to run ANOVA on
group (str) – column with group identifiers

Returns

Tuple with t-statistics and p-value.

calculate_repeated_measures_anova(df, column, subject='subject', group='group')[source]¶

One-way and two-way repeated measures ANOVA using pingouin stats.

Parameters

df – pandas dataframe with samples as rows and protein identifier as column. Data must be in long-format for two-way repeated measures.
column (str) – column label containing the dependant variable
subject (str) – column label containing subject identifiers
group (str) – column label containing the within factor

Returns

Tuple with protein identifier, t-statistics and p-value.

Example:

result = calculate_repeated_measures_anova(df, 'protein a', subject='subject', group='group')

get_max_permutations(df, group='group')[source]¶

Get maximum number of permutations according to number of samples.

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns
group (str) – column with group identifiers

Returns

Maximum number of permutations.

Return type

int

check_is_paired(df, subject, group)[source]¶

Check if samples are paired.

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
subject (str) – column with subject identifiers
group (str) – column with group identifiers

Returns

True if paired samples.

Return type

bool

run_dabest(df, drop_cols=['sample'], subject='subject', group='group', test='mean_diff')[source]¶

Parameters

df –
drop_cols (list) –
subject (str) –
group (str) –
test (str) –

Returns

Pandas dataframe

run_anova(df, alpha=0.05, drop_cols=['sample', 'subject'], subject='subject', group='group', permutations=0, correction='fdr_bh', is_logged=True, non_par=False)[source]¶

Performs statistical test for each protein in a dataset. Checks what type of data is the input (paired, unpaired or repeated measurements) and performs posthoc tests for multiclass data. Multiple hypothesis correction uses permutation-based if permutations>0 and Benjamini/Hochberg if permutations=0.

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
subject (str) – column with subject identifiers
group (str) – column with group identifiers
drop_cols (list) – column labels to be dropped from the dataframe
alpha (float) – error rate for multiple hypothesis correction
permutations (int) – number of permutations used to estimate false discovery rates.
non_par (bool) – if True, normality and variance equality assumptions are checked and non-parametric test Mann Whitney U test if not passed

Returns

Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group2)’, ‘Log2FC’, ‘std_error’, ‘tail’, ‘t-statistics’, ‘posthoc pvalue’, ‘effsize’, ‘efftype’, ‘FC’, ‘rejected’, ‘F-statistics’, ‘p-value’, ‘correction’, ‘-log10 p-value’, and ‘method’.

Example:

result = run_anova(df, alpha=0.05, drop_cols=["sample",'subject'], subject='subject', group='group', permutations=50)

correct_pairwise_ttest(df, alpha, correction='fdr_bh')[source]¶

run_repeated_measurements_anova(df, alpha=0.05, drop_cols=['sample'], subject='subject', group='group', permutations=50, correction='fdr_bh', is_logged=True)[source]¶

Performs repeated measurements anova and pairwise posthoc tests for each protein in dataframe.

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
subject (str) – column with subject identifiers
group (srt) – column with group identifiers
drop_cols (list) – column labels to be dropped from the dataframe
alpha (float) – error rate for multiple hypothesis correction
permutations (int) – number of permutations used to estimate false discovery rates

Returns

Pandas dataframe

Example:

result = run_repeated_measurements_anova(df, alpha=0.05, drop_cols=['sample'], subject='subject', group='group', permutations=50)

format_anova_table(df, aov_results, pairwise_results, pairwise_cols, group, permutations, alpha, correction)[source]¶

Performs p-value correction (permutation-based and FDR) and converts pandas dataframe into final format.

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
aov_results (list[tuple]) – list of tuples with anova results (one tuple per feature).
pairwise_results (list[dataframes]) – list of pandas dataframes with posthoc tests results
group (str) – column with group identifiers
alpha (float) – error rate for multiple hypothesis correction
permutations (int) – number of permutations used to estimate false discovery rates

Returns

Pandas dataframe

run_ttest(df, condition1, condition2, alpha=0.05, drop_cols=['sample'], subject='subject', group='group', paired=False, correction='fdr_bh', permutations=50, is_logged=True, non_par=False)[source]¶

Runs t-test (paired/unpaired) for each protein in dataset and performs permutation-based (if permutations>0) or Benjamini/Hochberg (if permutations=0) multiple hypothesis correction.

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
condition1 (str) – first of two conditions of the independent variable
condition2 (str) – second of two conditions of the independent variable
subject (str) – column with subject identifiers
group (str) – column with group identifiers (independent variable)
drop_cols (list) – column labels to be dropped from the dataframe
paired (bool) – paired or unpaired samples
correction (str) – method of pvalue correction see apply_pvalue_correction for methods
alpha (float) – error rate for multiple hypothesis correction
permutations (int) – number of permutations used to estimate false discovery rates.
is_logged (bool) – data is log-transformed
non_par (bool) – if True, normality and variance equality assumptions are checked and non-parametric test Mann Whitney U test if not passed

Returns

Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group2)’, ‘std(group1)’, ‘std(group2)’, ‘Log2FC’, ‘FC’, ‘rejected’, ‘T-statistics’, ‘p-value’, ‘correction’, ‘-log10 p-value’, and ‘method’.

Example:

result = run_ttest(df, condition1='group1', condition2='group2', alpha = 0.05, drop_cols=['sample'], subject='subject', group='group', paired=False, correction='fdr_bh', permutations=50)

define_samr_method(df, subject, group, drop_cols)[source]¶

Method to identify the correct problem type to run with SAMR

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
subject (str) – column with subject identifiers
group (str) – column with group identifiers
droop_cols (str) – columns to be dropped

Returns

tuple with the method to be used (One Class, Two class paired, Two class unpaired or Multiclass) and the labels (conditions)

Example:

method, labels = define_samr_method(df, subject, group)

calculate_pvalue_from_tstats(tstat, dfn, dfk)[source]¶

Calculate two-tailed p-values from T- or F-statistics.

tstat: T/F distribution dfn: degrees of freedrom n (values) per protein (keys), i.e. number of obervations - number of groups (dict) dfk: degrees of freedrom n (values) per protein (keys), i.e. number of groups - 1 (dict)

run_samr(df, subject='subject', group='group', drop_cols=['subject', 'sample'], alpha=0.05, s0='null', permutations=250, fc=0, is_logged=True, localfdr=False)[source]¶

Python adaptation of the ‘samr’ R package for statistical tests with permutation-based correction and s0 parameter. For more information visit https://cran.r-project.org/web/packages/samr/samr.pdf. The method only runs if R is installed and permutations is higher than 0, otherwise ANOVA.

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).
subject (str) – column with subject identifiers
group (str) – column with group identifiers
drop_cols (list) – columnlabels to be dropped from the dataframe
alpha (float) – error rate for multiple hypothesis correction
s0 (float) – exchangeability factor for denominator of test statistic
permutations (int) – number of permutations used to estimate false discovery rates. If number of permutations is equal to zero, the function will run anova with FDR Benjamini/Hochberg correction.
fc (float) – minimum fold change to define practical significance (needed when computing delta table)

Returns

Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group2)’, ‘Log2FC’, ‘FC’, ‘T-statistics’, ‘p-value’, ‘padj’, ‘correction’, ‘-log10 p-value’, ‘rejected’ and ‘method’

Example:

result = run_samr(df, subject='subject', group='group', drop_cols=['subject', 'sample'], alpha=0.05, s0=1, permutations=250, fc=0)

calculate_discriminant_lines(result)[source]¶

run_fisher(group1, group2, alternative='two-sided')[source]¶

annotated not-annotated group1 a b group2 c d ————————————

group1 = [a, b] group2 = [c, d]

odds, pvalue = stats.fisher_exact([[a, b], [c, d]])

run_kolmogorov_smirnov(dist1, dist2, alternative='two-sided')[source]¶

Compute the Kolmogorov-Smirnov statistic on 2 samples. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

Parameters

dist1 (list) – sequence of 1-D ndarray (first distribution to compare) drawn from a continuous distribution
dist2 (list) – sequence of 1-D ndarray (second distribution to compare) drawn from a continuous distribution
alternative (str) – defines the alternative hypothesis (default is ‘two-sided’): * ‘two-sided’ * ‘less’ * ‘greater’

Returns

statistic float and KS statistic pvalue float Two-tailed p-value.

Example:

result = run_kolmogorov_smirnov(dist1, dist2, alternative='two-sided')

run_site_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', regex='(\\w+~.+)_\\w\\d+\\-\\w+', correction='fdr_bh')[source]¶

This function runs a simple enrichment analysis for significantly regulated protein sites in a dataset.

Parameters

regulation_data – pandas dataframe resulting from differential regulation analysis.
annotation – pandas dataframe with annotations for features (columns: ‘annotation’, ‘identifier’ (feature identifiers), and ‘source’).
identifier (str) – name of the column from annotation containing feature identifiers.
groups (list) – column names from regulation_data containing group identifiers.
annotation_col (str) – name of the column from annotation containing annotation terms.
reject_col (str) – name of the column from regulatio_data containing boolean for rejected null hypothesis.
group_col (str) – column name for new column in annotation dataframe determining if feature belongs to foreground or background.
method (str) – method used to compute enrichment (only ‘fisher’ is supported currently).
regex (str) – how to extract the annotated identifier from the site identifier

Returns

Pandas dataframe with columns: ‘terms’, ‘identifiers’, ‘foreground’, ‘background’, ‘pvalue’, ‘padj’ and ‘rejected’.

Example:

result = run_site_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', match="(\w+~.+)_\w\d+\-\w+")

run_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh')[source]¶

This function runs a simple enrichment analysis for significantly regulated features in a dataset.

Parameters

regulation_data – pandas dataframe resulting from differential regulation analysis.
annotation – pandas dataframe with annotations for features (columns: ‘annotation’, ‘identifier’ (feature identifiers), and ‘source’).
identifier (str) – name of the column from annotation containing feature identifiers.
groups (list) – column names from regulation_data containing group identifiers.
annotation_col (str) – name of the column from annotation containing annotation terms.
reject_col (str) – name of the column from regulatio_data containing boolean for rejected null hypothesis.
group_col (str) – column name for new column in annotation dataframe determining if feature belongs to foreground or background.
method (str) – method used to compute enrichment (only ‘fisher’ is supported currently).

Returns

Pandas dataframe with columns: ‘terms’, ‘identifiers’, ‘foreground’, ‘background’, ‘pvalue’, ‘padj’ and ‘rejected’.

Example:

result = run_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher')

run_enrichment(data, foreground_id, background_id, annotation_col='annotation', group_col='group', identifier_col='identifier', method='fisher', correction='fdr_bh')[source]¶

Computes enrichment of the foreground relative to a given backgroung, using Fisher’s exact test, and corrects for multiple hypothesis testing.

Parameters

data – pandas dataframe with annotations for dataset features (columns: ‘annotation’, ‘identifier’, ‘source’, ‘group’).
foreground_id (str) – group identifier of features that belong to the foreground.
background_id (str) – group identifier of features that belong to the background.
annotation_col (str) – name of the column containing annotation terms.
group_col (str) – name of column containing the group identifiers.
identifier_col (str) – name of column containing dependent variables identifiers.
method (str) – method used to compute enrichment (only ‘fisher’ is supported currently).

Returns

Pandas dataframe with annotation terms, features, number of foregroung/background features in each term, p-values and corrected p-values (columns: ‘terms’, ‘identifiers’, ‘foreground’, ‘background’, ‘pvalue’, ‘padj’ and ‘rejected’).

Example:

result = run_enrichment(data, foreground='foreground', background='background', foreground_pop=len(foreground_list), background_pop=len(background_list), annotation_col='annotation', group_col='group', identifier_col='identifier', method='fisher')

calculate_fold_change(df, condition1, condition2)[source]¶

Calculates fold-changes between two groups for all proteins in a dataframe.

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns.
condition1 (str) – identifier of first group.
condition2 (str) – identifier of second group.

Returns

Numpy array.

Example:

result = calculate_fold_change(data, 'group1', 'group2')

cohen_d(df, condition1, condition2, ddof=0)[source]¶

Calculates Cohen’s d effect size based on the distance between two means, measured in standard deviations. For more information visit https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanstd.html.

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns.
condition1 (str) – identifier of first group.
condition2 (str) – identifier of second group.
ddof (int) – means Delta Degrees of Freedom.

Returns

Numpy array.

Example:

result = cohen_d(data, 'group1', 'group2', ddof=0)

hedges_g(df, condition1, condition2, ddof=0)[source]¶

Calculates Hedges’ g effect size (more accurate for sample sizes below 20 than Cohen’s d). For more information visit https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanstd.html.

Parameters

df – pandas dataframe with samples as rows and protein identifiers as columns.
condition1 (str) – identifier of first group.
condition2 (str) – identifier of second group.
ddof (int) – means Delta Degrees of Freedom.

Returns

Numpy array.

Example:

result = hedges_g(data, 'group1', 'group2', ddof=0)

run_mapper(data, lenses=['l2norm'], n_cubes=15, overlap=0.5, n_clusters=3, linkage='complete', affinity='correlation')[source]¶

Parameters

data –
lenses –
n_cubes –
overlap –
n_clusters –
linkage –
affinity –

Returns

run_WGCNA(data, drop_cols_exp, drop_cols_cli, RsquaredCut=0.8, networkType='unsigned', minModuleSize=30, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.25, verbose=0, sd_cutoff=0)[source]¶

Runs an automated weighted gene co-expression network analysis (WGCNA), using input proteomics/transcriptomics/genomics and clinical variables data.

Parameters

data (dict) – dictionary of pandas dataframes with processed clinical and experimental datasets
drop_cols_exp (list) – column names to be removed from the experimental dataset.
drop_cols_cli (list) – column names to be removed from the clinical dataset.
RsquaredCut (float) – desired minimum scale free topology fitting index R^2.
networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).
minModuleSize (int) – minimum module size.
deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.
merge_modules (bool) – if True, very similar modules are merged.
MEDissThres (float) – maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Returns

Tuple with multiple pandas dataframes.

Example:

result = run_WGCNA(data, drop_cols_exp=['subject', 'sample', 'group', 'index'], drop_cols_cli=['subject', 'biological_sample', 'group', 'index'], RsquaredCut=0.8, networkType='unsigned', minModuleSize=30, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.25, verbose=0)

most_central_edge(G)[source]¶

Compute the eigenvector centrality for the graph G, and finds the highest value.

Parameters: G (graph) – networkx graph
Returns: Highest eigenvector centrality value.
Return type: float

get_louvain_partitions(G, weight)[source]¶

Computes the partition of the graph nodes which maximises the modularity (or try..) using the Louvain heuristices. For more information visit https://python-louvain.readthedocs.io/en/latest/api.html.

Parameters

G (graph) – networkx graph which is decomposed.
weight (str) – the key in graph to use as weight.

Returns

The partition, with communities numbered from 0 to number of communities.

Return type

dict

get_network_communities(graph, args)[source]¶

Finds communities in a graph using different methods. For more information on the methods visit:

https://networkx.github.io/documentation/latest/reference/algorithms/generated/networkx.algorithms.community.modularity_max.greedy_modularity_communities.html

https://networkx.github.io/documentation/networkx-2.0/reference/algorithms/generated/networkx.algorithms.community.asyn_lpa.asyn_lpa_communities.html

https://networkx.github.io/documentation/latest/reference/algorithms/generated/networkx.algorithms.community.centrality.girvan_newman.html

https://networkx.github.io/documentation/latest/reference/generated/networkx.convert_matrix.to_pandas_adjacency.html

Parameters

graph (graph) – networkx graph
args (dict) – config file arguments

Returns

Dictionary of nodes and which community they belong to (from 0 to number of communities).

get_publications_abstracts(data, publication_col='publication', join_by=['publication', 'Proteins', 'Diseases'], index='PMID')[source]¶

Accesses NCBI PubMed over the WWW and retrieves the abstracts corresponding to a list of one or more PubMed IDs.

Parameters

data – pandas dataframe of diseases and publications linked to a list of proteins (columns: ‘Diseases’, ‘Proteins’, ‘linkout’ and ‘publication’).
publication_col (str) – column label containing PubMed ids.
join_by (list) – column labels to be kept from the input dataframe.
index (str) – column label containing PubMed ids from the NCBI retrieved data.

Returns

Pandas dataframe with publication information and columns ‘PMID’, ‘abstract’, ‘authors’, ‘date’, ‘journal’, ‘keywords’, ‘title’, ‘url’, ‘Proteins’ and ‘Diseases’.

Example:

result = get_publications_abstracts(data, publication_col='publication', join_by=['publication','Proteins','Diseases'], index='PMID')

eta_squared(aov)[source]¶

Calculates the effect size using Eta-squared.

Parameters: aov – pandas dataframe with anova results from statsmodels.
Returns: Pandas dataframe with additional Eta-squared column.

omega_squared(aov)[source]¶

Calculates the effect size using Omega-squared.

Parameters: aov – pandas dataframe with anova results from statsmodels.
Returns: Pandas dataframe with additional Omega-squared column.

run_two_way_anova(df, drop_cols=['sample'], subject='subject', group=['group', 'secondary_group'])[source]¶

Run a 2-way ANOVA when data[‘secondary_group’] is not empty

Parameters

df – processed pandas dataframe with samples as rows, and proteins and groups as columns.
drop_cols (list) – column names to drop from dataframe
subject (str) – column name containing subject identifiers.
group (list) – column names corresponding to independent variable groups

Returns

Two dataframes, anova results and residuals.

Example:

result = run_two_way_anova(data, drop_cols=['sample'], subject='subject', group=['group', 'secondary_group'])

merge_for_polar(regulation_data, regulators, identifier_col='identifier', group_col='group', theta_col='modifier', aggr_func='mean', normalize=True)[source]¶

run_qc_markers_analysis(data, qc_markers, sample_col='sample', group_col='group', drop_cols=['subject'], identifier_col='identifier', qcidentifier_col='identifier', qcclass_col='class')[source]¶

run_snf(df_dict, clusters, distance_metric, K_affinity, mu_affinity)[source]¶

Parameters

df_dict –
clusters –

run_km(data, time_col, event_col, group_col, args={})[source]¶

wgcnaAnalysis.py¶

get_data(data, drop_cols_exp=['subject', 'group', 'sample', 'index'], drop_cols_cli=['subject', 'group', 'biological_sample', 'index'], sd_cutoff=0)[source]¶

This function cleanes up and formats experimental and clinical data into similarly shaped dataframes.

Parameters

data (dict) – dictionary with processed clinical and proteomics datasets.
drop_cols_exp (list) – list of columns to drop from processed experimental (protemics/rna-seq/dna-seq) dataframe.
drop_cols_cli (list) – list of columns to drop from processed clinical dataframe.

Returns

Dictionary with experimental and clinical dataframes (keys are the same as in the input dictionary).

get_dendrogram(df, labels, distfun='euclidean', linkagefun='ward', div_clusters=False, fcluster_method='distance', fcluster_cutoff=15)[source]¶

This function calculates the distance matrix and performs hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.

Parameters

df – pandas dataframe with samples/subjects as index and features as columns.
labels (list) – labels for the leaves of the tree.
distfun (str) – distance measure to be used (‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’, ‘minkowski’ or ‘jaccard’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
div_clusters (bool) – dividing dendrogram leaves into clusters (True or False).
fcluster_method (str) – criterion to use in forming flat clusters.
fcluster_cutoff (int) – maximum cophenetic distance between observations in each cluster.

Returns

Dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’. If div_clusters is used, it will also return a dictionary of each cluster and respective leaves.

get_clusters_elements(linkage_matrix, fcluster_method, fcluster_cutoff, labels)[source]¶

This function implements the generation of flat clusters from an hierarchical clustering with the same interface as scipy.cluster.hierarchy.fcluster.

Parameters

linkage_matrix (ndarray) – hierarchical clustering encoded with a linkage matrix.
fcluster_method (str) – criterion to use in forming flat clusters (‘inconsistent’, ‘distance’, ‘maxclust’, ‘monocrit’, ‘maxclust_monocrit’).
fcluster_cutoff (float) – maximum cophenetic distance between observations in each cluster.
labels (list) – labels for the leaves of the dendrogram.

Returns

A dictionary where keys are the cluster numbers and values are the dendrogram leaves.

filter_df_by_cluster(df, clusters, number)[source]¶

Select only the members of a defined cluster.

Parameters

df – pandas dataframe with samples/subjects as index and features as columns.
clusters (dict) – clusters dictionary from get_dendrogram function if div_clusters option was True.
number (int) – cluster number (key).

Returns

Pandas dataframe with all the features (columns) and samples/subjects belonging to the defined cluster (index).

df_sort_by_dendrogram(df, Z_dendrogram)[source]¶

Reorders pandas dataframe by index and according to the dendrogram list of leaf nodes labels.

Parameters

df – pandas dataframe with the labels to be reordered as index.
Z_dendrogram (dict) – dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’.

Returns

Reordered pandas dataframe.

get_percentiles_heatmap(df, Z_dendrogram, bydendro=True, bycols=False)[source]¶

This function transforms the absolute values in each row or column (option ‘bycols’) into relative values.

Parameters

df – pandas dataframe with samples/subjects as index and features as columns.
Z_dendrogram (dict) – dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’.
bydendro (bool) – if labels should be ordered according to dendrogram list of leaf nodes labels set to True, otherwise set to False.
bycols (bool) – relative values calculated across rows (samples) then set to False. Calculation performed across columns (features) set to True.

Returns

Pandas dataframe.

get_miss_values_df(data)[source]¶

Proccesses pandas dataframe so missing values can be plotted in heatmap with specific color.

Parameters: data – pandas dataframe.
Returns: Pandas dataframe with missing values as integer 1, and originally valid values as NaN.

paste_matrices(matrix1, matrix2, rows, cols)[source]¶

Takes two matrices with analog shapes and concatenates each value in matrix 1 with corresponding one in matrix 2, returning a single pandas dataframe.

Parameters

matrix1 (ndarray) – input 1
matrix2 (ndarray) – input 2

Returns

Pandas dataframe.

cutreeDynamic(distmatrix, linkagefun='average', minModuleSize=50, method='hybrid', deepSplit=2, pamRespectsDendro=False, distfun=None)[source]¶

This function implements the R cutreeDynamic wrapper in Python, provinding an access point for methods of adaptive branh pruning of hierarchical clustering dendrograms.

Parameters

data – pandas dataframe.
distfun (str) – distance measure to be used (‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’, ‘minkowski’ or ‘jaccard’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
minModuleSize (int) – minimum module size.
method (str) – method to use (‘hybrid’ or ‘tree’).
deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.

Returns

Numpy array of numerical labels giving assignment of objects to modules. Unassigned objects are labeled 0, the largest module has label 1, next largest 2 etc.

build_network(data, softPower=6, networkType='unsigned', linkagefun='average', method='hybrid', minModuleSize=50, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.4, verbose=0)[source]¶

Weighted gene network construction and module detection. Calculates co-expression similarity and adjacency, topological overlap matrix (TOM) and clusters features in modules.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
softPower (int) – soft-thresholding power.
networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).
linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).
method (str) – method to use (‘hybrid’ or ‘tree’).
minModuleSize (int) – minimum module size.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.
merge_modules (bool) – if True, very similar modules are merged.
MEDissThres (float) – maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Paran int deepSplit

provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.

Returns

Tuple with TOM dissimilarity pandas dataframe, numpy array with module colors per experimental feature.

pick_softThreshold(data, RsquaredCut=0.8, networkType='unsigned', verbose=0)[source]¶

Analysis of scale free topology for multiple soft thresholding powers. Aids the user in choosing a proper soft-thresholding power for network construction.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
RsquaredCut (float) – desired minimum scale free topology fitting index R^2.
networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Returns

Estimated appropriate soft-thresholding power: the lowest power for which the scale free topology fit R^2 exceeds RsquaredCut.

Return type

int

identify_module_colors(matrix, linkagefun='average', method='hybrid', minModuleSize=30, deepSplit=2, pamRespectsDendro=False)[source]¶

Identifies co-expression modules and converts the numeric labels into colors.

Parameters

matrix – dissimilarity structure as produced by R.stats dist.
minModuleSize (int) – minimum module size.
deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.
pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.

Returns

Numpy array of strings with module color of each experimental feature.

calculate_module_eigengenes(data, modColors, softPower=6, dissimilarity=True)[source]¶

Calculates modules eigengenes to quantify co-expression similarity of entire modules.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
softPower (int) – soft-thresholding power.
dissimilarity – calculates dissimilarity of module eigengenes.

Returns

Pandas dataframe with calculated module eigengenes. If dissimilarity is set to True, returns a tuple with two pandas dataframes, the first with the module eigengenes and the second with the eigengenes dissimilarity.

merge_similar_modules(data, modColors, MEDissThres=0.4, verbose=0)[source]¶

Merges modules in co-expression network that are too close as measured by the correlation of their eigengenes.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Para, float MEDissThres

maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.

Returns

Tuple containing pandas dataframe with eigengenes of the new merged modules, and array with module colors of each expeirmental feature.

calculate_ModuleTrait_correlation(df_exp, df_traits, MEs)[source]¶

Correlates eigengenes with external traits in order to identify the most significant module-trait associations.

Parameters

df_exp – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
df_traits – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.
MEs – pandas dataframe with module eigengenes.

Returns

Tuple with two pandas datafames, first the correlation between all module eigengenes and all clinical traits, second a dataframe with concatenated correlation and p-value used for heatmap annotation.

calculate_ModuleMembership(data, MEs)[source]¶

For each module, calculates the correlation of the module eigengene and the feature expression profile (quantitative measure of module membership (MM)).

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
MEs – pandas dataframe with module eigengenes.

Returns

Tuple with two pandas dataframes, one with module membership correlations and another with p-values.

calculate_FeatureTraitSignificance(df_exp, df_traits)[source]¶

Quantifies associations of individual experimental features with the measured clinical traits, by defining Feature Significance (FS) as the absolute value of the correlation between the feature and the trait.

Parameters

df_exp – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
df_traits – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.

Returns

Tuple with two pandas dataframes, one with feature significance correlations and another with p-values.

get_FeaturesPerModule(data, modColors, mode='dictionary')[source]¶

Groups all experimental features by the co-expression module they belong to.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
mode (str) – type of the value returned by the function (‘dictionary’ or ‘dataframe’).

Returns

Depending on selected mode, returns a dictionary or dataframe with module color per experimental feature.

get_ModuleFeatures(data, modColors, modules=[])[source]¶

Groups and returns a list of the experimental features clustered in specific co-expression modules.

Parameters

data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.
modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.
modules (list) – list of module colors of interest.

Returns

List of lists with experimental features in each selected module.

get_EigengenesTrait_correlation(MEs, data)[source]¶

Eigengenes are used as representative profiles of the co-expression modules, and correlation between them is used to quantify module similarity. Clinical traits are added to the eigengenes to see how the traits fir into the eigengen network.

Parameters

MEs – pandas dataframe with module eigengenes.
data – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.

Returns

Tuple with two pandas dataframes, one with features and traits recalculates module eigengenes dissimilarity, and another with all the overall correlations.

kaplan_meierAnalysis.py¶

get_data_ready_for_km(dfs_dict, args)[source]¶

group_data_based_on_marker(df, marker, index_col, how, value)[source]¶

run_km(data, time_col, event_col, group_col, args={})[source]¶

get_km_results(df, group_col, time_col, event_col)[source]¶