Analytics

analytics.py

unit_vector(vector)[source]

Returns the unit vector of the vector. :param tuple vector: vector :return tuple unit_vector: unit vector

flatten(t, my_list=[])[source]

Code from: https://gist.github.com/shaxbee/0ada767debf9eefbdb6e Acknowledgements: Zbigniew Mandziejewicz (shaxbee) Generator flattening the structure

>>> list(flatten([2, [2, (4, 5, [7], [2, [6, 2, 6, [6], 4]], 6)]]))
[2, 2, 4, 5, 7, 2, 6, 2, 6, 6, 4, 6]
angle_between(v1, v2)[source]

Returns the angle in radians between vectors ‘v1’ and ‘v2’

Parameters
Return float angle

angle between two vectors in radians

Example::

angle = angle_between((1, 0, 0), (0, 1, 0))

transform_into_wide_format(data, index, columns, values, extra=[])[source]

This function converts a Pandas DataFrame from long to wide format using pandas pivot_table() function.

Parameters
  • data – long-format Pandas DataFrame

  • index (list) – columns that will be converted into the index

  • columns (str) – column name whose unique values will become the new column names

  • values (str) – column to aggregate

  • extra (list) – additional columns to be kept as columns

Returns

Wide-format pandas DataFrame

Example:

result = transform_into_wide_format(df, index='index', columns='x', values='y', extra='group')
transform_into_long_format(data, drop_columns, group, columns=['name', 'y'])[source]

Converts a Pandas DataDrame from wide to long format using pd.melt() function.

Parameters
  • data – wide-format Pandas DataFrame

  • drop_columns (list) – columns to be deleted

  • group (str or list) – column(s) to use as identifier variables

  • columns (list) – names to use for the 1)variable column, and for the 2)value column

Returns

Long-format Pandas DataFrame.

Example:

result = transform_into_long_format(df, drop_columns=['sample', 'subject'], group='group', columns=['name','y'])
get_ranking_with_markers(data, drop_columns, group, columns, list_markers, annotation={})[source]

This function creates a long-format dataframe with features and values to be plotted together with disease biomarker annotations.

Parameters
  • data – wide-format Pandas DataFrame with samples as rows and features as columns

  • drop_columns (list) – columns to be deleted

  • group (str) – column to use as identifier variables

  • columns (list) – names to use for the 1)variable column, and for the 2)value column

  • list_markers (list) – list of features from data, known to be markers associated to disease.

  • annotation (dict) – markers, from list_markers, and associated diseases.

Returns

Long-format pandas DataFrame with group identifiers as rows and columns: ‘name’ (identifier), ‘y’ (LFQ intensity), ‘symbol’ and ‘size’.

Example:

result = get_ranking_with_markers(data, drop_columns=['sample', 'subject'], group='group', columns=['name', 'y'], list_markers, annotation={})
extract_number_missing(data, min_valid, drop_cols=['sample'], group='group')[source]

Counts how many valid values exist in each column and filters column labels with more valid values than the minimum threshold defined.

Parameters
  • data – pandas DataFrame with group as rows and protein identifier as column.

  • group (str) – column label containing group identifiers. If None, number of valid values is counted across all samples, otherwise is counted per unique group identifier.

  • min_valid (int) – minimum number of valid values to be filtered.

  • drop_columns (list) – column labels to be dropped.

Returns

List of column labels above the threshold.

Example:

result = extract_number_missing(data, min_valid=3, drop_cols=['sample'], group='group')
extract_percentage_missing(data, missing_max, drop_cols=['sample'], group='group', how='all')[source]

Extracts ratio of missing/valid values in each column and filters column labels with lower ratio than the minimum threshold defined.

Parameters
  • data – pandas dataframe with group as rows and protein identifier as column.

  • group (str) – column label containing group identifiers. If None, ratio is calculated across all samples, otherwise is calculated per unique group identifier.

  • missing_max (float) – maximum ratio of missing/valid values to be filtered.

  • how (str) – define if labels with a higher percentage of missing values than the threshold in any group (‘any’) or in all groups (‘all’) should be filtered

Returns

List of column labels below the threshold.

Example::

result = extract_percentage_missing(data, missing_max=0.3, drop_cols=[‘sample’], group=’group’)

imputation_KNN(data, drop_cols=['group', 'sample', 'subject'], group='group', cutoff=0.6, alone=True)[source]

k-Nearest Neighbors imputation for pandas dataframes with missing data. For more information visit https://github.com/iskandr/fancyimpute/blob/master/fancyimpute/knn.py.

Parameters
  • data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • group (str) – column label containing group identifiers.

  • drop_cols (list) – column labels to be dropped. Final dataframe should only have gene/protein/etc identifiers as columns.

  • cutoff (float) – minimum ratio of missing/valid values required to impute in each column.

  • alone (boolean) – if True removes all columns with any missing values.

Returns

Pandas dataframe with samples as rows and protein identifiers as columns.

Example:

result = imputation_KNN(data, drop_cols=['group', 'sample', 'subject'], group='group', cutoff=0.6, alone=True)
imputation_mixed_norm_KNN(data, index_cols=['group', 'sample', 'subject'], shift=1.8, nstd=0.3, group='group', cutoff=0.6)[source]

Missing values are replaced in two steps: 1) using k-Nearest Neighbors we impute protein columns with a higher ratio of missing/valid values than the defined cutoff, 2) the remaining missing values are replaced by random numbers that are drawn from a normal distribution.

Parameters
  • data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • group (str) – column label containing group identifiers.

  • index_cols (list) – list of column labels to be set as dataframe index.

  • shift (float) – specifies the amount by which the distribution used for the random numbers is shifted downwards. This is in units of the standard deviation of the valid data.

  • nstd (float) – defines the width of the Gaussian distribution relative to the standard deviation of measured values. A value of 0.5 would mean that the width of the distribution used for drawing random numbers is half of the standard deviation of the data.

  • cutoff (float) – minimum ratio of missing/valid values required to impute in each column.

Returns

Pandas dataframe with samples as rows and protein identifiers as columns.

Example:

result = imputation_mixed_norm_KNN(data, index_cols=['group', 'sample', 'subject'], shift = 1.8, nstd = 0.3, group='group', cutoff=0.6)
imputation_normal_distribution(data, index_cols=['group', 'sample', 'subject'], shift=1.8, nstd=0.3)[source]

Missing values will be replaced by random numbers that are drawn from a normal distribution. The imputation is done for each sample (across all proteins) separately. For more information visit http://www.coxdocs.org/doku.php?id=perseus:user:activities:matrixprocessing:imputation:replacemissingfromgaussian.

Parameters
  • data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • index_cols (list) – list of column labels to be set as dataframe index.

  • shift (float) – specifies the amount by which the distribution used for the random numbers is shifted downwards. This is in units of the standard deviation of the valid data.

  • nstd (float) – defines the width of the Gaussian distribution relative to the standard deviation of measured values. A value of 0.5 would mean that the width of the distribution used for drawing random numbers is half of the standard deviation of the data.

Returns

Pandas dataframe with samples as rows and protein identifiers as columns.

Example:

result = imputation_normal_distribution(data, index_cols=['group', 'sample', 'subject'], shift = 1.8, nstd = 0.3)
normalize_data_per_group(data, group, method='median')[source]

This function normalizes the data by group using the selected method

Parameters
  • data – DataFrame with the data to be normalized (samples x features)

  • group_col – Column containing the groups

  • method (string) – normalization method to choose among: median_polish, median, quantile, linear

Returns

Pandas dataframe.

Example:

result = normalize_data_per_group(data, group='group' method='median')
normalize_data(data, method='median_polish')[source]

This function normalizes the data using the selected method

Parameters
  • data – DataFrame with the data to be normalized (samples x features)

  • method (string) – normalization method to choose among: median_polish, median, quantile, linear

Returns

Pandas dataframe.

Example:

result = normalize_data(data, method='median_polish')
median_normalization(data)[source]

This function normalizes each sample by using its median.

Parameters

data

Returns

Pandas dataframe.

Example:

result = median_normalization(data)
zscore_normalization(data)[source]

This function normalizes each sample by using its mean and standard deviation (mean=0, std=1).

Parameters

data

Returns

Pandas dataframe.

Example:

data = pd.DataFrame({'a': [2,5,4,3,3], 'b':[4,4,6,5,3], 'c':[4,14,8,8,9]})
result = zscore_normalization(data)
result

          a         b         c
        0 -1.154701  0.577350  0.577350
        1 -0.484182 -0.665750  1.149932
        2 -1.000000  0.000000  1.000000
        3 -0.927173 -0.132453  1.059626
        4 -0.577350 -0.577350  1.154701
median_polish_normalization(data, max_iter=250)[source]

This function iteratively normalizes each sample and each feature to its median until medians converge.

Parameters
  • data

  • max_iter (int) – number of maximum iterations to prevent infinite loop.

Returns

Pandas dataframe.

Example:

result = median_polish_normalization(data, max_iter = 10)
quantile_normalization(data)[source]

Applies quantile normalization to each column in pandas dataframe.

Parameters

data – pandas dataframe with features as columns and samples as rows.

Returns

Pandas dataframe

Example:

result = quantile_normalization(data)
linear_normalization(data, method='l1', axis=0)[source]

This function scales input data to a unit norm. For more information visit https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html.

Parameters
  • data – pandas dataframe with samples as rows and features as columns.

  • method (str) – norm to use to normalize each non-zero sample or non-zero feature (depends on axis).

  • axis (int) – axis used to normalize the data along. If 1, independently normalize each sample, otherwise (if 0) normalize each feature.

Returns

Pandas dataframe

Example:

result = linear_normalization(data, method = "l1", axis = 0)
remove_group(data)[source]

Removes column with label ‘group’.

Parameters

data – pandas dataframe with one column labelled ‘group’

Returns

Pandas dataframe

Example:

result = remove_group(data)
calculate_coefficient_variation(values)[source]

Compute the coefficient of variation, the ratio of the biased standard deviation to the mean, in percentage. For more information visit https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.variation.html.

Parameters

values (ndarray) – numpy array

Returns

The calculated variation along rows.

Return type

ndarray

Example:

result = calculate_coefficient_variation()
get_coefficient_variation(data, drop_columns, group, columns=['name', 'y'])[source]

Extracts the coefficients of variation in each group.

Parameters
  • data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • drop_columns (list) – column labels to be dropped from the dataframe

  • group (str) – column label containing group identifiers.

  • columns (list) – names to use for the variable column(s), and for the value column(s)

Returns

Pandas dataframe with columns ‘name’ (protein identifier), ‘x’ (coefficient of variation), ‘y’ (mean) and ‘group’.

Exmaple:

result = get_coefficient_variation(data, drop_columns=['sample', 'subject'], group='group')
transform_proteomics_edgelist(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', value_col='LFQ_intensity')[source]

Transforms a long format proteomics matrix into a wide format

Parameters
  • df – long-format pandas dataframe with columns ‘group’, ‘sample’, ‘subject’, ‘identifier’ (protein), ‘name’ (gene) and ‘LFQ_intensity’.

  • index_cols (list) – column labels to be be kept as index identifiers.

  • drop_cols (list) – column labels to be dropped from the dataframe.

  • group (str) – column label containing group identifiers.

  • identifier (str) – column label containing feature identifiers.

  • extra_identifier (str) – column label containing additional protein identifiers (e.g. gene names).

  • value_col (str) – column label containing expression values.

Returns

Pandas dataframe with samples as rows and protein identifiers (UniprotID~GeneName) as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

Example:

df = transform_proteomics_edgelist(original, index_cols=[‘group’, ‘sample’, ‘subject’], drop_cols=[‘sample’], group=’group’, identifier=’identifier’, value_col=’LFQ_intensity’)

get_proteomics_measurements_ready(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', imputation=True, method='distribution', missing_method='percentage', missing_per_group=True, missing_max=0.3, min_valid=1, value_col='LFQ_intensity', shift=1.8, nstd=0.3, knn_cutoff=0.6, normalize=False, normalization_method='median', normalize_group=False)[source]

Processes proteomics data extracted from the database: 1) filter proteins with high number of missing values (> missing_max or min_valid), 2) impute missing values. For more information on imputation method visit http://www.coxdocs.org/doku.php?id=perseus:user:activities:matrixprocessing:filterrows:filtervalidvaluesrows.

Parameters
  • df – long-format pandas dataframe with columns ‘group’, ‘sample’, ‘subject’, ‘identifier’ (protein), ‘name’ (gene) and ‘LFQ_intensity’.

  • index_cols (list) – column labels to be be kept as index identifiers.

  • drop_cols (list) – column labels to be dropped from the dataframe.

  • group (str) – column label containing group identifiers.

  • identifier (str) – column label containing feature identifiers.

  • extra_identifier (str) – column label containing additional protein identifiers (e.g. gene names).

  • imputation (bool) – if True performs imputation of missing values.

  • method (str) – method for missing values imputation (‘KNN’, ‘distribuition’, or ‘mixed’)

  • missing_method (str) – defines which expression rows are counted to determine if a column has enough valid values to survive the filtering process.

  • missing_per_group (bool) – if True filter proteins based on valid values per group; if False filter across all samples.

  • missing_max (float) – maximum ratio of missing/valid values to be filtered.

  • min_valid (int) – minimum number of valid values to be filtered.

  • value_col (str) – column label containing expression values.

  • shift (float) – when using distribution imputation, the down-shift

  • nstd (float) – when using distribution imputation, the width of the distribution

  • knn_cutoff (float) – when using KNN imputation, the minimum percentage of valid values for which to use KNN imputation (i.e. 0.6 -> if 60% valid values use KNN, otherwise MinProb)

Returns

Pandas dataframe with samples as rows and protein identifiers (UniprotID~GeneName) as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

Example 1:

result = get_proteomics_measurements_ready(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', imputation=True, method = 'distribution', missing_method = 'percentage', missing_per_group=True, missing_max = 0.3, value_col='LFQ_intensity')

Example 2:

result = get_proteomics_measurements_ready(df, index_cols=['group', 'sample', 'subject'], drop_cols=['sample'], group='group', identifier='identifier', extra_identifier='name', imputation = True, method = 'mixed', missing_method = 'at_least_x', missing_per_group=False, min_valid=5, value_col='LFQ_intensity')
get_clinical_measurements_ready(df, subject_id='subject', sample_id='biological_sample', group_id='group', columns=['clinical_variable'], values='values', extra=['group'], imputation=True, imputation_method='KNN')[source]

Processes clinical data extracted from the database by converting dataframe to wide-format and imputing missing values.

Parameters
  • df – long-format pandas dataframe with columns ‘group’, ‘biological_sample’, ‘subject’, ‘clinical_variable’, ‘value’.

  • subject_id (str) – column label containing subject identifiers.

  • sample_id (str) – column label containing biological sample identifiers.

  • group_id (str) – column label containing group identifiers.

  • columns (list) – column name whose unique values will become the new column names

  • values (str) – column label containing clinical variable values.

  • extra (list) – additional column labels to be kept as columns

  • imputation (bool) – if True performs imputation of missing values.

  • imputation_method (str) – method for missing values imputation (‘KNN’, ‘distribuition’, or ‘mixed’).

Returns

Pandas dataframe with samples as rows and clinical variables as columns (with additional columns ‘group’, ‘subject’ and ‘biological_sample’).

Example:

result = get_clinical_measurements_ready(df, subject_id='subject', sample_id='biological_sample', group_id='group', columns=['clinical_variable'], values='values', extra=['group'], imputation=True, imputation_method='KNN')
get_summary_data_matrix(data)[source]

Returns some statistics on the data matrix provided.

Parameters

data – pandas dataframe.

Returns

dictionary with the type of statistics as key and the statistic as value in the shape of a pandas data frame

Example:

result = get_summary_data_matrix(data)
check_equal_variances(data, drop_cols=['group', 'sample', 'subject'], group_col='group', alpha=0.05)[source]
check_normality(data, drop_cols=['group', 'sample', 'subject'], group_col='group', alpha=0.05)[source]
run_pca(data, drop_cols=['sample', 'subject'], group='group', components=2, dropna=True)[source]

Performs principal component analysis and returns the values of each component for each sample and each protein, and the loadings for each protein. For information visit https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.

Parameters
  • data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • drop_cols (list) – column labels to be dropped from the dataframe.

  • group (str) – column label containing group identifiers.

  • components (int) – number of components to keep.

  • dropna (bool) – if True removes all columns with any missing values.

Returns

Two dictionaries: 1) two pandas dataframes (first one with components values, the second with the components vectors for each protein), 2) xaxis and yaxis titles with components loadings for plotly.

Example:

result = run_pca(data, drop_cols=['sample', 'subject'], group='group', components=2, dropna=True)
run_tsne(data, drop_cols=['sample', 'subject'], group='group', components=2, perplexity=40, n_iter=1000, init='pca', dropna=True)[source]

Performs t-distributed Stochastic Neighbor Embedding analysis. For more information visit https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.

Parameters
  • data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • drop_cols (list) – column labels to be dropped from the dataframe.

  • group (str) – column label containing group identifiers.

  • components (int) – dimension of the embedded space.

  • perplexity (int) – related to the number of nearest neighbors that is used in other manifold learning algorithms. Consider selecting a value between 5 and 50.

  • n_iter (int) – maximum number of iterations for the optimization (at least 250).

  • init (str) – initialization of embedding (‘random’, ‘pca’ or numpy array of shape n_samples x n_components).

  • dropna (bool) – if True removes all columns with any missing values.

Returns

Two dictionaries: 1) pandas dataframe with embedding vectors, 2) xaxis and yaxis titles for plotly.

Example:

result = run_tsne(data, drop_cols=['sample', 'subject'], group='group', components=2, perplexity=40, n_iter=1000, init='pca', dropna=True)
run_umap(data, drop_cols=['sample', 'subject'], group='group', n_neighbors=10, min_dist=0.3, metric='cosine', dropna=True)[source]

Performs Uniform Manifold Approximation and Projection. For more information vist https://umap-learn.readthedocs.io.

Parameters
  • data – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • drop_cols (list) – column labels to be dropped from the dataframe.

  • group (str) – column label containing group identifiers.

  • n_neighbors (int) – number of neighboring points used in local approximations of manifold structure.

  • min_dist (float) – controls how tightly the embedding is allowed compress points together.

  • metric (str) – metric used to measure distance in the input space.

  • dropna (bool) – if True removes all columns with any missing values.

Returns

Two dictionaries: 1) pandas dataframe with embedding of the training data in low-dimensional space, 2) xaxis and yaxis titles for plotly.

Example:

result = run_umap(data, drop_cols=['sample', 'subject'], group='group', n_neighbors=10, min_dist=0.3, metric='cosine', dropna=True)
calculate_correlations(x, y, method='pearson')[source]

Calculates a Spearman (nonparametric) or a Pearson (parametric) correlation coefficient and p-value to test for non-correlation.

Parameters
  • x (ndarray) – array 1

  • y (ndarray) – array 2

  • method (str) – chooses which kind of correlation method to run

Returns

Tuple with two floats, correlation coefficient and two-tailed p-value.

Example:

result = calculate_correlations(x, y, method='pearson')
apply_pvalue_correction(pvalues, alpha=0.05, method='bonferroni')[source]

Performs p-value correction using the specified method. For more information visit https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.multipletests.html.

Parameters
  • pvalues (ndarray) – et of p-values of the individual tests.

  • alpha (float) – error rate.

  • method (str) – method of p-value correction: - bonferroni : one-step correction - sidak : one-step correction - holm-sidak : step down method using Sidak adjustments - holm : step-down method using Bonferroni adjustments - simes-hochberg : step-up method (independent) - hommel : closed method based on Simes tests (non-negative) - fdr_bh : Benjamini/Hochberg (non-negative) - fdr_by : Benjamini/Yekutieli (negative) - fdr_tsbh : two stage fdr correction (non-negative) - fdr_tsbky : two stage fdr correction (non-negative)

Returns

Tuple with two arrays, boolen for rejecting H0 hypothesis and float for adjusted p-value.

Exmaple:

result = apply_pvalue_correction(pvalues, alpha=0.05, method='bonferroni')
apply_pvalue_fdrcorrection(pvalues, alpha=0.05, method='indep')[source]

Performs p-value correction for false discovery rate. For more information visit https://www.statsmodels.org/devel/generated/statsmodels.stats.multitest.fdrcorrection.html.

Parameters
  • pvalues (ndarray) – et of p-values of the individual tests.

  • alpha (float) – error rate.

  • method (str) – method of p-value correction (‘indep’, ‘negcorr’).

Returns

Tuple with two arrays, boolen for rejecting H0 hypothesis and float for adjusted p-value.

Exmaple:

result = apply_pvalue_fdrcorrection(pvalues, alpha=0.05, method='indep')
apply_pvalue_twostage_fdrcorrection(pvalues, alpha=0.05, method='bh')[source]

Iterated two stage linear step-up procedure with estimation of number of true hypotheses. For more information visit https://www.statsmodels.org/dev/generated/statsmodels.stats.multitest.fdrcorrection_twostage.html.

Parameters
  • pvalues (ndarray) – et of p-values of the individual tests.

  • alpha (float) – error rate.

  • method (str) – method of p-value correction (‘bky’, ‘bh’).

Returns

Tuple with two arrays, boolen for rejecting H0 hypothesis and float for adjusted p-value.

Exmaple:

result = apply_pvalue_twostage_fdrcorrection(pvalues, alpha=0.05, method='bh')
apply_pvalue_permutation_fdrcorrection(df, observed_pvalues, group, alpha=0.05, permutations=50)[source]

This function applies multiple hypothesis testing correction using a permutation-based false discovery rate approach.

Parameters
  • df – pandas dataframe with samples as rows and features as columns.

  • oberved_pvalues – pandas Series with p-values calculated on the originally measured data.

  • group (str) – name of the column containing group identifiers.

  • alpha (float) – error rate. Values velow alpha are considered significant.

  • permutations (int) – number of permutations to be applied.

Returns

Pandas dataframe with adjusted p-values and rejected columns.

Example:

result = apply_pvalue_permutation_fdrcorrection(df, observed_pvalues, group='group', alpha=0.05, permutations=50)
get_counts_permutation_fdr(value, random, observed, n, alpha)[source]

Calculates local FDR values (q-values) by computing the fraction of accepted hits from the permuted data over accepted hits from the measured data normalized by the total number of permutations.

Parameters
  • value (float) – computed p-value on measured data for a feature.

  • random (ndarray) – p-values computed on the permuted data.

  • observed – pandas Series with p-values calculated on the originally measured data.

  • n (int) – number of permutations to be applied.

  • alpha (float) – error rate. Values velow alpha are considered significant.

Returns

Tuple with q-value and boolean for H0 rejected.

Example:

result = get_counts_permutation_fdr(value, random, observed, n=250, alpha=0.05)
convertToEdgeList(data, cols)[source]

This function converts a pandas dataframe to an edge list where index becomes the source nodes and columns the target nodes.

Parameters
  • data – pandas dataframe.

  • cols (list) – names for dataframe columns.

Returns

Pandas dataframe with columns cols.

run_correlation(df, alpha=0.05, subject='subject', group='group', method='pearson', correction='fdr_bh')[source]

This function calculates pairwise correlations for columns in dataframe, and returns it in the shape of a edge list with ‘weight’ as correlation score, and the ajusted p-values.

Parameters
  • df – pandas dataframe with samples as rows and features as columns.

  • subject (str) – name of column containing subject identifiers.

  • group (str) – name of column containing group identifiers.

  • method (str) – method to use for correlation calculation (‘pearson’, ‘spearman’).

  • alpha (floar) – error rate. Values velow alpha are considered significant.

  • correction (string) – type of correction see apply_pvalue_correction for methods

Returns

Pandas dataframe with columns: ‘node1’, ‘node2’, ‘weight’, ‘padj’ and ‘rejected’.

Example:

result = run_correlation(df, alpha=0.05, subject='subject', group='group', method='pearson', correction='fdr_bh')
run_multi_correlation(df_dict, alpha=0.05, subject='subject', on=['subject', 'biological_sample'], group='group', method='pearson', correction='fdr_bh')[source]

This function merges all input dataframes and calculates pairwise correlations for all columns.

Parameters
  • df_dict (dict) – dictionary of pandas dataframes with samples as rows and features as columns.

  • subject (str) – name of the column containing subject identifiers.

  • group (str) – name of the column containing group identifiers.

  • on (list) – column names to join dataframes on (must be found in all dataframes).

  • method (str) – method to use for correlation calculation (‘pearson’, ‘spearman’).

  • alpha (float) – error rate. Values velow alpha are considered significant.

  • correction (string) – type of correction see apply_pvalue_correction for methods

Returns

Pandas dataframe with columns: ‘node1’, ‘node2’, ‘weight’, ‘padj’ and ‘rejected’.

Example:

result = run_multi_correlation(df_dict, alpha=0.05, subject='subject', on=['subject', 'biological_sample'] , group='group', method='pearson', correction='fdr_bh')
calculate_rm_correlation(df, x, y, subject)[source]

Computes correlation and p-values between two columns a and b in df.

Parameters
  • df – pandas dataframe with subjects as rows and two features and columns.

  • x (str) – feature a name.

  • y (str) – feature b name.

  • subject – column name containing the covariate variable.

Returns

Tuple with values for: feature a, feature b, correlation, p-value and degrees of freedom.

Example:

result = calculate_rm_correlation(df, x='feature a', y='feature b', subject='subject')
run_rm_correlation(df, alpha=0.05, subject='subject', correction='fdr_bh')[source]

Computes pairwise repeated measurements correlations for all columns in dataframe, and returns results as an edge list with ‘weight’ as correlation score, p-values, degrees of freedom and ajusted p-values.

Parameters
  • df – pandas dataframe with samples as rows and features as columns.

  • subject (str) – name of column containing subject identifiers.

  • alpha (float) – error rate. Values velow alpha are considered significant.

  • correction (string) – type of correction type see apply_pvalue_correction for methods

Returns

Pandas dataframe with columns: ‘node1’, ‘node2’, ‘weight’, ‘pvalue’, ‘dof’, ‘padj’ and ‘rejected’.

Example:

result = run_rm_correlation(df, alpha=0.05, subject='subject', correction='fdr_bh')
run_efficient_correlation(data, method='pearson')[source]

Calculates pairwise correlations and returns lower triangle of the matrix with correlation values and p-values.

Parameters
  • data – pandas dataframe with samples as index and features as columns (numeric data only).

  • method (str) – method to use for correlation calculation (‘pearson’, ‘spearman’).

Returns

Two numpy arrays: correlation and p-values.

Example:

result = run_efficient_correlation(data, method='pearson')
calculate_ttest_samr(df, labels, n=2, s0=0, paired=False)[source]

Calculates modified T-test using ‘samr’ R package.

Parameters
  • df – pandas dataframe with group as columns and protein identifier as rows

  • abels (list) – integers reflecting the group each sample belongs to (e.g. group1 = 1, group2 = 2)

  • n (int) – number of samples

  • s0 (float) – exchangeability factor for denominator of test statistic

  • paired (bool) – True if samples are paired

Returns

Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group1)’, ‘log2FC’, ‘FC’, ‘t-statistics’, ‘p-value’.

Example:

result = calculate_ttest_samr(df, labels, n=2, s0=0.1, paired=False)
calculate_ttest(df, condition1, condition2, paired=False, is_logged=True, non_par=False, tail='two-sided', correction='auto', r=0.707)[source]

Calculates the t-test for the means of independent samples belonging to two different groups. For more information visit https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html.

Parameters
  • df – pandas dataframe with groups and subjects as rows and protein identifier as column.

  • condition1 (str) – identifier of first group.

  • condition2 (str) – ientifier of second group.

  • is_logged (bool) – data is logged transformed

  • non_par (bool) – if True, normality and variance equality assumptions are checked and non-parametric test Mann Whitney U test if not passed

Returns

Tuple with t-statistics, two-tailed p-value, mean of first group, mean of second group and logfc.

Example:

result = calculate_ttest(df, 'group1', 'group2')
calculate_THSD(df, column, group='group', alpha=0.05, is_logged=True)[source]

Pairwise Tukey-HSD posthoc test using pingouin stats. For more information visit https://pingouin-stats.org/generated/pingouin.pairwise_tukey.html

Parameters
  • df – pandas dataframe with group and protein identifier as columns

  • column (str) – column containing the protein identifier

  • group (str) – column label containing the between factor

  • alpha (float) – significance level

Returns

Pandas dataframe.

Example:

result = calculate_THSD(df, column='HBG2~P69892', group='group', alpha=0.05)
calculate_pairwise_ttest(df, column, subject='subject', group='group', correction='none', is_logged=True)[source]

Performs pairwise t-test using pingouin, as a posthoc test, and calculates fold-changes. For more information visit https://pingouin-stats.org/generated/pingouin.pairwise_ttests.html.

Parameters
  • df – pandas dataframe with subject and group as rows and protein identifier as column.

  • column (str) – column label containing the dependant variable

  • subject (str) – column label containing subject identifiers

  • group (str) – column label containing the between factor

  • correction (str) – method used for testing and adjustment of p-values.

Returns

Pandas dataframe with means, standard deviations, test-statistics, degrees of freedom and effect size columns.

Example:

result = calculate_pairwise_ttest(df, 'protein a', subject='subject', group='group', correction='none')
complement_posthoc(posthoc, identifier, is_logged)[source]

Calculates fold-changes after posthoc test.

Parameters
  • posthoc – pandas dataframe from posthoc test. Should have at least columns ‘mean(group1)’ and ‘mean(group2)’.

  • identifier (str) – feature identifier.

Returns

Pandas dataframe with additional columns ‘identifier’, ‘log2FC’ and ‘FC’.

calculate_dabest(df, idx, x, y, paired=False, id_col=None, test='mean_diff')[source]
Parameters
  • df

  • idx

  • x

  • y

  • paired

  • id_col

  • test

Returns

calculate_anova_samr(df, labels, s0=0)[source]

Calculates modified one-way ANOVA using ‘samr’ R package.

Parameters
  • df – pandas dataframe with group as columns and protein identifier as rows

  • labels (list) – integers reflecting the group each sample belongs to (e.g. group1 = 1, group2 = 2, group3 = 3)

  • s0 (float) – exchangeability factor for denominator of test statistic

Returns

Pandas dataframe with protein identifiers and F-statistics.

Example:

result = calculate_anova_samr(df, labels, s0=0.1)
calculate_anova(df, column, group='group')[source]

Calculates one-way ANOVA using pingouin.

Parameters
  • df – pandas dataframe with group as rows and protein identifier as column

  • column (str) – name of the column in df to run ANOVA on

  • group (str) – column with group identifiers

Returns

Tuple with t-statistics and p-value.

calculate_repeated_measures_anova(df, column, subject='subject', group='group')[source]

One-way and two-way repeated measures ANOVA using pingouin stats.

Parameters
  • df – pandas dataframe with samples as rows and protein identifier as column. Data must be in long-format for two-way repeated measures.

  • column (str) – column label containing the dependant variable

  • subject (str) – column label containing subject identifiers

  • group (str) – column label containing the within factor

Returns

Tuple with protein identifier, t-statistics and p-value.

Example:

result = calculate_repeated_measures_anova(df, 'protein a', subject='subject', group='group')
get_max_permutations(df, group='group')[source]

Get maximum number of permutations according to number of samples.

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns

  • group (str) – column with group identifiers

Returns

Maximum number of permutations.

Return type

int

check_is_paired(df, subject, group)[source]

Check if samples are paired.

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • subject (str) – column with subject identifiers

  • group (str) – column with group identifiers

Returns

True if paired samples.

Return type

bool

run_dabest(df, drop_cols=['sample'], subject='subject', group='group', test='mean_diff')[source]
Parameters
  • df

  • drop_cols (list) –

  • subject (str) –

  • group (str) –

  • test (str) –

Returns

Pandas dataframe

run_anova(df, alpha=0.05, drop_cols=['sample', 'subject'], subject='subject', group='group', permutations=0, correction='fdr_bh', is_logged=True, non_par=False)[source]

Performs statistical test for each protein in a dataset. Checks what type of data is the input (paired, unpaired or repeated measurements) and performs posthoc tests for multiclass data. Multiple hypothesis correction uses permutation-based if permutations>0 and Benjamini/Hochberg if permutations=0.

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • subject (str) – column with subject identifiers

  • group (str) – column with group identifiers

  • drop_cols (list) – column labels to be dropped from the dataframe

  • alpha (float) – error rate for multiple hypothesis correction

  • permutations (int) – number of permutations used to estimate false discovery rates.

  • non_par (bool) – if True, normality and variance equality assumptions are checked and non-parametric test Mann Whitney U test if not passed

Returns

Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group2)’, ‘Log2FC’, ‘std_error’, ‘tail’, ‘t-statistics’, ‘posthoc pvalue’, ‘effsize’, ‘efftype’, ‘FC’, ‘rejected’, ‘F-statistics’, ‘p-value’, ‘correction’, ‘-log10 p-value’, and ‘method’.

Example:

result = run_anova(df, alpha=0.05, drop_cols=["sample",'subject'], subject='subject', group='group', permutations=50)
correct_pairwise_ttest(df, alpha, correction='fdr_bh')[source]
run_repeated_measurements_anova(df, alpha=0.05, drop_cols=['sample'], subject='subject', group='group', permutations=50, correction='fdr_bh', is_logged=True)[source]

Performs repeated measurements anova and pairwise posthoc tests for each protein in dataframe.

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • subject (str) – column with subject identifiers

  • group (srt) – column with group identifiers

  • drop_cols (list) – column labels to be dropped from the dataframe

  • alpha (float) – error rate for multiple hypothesis correction

  • permutations (int) – number of permutations used to estimate false discovery rates

Returns

Pandas dataframe

Example:

result = run_repeated_measurements_anova(df, alpha=0.05, drop_cols=['sample'], subject='subject', group='group', permutations=50)
format_anova_table(df, aov_results, pairwise_results, pairwise_cols, group, permutations, alpha, correction)[source]

Performs p-value correction (permutation-based and FDR) and converts pandas dataframe into final format.

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • aov_results (list[tuple]) – list of tuples with anova results (one tuple per feature).

  • pairwise_results (list[dataframes]) – list of pandas dataframes with posthoc tests results

  • group (str) – column with group identifiers

  • alpha (float) – error rate for multiple hypothesis correction

  • permutations (int) – number of permutations used to estimate false discovery rates

Returns

Pandas dataframe

run_ttest(df, condition1, condition2, alpha=0.05, drop_cols=['sample'], subject='subject', group='group', paired=False, correction='fdr_bh', permutations=50, is_logged=True, non_par=False)[source]

Runs t-test (paired/unpaired) for each protein in dataset and performs permutation-based (if permutations>0) or Benjamini/Hochberg (if permutations=0) multiple hypothesis correction.

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • condition1 (str) – first of two conditions of the independent variable

  • condition2 (str) – second of two conditions of the independent variable

  • subject (str) – column with subject identifiers

  • group (str) – column with group identifiers (independent variable)

  • drop_cols (list) – column labels to be dropped from the dataframe

  • paired (bool) – paired or unpaired samples

  • correction (str) – method of pvalue correction see apply_pvalue_correction for methods

  • alpha (float) – error rate for multiple hypothesis correction

  • permutations (int) – number of permutations used to estimate false discovery rates.

  • is_logged (bool) – data is log-transformed

  • non_par (bool) – if True, normality and variance equality assumptions are checked and non-parametric test Mann Whitney U test if not passed

Returns

Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group2)’, ‘std(group1)’, ‘std(group2)’, ‘Log2FC’, ‘FC’, ‘rejected’, ‘T-statistics’, ‘p-value’, ‘correction’, ‘-log10 p-value’, and ‘method’.

Example:

result = run_ttest(df, condition1='group1', condition2='group2', alpha = 0.05, drop_cols=['sample'], subject='subject', group='group', paired=False, correction='fdr_bh', permutations=50)
define_samr_method(df, subject, group, drop_cols)[source]

Method to identify the correct problem type to run with SAMR

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • subject (str) – column with subject identifiers

  • group (str) – column with group identifiers

  • droop_cols (str) – columns to be dropped

Returns

tuple with the method to be used (One Class, Two class paired, Two class unpaired or Multiclass) and the labels (conditions)

Example:

method, labels = define_samr_method(df, subject, group)
calculate_pvalue_from_tstats(tstat, dfn, dfk)[source]

Calculate two-tailed p-values from T- or F-statistics.

tstat: T/F distribution dfn: degrees of freedrom n (values) per protein (keys), i.e. number of obervations - number of groups (dict) dfk: degrees of freedrom n (values) per protein (keys), i.e. number of groups - 1 (dict)

run_samr(df, subject='subject', group='group', drop_cols=['subject', 'sample'], alpha=0.05, s0='null', permutations=250, fc=0, is_logged=True, localfdr=False)[source]

Python adaptation of the ‘samr’ R package for statistical tests with permutation-based correction and s0 parameter. For more information visit https://cran.r-project.org/web/packages/samr/samr.pdf. The method only runs if R is installed and permutations is higher than 0, otherwise ANOVA.

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns (with additional columns ‘group’, ‘sample’ and ‘subject’).

  • subject (str) – column with subject identifiers

  • group (str) – column with group identifiers

  • drop_cols (list) – columnlabels to be dropped from the dataframe

  • alpha (float) – error rate for multiple hypothesis correction

  • s0 (float) – exchangeability factor for denominator of test statistic

  • permutations (int) – number of permutations used to estimate false discovery rates. If number of permutations is equal to zero, the function will run anova with FDR Benjamini/Hochberg correction.

  • fc (float) – minimum fold change to define practical significance (needed when computing delta table)

Returns

Pandas dataframe with columns ‘identifier’, ‘group1’, ‘group2’, ‘mean(group1)’, ‘mean(group2)’, ‘Log2FC’, ‘FC’, ‘T-statistics’, ‘p-value’, ‘padj’, ‘correction’, ‘-log10 p-value’, ‘rejected’ and ‘method’

Example:

result = run_samr(df, subject='subject', group='group', drop_cols=['subject', 'sample'], alpha=0.05, s0=1, permutations=250, fc=0)
calculate_discriminant_lines(result)[source]
run_fisher(group1, group2, alternative='two-sided')[source]

annotated not-annotated group1 a b group2 c d ————————————

group1 = [a, b] group2 = [c, d]

odds, pvalue = stats.fisher_exact([[a, b], [c, d]])

run_kolmogorov_smirnov(dist1, dist2, alternative='two-sided')[source]

Compute the Kolmogorov-Smirnov statistic on 2 samples. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

Parameters
  • dist1 (list) – sequence of 1-D ndarray (first distribution to compare) drawn from a continuous distribution

  • dist2 (list) – sequence of 1-D ndarray (second distribution to compare) drawn from a continuous distribution

  • alternative (str) – defines the alternative hypothesis (default is ‘two-sided’): * ‘two-sided’ * ‘less’ * ‘greater’

Returns

statistic float and KS statistic pvalue float Two-tailed p-value.

Example:

result = run_kolmogorov_smirnov(dist1, dist2, alternative='two-sided')
run_site_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', regex='(\\w+~.+)_\\w\\d+\\-\\w+', correction='fdr_bh')[source]

This function runs a simple enrichment analysis for significantly regulated protein sites in a dataset.

Parameters
  • regulation_data – pandas dataframe resulting from differential regulation analysis.

  • annotation – pandas dataframe with annotations for features (columns: ‘annotation’, ‘identifier’ (feature identifiers), and ‘source’).

  • identifier (str) – name of the column from annotation containing feature identifiers.

  • groups (list) – column names from regulation_data containing group identifiers.

  • annotation_col (str) – name of the column from annotation containing annotation terms.

  • reject_col (str) – name of the column from regulatio_data containing boolean for rejected null hypothesis.

  • group_col (str) – column name for new column in annotation dataframe determining if feature belongs to foreground or background.

  • method (str) – method used to compute enrichment (only ‘fisher’ is supported currently).

  • regex (str) – how to extract the annotated identifier from the site identifier

Returns

Pandas dataframe with columns: ‘terms’, ‘identifiers’, ‘foreground’, ‘background’, ‘pvalue’, ‘padj’ and ‘rejected’.

Example:

result = run_site_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', match="(\w+~.+)_\w\d+\-\w+")
run_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher', correction='fdr_bh')[source]

This function runs a simple enrichment analysis for significantly regulated features in a dataset.

Parameters
  • regulation_data – pandas dataframe resulting from differential regulation analysis.

  • annotation – pandas dataframe with annotations for features (columns: ‘annotation’, ‘identifier’ (feature identifiers), and ‘source’).

  • identifier (str) – name of the column from annotation containing feature identifiers.

  • groups (list) – column names from regulation_data containing group identifiers.

  • annotation_col (str) – name of the column from annotation containing annotation terms.

  • reject_col (str) – name of the column from regulatio_data containing boolean for rejected null hypothesis.

  • group_col (str) – column name for new column in annotation dataframe determining if feature belongs to foreground or background.

  • method (str) – method used to compute enrichment (only ‘fisher’ is supported currently).

Returns

Pandas dataframe with columns: ‘terms’, ‘identifiers’, ‘foreground’, ‘background’, ‘pvalue’, ‘padj’ and ‘rejected’.

Example:

result = run_regulation_enrichment(regulation_data, annotation, identifier='identifier', groups=['group1', 'group2'], annotation_col='annotation', reject_col='rejected', group_col='group', method='fisher')
run_enrichment(data, foreground_id, background_id, annotation_col='annotation', group_col='group', identifier_col='identifier', method='fisher', correction='fdr_bh')[source]

Computes enrichment of the foreground relative to a given backgroung, using Fisher’s exact test, and corrects for multiple hypothesis testing.

Parameters
  • data – pandas dataframe with annotations for dataset features (columns: ‘annotation’, ‘identifier’, ‘source’, ‘group’).

  • foreground_id (str) – group identifier of features that belong to the foreground.

  • background_id (str) – group identifier of features that belong to the background.

  • annotation_col (str) – name of the column containing annotation terms.

  • group_col (str) – name of column containing the group identifiers.

  • identifier_col (str) – name of column containing dependent variables identifiers.

  • method (str) – method used to compute enrichment (only ‘fisher’ is supported currently).

Returns

Pandas dataframe with annotation terms, features, number of foregroung/background features in each term, p-values and corrected p-values (columns: ‘terms’, ‘identifiers’, ‘foreground’, ‘background’, ‘pvalue’, ‘padj’ and ‘rejected’).

Example:

result = run_enrichment(data, foreground='foreground', background='background', foreground_pop=len(foreground_list), background_pop=len(background_list), annotation_col='annotation', group_col='group', identifier_col='identifier', method='fisher')
calculate_fold_change(df, condition1, condition2)[source]

Calculates fold-changes between two groups for all proteins in a dataframe.

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns.

  • condition1 (str) – identifier of first group.

  • condition2 (str) – identifier of second group.

Returns

Numpy array.

Example:

result = calculate_fold_change(data, 'group1', 'group2')
cohen_d(df, condition1, condition2, ddof=0)[source]

Calculates Cohen’s d effect size based on the distance between two means, measured in standard deviations. For more information visit https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanstd.html.

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns.

  • condition1 (str) – identifier of first group.

  • condition2 (str) – identifier of second group.

  • ddof (int) – means Delta Degrees of Freedom.

Returns

Numpy array.

Example:

result = cohen_d(data, 'group1', 'group2', ddof=0)
hedges_g(df, condition1, condition2, ddof=0)[source]

Calculates Hedges’ g effect size (more accurate for sample sizes below 20 than Cohen’s d). For more information visit https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanstd.html.

Parameters
  • df – pandas dataframe with samples as rows and protein identifiers as columns.

  • condition1 (str) – identifier of first group.

  • condition2 (str) – identifier of second group.

  • ddof (int) – means Delta Degrees of Freedom.

Returns

Numpy array.

Example:

result = hedges_g(data, 'group1', 'group2', ddof=0)
run_mapper(data, lenses=['l2norm'], n_cubes=15, overlap=0.5, n_clusters=3, linkage='complete', affinity='correlation')[source]
Parameters
  • data

  • lenses

  • n_cubes

  • overlap

  • n_clusters

  • linkage

  • affinity

Returns

run_WGCNA(data, drop_cols_exp, drop_cols_cli, RsquaredCut=0.8, networkType='unsigned', minModuleSize=30, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.25, verbose=0, sd_cutoff=0)[source]

Runs an automated weighted gene co-expression network analysis (WGCNA), using input proteomics/transcriptomics/genomics and clinical variables data.

Parameters
  • data (dict) – dictionary of pandas dataframes with processed clinical and experimental datasets

  • drop_cols_exp (list) – column names to be removed from the experimental dataset.

  • drop_cols_cli (list) – column names to be removed from the clinical dataset.

  • RsquaredCut (float) – desired minimum scale free topology fitting index R^2.

  • networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).

  • minModuleSize (int) – minimum module size.

  • deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.

  • pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.

  • merge_modules (bool) – if True, very similar modules are merged.

  • MEDissThres (float) – maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.

  • verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Returns

Tuple with multiple pandas dataframes.

Example:

result = run_WGCNA(data, drop_cols_exp=['subject', 'sample', 'group', 'index'], drop_cols_cli=['subject', 'biological_sample', 'group', 'index'], RsquaredCut=0.8, networkType='unsigned', minModuleSize=30, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.25, verbose=0)
most_central_edge(G)[source]

Compute the eigenvector centrality for the graph G, and finds the highest value.

Parameters

G (graph) – networkx graph

Returns

Highest eigenvector centrality value.

Return type

float

get_louvain_partitions(G, weight)[source]

Computes the partition of the graph nodes which maximises the modularity (or try..) using the Louvain heuristices. For more information visit https://python-louvain.readthedocs.io/en/latest/api.html.

Parameters
  • G (graph) – networkx graph which is decomposed.

  • weight (str) – the key in graph to use as weight.

Returns

The partition, with communities numbered from 0 to number of communities.

Return type

dict

get_network_communities(graph, args)[source]

Finds communities in a graph using different methods. For more information on the methods visit:

Parameters
  • graph (graph) – networkx graph

  • args (dict) – config file arguments

Returns

Dictionary of nodes and which community they belong to (from 0 to number of communities).

get_publications_abstracts(data, publication_col='publication', join_by=['publication', 'Proteins', 'Diseases'], index='PMID')[source]

Accesses NCBI PubMed over the WWW and retrieves the abstracts corresponding to a list of one or more PubMed IDs.

Parameters
  • data – pandas dataframe of diseases and publications linked to a list of proteins (columns: ‘Diseases’, ‘Proteins’, ‘linkout’ and ‘publication’).

  • publication_col (str) – column label containing PubMed ids.

  • join_by (list) – column labels to be kept from the input dataframe.

  • index (str) – column label containing PubMed ids from the NCBI retrieved data.

Returns

Pandas dataframe with publication information and columns ‘PMID’, ‘abstract’, ‘authors’, ‘date’, ‘journal’, ‘keywords’, ‘title’, ‘url’, ‘Proteins’ and ‘Diseases’.

Example:

result = get_publications_abstracts(data, publication_col='publication', join_by=['publication','Proteins','Diseases'], index='PMID')
eta_squared(aov)[source]

Calculates the effect size using Eta-squared.

Parameters

aov – pandas dataframe with anova results from statsmodels.

Returns

Pandas dataframe with additional Eta-squared column.

omega_squared(aov)[source]

Calculates the effect size using Omega-squared.

Parameters

aov – pandas dataframe with anova results from statsmodels.

Returns

Pandas dataframe with additional Omega-squared column.

run_two_way_anova(df, drop_cols=['sample'], subject='subject', group=['group', 'secondary_group'])[source]

Run a 2-way ANOVA when data[‘secondary_group’] is not empty

Parameters
  • df – processed pandas dataframe with samples as rows, and proteins and groups as columns.

  • drop_cols (list) – column names to drop from dataframe

  • subject (str) – column name containing subject identifiers.

  • group (list) – column names corresponding to independent variable groups

Returns

Two dataframes, anova results and residuals.

Example:

result = run_two_way_anova(data, drop_cols=['sample'], subject='subject', group=['group', 'secondary_group'])
merge_for_polar(regulation_data, regulators, identifier_col='identifier', group_col='group', theta_col='modifier', aggr_func='mean', normalize=True)[source]
run_qc_markers_analysis(data, qc_markers, sample_col='sample', group_col='group', drop_cols=['subject'], identifier_col='identifier', qcidentifier_col='identifier', qcclass_col='class')[source]
run_snf(df_dict, clusters, distance_metric, K_affinity, mu_affinity)[source]
Parameters
  • df_dict

  • clusters

run_km(data, time_col, event_col, group_col, args={})[source]

wgcnaAnalysis.py

get_data(data, drop_cols_exp=['subject', 'group', 'sample', 'index'], drop_cols_cli=['subject', 'group', 'biological_sample', 'index'], sd_cutoff=0)[source]

This function cleanes up and formats experimental and clinical data into similarly shaped dataframes.

Parameters
  • data (dict) – dictionary with processed clinical and proteomics datasets.

  • drop_cols_exp (list) – list of columns to drop from processed experimental (protemics/rna-seq/dna-seq) dataframe.

  • drop_cols_cli (list) – list of columns to drop from processed clinical dataframe.

Returns

Dictionary with experimental and clinical dataframes (keys are the same as in the input dictionary).

get_dendrogram(df, labels, distfun='euclidean', linkagefun='ward', div_clusters=False, fcluster_method='distance', fcluster_cutoff=15)[source]

This function calculates the distance matrix and performs hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.

Parameters
  • df – pandas dataframe with samples/subjects as index and features as columns.

  • labels (list) – labels for the leaves of the tree.

  • distfun (str) – distance measure to be used (‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’, ‘minkowski’ or ‘jaccard’).

  • linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).

  • div_clusters (bool) – dividing dendrogram leaves into clusters (True or False).

  • fcluster_method (str) – criterion to use in forming flat clusters.

  • fcluster_cutoff (int) – maximum cophenetic distance between observations in each cluster.

Returns

Dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’. If div_clusters is used, it will also return a dictionary of each cluster and respective leaves.

get_clusters_elements(linkage_matrix, fcluster_method, fcluster_cutoff, labels)[source]

This function implements the generation of flat clusters from an hierarchical clustering with the same interface as scipy.cluster.hierarchy.fcluster.

Parameters
  • linkage_matrix (ndarray) – hierarchical clustering encoded with a linkage matrix.

  • fcluster_method (str) – criterion to use in forming flat clusters (‘inconsistent’, ‘distance’, ‘maxclust’, ‘monocrit’, ‘maxclust_monocrit’).

  • fcluster_cutoff (float) – maximum cophenetic distance between observations in each cluster.

  • labels (list) – labels for the leaves of the dendrogram.

Returns

A dictionary where keys are the cluster numbers and values are the dendrogram leaves.

filter_df_by_cluster(df, clusters, number)[source]

Select only the members of a defined cluster.

Parameters
  • df – pandas dataframe with samples/subjects as index and features as columns.

  • clusters (dict) – clusters dictionary from get_dendrogram function if div_clusters option was True.

  • number (int) – cluster number (key).

Returns

Pandas dataframe with all the features (columns) and samples/subjects belonging to the defined cluster (index).

df_sort_by_dendrogram(df, Z_dendrogram)[source]

Reorders pandas dataframe by index and according to the dendrogram list of leaf nodes labels.

Parameters
  • df – pandas dataframe with the labels to be reordered as index.

  • Z_dendrogram (dict) – dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’.

Returns

Reordered pandas dataframe.

get_percentiles_heatmap(df, Z_dendrogram, bydendro=True, bycols=False)[source]

This function transforms the absolute values in each row or column (option ‘bycols’) into relative values.

Parameters
  • df – pandas dataframe with samples/subjects as index and features as columns.

  • Z_dendrogram (dict) – dictionary of data structures computed to render the dendrogram. Keys: ‘icoords’, ‘dcoords’, ‘ivl’ and ‘leaves’.

  • bydendro (bool) – if labels should be ordered according to dendrogram list of leaf nodes labels set to True, otherwise set to False.

  • bycols (bool) – relative values calculated across rows (samples) then set to False. Calculation performed across columns (features) set to True.

Returns

Pandas dataframe.

get_miss_values_df(data)[source]

Proccesses pandas dataframe so missing values can be plotted in heatmap with specific color.

Parameters

data – pandas dataframe.

Returns

Pandas dataframe with missing values as integer 1, and originally valid values as NaN.

paste_matrices(matrix1, matrix2, rows, cols)[source]

Takes two matrices with analog shapes and concatenates each value in matrix 1 with corresponding one in matrix 2, returning a single pandas dataframe.

Parameters
  • matrix1 (ndarray) – input 1

  • matrix2 (ndarray) – input 2

Returns

Pandas dataframe.

cutreeDynamic(distmatrix, linkagefun='average', minModuleSize=50, method='hybrid', deepSplit=2, pamRespectsDendro=False, distfun=None)[source]

This function implements the R cutreeDynamic wrapper in Python, provinding an access point for methods of adaptive branh pruning of hierarchical clustering dendrograms.

Parameters
  • data – pandas dataframe.

  • distfun (str) – distance measure to be used (‘euclidean’, ‘maximum’, ‘manhattan’, ‘canberra’, ‘binary’, ‘minkowski’ or ‘jaccard’).

  • linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).

  • minModuleSize (int) – minimum module size.

  • method (str) – method to use (‘hybrid’ or ‘tree’).

  • deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.

  • pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.

Returns

Numpy array of numerical labels giving assignment of objects to modules. Unassigned objects are labeled 0, the largest module has label 1, next largest 2 etc.

build_network(data, softPower=6, networkType='unsigned', linkagefun='average', method='hybrid', minModuleSize=50, deepSplit=2, pamRespectsDendro=False, merge_modules=True, MEDissThres=0.4, verbose=0)[source]

Weighted gene network construction and module detection. Calculates co-expression similarity and adjacency, topological overlap matrix (TOM) and clusters features in modules.

Parameters
  • data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.

  • softPower (int) – soft-thresholding power.

  • networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).

  • linkagefun (str) – hierarchical/agglomeration method to be used (‘single’, ‘complete’, ‘average’, ‘weighted’, ‘centroid’, ‘median’ or ‘ward’).

  • method (str) – method to use (‘hybrid’ or ‘tree’).

  • minModuleSize (int) – minimum module size.

  • pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.

  • merge_modules (bool) – if True, very similar modules are merged.

  • MEDissThres (float) – maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.

  • verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Paran int deepSplit

provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.

Returns

Tuple with TOM dissimilarity pandas dataframe, numpy array with module colors per experimental feature.

pick_softThreshold(data, RsquaredCut=0.8, networkType='unsigned', verbose=0)[source]

Analysis of scale free topology for multiple soft thresholding powers. Aids the user in choosing a proper soft-thresholding power for network construction.

Parameters
  • data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.

  • RsquaredCut (float) – desired minimum scale free topology fitting index R^2.

  • networkType (str) – network type (‘unsigned’, ‘signed’, ‘signed hybrid’, ‘distance’).

  • verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Returns

Estimated appropriate soft-thresholding power: the lowest power for which the scale free topology fit R^2 exceeds RsquaredCut.

Return type

int

identify_module_colors(matrix, linkagefun='average', method='hybrid', minModuleSize=30, deepSplit=2, pamRespectsDendro=False)[source]

Identifies co-expression modules and converts the numeric labels into colors.

Parameters
  • matrix – dissimilarity structure as produced by R.stats dist.

  • minModuleSize (int) – minimum module size.

  • deepSplit (int) – provides a rough control over sensitivity to cluster splitting, the higher the value (with ‘hybrid’ method) or if True (with ‘tree’ method), the more and smaller modules.

  • pamRespectsDendro (bool) – only used for method ‘hybrid’. Objects and small modules will only be assigned to modules that belong to the same branch in the dendrogram structure.

Returns

Numpy array of strings with module color of each experimental feature.

calculate_module_eigengenes(data, modColors, softPower=6, dissimilarity=True)[source]

Calculates modules eigengenes to quantify co-expression similarity of entire modules.

Parameters
  • data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.

  • modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.

  • softPower (int) – soft-thresholding power.

  • dissimilarity – calculates dissimilarity of module eigengenes.

Returns

Pandas dataframe with calculated module eigengenes. If dissimilarity is set to True, returns a tuple with two pandas dataframes, the first with the module eigengenes and the second with the eigengenes dissimilarity.

merge_similar_modules(data, modColors, MEDissThres=0.4, verbose=0)[source]

Merges modules in co-expression network that are too close as measured by the correlation of their eigengenes.

Parameters
  • data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.

  • modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.

  • verbose (int) – integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

Para, float MEDissThres

maximum dissimilarity (i.e., 1-correlation) that qualifies modules for merging.

Returns

Tuple containing pandas dataframe with eigengenes of the new merged modules, and array with module colors of each expeirmental feature.

calculate_ModuleTrait_correlation(df_exp, df_traits, MEs)[source]

Correlates eigengenes with external traits in order to identify the most significant module-trait associations.

Parameters
  • df_exp – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.

  • df_traits – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.

  • MEs – pandas dataframe with module eigengenes.

Returns

Tuple with two pandas datafames, first the correlation between all module eigengenes and all clinical traits, second a dataframe with concatenated correlation and p-value used for heatmap annotation.

calculate_ModuleMembership(data, MEs)[source]

For each module, calculates the correlation of the module eigengene and the feature expression profile (quantitative measure of module membership (MM)).

Parameters
  • data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.

  • MEs – pandas dataframe with module eigengenes.

Returns

Tuple with two pandas dataframes, one with module membership correlations and another with p-values.

calculate_FeatureTraitSignificance(df_exp, df_traits)[source]

Quantifies associations of individual experimental features with the measured clinical traits, by defining Feature Significance (FS) as the absolute value of the correlation between the feature and the trait.

Parameters
  • df_exp – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.

  • df_traits – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.

Returns

Tuple with two pandas dataframes, one with feature significance correlations and another with p-values.

get_FeaturesPerModule(data, modColors, mode='dictionary')[source]

Groups all experimental features by the co-expression module they belong to.

Parameters
  • data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.

  • modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.

  • mode (str) – type of the value returned by the function (‘dictionary’ or ‘dataframe’).

Returns

Depending on selected mode, returns a dictionary or dataframe with module color per experimental feature.

get_ModuleFeatures(data, modColors, modules=[])[source]

Groups and returns a list of the experimental features clustered in specific co-expression modules.

Parameters
  • data – pandas dataframe containing experimental data, with samples/subjects as rows and features as columns.

  • modColors (ndarray) – array (numeric, character or a factor) attributing module colors to each feature in the experimental dataframe.

  • modules (list) – list of module colors of interest.

Returns

List of lists with experimental features in each selected module.

get_EigengenesTrait_correlation(MEs, data)[source]

Eigengenes are used as representative profiles of the co-expression modules, and correlation between them is used to quantify module similarity. Clinical traits are added to the eigengenes to see how the traits fir into the eigengen network.

Parameters
  • MEs – pandas dataframe with module eigengenes.

  • data – pandas dataframe containing clinical data, with samples/subjects as rows and clinical traits as columns.

Returns

Tuple with two pandas dataframes, one with features and traits recalculates module eigengenes dissimilarity, and another with all the overall correlations.

kaplan_meierAnalysis.py

get_data_ready_for_km(dfs_dict, args)[source]
group_data_based_on_marker(df, marker, index_col, how, value)[source]
run_km(data, time_col, event_col, group_col, args={})[source]
get_km_results(df, group_col, time_col, event_col)[source]