Graph Database Builder (graphdb_builder)¶
- Ontology Databases
- Biomedical Databases
- Biomedical Databases Parsers
- cancerGenomeInterpreterParser.py
- corumParser.py
- disgenetParser.py
- drugBankParser.py
- drugGeneInteractionDBParser.py
- exposomeParser.py
- foodbParser.py
- goaParser.py
- gwasCatalogParser.py
- hgncParser.py
- hmdbParser.py
- hpaParser.py
- intactParser.py
- jensenlabParser.py
- mutationDsParser.py
- oncokbParser.py
- pathwayCommonsParser.py
- pfamParser.py
- pspParser.py
- reactomeParser.py
- refseqParser.py
- siderParser.py
- signorParser.py
- smpdbParser.py
- stringParser.py
- textminingParser.py
- uniprotParser.py
- databases_controller.py
- Biomedical Databases Parsers
- Experimental Data
- User Creation
- CKG Builder
builder_utils.py¶
-
parse_contents(contents, filename)[source]¶ Reads binary string files and returns a Pandas DataFrame.
-
export_contents(data, dataDir, filename)[source]¶ Export Pandas DataFrame to file, with UTF-8 endocing.
-
write_relationships(relationships, header, outputfile)[source]¶ Reads a set of relationships and saves them to a file.
-
write_entities(entities, header, outputfile)[source]¶ Reads a set of entities and saves them to a file.
-
get_config(config_name, data_type='databases')[source]¶ Reads YAML configuration file and converts it into a Python dictionary.
- Parameters
- Returns
Dictionary.
Note
Use this function to obtain configuration for individual database/ontology parsers.
-
expand_cols(data, col, sep=';')[source]¶ Expands the rows of a dataframe by splitting the specified column
-
setup_config(data_type='databases')[source]¶ Reads YAML configuration file and converts it into a Python dictionary.
- Parameters
data_type – configuration type (‘databases’, ‘ontologies’, ‘experiments’ or ‘builder’).
- Returns
Dictionary.
Note
This function should be used to obtain the configuration for databases_controller.py, ontologies_controller.py, experiments_controller.py and builder.py.
-
get_full_path_directories()[source]¶ Reads Builder YAML configuration file and returns the full path of all directories. :return: Dictionary.
-
list_ftp_directory(ftp_url, user='', password='')[source]¶ Lists all files present in folder from FTP server.
-
download_PRIDE_data(pxd_id, file_name, to='.', user='', password='', date_field='publicationDate')[source]¶ This function downloads a project file from the PRIDE repository
- Parameters
pxd_id (str) – PRIDE project identifier (id. PXD013599).
file_name (str) – name of the file to dowload
to (str) – local directory where the file should be downloaded
user (str) – username to access biomedical database server if required.
password (str) – password to access biomedical database server if required.
date_field (str) – projects deposited in PRIDE are search based on date, either submissionData or publicationDate (default)
-
downloadDB(databaseURL, directory=None, file_name=None, user='', password='', avoid_wget=False)[source]¶ This function downloads the raw files from a biomedical database server when a link is provided.
- Parameters
databaseURL (str) – link to access biomedical database server.
file_name (str or None) – name of the file to dowload. If None, ‘databaseURL’ must contain filename after the last ‘/’.
user (str) – username to access biomedical database server if required.
password (str) – password to access biomedical database server if required.
-
searchPubmed(searchFields, sortby='relevance', num='10', resultsFormat='json')[source]¶ Searches PubMed database for MeSH terms and other additional fields (‘searchFields’), sorts them by relevance and returns the top ‘num’.
-
is_number(s)[source]¶ This function checks if given input is a float and returns True if so, and False if it is not.
- Parameters
s – input
- Returns
Boolean.
-
getMedlineAbstracts(idList)[source]¶ This function accesses NCBI over the WWWW and returns Medline data as a handle object, which is parsed and converted to a Pandas DataFrame.
-
listDirectoryFiles(directory)[source]¶ Lists all files in a specified directory.
- Parameters
directory (str) – path to folder.
- Returns
List of file names.
-
listDirectoryFolders(directory)[source]¶ Lists all directories in a specified directory.
- Parameters
directory (str) – path to folder.
- Returns
List of folder names.
-
listDirectoryFoldersNotEmpty(directory)[source]¶ Lists all directories in a specified directory.
- Parameters
directory (str) – path to folder.
- Returns
List of folder names.
-
checkDirectory(directory)[source]¶ Checks if given directory exists and if not, creates it.
- Parameters
directory (str) – path to folder.
-
flatten(t)[source]¶ Code from: https://gist.github.com/shaxbee/0ada767debf9eefbdb6e Acknowledgements: Zbigniew Mandziejewicz (shaxbee) Generator flattening the structure
>>> list(flatten([2, [2, (4, 5, [7], [2, [6, 2, 6, [6], 4]], 6)]])) [2, 2, 4, 5, 7, 2, 6, 2, 6, 6, 4, 6]
-
pretty_print(data)[source]¶ This function provides a capability to “pretty-print” arbitrary Python data structures in a forma that can be used as input to the interpreter. For more information visit https://docs.python.org/2/library/pprint.html.
- Parameters
data – python object.
-
convertOBOtoNet(ontologyFile)[source]¶ Takes an .obo file and returns a NetworkX graph representation of the ontology, that holds multiple edges between two nodes.
- Parameters
ontologyFile (str) – path to ontology file.
- Returns
NetworkX graph.
-
getCurrentTime()[source]¶ Returns current date (Year-Month-Day) and time (Hour-Minute-Second).
- Returns
Two strings: date and time.
-
convert_bytes(num)[source]¶ This function will convert bytes to MB…. GB… etc.
- Parameters
num – float, integer or pandas.Series.
-
buildStats(count, otype, name, dataset, filename, updated_on=None)[source]¶ Returns a tuple with all the information needed to build a stats file.
- Parameters
- Returns
Tuple with date, time, database name, file where entities/relationships are stored, file size, number of entities/relationships imported, type and label.
-
unrar(filepath, to)[source]¶ Decompress RAR file :param str filepath: path to rar file :param str to: where to extract all files
-
compress_directory(folder_to_backup, dest_folder, file_name)[source]¶ Compresses folder to .tar.gz to create data backup archive file.
-
read_gzipped_file(filepath)[source]¶ Opens an underlying process to access a gzip file through the creation of a new pipe to the child.
- Parameters
filepath (str) – path to gzip file.
- Returns
A bytes sequence that specifies the standard output.
-
parse_fasta(file_handler)[source]¶ Using BioPython to read fasta file as SeqIO objects
- Parameters
file_handler (file_handler) – opened fasta file
- Return iterator records
iterator of sequence objects
-
batch_iterator(iterator, batch_size)[source]¶ Returns lists of length batch_size.
This can be used on any iterator, for example to batch up SeqRecord objects from Bio.SeqIO.parse(…), or to batch Alignment objects from Bio.AlignIO.parse(…), or simply lines from a file handle.
This is a generator function, and it returns lists of the entries from the supplied iterator. Each list will have batch_size entries, although the final list may be shorter.
- Parameters
iterator (iterator) – batch to be extracted
batch_size (integer) – size of the batch
- Return list batch
list with the batch elements of size batch_size
mapping.py¶
-
reset_mapping(entity)[source]¶ Checks if mapping.tsv file exists and removes it.
- Parameters
entity (str) – entity label as defined in databases_config.yml
-
mark_complete_mapping(entity)[source]¶ Checks if mapping.tsv file exists and renames it to complete_mapping.tsv.
- Parameters
entity (str) – entity label as defined in databases_config.yml
-
getMappingFromOntology(ontology, source=None)[source]¶ Converts .tsv file with complete list of ontology identifiers and aliases, to dictionary with aliases as keys and ontology identifiers as values.
-
getMappingForEntity(entity)[source]¶ Converts .tsv file with complete list of entity identifiers and aliases, to dictionary with aliases as keys and entity identifiers as values.
- Parameters
entity (str) – entity label as defined in databases_config.yml.
- Returns
Dictionary of aliases (keys) and entity identifiers (value).
-
getMultipleMappingForEntity(entity)[source]¶ Converts .tsv file with complete list of entity identifiers and aliases, to dictionary with aliases to other databases as keys and entity identifiers as values.
- Parameters
entity (str) – entity label as defined in databases_config.yml.
- Returns
Dictionary of aliases (keys) and set of unique entity identifiers (values).
-
get_STRING_mapping_url(db='STRING')[source]¶ Get the url for downloading the mapping file from either STRING or STITCH
- Parameters
db (str) – Which database to get the url from: STRING or STITCH
- Returns
url where to download the mapping file
-
getSTRINGMapping(source='BLAST_UniProt_AC', download=True, db='STRING')[source]¶ Parses database (db) and extracts relationships between identifiers to order databases (source).
- Parameters
- Returns
Dictionary of database identifers (keys) and set of unique aliases to other databases (values).
-
buildMappingFromOBO(oboFile, ontology)[source]¶ Parses and extracts ontology idnetifiers, names and synonyms from raw file, and writes all the information to a .tsv file. :param str oboFile: path to ontology raw file. :param str ontology: ontology database acronym as defined in ontologies_config.yml.