swissprot: added scripts to get swissprot dataset (0b72ba08) · Commits · Mateusz Pawlik / ted-datasets

bolzano-address-trees/download_prepare.sh

+1 −1

Original line number	Diff line number	Diff line
		@@ -16,7 +16,7 @@ wget https://dbresearch.uni-salzburg.at/projects/pq-gram-ordered-labeled-trees/b
		unzip bolzano-address-trees.zip -d original_data

		# remove zip file
		rm -rf unzip bolzano-address-trees.zip
		rm -rf bolzano-address-trees.zip

		# change to unzipped folder
		cd original_data

0 → 100644

+48 −0

Original line number	Diff line number	Diff line
		# Swissprot

		## Short description

		Annotated and non-redundant protein sequence database. Can be downloaded in a large XML file.

		## Conversion details

		- Each tag is a node with tag's name as a label.
		- Tag nesting is converted to parent-child relationships.
		- A tag's content is tag's child node with content as label.
		- Each attribute, (key, value) pair, are two nodes (parent and child) with the labels being key and value, respectively.
		- Attribute-nodes are children of the corresponding tag-node.
		- Attribute-nodes are ordered by their key values.
		- Attribute-nodes come before content-node.

		## Dependencies

		- lxml (XML processor for Python)
		http://lxml.de/installation.html
		- Python3
		https://www.python.org/downloads/
		- wget
		- gzip

		## Steps

		For repeatability, it downloads a specific version of the data (hardcoded in ``download_prepare.sh`` and ``dblp-to-bracket.py``).

		Execute to download all necessary files.
		```bash
		./download_prepare.sh
		```

		Execute to convert the raw data file into bracket notation. Takes some time.
		```bash
		python swissprot_to_bracket.py
		```

		Sample the dataset. We perform a join on a subset only.
		```bash
		python random_lines.py 100000 swissprot.bracket > swissprot_random_100k.bracket
		```

		Execute to sort the dataset by tree size.
		```bash
		./sort_dataset.sh swissprot_random_100k.bracket
		```

0 → 100755

+24 −0

Original line number	Diff line number	Diff line
		#!/bin/bash
		# file: prepare_data.sh
		#
		# Program: Downloads and prepares data containing the swissprot dataset.
		#
		# Author: Thomas Huetter

		# create target folder and change into it
		mkdir data
		cd data

		# download the data files
		wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz

		# gunzip the data to uniprot_sprot.xml
		gunzip uniprot_sprot.xml.gz

		# remove gzip file
		rm -rf uniprot_sprot.xml.gz

		# next steps can be seen in README.md
		# convert file to bracket notation

		# select random line with random_lines.py and sort them with sort_dataset.sh
		No newline at end of file

0 → 100644

+13 −0

Original line number	Diff line number	Diff line
		import random
		import sys

		number = int(sys.argv[1])
		filename = sys.argv[2]

		with open(filename) as f:
		lines = f.readlines()

		linestoprint = random.sample(range(len(lines)), number)

		for ln in linestoprint:
		print(lines[ln], end='')

0 → 100755

+4 −0

Original line number	Diff line number	Diff line
		#!/bin/bash

		# \| sort by number of nodes (equivalent to number of "{")
		cat $1 \| awk '{print gsub("{","{"), $0}' \| sort -n \| cut -d' ' -f2- > "${1%.bracket}_sorted.bracket"