Commit 0b72ba08 authored by Thomas Huetter's avatar Thomas Huetter
Browse files

swissprot: added scripts to get swissprot dataset

parent 4e0ee533
......@@ -16,7 +16,7 @@ wget
unzip -d original_data
# remove zip file
rm -rf unzip
rm -rf
# change to unzipped folder
cd original_data
# Swissprot
## Short description
Annotated and non-redundant protein sequence database. Can be downloaded in a large XML file.
## Conversion details
- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- Each attribute, (*key*, *value*) pair, are two nodes (parent and child) with the labels being *key* and *value*, respectively.
- Attribute-nodes are children of the corresponding tag-node.
- Attribute-nodes are ordered by their key values.
- Attribute-nodes come before content-node.
## Dependencies
- **lxml** (XML processor for Python)
- **Python3**
- **wget**
- **gzip**
## Steps
**For repeatability, it downloads a specific version of the data (hardcoded in ```` and ````).**
Execute to download all necessary files.
Execute to convert the raw data file into bracket notation. **Takes some time.**
Sample the dataset. **We perform a join on a subset only.**
python 100000 swissprot.bracket > swissprot_random_100k.bracket
Execute to sort the dataset by tree size.
./ swissprot_random_100k.bracket
# file:
# Program: Downloads and prepares data containing the swissprot dataset.
# Author: Thomas Huetter
# create target folder and change into it
mkdir data
cd data
# download the data files
# gunzip the data to uniprot_sprot.xml
gunzip uniprot_sprot.xml.gz
# remove gzip file
rm -rf uniprot_sprot.xml.gz
# next steps can be seen in
# convert file to bracket notation
# select random line with and sort them with
\ No newline at end of file
import random
import sys
number = int(sys.argv[1])
filename = sys.argv[2]
with open(filename) as f:
lines = f.readlines()
linestoprint = random.sample(range(len(lines)), number)
for ln in linestoprint:
print(lines[ln], end='')
# | sort by number of nodes (equivalent to number of "{")
cat $1 | awk '{print gsub("{","{"), $0}' | sort -n | cut -d' ' -f2- > "${1%.bracket}_sorted.bracket"
from lxml import etree
import lxml.sax
from xml.sax.handler import ContentHandler
# This class implements the sax-like events for converting XML elemnts into
# bracket-notation nodes and labels.
class SwissprotContentHandler(ContentHandler):
def __init__(self): = ""
# Open tag.
def startElementNS(self, name, qname, attributes):
uri, localname = name += "{" + localname
d = dict(attributes)
# Sort the attributes by their keys.
for key, value in sorted(d.items(), key = lambda element : element[0][1]): += "{" + key[1] + "{" + value + "}}"
# Close tag.
def endElementNS(self, name, qname): += "}"
# Tag content.
def characters(self, data): += "{" + data + "}"
print("--- Loading Swissprot dataset.")
swissprot_parser = etree.XMLParser(load_dtd=False, remove_blank_text=True)
swissprot_data_tree = etree.parse('uniprot_sprot.xml', dblp_parser)
root = swissprot_data_tree.getroot()
# Output files.
swissprot_bracket = open('swissprot.bracket', 'w')
print("--- Processing each child of Swissprot's root.")
tree_id = 0
for child in root:
tree_id += 1
# Printing simple progress.
if tree_id % 100000 == 0:
print("- Tree %s" % (tree_id))
handler = SwissprotContentHandler()
lxml.sax.saxify(child, handler)
swissprot_bracket.write( + "\n")
print("--- Closing output files.")
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment