Commit 0b72ba08 authored by Thomas Huetter's avatar Thomas Huetter

swissprot: added scripts to get swissprot dataset

parent 4e0ee533
...@@ -16,7 +16,7 @@ wget https://dbresearch.uni-salzburg.at/projects/pq-gram-ordered-labeled-trees/b ...@@ -16,7 +16,7 @@ wget https://dbresearch.uni-salzburg.at/projects/pq-gram-ordered-labeled-trees/b
unzip bolzano-address-trees.zip -d original_data unzip bolzano-address-trees.zip -d original_data
# remove zip file # remove zip file
rm -rf unzip bolzano-address-trees.zip rm -rf bolzano-address-trees.zip
# change to unzipped folder # change to unzipped folder
cd original_data cd original_data
......
# Swissprot
## Short description
Annotated and non-redundant protein sequence database. Can be downloaded in a large XML file.
## Conversion details
- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- Each attribute, (*key*, *value*) pair, are two nodes (parent and child) with the labels being *key* and *value*, respectively.
- Attribute-nodes are children of the corresponding tag-node.
- Attribute-nodes are ordered by their key values.
- Attribute-nodes come before content-node.
## Dependencies
- **lxml** (XML processor for Python)
http://lxml.de/installation.html
- **Python3**
https://www.python.org/downloads/
- **wget**
- **gzip**
## Steps
**For repeatability, it downloads a specific version of the data (hardcoded in ``download_prepare.sh`` and ``dblp-to-bracket.py``).**
Execute to download all necessary files.
```bash
./download_prepare.sh
```
Execute to convert the raw data file into bracket notation. **Takes some time.**
```bash
python swissprot_to_bracket.py
```
Sample the dataset. **We perform a join on a subset only.**
```bash
python random_lines.py 100000 swissprot.bracket > swissprot_random_100k.bracket
```
Execute to sort the dataset by tree size.
```bash
./sort_dataset.sh swissprot_random_100k.bracket
```
#!/bin/bash
# file: prepare_data.sh
#
# Program: Downloads and prepares data containing the swissprot dataset.
#
# Author: Thomas Huetter
# create target folder and change into it
mkdir data
cd data
# download the data files
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz
# gunzip the data to uniprot_sprot.xml
gunzip uniprot_sprot.xml.gz
# remove gzip file
rm -rf uniprot_sprot.xml.gz
# next steps can be seen in README.md
# convert file to bracket notation
# select random line with random_lines.py and sort them with sort_dataset.sh
\ No newline at end of file
import random
import sys
number = int(sys.argv[1])
filename = sys.argv[2]
with open(filename) as f:
lines = f.readlines()
linestoprint = random.sample(range(len(lines)), number)
for ln in linestoprint:
print(lines[ln], end='')
#!/bin/bash
# | sort by number of nodes (equivalent to number of "{")
cat $1 | awk '{print gsub("{","{"), $0}' | sort -n | cut -d' ' -f2- > "${1%.bracket}_sorted.bracket"
from lxml import etree
import lxml.sax
from xml.sax.handler import ContentHandler
# This class implements the sax-like events for converting XML elemnts into
# bracket-notation nodes and labels.
class SwissprotContentHandler(ContentHandler):
def __init__(self):
self.bn = ""
# Open tag.
def startElementNS(self, name, qname, attributes):
uri, localname = name
self.bn += "{" + localname
d = dict(attributes)
# Sort the attributes by their keys.
for key, value in sorted(d.items(), key = lambda element : element[0][1]):
self.bn += "{" + key[1] + "{" + value + "}}"
# Close tag.
def endElementNS(self, name, qname):
self.bn += "}"
# Tag content.
def characters(self, data):
self.bn += "{" + data + "}"
print("--- Loading Swissprot dataset.")
swissprot_parser = etree.XMLParser(load_dtd=False, remove_blank_text=True)
swissprot_data_tree = etree.parse('uniprot_sprot.xml', dblp_parser)
root = swissprot_data_tree.getroot()
# Output files.
swissprot_bracket = open('swissprot.bracket', 'w')
print("--- Processing each child of Swissprot's root.")
tree_id = 0
for child in root:
tree_id += 1
# Printing simple progress.
if tree_id % 100000 == 0:
print("- Tree %s" % (tree_id))
handler = SwissprotContentHandler()
lxml.sax.saxify(child, handler)
swissprot_bracket.write(handler.bn + "\n")
print("--- Closing output files.")
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment