Commit 0b72ba08 authored by Thomas Huetter's avatar Thomas Huetter
Browse files

swissprot: added scripts to get swissprot dataset

parent 4e0ee533
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -16,7 +16,7 @@ wget https://dbresearch.uni-salzburg.at/projects/pq-gram-ordered-labeled-trees/b
unzip bolzano-address-trees.zip -d original_data

# remove zip file
rm -rf unzip bolzano-address-trees.zip
rm -rf bolzano-address-trees.zip

# change to unzipped folder
cd original_data

swissprot/README.md

0 → 100644
+48 −0
Original line number Diff line number Diff line
# Swissprot

## Short description

Annotated and non-redundant protein sequence database. Can be downloaded in a large XML file.

## Conversion details

- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- Each attribute, (*key*, *value*) pair, are two nodes (parent and child) with the labels being *key* and *value*, respectively.
- Attribute-nodes are children of the corresponding tag-node.
- Attribute-nodes are ordered by their key values.
- Attribute-nodes come before content-node.

## Dependencies

- **lxml** (XML processor for Python)
  http://lxml.de/installation.html
- **Python3**
  https://www.python.org/downloads/
- **wget**
- **gzip**

## Steps

**For repeatability, it downloads a specific version of the data (hardcoded in ``download_prepare.sh`` and ``dblp-to-bracket.py``).**

Execute to download all necessary files.
```bash
./download_prepare.sh
```

Execute to convert the raw data file into bracket notation. **Takes some time.**
```bash
python swissprot_to_bracket.py
```

Sample the dataset. **We perform a join on a subset only.**
```bash
python random_lines.py 100000 swissprot.bracket > swissprot_random_100k.bracket
```

Execute to sort the dataset by tree size.
```bash
./sort_dataset.sh swissprot_random_100k.bracket
```
+24 −0
Original line number Diff line number Diff line
#!/bin/bash
# file: prepare_data.sh
#
# Program: Downloads and prepares data containing the swissprot dataset.
#
# Author: Thomas Huetter

# create target folder and change into it
mkdir data
cd data

# download the data files
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz

# gunzip the data to uniprot_sprot.xml
gunzip uniprot_sprot.xml.gz

# remove gzip file
rm -rf uniprot_sprot.xml.gz 

# next steps can be seen in README.md
# convert file to bracket notation

# select random line with random_lines.py and sort them with sort_dataset.sh
 No newline at end of file
+13 −0
Original line number Diff line number Diff line
import random
import sys

number = int(sys.argv[1])
filename = sys.argv[2]

with open(filename) as f:
    lines = f.readlines()

linestoprint = random.sample(range(len(lines)), number)

for ln in linestoprint:
    print(lines[ln], end='')
+4 −0
Original line number Diff line number Diff line
#!/bin/bash

#      | sort by number of nodes (equivalent to number of "{")
cat $1 | awk '{print gsub("{","{"), $0}' | sort -n  | cut -d' ' -f2- > "${1%.bracket}_sorted.bracket"
Loading