Commit 4ec3f4f7 authored by Mateusz Pawlik's avatar Mateusz Pawlik
Browse files

Working on swissprot.

parent 167f0f65
Loading
Loading
Loading
Loading
+5 −11
Original line number Diff line number Diff line
@@ -25,25 +25,19 @@ https://www.uniprot.org/downloads
- **wget**
- **gzip**

## RAM requirements

To be measured.

## Steps

**It downloads the current version of the data.**

Execute to download all necessary files.
Execute the following to download and prepare the dataset.
```bash
./download_prepare.sh
```

Execute to convert the raw data file into bracket notation. **Takes some time.**
```bash
python swissprot_to_bracket.py
```

Execute to sort the dataset by tree size.
```bash
./sort_dataset.sh swissprot.bracket
```

## Estimated time

To be measured.
+29 −15
Original line number Diff line number Diff line
#!/bin/bash
# file: prepare_data.sh

# The MIT License (MIT)
# Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik.
# 
# Program: Downloads and prepares data containing the swissprot dataset.
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# Author: Thomas Huetter

# create target folder and change into it
mkdir data
cd data
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# download the data files
# Download the data files.
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz

# gunzip the data to uniprot_sprot.xml
# Extract the dataset.
gunzip uniprot_sprot.xml.gz

# remove gzip file
rm -rf uniprot_sprot.xml.gz 
# Convert XML to bracket notation.
./swissprot_to_bracket.py

# next steps can be seen in README.md
# convert file to bracket notation
# Sort the dataset.
./../utilities/sort_dataset.sh swissprot.bracket > swissprot_sorted.bracket

# select random line with random_lines.py and sort them with sort_dataset.sh
 No newline at end of file
# Tidy up.
rm *xml*
rm swissprot.bracket
+29 −1
Original line number Diff line number Diff line
#!/usr/bin/env python3

# The MIT License (MIT)
# Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik.
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

from lxml import etree
import lxml.sax
from xml.sax.handler import ContentHandler

# This script converts DBLP from XML to bracket notation.

# NOTE: Filenames are hardcoded in this script.

# This class implements the sax-like events for converting XML elemnts into
# bracket-notation nodes and labels.
class SwissprotContentHandler(ContentHandler):
@@ -53,3 +80,4 @@ for child in root:
    swissprot_bracket.write(handler.bn + "\n")

print("--- Closing output files.")
swissprot_bracket.close()
 No newline at end of file