Commit 4ec3f4f7 authored by Mateusz Pawlik's avatar Mateusz Pawlik

Working on swissprot.

parent 167f0f65
......@@ -25,25 +25,19 @@ https://www.uniprot.org/downloads
- **wget**
- **gzip**
## RAM requirements
To be measured.
## Steps
**It downloads the current version of the data.**
Execute to download all necessary files.
Execute the following to download and prepare the dataset.
```bash
./download_prepare.sh
```
Execute to convert the raw data file into bracket notation. **Takes some time.**
```bash
python swissprot_to_bracket.py
```
Execute to sort the dataset by tree size.
```bash
./sort_dataset.sh swissprot.bracket
```
## Estimated time
To be measured.
#!/bin/bash
# file: prepare_data.sh
#
# Program: Downloads and prepares data containing the swissprot dataset.
#
# Author: Thomas Huetter
# create target folder and change into it
mkdir data
cd data
# The MIT License (MIT)
# Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# download the data files
# Download the data files.
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz
# gunzip the data to uniprot_sprot.xml
# Extract the dataset.
gunzip uniprot_sprot.xml.gz
# remove gzip file
rm -rf uniprot_sprot.xml.gz
# Convert XML to bracket notation.
./swissprot_to_bracket.py
# next steps can be seen in README.md
# convert file to bracket notation
# Sort the dataset.
./../utilities/sort_dataset.sh swissprot.bracket > swissprot_sorted.bracket
# select random line with random_lines.py and sort them with sort_dataset.sh
\ No newline at end of file
# Tidy up.
rm *xml*
rm swissprot.bracket
#!/usr/bin/env python3
# The MIT License (MIT)
# Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
from lxml import etree
import lxml.sax
from xml.sax.handler import ContentHandler
# This script converts DBLP from XML to bracket notation.
# NOTE: Filenames are hardcoded in this script.
# This class implements the sax-like events for converting XML elemnts into
# bracket-notation nodes and labels.
class SwissprotContentHandler(ContentHandler):
......@@ -52,4 +79,5 @@ for child in root:
lxml.sax.saxify(child, handler)
swissprot_bracket.write(handler.bn + "\n")
print("--- Closing output files.")
\ No newline at end of file
print("--- Closing output files.")
swissprot_bracket.close()
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment