Skip to content
GitLab
Menu
Projects
Groups
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
Mateusz Pawlik
ted-datasets
Commits
4ec3f4f7
Commit
4ec3f4f7
authored
Oct 23, 2018
by
Mateusz Pawlik
Browse files
Working on swissprot.
parent
167f0f65
Changes
3
Show whitespace changes
Inline
Side-by-side
swissprot/README.md
View file @
4ec3f4f7
...
...
@@ -25,25 +25,19 @@ https://www.uniprot.org/downloads
-
**wget**
-
**gzip**
## RAM requirements
To be measured.
## Steps
**It downloads the current version of the data.**
Execute t
o download all necessary files
.
Execute t
he following to download and prepare the dataset
.
```
bash
./download_prepare.sh
```
Execute to convert the raw data file into bracket notation.
**Takes some time.**
```
bash
python swissprot_to_bracket.py
```
Execute to sort the dataset by tree size.
```
bash
./sort_dataset.sh swissprot.bracket
```
## Estimated time
To be measured.
swissprot/download_prepare.sh
View file @
4ec3f4f7
#!/bin/bash
# file: prepare_data.sh
# The MIT License (MIT)
# Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik.
#
# Program: Downloads and prepares data containing the swissprot dataset.
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# Author: Thomas Huetter
# create target folder and change into it
mkdir
data
cd
data
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#
d
ownload the data files
#
D
ownload the data files
.
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz
#
gunzip
the data
to uniprot_sprot.xml
#
Extract
the data
set.
gunzip
uniprot_sprot.xml.gz
#
remove gzip file
rm
-rf
uniprot_sprot.xml.gz
#
Convert XML to bracket notation.
./swissprot_to_bracket.py
#
next steps can be seen in README.md
# convert file to bracket notation
#
Sort the dataset.
./../utilities/sort_dataset.sh swissprot.bracket
>
swissprot_sorted.bracket
# select random line with random_lines.py and sort them with sort_dataset.sh
\ No newline at end of file
# Tidy up.
rm
*
xml
*
rm
swissprot.bracket
swissprot/swissprot_to_bracket.py
100644 → 100755
View file @
4ec3f4f7
#!/usr/bin/env python3
# The MIT License (MIT)
# Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
from
lxml
import
etree
import
lxml.sax
from
xml.sax.handler
import
ContentHandler
# This script converts DBLP from XML to bracket notation.
# NOTE: Filenames are hardcoded in this script.
# This class implements the sax-like events for converting XML elemnts into
# bracket-notation nodes and labels.
class
SwissprotContentHandler
(
ContentHandler
):
...
...
@@ -53,3 +80,4 @@ for child in root:
swissprot_bracket
.
write
(
handler
.
bn
+
"
\n
"
)
print
(
"--- Closing output files."
)
swissprot_bracket
.
close
()
\ No newline at end of file
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment