Loading swissprot/README.md +5 −11 Original line number Diff line number Diff line Loading @@ -25,25 +25,19 @@ https://www.uniprot.org/downloads - **wget** - **gzip** ## RAM requirements To be measured. ## Steps **It downloads the current version of the data.** Execute to download all necessary files. Execute the following to download and prepare the dataset. ```bash ./download_prepare.sh ``` Execute to convert the raw data file into bracket notation. **Takes some time.** ```bash python swissprot_to_bracket.py ``` Execute to sort the dataset by tree size. ```bash ./sort_dataset.sh swissprot.bracket ``` ## Estimated time To be measured. swissprot/download_prepare.sh +29 −15 Original line number Diff line number Diff line #!/bin/bash # file: prepare_data.sh # The MIT License (MIT) # Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik. # # Program: Downloads and prepares data containing the swissprot dataset. # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # Author: Thomas Huetter # create target folder and change into it mkdir data cd data # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # download the data files # Download the data files. wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz # gunzip the data to uniprot_sprot.xml # Extract the dataset. gunzip uniprot_sprot.xml.gz # remove gzip file rm -rf uniprot_sprot.xml.gz # Convert XML to bracket notation. ./swissprot_to_bracket.py # next steps can be seen in README.md # convert file to bracket notation # Sort the dataset. ./../utilities/sort_dataset.sh swissprot.bracket > swissprot_sorted.bracket # select random line with random_lines.py and sort them with sort_dataset.sh No newline at end of file # Tidy up. rm *xml* rm swissprot.bracket swissprot/swissprot_to_bracket.py 100644 → 100755 +29 −1 Original line number Diff line number Diff line #!/usr/bin/env python3 # The MIT License (MIT) # Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik. # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. from lxml import etree import lxml.sax from xml.sax.handler import ContentHandler # This script converts DBLP from XML to bracket notation. # NOTE: Filenames are hardcoded in this script. # This class implements the sax-like events for converting XML elemnts into # bracket-notation nodes and labels. class SwissprotContentHandler(ContentHandler): Loading Loading @@ -53,3 +80,4 @@ for child in root: swissprot_bracket.write(handler.bn + "\n") print("--- Closing output files.") swissprot_bracket.close() No newline at end of file Loading
swissprot/README.md +5 −11 Original line number Diff line number Diff line Loading @@ -25,25 +25,19 @@ https://www.uniprot.org/downloads - **wget** - **gzip** ## RAM requirements To be measured. ## Steps **It downloads the current version of the data.** Execute to download all necessary files. Execute the following to download and prepare the dataset. ```bash ./download_prepare.sh ``` Execute to convert the raw data file into bracket notation. **Takes some time.** ```bash python swissprot_to_bracket.py ``` Execute to sort the dataset by tree size. ```bash ./sort_dataset.sh swissprot.bracket ``` ## Estimated time To be measured.
swissprot/download_prepare.sh +29 −15 Original line number Diff line number Diff line #!/bin/bash # file: prepare_data.sh # The MIT License (MIT) # Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik. # # Program: Downloads and prepares data containing the swissprot dataset. # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # Author: Thomas Huetter # create target folder and change into it mkdir data cd data # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # download the data files # Download the data files. wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz # gunzip the data to uniprot_sprot.xml # Extract the dataset. gunzip uniprot_sprot.xml.gz # remove gzip file rm -rf uniprot_sprot.xml.gz # Convert XML to bracket notation. ./swissprot_to_bracket.py # next steps can be seen in README.md # convert file to bracket notation # Sort the dataset. ./../utilities/sort_dataset.sh swissprot.bracket > swissprot_sorted.bracket # select random line with random_lines.py and sort them with sort_dataset.sh No newline at end of file # Tidy up. rm *xml* rm swissprot.bracket
swissprot/swissprot_to_bracket.py 100644 → 100755 +29 −1 Original line number Diff line number Diff line #!/usr/bin/env python3 # The MIT License (MIT) # Copyright (c) 2017 Thomas Hütter, Mateusz Pawlik. # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. from lxml import etree import lxml.sax from xml.sax.handler import ContentHandler # This script converts DBLP from XML to bracket notation. # NOTE: Filenames are hardcoded in this script. # This class implements the sax-like events for converting XML elemnts into # bracket-notation nodes and labels. class SwissprotContentHandler(ContentHandler): Loading Loading @@ -53,3 +80,4 @@ for child in root: swissprot_bracket.write(handler.bn + "\n") print("--- Closing output files.") swissprot_bracket.close() No newline at end of file