README.md 1.12 KB
Newer Older
1 2 3 4
# Swissprot

Annotated and non-redundant protein sequence database. Can be downloaded in a large XML file.

5 6 7 8
## Source

https://www.uniprot.org/downloads

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## Conversion details

- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- Each attribute, (*key*, *value*) pair, are two nodes (parent and child) with the labels being *key* and *value*, respectively.
- Attribute-nodes are children of the corresponding tag-node.
- Attribute-nodes are ordered by their key values.
- Attribute-nodes come before content-node.

## Dependencies

- **lxml** (XML processor for Python)
  http://lxml.de/installation.html
- **Python3**
  https://www.python.org/downloads/
- **wget**
- **gzip**

## Steps

30
**It downloads the current version of the data.**
31 32 33 34 35 36 37 38 39 40 41 42 43

Execute to download all necessary files.
```bash
./download_prepare.sh
```

Execute to convert the raw data file into bracket notation. **Takes some time.**
```bash
python swissprot_to_bracket.py
```

Execute to sort the dataset by tree size.
```bash
44
./sort_dataset.sh swissprot.bracket
45
```