README.md 1.16 KB
Newer Older
1
2
3
4
# Swissprot

Annotated and non-redundant protein sequence database. Can be downloaded in a large XML file.

5
6
7
8
## Source

https://www.uniprot.org/downloads

9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## Conversion details

- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- Each attribute, (*key*, *value*) pair, are two nodes (parent and child) with the labels being *key* and *value*, respectively.
- Attribute-nodes are children of the corresponding tag-node.
- Attribute-nodes are ordered by their key values.
- Attribute-nodes come before content-node.

## Dependencies

- **lxml** (XML processor for Python)
  http://lxml.de/installation.html
- **Python3**
  https://www.python.org/downloads/
- **wget**
- **gzip**

## Steps

30
**It downloads the current version of the data.**
31
32
33
34
35
36
37
38
39
40
41
42
43

Execute to download all necessary files.
```bash
./download_prepare.sh
```

Execute to convert the raw data file into bracket notation. **Takes some time.**
```bash
python swissprot_to_bracket.py
```

Execute to sort the dataset by tree size.
```bash
44
./sort_dataset.sh swissprot.bracket
45
```
46
47
48
49

## Estimated time

To be measured.