README.md 1.34 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
# Swissprot

## Short description

Annotated and non-redundant protein sequence database. Can be downloaded in a large XML file.

## Conversion details

- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- Each attribute, (*key*, *value*) pair, are two nodes (parent and child) with the labels being *key* and *value*, respectively.
- Attribute-nodes are children of the corresponding tag-node.
- Attribute-nodes are ordered by their key values.
- Attribute-nodes come before content-node.

## Dependencies

- **lxml** (XML processor for Python)
  http://lxml.de/installation.html
- **Python3**
  https://www.python.org/downloads/
- **wget**
- **gzip**

## Steps

**For repeatability, it downloads a specific version of the data (hardcoded in ``download_prepare.sh`` and ``dblp-to-bracket.py``).**

Execute to download all necessary files.
```bash
./download_prepare.sh
```

Execute to convert the raw data file into bracket notation. **Takes some time.**
```bash
python swissprot_to_bracket.py
```

Sample the dataset. **We perform a join on a subset only.**
```bash
python random_lines.py 100000 swissprot.bracket > swissprot_random_100k.bracket
```

Execute to sort the dataset by tree size.
```bash
./sort_dataset.sh swissprot_random_100k.bracket
```