Skip to content
GitLab
Menu
Projects
Groups
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
Mateusz Pawlik
ted-datasets
Commits
8c94e595
Commit
8c94e595
authored
Oct 24, 2018
by
Mateusz Pawlik
Browse files
Documenting.
parent
0c22efd2
Changes
1
Hide whitespace changes
Inline
Side-by-side
README.md
View file @
8c94e595
...
@@ -15,11 +15,11 @@ Currently we support the following datasest:
...
@@ -15,11 +15,11 @@ Currently we support the following datasest:
-
**Swissprot**
- Protein sequence data in XML.
-
**Swissprot**
- Protein sequence data in XML.
The details about each dataset can be found in the README files in the
The details about each dataset can be found in the README files in the
datasets
subdirectories.
corresponding
subdirectories.
## Statistics
## Statistics
The
`statistics`
subdirectory contains scripts to summ
i
rize the data. The table
The
`statistics`
subdirectory contains scripts to summ
a
rize the data. The table
below shows the statistics of our datasets.
below shows the statistics of our datasets.
Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
...
@@ -34,6 +34,19 @@ Swissprot | 556196 | 862 | 101 | 48286 | 1
...
@@ -34,6 +34,19 @@ Swissprot | 556196 | 862 | 101 | 48286 | 1
Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
There is a script
`download_prepare.sh`
for each dataset, that downloads the
raw data, converts to bracket notation, and sorts by tree size.
The
`utilities`
directory holds tools common for multiple datasets.
## RAM requirements and runtime estimates
For each dataset, if necessary, we list the required RAM memory and estimated
runtime. The RAM requirements go up to
**60GB**
for the Swissprot dataset.
The script
`utilities/sort_dataset.sh`
has a hard-coded 10GB buffer size.
However, it executes on machines with less RAM too.
## Expected output
## Expected output
Each output dataset must satisfy the following requirements:
Each output dataset must satisfy the following requirements:
...
...
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment