Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
What's new
10
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Open sidebar
Mateusz Pawlik
ted-datasets
Commits
8c94e595
Commit
8c94e595
authored
Oct 24, 2018
by
Mateusz Pawlik
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Documenting.
parent
0c22efd2
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
15 additions
and
2 deletions
+15
-2
README.md
README.md
+15
-2
No files found.
README.md
View file @
8c94e595
...
...
@@ -15,11 +15,11 @@ Currently we support the following datasest:
-
**Swissprot**
- Protein sequence data in XML.
The details about each dataset can be found in the README files in the
datasets
subdirectories.
corresponding
subdirectories.
## Statistics
The
`statistics`
subdirectory contains scripts to summ
i
rize the data. The table
The
`statistics`
subdirectory contains scripts to summ
a
rize the data. The table
below shows the statistics of our datasets.
Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
...
...
@@ -34,6 +34,19 @@ Swissprot | 556196 | 862 | 101 | 48286 | 1
Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
There is a script
`download_prepare.sh`
for each dataset, that downloads the
raw data, converts to bracket notation, and sorts by tree size.
The
`utilities`
directory holds tools common for multiple datasets.
## RAM requirements and runtime estimates
For each dataset, if necessary, we list the required RAM memory and estimated
runtime. The RAM requirements go up to
**60GB**
for the Swissprot dataset.
The script
`utilities/sort_dataset.sh`
has a hard-coded 10GB buffer size.
However, it executes on machines with less RAM too.
## Expected output
Each output dataset must satisfy the following requirements:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment