Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
What's new
10
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Open sidebar
Mateusz Pawlik
ted-datasets
Commits
be9379a5
Commit
be9379a5
authored
Oct 22, 2018
by
Mateusz Pawlik
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Iteration over README files. Tried to unify.
parent
0412c8e9
Changes
6
Hide whitespace changes
Inline
Side-by-side
Showing
6 changed files
with
56 additions
and
17 deletions
+56
-17
README.md
README.md
+3
-0
bolzano-address-trees/README.md
bolzano-address-trees/README.md
+28
-0
dblp/README.md
dblp/README.md
+9
-3
python_ast/README.md
python_ast/README.md
+6
-3
sentiment/README.md
sentiment/README.md
+4
-2
swissprot/README.md
swissprot/README.md
+6
-9
No files found.
README.md
View file @
be9379a5
...
...
@@ -19,6 +19,9 @@ datasets subdirectories.
## Statistics
The
`statistics`
subdirectory contains scripts to summirize the data. The table
below shows the statistics of our datasets.
Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
----------|-----------------|----------------|---------------|---------------|--------------------------
Bolzano | 299 | 166 | 2 | 2105 | 592
...
...
bolzano-address-trees/README.md
0 → 100644
View file @
be9379a5
# Bolzano
Residential addresses in the city of Bolzano. A tree represents all addresses
of a street (root: street name, root-leaf path: address).
## Source
https://dbresearch.uni-salzburg.at/projects/pq-gram-ordered-labeled-trees/description.php
## Conversion details
The original trees are in the bracket notation. We only remove the tree
identifiers.
## Dependencies
-
**wget**
-
**unzip**
-
**iconf**
-
**sed**
-
**awk**
## Steps
Execute the following to download and prepare the dataset.
```
bash
./download_prepare.sh
```
dblp/README.md
View file @
be9379a5
# DBLP
## Short description
Computer science bibliography. Can be downloaded in a large XML file.
## Source
https://dblp2.uni-trier.de/faq/How+can+I+download+the+whole+dblp+dataset.html
## Conversion details
We include all entries except “www” elements, which are small
subtrees that all match for threshold ≥ 3, blowing up the join result by
billions of pair.
-
Each tag is a node with tag's name as a label.
-
Tag nesting is converted to parent-child relationships.
-
A tag's
*content*
is tag's child node with
*content*
as label.
...
...
@@ -64,5 +70,5 @@ awk '!/{www{key{homepages/' dblp_sorted.bracket > dblp_no_www_sorted.bracket
## Troubleshooting
-
Encoding
###
Encoding
In case of encoding error follow the steps on this webpage:
[
https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian
](
https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian
)
.
python_ast/README.md
View file @
be9379a5
# Python Abstract Syntax Trees
## Short description
Abstract Syntax trees of 150k Python collected and curated by SRI Group of ETH
Zuerich.
Abstract Syntax trees of 150k Python programs found on Github. Data provided by SRI Group of ETH Zuerich.
## Source
https://www.sri.inf.ethz.ch/py150
## Conversion details
...
...
@@ -13,7 +16,7 @@ Abstract Syntax trees of 150k Python programs found on Github. Data provided by
-
**Python3**
https://www.python.org/downloads/
-
**argparse**
-
**argparse**
module of Python3
-
**wget**
-
**wget**
-
**tar**
...
...
sentiment/README.md
View file @
be9379a5
# Sentiment
## Short description
Syntax trees of movie ratings from the Stanford NLP Group.
Syntax trees of movie ratings.
## Source
https://nlp.stanford.edu/sentiment/
## Conversion details
...
...
swissprot/README.md
View file @
be9379a5
# Swissprot
## Short description
Annotated and non-redundant protein sequence database. Can be downloaded in a large XML file.
## Source
https://www.uniprot.org/downloads
## Conversion details
-
Each tag is a node with tag's name as a label.
...
...
@@ -25,7 +27,7 @@ Annotated and non-redundant protein sequence database. Can be downloaded in a la
## Steps
**
For repeatability, i
t downloads
a specific
version of the data
(hardcoded in ``download_prepare.sh`` and ``dblp-to-bracket.py``)
.**
**
I
t downloads
the current
version of the data.**
Execute to download all necessary files.
```
bash
...
...
@@ -37,12 +39,7 @@ Execute to convert the raw data file into bracket notation. **Takes some time.**
python swissprot_to_bracket.py
```
Sample the dataset.
**We perform a join on a subset only.**
```
bash
python random_lines.py 100000 swissprot.bracket
>
swissprot_random_100k.bracket
```
Execute to sort the dataset by tree size.
```
bash
./sort_dataset.sh swissprot
_random_100k
.bracket
./sort_dataset.sh swissprot.bracket
```
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment