Commit be9379a5 authored by Mateusz Pawlik's avatar Mateusz Pawlik

Iteration over README files. Tried to unify.

parent 0412c8e9
......@@ -19,6 +19,9 @@ datasets subdirectories.
## Statistics
The `statistics` subdirectory contains scripts to summirize the data. The table
below shows the statistics of our datasets.
Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
----------|-----------------|----------------|---------------|---------------|--------------------------
Bolzano | 299 | 166 | 2 | 2105 | 592
......
# Bolzano
Residential addresses in the city of Bolzano. A tree represents all addresses
of a street (root: street name, root-leaf path: address).
## Source
https://dbresearch.uni-salzburg.at/projects/pq-gram-ordered-labeled-trees/description.php
## Conversion details
The original trees are in the bracket notation. We only remove the tree
identifiers.
## Dependencies
- **wget**
- **unzip**
- **iconf**
- **sed**
- **awk**
## Steps
Execute the following to download and prepare the dataset.
```bash
./download_prepare.sh
```
# DBLP
## Short description
Computer science bibliography. Can be downloaded in a large XML file.
## Source
https://dblp2.uni-trier.de/faq/How+can+I+download+the+whole+dblp+dataset.html
## Conversion details
We include all entries except “www” elements, which are small
subtrees that all match for threshold ≥ 3, blowing up the join result by
billions of pair.
- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
......@@ -64,5 +70,5 @@ awk '!/{www{key{homepages/' dblp_sorted.bracket > dblp_no_www_sorted.bracket
## Troubleshooting
- Encoding
### Encoding
In case of encoding error follow the steps on this webpage: [https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian](https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian).
# Python Abstract Syntax Trees
## Short description
Abstract Syntax trees of 150k Python collected and curated by SRI Group of ETH
Zuerich.
Abstract Syntax trees of 150k Python programs found on Github. Data provided by SRI Group of ETH Zuerich.
## Source
https://www.sri.inf.ethz.ch/py150
## Conversion details
......@@ -13,7 +16,7 @@ Abstract Syntax trees of 150k Python programs found on Github. Data provided by
- **Python3**
https://www.python.org/downloads/
- **argparse**
- **argparse** module of Python3
- **wget**
- **wget**
- **tar**
......
# Sentiment
## Short description
Syntax trees of movie ratings from the Stanford NLP Group.
Syntax trees of movie ratings.
## Source
https://nlp.stanford.edu/sentiment/
## Conversion details
......
# Swissprot
## Short description
Annotated and non-redundant protein sequence database. Can be downloaded in a large XML file.
## Source
https://www.uniprot.org/downloads
## Conversion details
- Each tag is a node with tag's name as a label.
......@@ -25,7 +27,7 @@ Annotated and non-redundant protein sequence database. Can be downloaded in a la
## Steps
**For repeatability, it downloads a specific version of the data (hardcoded in ``download_prepare.sh`` and ``dblp-to-bracket.py``).**
**It downloads the current version of the data.**
Execute to download all necessary files.
```bash
......@@ -37,12 +39,7 @@ Execute to convert the raw data file into bracket notation. **Takes some time.**
python swissprot_to_bracket.py
```
Sample the dataset. **We perform a join on a subset only.**
```bash
python random_lines.py 100000 swissprot.bracket > swissprot_random_100k.bracket
```
Execute to sort the dataset by tree size.
```bash
./sort_dataset.sh swissprot_random_100k.bracket
./sort_dataset.sh swissprot.bracket
```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment