Loading README.md +3 −0 Original line number Diff line number Diff line Loading @@ -19,6 +19,9 @@ datasets subdirectories. ## Statistics The `statistics` subdirectory contains scripts to summirize the data. The table below shows the statistics of our datasets. Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels ----------|-----------------|----------------|---------------|---------------|-------------------------- Bolzano | 299 | 166 | 2 | 2105 | 592 Loading bolzano-address-trees/README.md 0 → 100644 +28 −0 Original line number Diff line number Diff line # Bolzano Residential addresses in the city of Bolzano. A tree represents all addresses of a street (root: street name, root-leaf path: address). ## Source https://dbresearch.uni-salzburg.at/projects/pq-gram-ordered-labeled-trees/description.php ## Conversion details The original trees are in the bracket notation. We only remove the tree identifiers. ## Dependencies - **wget** - **unzip** - **iconf** - **sed** - **awk** ## Steps Execute the following to download and prepare the dataset. ```bash ./download_prepare.sh ``` dblp/README.md +9 −3 Original line number Diff line number Diff line # DBLP ## Short description Computer science bibliography. Can be downloaded in a large XML file. ## Source https://dblp2.uni-trier.de/faq/How+can+I+download+the+whole+dblp+dataset.html ## Conversion details We include all entries except “www” elements, which are small subtrees that all match for threshold ≥ 3, blowing up the join result by billions of pair. - Each tag is a node with tag's name as a label. - Tag nesting is converted to parent-child relationships. - A tag's *content* is tag's child node with *content* as label. Loading Loading @@ -64,5 +70,5 @@ awk '!/{www{key{homepages/' dblp_sorted.bracket > dblp_no_www_sorted.bracket ## Troubleshooting - Encoding ### Encoding In case of encoding error follow the steps on this webpage: [https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian](https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian). python_ast/README.md +6 −3 Original line number Diff line number Diff line # Python Abstract Syntax Trees ## Short description Abstract Syntax trees of 150k Python collected and curated by SRI Group of ETH Zuerich. Abstract Syntax trees of 150k Python programs found on Github. Data provided by SRI Group of ETH Zuerich. ## Source https://www.sri.inf.ethz.ch/py150 ## Conversion details Loading @@ -13,7 +16,7 @@ Abstract Syntax trees of 150k Python programs found on Github. Data provided by - **Python3** https://www.python.org/downloads/ - **argparse** - **argparse** module of Python3 - **wget** - **wget** - **tar** Loading sentiment/README.md +4 −2 Original line number Diff line number Diff line # Sentiment ## Short description Syntax trees of movie ratings from the Stanford NLP Group. Syntax trees of movie ratings. ## Source https://nlp.stanford.edu/sentiment/ ## Conversion details Loading Loading
README.md +3 −0 Original line number Diff line number Diff line Loading @@ -19,6 +19,9 @@ datasets subdirectories. ## Statistics The `statistics` subdirectory contains scripts to summirize the data. The table below shows the statistics of our datasets. Dataset | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels ----------|-----------------|----------------|---------------|---------------|-------------------------- Bolzano | 299 | 166 | 2 | 2105 | 592 Loading
bolzano-address-trees/README.md 0 → 100644 +28 −0 Original line number Diff line number Diff line # Bolzano Residential addresses in the city of Bolzano. A tree represents all addresses of a street (root: street name, root-leaf path: address). ## Source https://dbresearch.uni-salzburg.at/projects/pq-gram-ordered-labeled-trees/description.php ## Conversion details The original trees are in the bracket notation. We only remove the tree identifiers. ## Dependencies - **wget** - **unzip** - **iconf** - **sed** - **awk** ## Steps Execute the following to download and prepare the dataset. ```bash ./download_prepare.sh ```
dblp/README.md +9 −3 Original line number Diff line number Diff line # DBLP ## Short description Computer science bibliography. Can be downloaded in a large XML file. ## Source https://dblp2.uni-trier.de/faq/How+can+I+download+the+whole+dblp+dataset.html ## Conversion details We include all entries except “www” elements, which are small subtrees that all match for threshold ≥ 3, blowing up the join result by billions of pair. - Each tag is a node with tag's name as a label. - Tag nesting is converted to parent-child relationships. - A tag's *content* is tag's child node with *content* as label. Loading Loading @@ -64,5 +70,5 @@ awk '!/{www{key{homepages/' dblp_sorted.bracket > dblp_no_www_sorted.bracket ## Troubleshooting - Encoding ### Encoding In case of encoding error follow the steps on this webpage: [https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian](https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian).
python_ast/README.md +6 −3 Original line number Diff line number Diff line # Python Abstract Syntax Trees ## Short description Abstract Syntax trees of 150k Python collected and curated by SRI Group of ETH Zuerich. Abstract Syntax trees of 150k Python programs found on Github. Data provided by SRI Group of ETH Zuerich. ## Source https://www.sri.inf.ethz.ch/py150 ## Conversion details Loading @@ -13,7 +16,7 @@ Abstract Syntax trees of 150k Python programs found on Github. Data provided by - **Python3** https://www.python.org/downloads/ - **argparse** - **argparse** module of Python3 - **wget** - **wget** - **tar** Loading
sentiment/README.md +4 −2 Original line number Diff line number Diff line # Sentiment ## Short description Syntax trees of movie ratings from the Stanford NLP Group. Syntax trees of movie ratings. ## Source https://nlp.stanford.edu/sentiment/ ## Conversion details Loading