Commit be9379a5 authored by Mateusz Pawlik's avatar Mateusz Pawlik
Browse files

Iteration over README files. Tried to unify.

parent 0412c8e9
Loading
Loading
Loading
Loading
+3 −0
Original line number Original line Diff line number Diff line
@@ -19,6 +19,9 @@ datasets subdirectories.


## Statistics
## Statistics


The `statistics` subdirectory contains scripts to summirize the data. The table
below shows the statistics of our datasets.

Dataset   | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
Dataset   | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
----------|-----------------|----------------|---------------|---------------|-------------------------- 
----------|-----------------|----------------|---------------|---------------|-------------------------- 
Bolzano   | 299             | 166            | 2             | 2105          | 592
Bolzano   | 299             | 166            | 2             | 2105          | 592
+28 −0
Original line number Original line Diff line number Diff line
# Bolzano

Residential addresses in the city of Bolzano. A tree represents all addresses
of a street (root: street name, root-leaf path: address).

## Source

https://dbresearch.uni-salzburg.at/projects/pq-gram-ordered-labeled-trees/description.php

## Conversion details

The original trees are in the bracket notation. We only remove the tree
identifiers.

## Dependencies

- **wget**
- **unzip**
- **iconf**
- **sed**
- **awk**

## Steps

Execute the following to download and prepare the dataset.
```bash
./download_prepare.sh
```
+9 −3
Original line number Original line Diff line number Diff line
# DBLP
# DBLP


## Short description

Computer science bibliography. Can be downloaded in a large XML file.
Computer science bibliography. Can be downloaded in a large XML file.


## Source

https://dblp2.uni-trier.de/faq/How+can+I+download+the+whole+dblp+dataset.html

## Conversion details
## Conversion details


We include all entries except “www” elements, which are small
subtrees that all match for threshold ≥ 3, blowing up the join result by
billions of pair.

- Each tag is a node with tag's name as a label.
- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- A tag's *content* is tag's child node with *content* as label.
@@ -64,5 +70,5 @@ awk '!/{www{key{homepages/' dblp_sorted.bracket > dblp_no_www_sorted.bracket


## Troubleshooting
## Troubleshooting


- Encoding
### Encoding
In case of encoding error follow the steps on this webpage: [https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian](https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian).
In case of encoding error follow the steps on this webpage: [https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian](https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian).
+6 −3
Original line number Original line Diff line number Diff line
# Python Abstract Syntax Trees
# Python Abstract Syntax Trees


## Short description
Abstract Syntax trees of 150k Python collected and curated by SRI Group of ETH
Zuerich.


Abstract Syntax trees of 150k Python programs found on Github. Data provided by SRI Group of ETH Zuerich.
## Source

https://www.sri.inf.ethz.ch/py150


## Conversion details
## Conversion details


@@ -13,7 +16,7 @@ Abstract Syntax trees of 150k Python programs found on Github. Data provided by


- **Python3**
- **Python3**
  https://www.python.org/downloads/
  https://www.python.org/downloads/
- **argparse**
- **argparse** module of Python3
- **wget**
- **wget**
- **wget**
- **wget**
- **tar**
- **tar**
+4 −2
Original line number Original line Diff line number Diff line
# Sentiment
# Sentiment


## Short description
Syntax trees of movie ratings from the Stanford NLP Group.


Syntax trees of movie ratings.
## Source

https://nlp.stanford.edu/sentiment/


## Conversion details
## Conversion details


Loading