README.md 2.59 KB
Newer Older
Mateusz Pawlik's avatar
Mateusz Pawlik committed
1
# Datasets for tree edit distance experiments
Mateusz Pawlik's avatar
Mateusz Pawlik committed
2

Mateusz Pawlik's avatar
Mateusz Pawlik committed
3
4
This repository contains all resources to acquire datasets for experimenting
on tree edit distance algorithms.
5
6
7

**We do not store the datasets**, only the scripts to obtain and prepare them.

Mateusz Pawlik's avatar
Mateusz Pawlik committed
8
9
10
## Datasets description

Currently we support the following datasest:
Mateusz Pawlik's avatar
Mateusz Pawlik committed
11
12
13
14
15
16
17
- **Bolzano** - Residential addresses in the city of Bolzano.
- **DBLP** - Bibliographic XML data.
- **Python** - Abstract syntax trees of Python source code in JSON.
- **Sentiment** - Semantic trees of movie reviews in the PennTreeBank format.
- **Swissprot** - Protein sequence data in XML.

The details about each dataset can be found in the README files in the
Mateusz Pawlik's avatar
Mateusz Pawlik committed
18
corresponding subdirectories.
19

Mateusz Pawlik's avatar
Mateusz Pawlik committed
20
21
## Statistics

Mateusz Pawlik's avatar
Mateusz Pawlik committed
22
The `statistics` subdirectory contains scripts to summarize the data. The table
23
24
below shows the statistics of our datasets.

Mateusz Pawlik's avatar
Mateusz Pawlik committed
25
26
27
28
29
30
31
32
Dataset   | Number of trees | Avg. tree size | Min tree size | Max tree size | Number of distinct labels
----------|-----------------|----------------|---------------|---------------|-------------------------- 
Bolzano   | 299             | 166            | 2             | 2105          | 592
DBLP      | 3934134         | 25             | 8             | 2986          | 14664605
Python    | 150000          | 946            | 1             | 46481         | 3523697
Sentiment | 9645            | 37             | 3             | 103           | 19470
Swissprot | 556196          | 862            | 101           | 48286         | 11439467

Mateusz Pawlik's avatar
Mateusz Pawlik committed
33
34
35
## Repository organisation

Each dataset and its corresponding scripts belong to a separate directory with a name identifying the dataset.
36

Mateusz Pawlik's avatar
Mateusz Pawlik committed
37
38
39
40
41
42
43
44
45
46
47
48
49
There is a script `download_prepare.sh` for each dataset, that downloads the
raw data, converts to bracket notation, and sorts by tree size.

The `utilities` directory holds tools common for multiple datasets.

## RAM requirements and runtime estimates

For each dataset, if necessary, we list the required RAM memory and estimated
runtime. The RAM requirements go up to **60GB** for the Swissprot dataset.

The script `utilities/sort_dataset.sh` has a hard-coded 10GB buffer size.
However, it executes on machines with less RAM too.

50
51
52
53
54
55
## Expected output

Each output dataset must satisfy the following requirements:
- The output dataset must be a single text file with one tree per line.
- The trees must be in bracket notation.
- The trees must be sorted by size.
Mateusz Pawlik's avatar
Mateusz Pawlik committed
56
57
58
59
60
61
62

## TODO

- [ ]Explicitly break ties in sorting: first size, then bracket notation strings lexicographically.
  Example: `sort -k1,1nr -k2 data`
- [ ]Publish snapshots of raw data.
- [ ]Add where the raw data should be fetched from: original source, our snapshot.