README.md 2.09 KB
Newer Older
1 2 3 4
# DBLP

Computer science bibliography. Can be downloaded in a large XML file.

5 6 7 8
## Source

https://dblp2.uni-trier.de/faq/How+can+I+download+the+whole+dblp+dataset.html

9 10
## Conversion details

11 12 13 14
We include all entries except “www” elements, which are small
subtrees that all match for threshold ≥ 3, blowing up the join result by
billions of pair.

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
- Each tag is a node with tag's name as a label.
- Tag nesting is converted to parent-child relationships.
- A tag's *content* is tag's child node with *content* as label.
- Each attribute, (*key*, *value*) pair, are two nodes (parent and child) with the labels being *key* and *value*, respectively.
- Attribute-nodes are children of the corresponding tag-node.
- Attribute-nodes are ordered by their key values.
- Attribute-nodes come before content-node.

## Dependencies

- **lxml** (XML processor for Python)
  http://lxml.de/installation.html
- **Python3**
  https://www.python.org/downloads/
- **wget**
- **gzip**

## Steps

**For repeatability, it downloads a specific version of the data (hardcoded in ``download.sh`` and ``dblp-to-bracket.py``).**

Execute to download all necessary files.
```bash
./download.sh
```

Execute to convert the raw data file into bracket notation. **Takes around 10 minutes on i5 laptop machine.**
```bash
python dblp_to_bracket.py
```

46
**Execute to get a subset only.**
47 48 49 50 51 52
```bash
python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket
```

Execute to sort the dataset by tree size.
```bash
53 54 55 56 57
./sort_dataset.sh dblp.bracket
```

Execute to remove the homepage entries.
```bash
58
sed '/{www{key{homepages/d' dblp_sorted.bracket > dblp_no_www_sorted.bracket
59
```
60 61
or
```bash
62
awk '!/{www{key{homepages/' dblp_sorted.bracket > dblp_no_www_sorted.bracket
63 64
```

65 66 67 68 69

**(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files.
```bash
./tidy-up.sh
```
70 71 72

## Troubleshooting

73
### Encoding
74
In case of encoding error follow the steps on this webpage: [https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian](https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian).