Commit 0bd99fec authored by Mateusz Pawlik's avatar Mateusz Pawlik

dblp: Added command for removing homepage entries.

parent 857a4f43
......@@ -37,14 +37,19 @@ Execute to convert the raw data file into bracket notation. **Takes around 10 mi
python dblp_to_bracket.py
```
Sample the dataset. **We perform a join on a subset only.**
**Execute to get a subset only.**
```bash
python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket
```
Execute to sort the dataset by tree size.
```bash
./sort_dataset.sh dblp_random_100k.bracket
./sort_dataset.sh dblp.bracket
```
Execute to remove the homepage entries.
```bash
sed '/{www{key{/d' dblp_sorted.bracket > dblp_no_www_sorted.bracket
```
**(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment