Commit 0bd99fec authored by Mateusz Pawlik's avatar Mateusz Pawlik
Browse files

dblp: Added command for removing homepage entries.

parent 857a4f43
Loading
Loading
Loading
Loading
+7 −2
Original line number Original line Diff line number Diff line
@@ -37,14 +37,19 @@ Execute to convert the raw data file into bracket notation. **Takes around 10 mi
python dblp_to_bracket.py
python dblp_to_bracket.py
```
```


Sample the dataset. **We perform a join on a subset only.**
**Execute to get a subset only.**
```bash
```bash
python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket
python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket
```
```


Execute to sort the dataset by tree size.
Execute to sort the dataset by tree size.
```bash
```bash
./sort_dataset.sh dblp_random_100k.bracket
./sort_dataset.sh dblp.bracket
```

Execute to remove the homepage entries.
```bash
sed '/{www{key{/d' dblp_sorted.bracket > dblp_no_www_sorted.bracket
```
```


**(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files.
**(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files.