Commit 0bd99fec authored by Mateusz Pawlik's avatar Mateusz Pawlik

dblp: Added command for removing homepage entries.

parent 857a4f43
...@@ -37,14 +37,19 @@ Execute to convert the raw data file into bracket notation. **Takes around 10 mi ...@@ -37,14 +37,19 @@ Execute to convert the raw data file into bracket notation. **Takes around 10 mi
python dblp_to_bracket.py python dblp_to_bracket.py
``` ```
Sample the dataset. **We perform a join on a subset only.** **Execute to get a subset only.**
```bash ```bash
python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket
``` ```
Execute to sort the dataset by tree size. Execute to sort the dataset by tree size.
```bash ```bash
./sort_dataset.sh dblp_random_100k.bracket ./sort_dataset.sh dblp.bracket
```
Execute to remove the homepage entries.
```bash
sed '/{www{key{/d' dblp_sorted.bracket > dblp_no_www_sorted.bracket
``` ```
**(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files. **(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment