Loading dblp/README.md +7 −2 Original line number Original line Diff line number Diff line Loading @@ -37,14 +37,19 @@ Execute to convert the raw data file into bracket notation. **Takes around 10 mi python dblp_to_bracket.py python dblp_to_bracket.py ``` ``` Sample the dataset. **We perform a join on a subset only.** **Execute to get a subset only.** ```bash ```bash python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket ``` ``` Execute to sort the dataset by tree size. Execute to sort the dataset by tree size. ```bash ```bash ./sort_dataset.sh dblp_random_100k.bracket ./sort_dataset.sh dblp.bracket ``` Execute to remove the homepage entries. ```bash sed '/{www{key{/d' dblp_sorted.bracket > dblp_no_www_sorted.bracket ``` ``` **(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files. **(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files. Loading Loading
dblp/README.md +7 −2 Original line number Original line Diff line number Diff line Loading @@ -37,14 +37,19 @@ Execute to convert the raw data file into bracket notation. **Takes around 10 mi python dblp_to_bracket.py python dblp_to_bracket.py ``` ``` Sample the dataset. **We perform a join on a subset only.** **Execute to get a subset only.** ```bash ```bash python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket ``` ``` Execute to sort the dataset by tree size. Execute to sort the dataset by tree size. ```bash ```bash ./sort_dataset.sh dblp_random_100k.bracket ./sort_dataset.sh dblp.bracket ``` Execute to remove the homepage entries. ```bash sed '/{www{key{/d' dblp_sorted.bracket > dblp_no_www_sorted.bracket ``` ``` **(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files. **(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files. Loading