Commit f5ccdceb authored by Mateusz Pawlik's avatar Mateusz Pawlik
Browse files

Documenting DBLP: RAM and runtime estimates.

parent 791e1282
Loading
Loading
Loading
Loading
+11 −35
Original line number Diff line number Diff line
@@ -28,51 +28,27 @@ billions of pair.
  https://www.python.org/downloads/
- **wget**
- **gzip**
- **awk**

## Steps

**For repeatability, it downloads a specific version of the data (hardcoded in ``download.sh`` and ``dblp-to-bracket.py``).**

Execute to download all necessary files.
```bash
./download.sh
```
## RAM requirements

Execute to convert the raw data file into bracket notation. **Takes around 10 minutes on i5 laptop machine.**
```bash
python dblp_to_bracket.py
```
The current way of processing DBLP dataset requires **16GB** of RAM memory.

**Execute to get a subset only.**
```bash
python random_lines.py 100000 dblp.bracket > dblp_random_100k.bracket
```
## Steps

Execute to sort the dataset by tree size.
```bash
./sort_dataset.sh dblp.bracket
```
**For repeatability, it downloads a specific version of the data (hardcoded in
``download.sh`` and ``dblp-to-bracket.py``).**

Execute to remove the homepage entries.
Execute the following to download and prepare the dataset.
```bash
sed '/{www{key{homepages/d' dblp_sorted.bracket > dblp_no_www_sorted.bracket
```
or
```bash
awk '!/{www{key{homepages/' dblp_sorted.bracket > dblp_no_www_sorted.bracket
./download_prepare.sh
```

## Estimated time

**(Optional)** Execute to delete all downloaded files. It leaves only the output dataset files.
```bash
./tidy-up.sh
```
On an Intel Xeon 2.40GHz CPU, it takes around **15min**.

## Troubleshooting

### Encoding
In case of encoding error follow the steps on this webpage: [https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian](https://www.thomas-krenn.com/de/wiki/Perl_warning_Setting_locale_failed_unter_Debian).
 No newline at end of file

## Estimated time

Partially listed in Steps. Total time to be measured.
 No newline at end of file