Commit 74ecde05 authored by Thomas Huetter's avatar Thomas Huetter
Browse files

updated readmes; added python asts

parent 42c89843
Loading
Loading
Loading
Loading

python_ast/README.md

0 → 100644
+29 −0
Original line number Diff line number Diff line
# Python Abstract Syntax Trees

## Short description

Abstract Syntax trees of 150k Python programs found on Github. Data provided by SRI Group of ETH Zuerich.

## Conversion details

- Each value of a type is a node.
- The value of an optional value attribute is inserted as first child of a type.

## Dependencies

- **Python3**
  https://www.python.org/downloads/
- **argparse**
- **wget**
- **wget**
- **tar**
- **awk**
- **sort**
- **cut**

## Steps

Execute to download all necessary files, convert them into bracket notation and sort them.
```bash
./download_prepare.sh
```
 No newline at end of file
+25 −0
Original line number Diff line number Diff line
#!/bin/bash
# file: download_prepare.sh
#
# Program: Downloads and prepares data containing abstract syntax trees 
# from python programs. (https://www.sri.inf.ethz.ch/py150)
#
# Author: Thomas Huetter

# download abstract syntax trees
wget http://files.srl.inf.ethz.ch/data/py150.tar.gz

# extract abstract syntax trees
tar -xzf py150.tar.gz

# change to extracted directory
cd py150

# convert ast to bracket notation
python ../parse_json.py --inputfile python100k_train.json > python_ast.bracket

# convert ast to bracket notation
python ../parse_json.py --inputfile python50k_eval.json >> python_ast.bracket

# sort the trees ascending by their size
../sort_dataset.sh python_ast.bracket
 No newline at end of file
+37 −0
Original line number Diff line number Diff line
#!/usr/bin/env python
'''
    File name: parse_json.py
    Author: Thomas Huetter
    Program: Reads Python and Javascript Abstract Syntax Trees
             and converts them into bracket notation.
             (https://www.sri.inf.ethz.ch/py150)
'''
import argparse
import json

# recursively traverses the tree and converts it step by step 
# into bracket notation
def print_tree(json_tree, index):
  print('{' + json_tree[index]['type'], end='')
  if 'value' in json_tree[index]:
    print('{' + json_tree[index]['value'].replace('\r','').replace('\n','').strip().translate(str.maketrans({"{":  r"\{",
                                                                                                            "}":  r"\}",
                                                                                                            "\\": r"\\"}))
              + '}', end='')
  if 'children' in json_tree[index]:
    for child in json_tree[index]['children']:
      print_tree(json_tree, child)
  print('}', end='')

# parse input argurments
parser = argparse.ArgumentParser()
parser.add_argument("--inputfile", type=str, 
  help="path to input files containing line seperated JSON ASTs")

args = parser.parse_args()

with open(args.inputfile) as f:
  for line in f:
    json_tree = json.loads(line)
    print_tree(json_tree, 0)
    print()
 No newline at end of file
+4 −0
Original line number Diff line number Diff line
#!/bin/bash

#      | sort by number of nodes (equivalent to number of "{")
cat $1 | awk '{print gsub("{","{"), $0}' | sort -n  | cut -d' ' -f2- > "${1%.bracket}_sorted.bracket"

sentiment/README.md

0 → 100644
+25 −0
Original line number Diff line number Diff line
# Sentiment

## Short description

Syntax trees of movie ratings.

## Conversion details

- Replace '(' and ')' by '{' and '}'.
- Combine dev.txt and train.txt

## Dependencies

- **wget**
- **unzip**
- **iconf**
- **sed**
- **awk**

## Steps

Execute to download all necessary files and convert it into bracket notation.
```bash
./download_prepare.sh
```
 No newline at end of file
Loading