Commit 74ecde05 authored by Thomas Huetter's avatar Thomas Huetter

updated readmes; added python asts

parent 42c89843
# Python Abstract Syntax Trees
## Short description
Abstract Syntax trees of 150k Python programs found on Github. Data provided by SRI Group of ETH Zuerich.
## Conversion details
- Each value of a type is a node.
- The value of an optional value attribute is inserted as first child of a type.
## Dependencies
- **Python3**
https://www.python.org/downloads/
- **argparse**
- **wget**
- **wget**
- **tar**
- **awk**
- **sort**
- **cut**
## Steps
Execute to download all necessary files, convert them into bracket notation and sort them.
```bash
./download_prepare.sh
```
\ No newline at end of file
#!/bin/bash
# file: download_prepare.sh
#
# Program: Downloads and prepares data containing abstract syntax trees
# from python programs. (https://www.sri.inf.ethz.ch/py150)
#
# Author: Thomas Huetter
# download abstract syntax trees
wget http://files.srl.inf.ethz.ch/data/py150.tar.gz
# extract abstract syntax trees
tar -xzf py150.tar.gz
# change to extracted directory
cd py150
# convert ast to bracket notation
python ../parse_json.py --inputfile python100k_train.json > python_ast.bracket
# convert ast to bracket notation
python ../parse_json.py --inputfile python50k_eval.json >> python_ast.bracket
# sort the trees ascending by their size
../sort_dataset.sh python_ast.bracket
\ No newline at end of file
#!/usr/bin/env python
'''
File name: parse_json.py
Author: Thomas Huetter
Program: Reads Python and Javascript Abstract Syntax Trees
and converts them into bracket notation.
(https://www.sri.inf.ethz.ch/py150)
'''
import argparse
import json
# recursively traverses the tree and converts it step by step
# into bracket notation
def print_tree(json_tree, index):
print('{' + json_tree[index]['type'], end='')
if 'value' in json_tree[index]:
print('{' + json_tree[index]['value'].replace('\r','').replace('\n','').strip().translate(str.maketrans({"{": r"\{",
"}": r"\}",
"\\": r"\\"}))
+ '}', end='')
if 'children' in json_tree[index]:
for child in json_tree[index]['children']:
print_tree(json_tree, child)
print('}', end='')
# parse input argurments
parser = argparse.ArgumentParser()
parser.add_argument("--inputfile", type=str,
help="path to input files containing line seperated JSON ASTs")
args = parser.parse_args()
with open(args.inputfile) as f:
for line in f:
json_tree = json.loads(line)
print_tree(json_tree, 0)
print()
\ No newline at end of file
#!/bin/bash
# | sort by number of nodes (equivalent to number of "{")
cat $1 | awk '{print gsub("{","{"), $0}' | sort -n | cut -d' ' -f2- > "${1%.bracket}_sorted.bracket"
# Sentiment
## Short description
Syntax trees of movie ratings.
## Conversion details
- Replace '(' and ')' by '{' and '}'.
- Combine dev.txt and train.txt
## Dependencies
- **wget**
- **unzip**
- **iconf**
- **sed**
- **awk**
## Steps
Execute to download all necessary files and convert it into bracket notation.
```bash
./download_prepare.sh
```
\ No newline at end of file
# X-Mark
NOT FINISHED.
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment