Masood Ghayoomi
PerTreeBank

The Persian Treebank (PerTreeBank)

PerTreeBank is an HPSG-based treebank for Persian developed by Masood Ghayoomi within the CLaRK system. The data set used to develop this treebank is the freely available Bijankhan Corpus which is part-of-speech tagged.

Currently PerTreeBank contains 1028 trees. Of course, it is aimed to increase the number of sentences to have more comprehensive syntactic analyses of Persian sentences. PerTreeBank is also available in dependencies, called the Dependency Persian Treebank (DepPerTreeBank).

Both PerTreeBank and DepPerTreeBank are licensed under LGPL for Linguistic Resources and they are freely available for academic research purposes only. The data sets may not be used for any financial or military purposes!

Information about the features of PerTreeBank is found here.

PerTreeBank is accessible here.
DepPerTreeBank in the CoNLL 2006 format is accessible here.
DepPerTreeBank in the CoNLL 2009 format is accessible here.

- Be sure that you have downloaded the latest release of PerTreeBank. There might be changes on the syntactic analyses or morphosyntactic information of the data set to improve the quality. Please check this page regularly to get the latest release of PerTreeBank.
- If you find any mistake or error on the data, you are highly appreciated to report it to Masood Ghayoomi.

News about the data:

March 22, 2014 uploading the Dependency PerTreeBank
Jul 12, 2013 binarizing coordination phrases, removing the 'nid' empty node and added it as a feature to the mother node, correcting errors
Jul 26, 2012 adding the "Pragmatic" node, correcting errors
Jan 19, 2012 determining present perfect in the tags
Jan 18, 2012 renaming V nodes wrt their types (simple, auxiliary, past participle and infinitive forms) to reduce ambiguity; updating DTD
Jan 04, 2012 adding discoursal info. for some sentences; providing the relevant DTD of the treebank
Dec 16, 2011 increasing the number of trees to 1012
Oct 20, 2011 release of the data
