ACOPOST - A COllection of POS Taggers

Quite hard to classify...

Welcome to the home page of ACOPOST, a free and open source collection of part-of-speech taggers. In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context — i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags.

Download now (C source and Perl/Python scripts, .tar.gz)

News

Current status (2012-11-22): The version patched for 64-bit systems is ready in Git. The bugs in t3 and met related to large and/or noisy lexicons seem to have been fixed. The maintenance team has been expanded to three members. We have made it compile and work on Mac OS X, and have created autoconf/automake scripts, as well as an RPM spec file. We are close to being able to make a release of a new version. Until we make the release, users interested in the new version should clone the Git repository. For more information on the project, please write me (Tiago).
2010-03-04: Created a git repository for the project, including this web page. Access information is given at https://sourceforge.net/projects/acopost/develop, code can be browsed online here.
2010-03-04: Project changes: Tiago Tresoldi is the new maintainer; besides a new home page, the programs are being adapted to 64-bit systems and code is being cleaned.
2007-05-13: Tiago Tresoldi released his own patched version of ACOPOST, 1.8.6, which compiled with gcc versions 3 and 4, on his Hermes project page.
2002-09-23: Renamed ICOPOST to ACOPOST and moved the package to the Sourceforge repository of open source projects. Released version 1.8.4, which contained a preliminary user's guide. The project was put on halt since Ingo Schröder (the original maintainer) would not have the time to maintain the package.
2002-04-24: Released version 1.8.3 beta, containing an additional tagger based on example-based techniques.
2001-08-21: Released version 0.9.0 (first public release).
2001-07-14: First public talk about ICOPOST.
2001-06-08: Web page started.

What is ACOPOST about?

Part-of-speech (POS) tagging is the task of assigning grammatical classes to words in a natural language sentence. It's important because subsequent processing stages (such as parsing or sentence translaiton) become easier if the word class for a word is available.

Here's an English example of a tagged sentence taken from the Wall Street Journal of the Penn Treebank:

Word	Part-of-speech tag
Measures	NNS
of	IN
manufacturing	VBG
activity	NN
fell	VBD
more	RBR
than	IN
the	DT
overall	JJ
measures	NNS
.	.

ACOPOST is a set of freely available POS taggers modeled after well-known techniques. The programs are written in C (aiming for extreme portability and code correctness/safety) and run under various UNIX flavors (and probably even under Windows). ACOPOST currently consists of four taggers which are based on different frameworks:

Maximum Entropy Tagger MET: This tagger uses an iterative procedure to successively improve parameters for a set of features that help to distinguish between relevant contexts. It's based on a framework suggested by Ratnaparkhi [1997].
Trigram Tagger T3: This kind of tagger is based on Hidden Markov Models (HMM) where the states are tag pairs that emit words, i. e., it's based on transitional and lexical probabilities. The technique has been suggested by Rabiner [1990] and the implementation is influenced by Brants [2000].
Error-driven Transformation-based Tagger TBT: Transformation rules are learned from an annotated corpus which change the currently assigned tag depending on triggering context conditions. The general approach as well as the application to POS tagging has been proposed by Brill [1993].
Example-based tagger ET: Example-based models (also called memory-based, instance-based or distance-based) rest on the assumption that cognitive behavior can be achieved by looking at past experiences that resemble the current problem rather than learning and applying abstract rules. They have been suggested for NLP by Daelemans et al. [1996].

A detailed description, an extensive evaluation and new suggestions can be found in an accompanying technical report [Schröder 2002].

Further information

The project page at Sourceforge can be reached at http://sourceforge.net/projects/acopost/ where the latest releases can be found.

Mailing lists are available for announcements, for developers and for users at http://sourceforge.net/p/acopost/mailman/.

References

Thorsten Brants. 2000. TnT - as statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP-2000), Seattle, WA, USA.

Eric Brill. 1993. Automatic grammar induction and parsing free text: A transformation-based appraoch. In Proceedings of the 31st Annual Meeting of the ACL.

Walter Daelemans, Jakub Zavrel, Peter Berck & Steven Gillis. 1996. MBT: A memory-based part of speech tagger-generator. In Eva Ejerhed & Ido Dagan, ed., Proceedings of the Fourth Workshop on Very Large Corpora, pages 14-27.

Ingo Schröder. 2002. A Case Study in Part-of-Speech tagging Using the ICOPOST Toolkit. Technical report FBI-HH-M-314/02. Department of Computer Science, University of Hamburg.

Lawrence R. Rabiner. 1990. A tutorial on hidden markov models and selected applications in speech recognition. In Alex Waibel & Kai-Fu Lee, ed., Readings in Speech Recognition. Morgan Kaufmann, San Mateo, CA, USA, pages 267-290. See also Errata.

Adwait Ratnaparkhi. 1998. Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania.

For contact: tresoldi at gmail dot com