Welcome to the home page of ACOPOST, a free and open source collection of part-of-speech taggers. In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context — i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags.
Download now (C source and Perl/Python scripts, .tar.gz) |
Part-of-speech (POS) tagging is the task of assigning grammatical classes to words in a natural language sentence. It's important because subsequent processing stages (such as parsing or sentence translaiton) become easier if the word class for a word is available.
Here's an English example of a tagged sentence taken from the Wall Street Journal of the Penn Treebank:
Word | Part-of-speech tag |
---|---|
Measures | NNS |
of | IN |
manufacturing | VBG |
activity | NN |
fell | VBD |
more | RBR |
than | IN |
the | DT |
overall | JJ |
measures | NNS |
. | . |
ACOPOST is a set of freely available POS taggers modeled after well-known techniques. The programs are written in C (aiming for extreme portability and code correctness/safety) and run under various UNIX flavors (and probably even under Windows). ACOPOST currently consists of four taggers which are based on different frameworks:
A detailed description, an extensive evaluation and new suggestions can be found in an accompanying technical report [Schröder 2002].
The project page at Sourceforge can be reached at http://sourceforge.net/projects/acopost/ where the latest releases can be found.
Mailing lists are available for announcements, for developers and for users at http://sourceforge.net/p/acopost/mailman/.
Thorsten Brants. 2000. TnT - as statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP-2000), Seattle, WA, USA.
Eric Brill. 1993. Automatic grammar induction and parsing free text: A transformation-based appraoch. In Proceedings of the 31st Annual Meeting of the ACL.
Walter Daelemans, Jakub Zavrel, Peter Berck & Steven Gillis. 1996. MBT: A memory-based part of speech tagger-generator. In Eva Ejerhed & Ido Dagan, ed., Proceedings of the Fourth Workshop on Very Large Corpora, pages 14-27.
Ingo Schröder. 2002. A Case Study in Part-of-Speech tagging Using the ICOPOST Toolkit. Technical report FBI-HH-M-314/02. Department of Computer Science, University of Hamburg.
Lawrence R. Rabiner. 1990. A tutorial on hidden markov models and selected applications in speech recognition. In Alex Waibel & Kai-Fu Lee, ed., Readings in Speech Recognition. Morgan Kaufmann, San Mateo, CA, USA, pages 267-290. See also Errata.
Adwait Ratnaparkhi. 1998. Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania.