Lexical Information induced from Lexicalised PCFGs
[Note that you have access to examples only. The full resources are freely available for education, research and other non-commercial purposes. Please contact me to obtain access to the complete sources.]Head-Lexicalised Probabilistic Context-Free Grammars (HeadLex-PCFGs) represent a lexicalised extension of PCFGs, and incorporate lexical heads into the grammar rules, cf. Charniak (1997) and Carroll and Rooth (1998). As the core of a HeadLex-PCFG, a context-free grammar is developed, with head-marking on the children. The parameters of the probabilistic version of the context-free grammar - both for the unlexicalised PCFG, a lexicalisation bootstrapping, and the lexicalised HeadLex-PCFG - are then estimated in an unsupervised training procedure, using the Expectation-Maximization algorithm (Baum, 1972). The algorithm iteratively improves model parameters by alternately assessing frequencies and estimating probabilities.
We used the statistical parser LoPar to perform
the parameter training. The trained grammar model provides
lexicalised rules and syntax-semantics head-head
co-occurrences, as an empirical resource for inducing
quantitative lexical properties at the syntax-semantics
interface. The lexical information can be used for lexical
acquisition and modeling linguistic phenomena. This page
provides lexical information for German and for English.
References:
Eugene Charniak
Statistical Parsing with a Context-Free Grammar and Word Statistics
Proceedings of the 14th National Conference on Artificial Intelligence. Menlo Park, CA, 1997.
Glenn Carroll, Mats Rooth
Valence Induction with a Head-Lexicalized PCFG
Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing. Granada, Spain, 1998.
Helmut Schmid
LoPar: Design and Implementation
Arbeitspapiere des Sonderforschungsbereichs 340 Linguistic Theory and the Foundations of Computational Linguistics, No. 149. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 2000.
Sabine Schulte im Walde, Helmut Schmid, Mats Rooth, Stefan Riezler, Detlef Prescher
Statistical Grammar Models and Lexicon Acquisition
[pdf/bib]
In: Christian Rohrer, Antje Rossdeutscher and Hans Kamp (eds)
Linguistic Form and its Computation. CSLI Publications, Stanford, CA, 2001.
Lexical Acquisition from the Huge German Corpus (HGC)
The German HeadLex-PCFG was trained on 35 million words
of the Huge German Corpus (HGC), a collection of newspaper
corpora from the 1990s. We provide the unlexicalised and
lexicalised grammar files for parsing, empirical word
frequencies, and lexical data on various linguistic
phenomena.
Grammar Sources (for Parsing):
- README
- unlexicalised parameters (linux)
- lexicalised parameters (linux)
- unlexicalised parameters (solaris)
- lexicalised parameters (solaris)
- frequencies of lemmas [example: random choice of 100 entries]
- frequencies of lemma-word combinations [example]
- frequencies of lemma-tag combinations [example]
- frequencies of full word forms [example]
- frequencies of word-tag combinations [example]
- frequencies of word-tag-lemma triples [example]
- verb frequencies [example: random choice of 100 entries]
... for verbs:
- frequency and probability distributions over 38 subcategorisation frame types [examples: achten; beginnen]
- ditto; over 183 subcategorisation frame types including pp specification [examples: achten; beginnen]
- frequencies of nominal fillers in subcategorised arguments, with reference to verb-frame-slot combination [examples: intransitive subjects for anfangen; direct objects for essen]
- frequencies of fillers in subcategorised prepositional phrases [examples: achten auf; wohnen in]
- frequencies of verb fillers in subcategorised finite clauses [examples: behaupten; zeigen]
- frequencies of verb fillers in subcategorised non-finite clauses [examples: anfangen; versuchen]
- frequencies of prepositional adjuncts to clauses [examples: abbrechen; enden]
- verb active vs. passive alternation (frequencies and probabilities) [examples]
- verb auxiliary alternation haben/sein (frequencies and probabilities) [examples]
- frequency and probability distributions of adjective modifiers [examples: Anfang; Schule]
- frequency and probability distributions of genitive noun modifiers [examples: Anfang; Symbol]
- frequency and probability distributions of prepositional noun modifiers [examples: Angebot; Markt]
- frequency distributions of subcategorising verb-frame-slot combinations [examples: Buch as direct object; Bürger as intransitive subject]
- proper name tuples, with LLH and frequency values [top 50 entries]
- frequency and probability distributions of adjective modifiers [examples: demokratisch; englisch]
- frequency and probability distributions of adverbial modifiers [examples: gut; klein]
- frequency and probability distributions of subcategorised nouns [examples: case: acc, prep: durch; case: dat, prep: entsprechend]
Reference:
Sabine Schulte im Walde
Experiments on the Automatic Induction of German Semantic Verb Classes [pdf/bib]
PhD Thesis. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, June 2003. Published as AIMS Report 9(2).
[Chapter 3]
Lexical Acquisition from the British National Corpus (BNC)
The English HeadLex-PCFG was trained on approx. half of the BNC, 50 million words. The resulting grammar model was applied to obtain Viterbi parses for the whole corpus, 117 million words. From the Viterbi parses we then extracted lexical information about verbs, subcategorisation frames and arguments.
Parses:
- verbs with arguments amd adjuncts, as extracted from Viterbi parses, one line per clause, active vs. passive mode, categories and heads, lemmatised version [example: random choice of 500 cases]
- 12,238 verb-particle combinations and their corpus frequencies [example: random choice of 50 combinations]
- 74 subcategorisation frames and their corpus frequencies
- 7,444 subcategorisation frames including pp-specifications, and their corpus frequencies
- joint frequencies of verb-particle combinations and subcategorisation frames [example: give]
- ditto; including pp-specifications [example: give]
- argument heads for verb-frame-slot combinations, including pp-specifications [examples: direct objects for read; pp:in for live; adjectives for feel]
Reference:
Sabine Schulte im Walde
Automatic Semantic Classification of Verbs According to their Alternation Behaviour [pdf/bib]
Diplomarbeit. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 1998.
[mainly: Section 2.1 and Appendix A]