SSiW



Resources 
 
   



Home

CV

Research

Publications

Talks

Teaching

Professional Services

Resources






                   

Resources

This site provides two kinds of resources: 1. lexical information induced from lexicalised PCFGs, and 2. human associations to German verbs. Each resource is explained below, including examples and references.

Please note that you have access to the examples only. The full resources are freely available for education, research and other non-commercial purposes. Please contact me to obtain access to the complete sources.

  1. Lexical information induced from lexicalised PCFGs


  2. Human associations to German verbs



1. Lexical Information induced from Lexicalised PCFGs

Head-Lexicalised Probabilistic Context-Free Grammars (HeadLex-PCFGs) represent a lexicalised extension of PCFGs, and incorporate lexical heads into the grammar rules, cf. Charniak (1997) and Carroll and Rooth (1998). As the core of a HeadLex-PCFG, a context-free grammar is developed, with head-marking on the children. The parameters of the probabilistic version of the context-free grammar - both for the unlexicalised PCFG, a lexicalisation bootstrapping, and the lexicalised HeadLex-PCFG - are then estimated in an unsupervised training procedure, using the Expectation-Maximization algorithm (Baum, 1972). The algorithm iteratively improves model parameters by alternately assessing frequencies and estimating probabilities.

We used the statistical parser LoPar to perform the parameter training. The trained grammar model provides lexicalised rules and syntax-semantics head-head co-occurrences, as an empirical resource for inducing quantitative lexical properties at the syntax-semantics interface. The lexical information can be used for lexical acquisition and modeling linguistic phenomena. This page provides lexical information for German and for English.

References:

Eugene Charniak
Statistical Parsing with a Context-Free Grammar and Word Statistics
Proceedings of the 14th National Conference on Artificial Intelligence. Menlo Park, CA, 1997.

Glenn Carroll, Mats Rooth
Valence Induction with a Head-Lexicalized PCFG
Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing. Granada, Spain, 1998.

Helmut Schmid
LoPar: Design and Implementation
Arbeitspapiere des Sonderforschungsbereichs 340 Linguistic Theory and the Foundations of Computational Linguistics, No. 149. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 2000.

Sabine Schulte im Walde, Helmut Schmid, Mats Rooth, Stefan Riezler, Detlef Prescher
Statistical Grammar Models and Lexicon Acquisition [ps.gz/bib]
In: Christian Rohrer, Antje Rossdeutscher and Hans Kamp (eds)
Linguistic Form and its Computation. CSLI Publications, Stanford, CA, 2001.



Lexical Acquisition from the Huge German Corpus (HGC)

The German HeadLex-PCFG was trained on 35 million words of the Huge German Corpus (HGC), a collection of newspaper corpora from the 1990s. We provide the unlexicalised and lexicalised grammar files for parsing, empirical word frequencies, and lexical data on various linguistic phenomena.

Grammar Sources (for Parsing):

Statistical Corpus Frequencies: Lexical Data:

  ... for verbs:   ... for nouns:   ... for adjectives:   ... for prepositions: Reference:

Sabine Schulte im Walde
Experiments on the Automatic Induction of German Semantic Verb Classes [ps.gz/bib]
PhD Thesis. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, June 2003. Published as AIMS Report 9(2).
[Chapter 3]



Lexical Acquisition from the British National Corpus (BNC)

The English HeadLex-PCFG was trained on approx. half of the BNC, 50 million words. The resulting grammar model was applied to obtain Viterbi parses for the whole corpus, 117 million words. From the Viterbi parses we then extracted lexical information about verbs, subcategorisation frames and arguments.

Parses:

  • verbs with arguments amd adjuncts, as extracted from Viterbi parses, one line per clause, active vs. passive mode, categories and heads, lemmatised version
    [example: random choice of 500 cases]
Lexical Data: Reference:

Sabine Schulte im Walde
Automatic Semantic Classification of Verbs According to their Alternation Behaviour [ps.gz/bib]
Diplomarbeit. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 1998.
[mainly: Section 2.1 and Appendix A]



2. Human Associations to German Verbs

The data collection was performed as a web experiment, which asked native speakers to provide associations to German verbs.

Material: 330 verbs were selected for the experiment. They were drawn from a variety of semantic classes including verbs of self-motion (e.g. gehen `walk', schwimmen `swim'), transfer of possession (e.g. kaufen `buy', kriegen `receive'), cause (e.g.verbrennen `burn', reduzieren `reduce'), experiencing (e.g. hassen `hate', überraschen `surprise'), communication (e.g. reden `talk', beneiden `envy'), etc. The stimulus verbs were divided randomly into 6 separate experimental lists of 55 verbs each. The lists were balanced for class affiliation and frequency ranges (0, 100, 500, 1000, 5000), such that each list contained verbs from each grossly defined semantic class, and had equivalent overall verb frequency distributions.

Data: 299 native German speakers participated in the experiment, between 44 and 54 for each data set. In total, we collected 79,480 associates distributed over 39,254 different response types. Each trial elicited an average of 5.16 associate responses with a range of 0-16. We pre-processed all data sets in the following way: For each stimulus word, we quantified over all responses in the experiment, disregarding the order in which associates were provided and, for noun stimuli, the presentation type of the questionnaire. The result is a frequency distribution for the stimulus words, providing frequencies for each response type. The responses were not distinguished according to polysemic senses of the stimuli.

Example: klagen

Related Resources:
- Alissa Melinger and Andrea Weber collected associations to German nouns. Their Database of Noun Associations for German can be accessed online.
- Annamaria Guida collected associations to Italian verbs in the same way as we collected the associations to German verbs.
- Daniela Marzo collected associations to French verbs in the same way as we collected the associations to German verbs.

References:

Sabine Schulte im Walde, Alissa Melinger
An In-Depth Look into the Co-Occurrence Distribution of Semantic Associates [doi/bib; pre-print version: pdf; errata: pdf]
Italian Journal of Linguistics 20(1):89-128. Special Issue on From Context to Meaning: Distributional Models of the Lexicon in Linguistics and Cognitive Science.

Sabine Schulte im Walde, Alissa Melinger, Michael Roth, Andrea Weber
An Empirical Characterisation of Response Types in German Association Norms [doi/bib; pre-print version: pdf]
Research on Language and Computation 6(2):205-238, 2008.

Annamaria Guida
The Representation of Verb Meaning within Lexical Semantic Memory: Evidence from Word Associations
Master's Thesis. Universitá degli studi di Pisa. 2007.