Further thanks goes to Victor Lavrenko, who contributed greatly to the clarification of my ideas with timely feedback, and to Burkhard Schafer, who advised in the early stages of this work and demonstrated with enthusiasm that academia is full of exciting opportunity.
A raised glass goes to my peers and colleagues at the University of Massachusetts, Amherst, and the University of Edinburgh, who made life as a PhD student so thor- oughly enjoyable. Their comments, suggestions and tips filled small gaps in my knowl- edge when I needed a hand, and lightened my burden in more ways than one.
Of my colleagues, particular thanks goes to Michael Bendersky for the code to extract collec- tion statistics from an Indri index. To Jeff Dalton for fine dining and an invitation to pitch in on experiments. And to Sam Huston for backing up his interest in this work with the addition of PhRank queries to his weight optimization experiments, enabling comparison with a highly effective weighted sequential dependence model.
A warm hand goes to Dan, Alison and Roger who cheerfully worked miracles with compute resources on my behalf. And extra special thanks is reserved for David Fisher, whose witty and insightful responses to questioning made engineering and thesis drafting pos- itively fun. An unwitting contribution to this dissertation was made by Richard Johansson and Pierre Nugues, whose joint semantic-syntactic parser features in this research.
I thank the creators for their generosity in making these resources available.
Finally, in the realms of personal support I would not have made it through without my partner, Gregor, whose understanding and encouragement were pivotal to my sanity and success. I am also thankful to friends who taught me, again and again, to have courage, curiosity and confidence. And to my mum and dad, who encouraged me to keep going and believed in me every step of the way. This is for you. The contri- bution of the candidate and the other authors to this work has been explicitly indicated below. The candidate confirms that appropriate credit has been given within the thesis where reference has been made to the work of others.
Work from published papers under joint authorship appears as follows: 1. Chap 6: Experiments using human annotated training data, and supporting in- troductory material appeared at ACL All work is credited to the candidate. Tamsin Maxwell, Jon Oberlander and W. Tamsin Maxwell and W. This assumption is clearly unrealistic since documents are not bags of words, but treating them like bags of words simplifies engineering in IR systems.
PhD Thesis - Arjen P. de Vries
The word independence assumption avoids difficulties with estimation of word depen- dence probabilities and any need to develop a single weighting scheme for WORDS and PHRASES1 single weighting schemes tend to systematically favour phrases Gao et al. In practice, an assumption of word independence is also quite effective. The as- sumption works relatively well in practice because the meaning of phrases can often be interpreted as a function of the meanings of their component words.
Compositional semantics defines the meaning of a phrase to be a function of the meanings of its parts and the way they are put together. The combination of probabilities, or scores, assigned to documents on the basis of individual words is used to produce a ranking over documents. The independence assumption for IR has three basic shortcomings. First, words are not independent in reality; their context determines whether they are more or less likely to occur.
Users are not interested in retrieving documents containing individual words, but documents con- taining particular senses of words and concepts Krovetz, , Models that incorporate word associations are thought to retrieve documents with more PRECISION than models that assume word independence because they more closely specify document content. Word associations constrain language context and thereby help to address certain challenges of semantic interpretation including word ambiguity and content specification. Yet in reality, relevance is contingent on both the semantic overlap between a request and a query and the interpretation of documents.
Unfortunately, it is not easy to infer the semantic overlap of a request and a query because the number of semantic interpretations in any formal analysis is likely to be exponential Blackburn and Bos, Even a request with one word can have multiple meanings. In practice, rather than throw out existing definitions, this argument simply highlights the potential importance of semantics in IR for assessing the utility of specific word associations and words.
The selection of desirable word associations from a request demands consideration of two criteria: the accuracy with which associations capture language semantics, and the ability of those associations to discriminate relevant documents according to a standard definition of relevance. In other words, there is a need for both semantic representation the static interpretation of request semantics, in this case limited by selection of word associations, see Section 4.
This distinction natu- 4 There are a many definitions of relevance that consider additional or alternative factors such as user preferences, prior knowledge, uncertainty about an underlying information need, differences in task definition, document like-ability, and whether similar documents have already been judged by a user for relevance Lavrenko, ; Mizzaro, As applied in IR, the approaches differ largely in their efficiency and the degree to which they mark the details of relationships.
Statistical methods leave word relationships unspecified, while the re- verse is usually true for linguistic methods.
PhD thesis submitted
Systems that use linguistic processing tend to focus on accurate description of language complexity. An advantage of a linguistic approach is that identified word associations can probably be assumed to represent the semantics of the request. In contrast, a statistical approach aims to capture patterns in data rather than se- mantics. Statistical retrieval models fit language data well and are shown to be highly effective.
Probability theory gives a rich view of language structure and use Manning, , and a mathematical paradigm lends itself to the probabilistic detection of spuri- ous word dependencies. Mathematical approaches are also often highly efficient, and scale well to real world IR systems. A substantial amount of research on statistical models concentrates on improving practical implementations. A statistical approach to IR is often considered preferable to one inspired by lin- guistics. Yet despite many benefits, a statistical approach has two major disadvantages. First, it accounts for observed data, but does not require the resulting model to be in- terpretable by humans or bear an obvious relation to accurate linguistic generalization.
This can make it difficult to recognise and correct systematic retrieval errors. It also dis- cards an opportunity for interactive query tuning that can improve search performance. Interpretable queries generated by linguistically inspired approaches facilitate amend- ment and help users to recover when a system fails to retrieve desired documents. They can be particularly useful in domains such as law where search transparency is vital. More importantly, it is not immediately obvious how to focus a mathematical ap- proach to optimally select informative word associations.
This difficulty is evidenced by more than 50 years of experi- mentation with word association: if brute force was sufficient, the solution would be clear by now. Machine learning provides an efficient framework for learning word asso- ciations, but machine learning algorithms are not always guaranteed to find the optimal solution.
Moreover, they can be confused by uninformative, unreliable or redundant features. As has been pointed out, the common weakness for learnt approaches is the lack of guidance on how to select features Zhai, The wrong selection can reduce the separability of relevant and irrelevant documents, and make finding a good solution less probable. The critical step is selection of features that constitute the most profitable bias for learning.
Linguistic features enter the frame because they can supply a profitable bias for statistical learning at the same time they provide some basis for semantic interpreta- tion.
Non-statistical, rule-based processes operating at a small scale, such as syntactic interactions between individual words, produce patterns in language and thus make useful features. This forms a basis for modeling language in large scale IR systems, and means that systems built up from linguistic interactions at the sentence or word level can perform exceptionally well.
In this way, linguistics is related to IR as the- ory to evidence. It can guide development, provide a principled way to understand the consequences of feature selection, and facilitate insight into when and how word associations are likely to aid retrieval. Just like any other feature source, linguistics can supply a misleading bias for learning if it does not describe key aspects of language with respect to a particular task. The application of appropriate linguistic features is made difficult by the fact that there are many competing linguistic theories.
In addition, natural language process- ing techniques NLP can be complex and incur a substantial processing cost, making them impractical for large-scale applications. It can also be argued that linguists col- lect evidence to determine the principles governing production and understanding of language, while researchers in IR collect evidence to uncover the principles that govern document relevance. By consequence, linguistics does not necessarily reveal anything of practical importance about document relevance. The problem seems to be that in many cases retrieval gains made using language processing components, such as part-of-speech tagging and shallow parsing chunking are offset by significant negative effects.
This results in minimal positive, or even negative, overall impact when compared to approaches that do not use any linguistic or domain knowledge Brants, ; Lewis and Jones, ; Song and Croft, Phrase structure grammars emphasize, and are designed for, those aspects of lan- guage that adhere to a principle of compositionality: that the meaning of a phrase is a function of the meaning of its parts and the way they are put together syntacti- cally.
In addition, discrete category assignments are prone to error see Chapter 3.
Statistical techniques can overcome specific limitations of phrase structure theory and are highly effective Bendersky and Croft, ; Lease et al. Dependency theory provides an alternative in- terpretation of language structure that is nonetheless compatible with phrase structure theory.
Related doctoral thesis and query
Copyright 2019 - All Right Reserved