Forschungsthemen
[] Clustering of Distributed Word Representations and its Applicability for Enterprise Search
The hypothesis that the meaning of a word can be exhaustively described through its context is commonly attributed to the linguists Harris and Firth and the 1950s. As a logical consequence the similarity of words can be expressed as a function of context similarity. With the help of simplified neural network models modern computers learn a reduced distributed representation of words as vectors in feature space. This group of techniques, also referred to as neural word embeddings, constitutes a popular approach to natural language modelling in state-of-the-art research. Knowledge management tools for enterprises aim to support their users by preparing and presenting information in a way to help them overcome the otherwise inaccessible amount of accelerating data. With enhanced understanding of semantic relationships, systems become more adaptable to the apparent irregularities of natural language and are enabled to assume even larger responsibilities. Similar to recent developments in web search, they can provide recommendations to refine, complement or disambiguate a query. A typical use-case from general vocabulary is the word Erde which can either refer to the Earth as a planet or earth as soil. Based on semantic understanding, one good alternative for a system to deal with this is to provide results for the word sense which is more probable in the given context and give a link to the other one in the style of “Did you mean…?”. For this intention, however, it is necessary to find a suitable differentiation such as “Erde (Planet)” or “Erde (Boden)”. If aware of the position in a simplified taxonomy, it is furthermore possible to offer UI controls for extending, shifting or refining the search area. Compared to other approaches, word embeddings hold out the prospect of low manual effort together with good performance and certain customisation options which is specifically attractive for the field of enterprise search. In cooperation with the local software company interface projects, the results of subsequent term clustering on German word embeddings shall be explored. Expected outcome is insight into how well these capture the semantic relationships described above. Although recent findings indicate remarkable improvements of distributed models towards more traditional approaches in many tasks, there is scope for further research, especially in respect to the German language and the enterprise search infrastructure.
Betreuer: Birgit Demuth