Theses
On the relationship of space and content of volunteered geographic information
Art der Abschlussarbeit
Dissertation
Autoren
- Hahmann, Stefan
Betreuer
- Prof. Dr.-Ing.habil. Dipl.-Phys. Dirk Burghardt
Abstract
In the last ten years there has been a significant progress of the World Wide Web, which evolved to become the so-called “Web 2.0”. The most important feature of this new quality of the WWW is the participation of the users in generating contents. This trend facilitates the formation of user communities which collaborate on diverse projects, where they collect and publish information. Prominent examples of such projects are the onlineencyclopedia “Wikipedia”, the microblogging-platform “Twitter”, the photo-platform “Flickr” and the database of topographic information “OpenStreetMap”.
User-generated content, which is directly or indirectly geospatially referenced, is of-ten
termed more specifically as “volunteered geographic information”. The geospatial reference of this information is constituted either directly by coordinates that are given as metainformation or indirectly through georeferencing of toponyms or addresses that are contained in this information.
Volunteered geographic information is particularly suited for research, as it can be accessed with low or even at no costs at all. Furthermore it reflects a variety of human decisions which are linked to geographic space. In this thesis, the relationship of space and content of volunteered geographic information is investigated from two different perspectives.
The first part of this thesis addresses the question for which share of information there
exists a relationship between space and content of the information, such that the information
is locatable in geospace. In this context, the assumption that about 80% of all information
has a reference to space has been well known within the community of geographic
information system users. Since the 1980s it has served as a marketing tool within
the whole geoinformation sector, although there has not been any empirical evidence. This
thesis contributes to fill this research gap.
For the validation of the ‘80%-hypothesis’ two approaches are presented. The first approach
is based on a corpus of information that is as representative as possible for world knowledge. For this purpose the German language edition of Wikipedia has been selected.
This corpus is modeled as a network of information where the articles are considered the nodes and the cross references are considered the edges of a directed graph. With the help of this network a graduated definition of geospatial references is possible. It is implemented by computing the distance of each article to its closest article within the network that is assigned with spatial coordinates. Parallel to this, a survey-based approach is developed where participants have the task to assign pieces of information to one of the categories “direct geospatial reference”, “indirect geospatial reference” and “no geospatial reference”. A synthesis of both approaches leads to an empirically justified figure for
the “80%-assertion”. The result of the investigation is that for the corpus of Wikipedia 27% of the information may be categorized as directly geospatially referenced and 30% of the information may be categorized as indirectly geospatially referenced.
In the second part of the thesis the question is investigated in how far volunteered geographic
information that is produced on mobile devices is related to the locations where it
is published. For this purpose, a collection of microblogging-texts produced on mobile devices
serve as research corpus. Microblogging-texts are short texts that are published via
the World Wide Web. For this type of information the relationship be-tween the content of
the information and their position is less obvious than e.g. for topographic information or
photo descriptions.
The analysis of microblogging-texts offers new possibilities for market and opinion research,
the monitoring of natural events and human activities as well as for decision support
in disaster management. The spatial analysis of the texts may add extra value. In fact
for some of the applications the spatial analysis is a necessary condition. For this reason,
the investigation of the relationship of the published contents with the locations where
they are generated is of interest.
Within this thesis, methods are described that support the investigation of this relationship.
In the presented approach, classified Points of Interest serve as a model for the environment.
For the purpose of the investigation of the correlation between these points and
the microblogging-texts, manual classification and natural language processing are used in
order to classify these texts according to their relevance in regard to the respective feature
classes. Subsequently, it is tested whether the share of relevant texts in the proximity of
objects of the tested classes is above average. The results of the investigation show that
the strength of the location-content-correlation depends on the tested feature class. While for the feature classes ‘train station’, ‘airport’ and ‘restaurant’ a significant dependency of the share of relevant texts on the distance to the respective objects may be observed, this
is not confirmed for objects of other feature classes, such as ‘cinema’ and ‘supermarket’.
However, as prior research that describes investigations on small cartographic scale has
detected correlations between space and content of microblogging-texts, it can be concluded that the strength of the correlation between space and content of microbloggingtexts
depends on scale and topic.
User-generated content, which is directly or indirectly geospatially referenced, is of-ten
termed more specifically as “volunteered geographic information”. The geospatial reference of this information is constituted either directly by coordinates that are given as metainformation or indirectly through georeferencing of toponyms or addresses that are contained in this information.
Volunteered geographic information is particularly suited for research, as it can be accessed with low or even at no costs at all. Furthermore it reflects a variety of human decisions which are linked to geographic space. In this thesis, the relationship of space and content of volunteered geographic information is investigated from two different perspectives.
The first part of this thesis addresses the question for which share of information there
exists a relationship between space and content of the information, such that the information
is locatable in geospace. In this context, the assumption that about 80% of all information
has a reference to space has been well known within the community of geographic
information system users. Since the 1980s it has served as a marketing tool within
the whole geoinformation sector, although there has not been any empirical evidence. This
thesis contributes to fill this research gap.
For the validation of the ‘80%-hypothesis’ two approaches are presented. The first approach
is based on a corpus of information that is as representative as possible for world knowledge. For this purpose the German language edition of Wikipedia has been selected.
This corpus is modeled as a network of information where the articles are considered the nodes and the cross references are considered the edges of a directed graph. With the help of this network a graduated definition of geospatial references is possible. It is implemented by computing the distance of each article to its closest article within the network that is assigned with spatial coordinates. Parallel to this, a survey-based approach is developed where participants have the task to assign pieces of information to one of the categories “direct geospatial reference”, “indirect geospatial reference” and “no geospatial reference”. A synthesis of both approaches leads to an empirically justified figure for
the “80%-assertion”. The result of the investigation is that for the corpus of Wikipedia 27% of the information may be categorized as directly geospatially referenced and 30% of the information may be categorized as indirectly geospatially referenced.
In the second part of the thesis the question is investigated in how far volunteered geographic
information that is produced on mobile devices is related to the locations where it
is published. For this purpose, a collection of microblogging-texts produced on mobile devices
serve as research corpus. Microblogging-texts are short texts that are published via
the World Wide Web. For this type of information the relationship be-tween the content of
the information and their position is less obvious than e.g. for topographic information or
photo descriptions.
The analysis of microblogging-texts offers new possibilities for market and opinion research,
the monitoring of natural events and human activities as well as for decision support
in disaster management. The spatial analysis of the texts may add extra value. In fact
for some of the applications the spatial analysis is a necessary condition. For this reason,
the investigation of the relationship of the published contents with the locations where
they are generated is of interest.
Within this thesis, methods are described that support the investigation of this relationship.
In the presented approach, classified Points of Interest serve as a model for the environment.
For the purpose of the investigation of the correlation between these points and
the microblogging-texts, manual classification and natural language processing are used in
order to classify these texts according to their relevance in regard to the respective feature
classes. Subsequently, it is tested whether the share of relevant texts in the proximity of
objects of the tested classes is above average. The results of the investigation show that
the strength of the location-content-correlation depends on the tested feature class. While for the feature classes ‘train station’, ‘airport’ and ‘restaurant’ a significant dependency of the share of relevant texts on the distance to the respective objects may be observed, this
is not confirmed for objects of other feature classes, such as ‘cinema’ and ‘supermarket’.
However, as prior research that describes investigations on small cartographic scale has
detected correlations between space and content of microblogging-texts, it can be concluded that the strength of the correlation between space and content of microbloggingtexts
depends on scale and topic.
Zugeordnete Forschungsschwerpunkte
- Mobile Kartographie
- Automatische Generalisierung
- Geowebdienste
Zugeordnete Forschungsprojekte
- Kartographische Kommunikation von nutzergenerierten Inhalten mittels mobiler Karten
Schlagwörter
Volunteered Geographic Information, VGI, User Generated Content, UGC, Geographical information science, Wikipedia, Twitter, OpenStreetMap, Networks, Geospatial reference, Geographic information retrieval, machine learning, natural language programming
Berichtsjahr
2014