Weblogs for market research: finding more relevant opinion documents using system fusion

Published date25 September 2009
Date25 September 2009
Pages873-888
DOIhttps://doi.org/10.1108/14684520911001882
AuthorDeanna Osman,John Yearwood,Peter Vamplew
Subject MatterInformation & knowledge management,Library & information science
Weblogs for market research:
finding more relevant opinion
documents using system fusion
Deanna Osman, John Yearwood and Peter Vamplew
School of Information Technology and Mathematical Sciences,
University of Ballarat, Ballarat, Australia
Abstract
Purpose – The purpose of this paper is to examine the usefulness of fusion as a means of improving
the precision of automated opinion detection.
Design/methodology/approach – Five system fusion methods are proposed and tested using runs
submitted by the Text REtrieval Conference (TREC) Blog06 participants as input. The methods
include a voting method, an inverse rank method (IRM), a linear-normalised score method and two
weighted methods that use a weighted IRM score to rank the document.
Findings – Mean average precision (MAP) is used as an indicator of the performance of the runs in
this study. The best system fusion method achieves a 55.5 percent higher MAP result compared with
the highest MAP result of any individual run submitted by the Blog06 participants. This equates to an
increase in detection of 2,398 relevant opinion documents (21 percent).
Practical implications – System fusion can be used to improve upon the results achieved by
existing individual opinion detection systems. On the other hand, multiple opinion detection
approaches can be combined into one system and fusion used to combine the results to build in
diversity. Diversity within fusion inputs can increase the improvements achieved by fusion methods.
The improved output from a diverse opinion detection system will then contain a higher number of
relevant documents and reduce the incidence of high-ranking non-relevant documents and
low-ranking relevant documents.
Originality/value – The fusion methods proposed in this study demonstrate that simple fusion of
opinion detection systems can improve performance.
Keywords Market research,Internet, Communication technologies
Paper type Technical paper
Introduction
The number of people regularly accessing the internet is reported to have grown by
244.7 percent worldwide between 2000 and 2007 (Internet World Stats, 2007). One area
recording a high level of growth on the internet is weblogs (blogs). In December 2007 a
blog tracking company, Technorati (2007), reported that it was monitoring 112.8
million blogs worldwide, up from 4.2 million in October 2004 (Rosenbloom, 2004).
Along with the growth in the number of blogs on the internet, there is a growth in
interest in the content of blogs, particularly opinions within blogs. The majority of blog
authors surveyed by Lenhart and Fox (2006) indicated that their reason for blogging is
to share their knowledge, skills and life experiences. Often bloggers will express their
opinions about products, events and people affecting their lives.
These unsolicited opinions could prove invaluable for market research by
organisations who wish to gauge reactions to products and services. For example,
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/1468-4527.htm
Weblogs for
market research
873
Refereed article received
23 August 2008
Approved for publication
20 December 2008
Online Information Review
Vol. 33 No. 5, 2009
pp. 873-888
qEmerald Group Publishing Limited
1468-4527
DOI 10.1108/14684520911001882
negative opinions about a competitor’s product may provide a competitive edge for a
new design or governments could search blogs for data in qualitative research
regarding new policies or upcoming elections. Small businesses, which do not have a
large “market research” budget, could gain access to millions of people who potentially
have an opinion relating to them.
Most users will not read all documents returned by a search engine. Jansen et al. (2000)
found that 58 percent of users do not read more than the first page of a list of relevant
documents. Therefore, the aim of this research is not only to create a list of documents
with a higher proportion of relevant opinion documents, but also to float the documents
with relevant opinions to the top of the list and the remaining documents to the bottom of
the list. The resulting list will include a higher number of documents expressing a
relevant opinion on the topic, which can then be used as a list for a search engine or as
input into an automated opinion analysis system. An automated analysis system would
requirea set of documents with a highproportion of relevantopinion documentsto enable
the system to quantify positive and negative opinions toward the topic.
In 2005 and 2006, the Text REtrieval Conference (TREC) created a blog document
collection (Blog06), comprising 3.2 million blog posts and comments. The tasks in 2006 for
the Blog06 collection included an “Opinion Retrieval Task” where participants retrieved
blogs expressing an opinion on each of 50 given topics. Participants could submit up to
five runs (a “run” is a ranked list of relevant documents submitted to TREC by the
participants), which included retrieved documents expressing an opinion on a given topic
(Ounis et al., 2006). A total of 56 runs consisting of the top 1000 opinion-bearing
documents for each topic (TREC provided the participants with 50 topics for this task)
were submitted by the 14 Blog06 participants. Of these, the top 100 documents from 27
runs were combined with the top ten documents from the remainder of the runs to create a
list of blog documents to be assessed by TREC assessors (Ounis et al., 2006). The runs
submitted in 2006 were used as input runs for the study discussed in this report.
Once the assessments were available, mean average precision (MAP) was calculated
for each run (NIST, 2005), measuring the precision of retrieval of documents relevant to
the given topic (irrespective of whether an opinion existed on the topic within the
document) andthe retrieval of documents expressingan opinion on the given topic. MAP
has been previously used to measure relevance in TREC corpora, however in Blog06
there were two measures: relevant documents and relevant opinion documents. The
MAP results were published in Ounis et al. (2006). MAP is a standard reporting method
of TREC corpora. Precision (P) is calculated using formula (1a) and formula (1b)
calculates average precision (AP), where Nis the total number of documents in the run
for the topic – in the Blog06 corpus, there was a maximum of 1,000 documents in each
run for each topic. MAP is the mean of the AP values for 50 topics:
Ri¼
0 if documentiis not relevant
1 if documentiis relevant
Pi¼
X
i
j¼1
Rj
ið1aÞ
OIR
33,5
874

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT