Sustainable Questions
-
Upload
bart-de-goede -
Category
Spiritual
-
view
185 -
download
0
Transcript of Sustainable Questions
Sustainable Questions
27 August 2012Bart de Goede
Universiteit van Amsterdam
Determining the expiration date of answers
SupervisorsMaarten de Rijke, Anne Schuth
Outline
• Introduction to CQA
• Problem statement
• Approach
• Cluster similar questions
• Compare answers in clusters
• Classify sustainable clusters
• Discussion and conclusion
Community Question Answering
• Community of users asking and answering questions
• Natural language
• Formally, a service that involves:
1) A method for a person to present his/her information need in natural language,
2) a place where other people can respond to that information need and
3) a community built around such a service based on participation. (Shah et al., 2009)
Community Question Answering
• CQA-services have many answered questions
• CQA-retrieval aims to find answered questions similar to the question a user posts
• However, not all questions may be readily reused:
• Who designed the Eiffel Tower?Alexander Gustave Eiffel.
• Who is the prime minister of the UK?Now: David Cameron. Before: Gordon Brown.
Community Question Answering
• Some questions are sustainable and can readily be reused, others are not
• A question is sustainable if the answer to that question is independent of the point in time the question is asked
• So, if the answer to semantically similar questions over time does not change, the questions are considered sustainable
Problem statement
RQ1: What are the distinguishing properties of sustainable ques- tions?
RQ2: Can we measure these properties of sustainability?
RQ3: Can we tell sustainable and non-sustainable questions apart based on these properties?
Research questions
Approach:What makes a question sustainable?
1. Cluster semantically similar questions
2. Compare answers in each cluster
3. Classify clusters as sustainable
Time
Cluster semantically similar questions
• Questions are semantically similar if they would be satisfied by the same information when asked at the same time
• However, questions tend to be
• very short
• phrased in different ways
• noisy
• littered with function words
Cluster semantically similar questions
• Latent Semantic Analysis (LSA; Deerwester et al., 1990) or Latent Dirichlet Allocation (LDA; Blei et al., 2003)
• topic modeling techniques
• cosine distance between topic vectors
• Locality Sensitive Hashing (LSH; Charikar, 2002)
• Used for near-duplicate detection
• Intuition: near-duplicates are very likely to be similar
Cluster semantically similar questions
• Manually labeled set of 559 question pairs
• Calculate accuracy on samples of Yahoo! Answers Comprehensive Questions and Answers version 1.0
sample size
algorithm 10K 100K all
LDA 0.435 0.500 -
LSA 0.706 0.638 -
LSH16bits 0.472 0.484 0.500
LSH24bits 0.465 0.502 0.495
LSH32bits 0.512 0.514 0.509
LSH40bits 0.523 0.537 0.542
Table 2: Accuracy of several question clustering methods. Miss-ing values represent experiments that never terminated.
In order to overcome vocabulary mismatch (different words with
the same meaning are used in either the question or the answer),
different ways of spelling, and improve overall matching on a se-
mantic level, we use the semantic linking system of Meij et al. [24],
developed to determine concepts in tweets.
This system approaches a similar problem; finding out what short
pieces of text are about. In operationalisation, pages on Wikipedia
are considered as concepts. Subsequently, a model is trained to
estimate the probability that a concept c is the target of an hyperlink
(in Wikipedia) witn an anchor text containing an ngram q. Given
a question or an answer, we obtain the set of concepts that are
likely to be linked to by occurence of ngrams in that question or
answer in Wikipedia anchor texts, as well as the score (a sum of
the probabilities of all ngrams in the piece of text linking to that
specific concept).
Using these concepts as document vectors, and their scores as
the values in those vectors, rather than tf-idf on the bag-of-words
representation of questions and answers, we hope to obtain a less
noisy vector space (change between answers is based solely on
concepts present in the text), and diminish the influence of spelling
and vocabulary mismatch. We will refer to documents processed
this way as ‘semanticized’.
We view the clustering of questions as a preprocessing step and
therefore take it as part of the experimental setup. We explore three
approaches to finding similar questions: latent semantic analysis [8],
latent Dirichlet allocation [4] and locality senstive hashing [6].
From the output of each clustering method on the 10K dataset,
we sampled 559 pairs of questions and manually labeled 205 as cor-
rectly clustered together and 354 as wrongly clustered together. We
used the combined set of labels (randomly sampling 205 questions
from the wrongly-clustered set) to arrive at the results in Table 2;
for each labeled pair of questions we observe whether the algorithm
was correct in either putting both questions in the same cluster or
keeping them separate.
Based on these accuracy results we decided on using LSA as our
clustering approach for the remainder of our experiments. We also
decided on taking the sample of 10K documents as the basis for our
analysis.
While we consider clustering of similar questions as a preprocess-
ing step for our approaches to sustainability, we can not ignore the
fact that obtaining a reasonable clustering performance is important
for our sustainability estimation. Therefore, we opt to manually
label data for further investigation, as our clustering methods per-
formed rather poor.
We manually divided the 904 clusters in the output of our LSA
clustering approach on the before mentioned subset of 10K questions
in three classes: 752 all clusters, 143 clusters with similar questions
and 7 clusters with sustainable questions.
−0.5 0.0 0.5 1.0 1.5Average cosine distance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Den
sity
allsimilarsustainable
Figure 5: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between answers labeled as best accord-ing to either the user or the community.
The clusters in the similar class are only required to have similar
questions—questions asking for the same information—regardless
of the answers; these clusters can thus be sustainable and unsustain-
able. Additionally, the clusters in the sustainable class are required
to have answers that do not change over time. Note that this defini-
tion implies that the sustainable class is a subset of the similar class
which is a subset of the all class.
Subsequently, for each cluster we compute cosine distances be-
tween chronologically sorted best answers, as described in section
3.2.1. For each set of distances (per cluster), we compute the aver-
age, standard deviation, average change per day, standard deviation
on the cumulative distances, as well as the slope and sum of squared
errors of a linearly fitted function on the cumulative distances.
Also, we compute for each cluster the time between the moment a
question was posted and the answer labeled as best answer, and the
time between the last answer that question received, as described in
section 3.2.2. For each set of these distances in time, we compute the
average, standard deviation, standard deviation on the cumulative
distances in time, as well as the slope and sum of squared errors of
a linearly fitted function on the cumulative distances in days.
4.2 ResultsFigure 5 shows a kernel density estimation
9plot of the average
cosine distance between the best answers for each class of clusters.
Although there seems to be some evidence for this metric to distin-
guish similar and sustainable clusters from regular clusters, it is not
that strong.
Figure 6 also shows a kernel density estimation plot, for the aver-
age cosine distance between the semanticized vector representations
of the best answers for each class of clusters. As the remarkable sim-
ilar plots for the distances between tf-idf vectors and semanticized
vector representations in the single cluster example in section 3.2.1
already suggested, semanticizing answers does not seem to be an
improvement on the traditional tf-idf bag-of-words representation.
Also, considering time to answer as a distinctive property of
sustainable questions does not yield decisive results. Figure 7 shows
a kernel density estimation of the average time in days between the
9We use kernel density estimation because it models the density
of data points at a value. In this way, a fairer comparison between
the instances of our three classes can be made; we have far less
sustainable than similar questions [37].
Accuracy of several question clustering methods. Missing values represent experiments that never terminated.
Compare answers in each cluster
• Answers to similar questions that do not change over time indicate sustainable questions
• Output of LSA contained 904 clusters:
• 9 clusters considered sustainable
• 143 clusters considered similar
• 756 clusters considered all
• Compute properties of question-answer pairs (change, time, number of answers, etc.)
Compare answers in each cluster
Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006−2
−1
0
1
2
3
4
5
6
7
8
Cum
ulat
ive
cosi
nedi
stan
ce
Linear fitted lineCumulative cosine distance
Figure 4: As in Figure 3, the cumulative cosine distance be-
tween semanticized representations of answers with linear fit-
ted line for a single cluster. However, here the timing of the
answers is taken in to account.
ition is that sustainable questions are more likely to solicit answerslonger after they were posted than non-sustainable questions; manyquestions are answered straightaway and disappear in the timelinequickly, whereas some questions keep getting attention, and aretherefore not expired (yet).
4. EXPERIMENTS
Our experiments are aimed at answering the following researchquestions. What are the distinguishing properties of sustainablequestions? Can we measure these properties of sustainability? Canwe tell sustainable and unsustainable questions apart based on theseproperties?
4.1 Experimental Setup
Yahoo! Answers is a question answering community website,where users can ask and answer questions. Users are encouragedto answer questions by rewarding points, with accompanying ranksand earnable badges.
4.1.1 DataAll our experiments are run on the Yahoo! Answers Compre-
hensive Questions and Answers version 1.06 dataset. This data setconsists of 4.5M questions with often multiple answers, of whichwe used 3.2M.7
We have sampled two sets from the training set in order to de-velop, test and obtain clusters of similar questions; given the avail-able resources we were not able to perform all clustering methods(discussed in section 3.1) on the complete training data set. Table 1shows some decriptive statistics of the two subsets and the completeset, indicating that on a superficial level, the distributions of ques-tions do not differ much. However, we do note that the amount ofdifferent languages grows with the size of the data set.
Also, we can see that questions and answers tend to be short. Al-though the information need is conveyed with a richer representation
6http://webscope.sandbox.yahoo.com/catalog.php?datatype=l7For validation purposes, we sorted the set by date, and then splitthe set (80% training set, 20% test set), and held 10% of the trainingset back as dev-test set. However, due to time constraints we neverused the held back data.
sample sizeStatistic 10K 100K all
Number of questions 10K 100K 3.2M
Average number of answers/question 7,1 7,1 7,1Std. dev. number of answers/question 7,4 7,2 8,1
Average number of characters/question 175,0 176,7 177,3Std. dev. of characters/question 204,2 200,0 201,7Median of characters/question 103 104 105Average number of characters/answer 332,8 336,5 336,0Std. dev. of characters/answer 507,6 503,7 499,6Median of characters/answer 168 175 177
Average number of sentences/question 2,8 2,8 2,9Std. dev. number of sentences/question 2,7 2,6 2,6Median number of sentences/questions 2 2 2Average number of sentences/answer 3,9 3,9 3,9Std. dev. number of sentences/answer 6,3 5,2 5,1Median number of sentences/answer 2 2 2
Question languages 6 12 28Main categories 163 176 179Categories 869 1744 2853Sub categories 677 1245 1539
Table 1: Descriptive statistics of the Yahoo! Answer data set.
The average number of answers is per question, the aver-
age question and answer length is in characters (spaces in-
cluded). Languages and categories are the amount of unique
occurences.
in natural language, we must also consider that our ‘documents’ arefar sparser as would be the case in a traditional retrieval setting, andcan influence our efforts to estimate similarity between questionsnegatively.
Additionally, we extracted several attributes of each question,such as the date the question was posted, was resolved, when thelast answer was solicited (and how much time was between askingand answering), how many answers were given, what the best answerwas (either chosen by the asker, or voted for by the community), orto which category it was assigned.
4.1.2 PreprocessingIn order to perform the change rate measures as described in
section 3.2.1, we employ two strategies to model the answer space.However, we first perform for each cluster of questions in the output(clustering methods are discussed in section 3.1) case and accentfolding and simple tokenisation on both questions and answers. Wethen model the answers as a tf-idf vector space [23].8
However, while this approach is a traditional, well-tested anddescribed approach in information retrieval, the vector space tendsto be very sparse. As Table 1 indicates, questions and answers tendto be very short. There are some elaborate answers in the corpus,skewing the average, but more than half of the questions and answersare represented by two sentences.
In addition, Yahoo! Answers community members represent theirquestions and answers in natural language, creating an abundanceof different spellings, synonyms and complexity. This results in aneven sparser vector space, as different spellings of the same word(including typographical errors) result in separate features in thevector space, while the same semantic meaning is intended.8We used the implementation from the scikit-learn package;http://scikit-learn.org/stable/index.html.
Compare answers in each cluster
sample size
algorithm 10K 100K all
LDA 0.435 0.500 -
LSA 0.706 0.638 -
LSH16bits 0.472 0.484 0.500
LSH24bits 0.465 0.502 0.495
LSH32bits 0.512 0.514 0.509
LSH40bits 0.523 0.537 0.542
Table 2: Accuracy of several question clustering methods. Miss-ing values represent experiments that never terminated.
In order to overcome vocabulary mismatch (different words with
the same meaning are used in either the question or the answer),
different ways of spelling, and improve overall matching on a se-
mantic level, we use the semantic linking system of Meij et al. [24],
developed to determine concepts in tweets.
This system approaches a similar problem; finding out what short
pieces of text are about. In operationalisation, pages on Wikipedia
are considered as concepts. Subsequently, a model is trained to
estimate the probability that a concept c is the target of an hyperlink
(in Wikipedia) witn an anchor text containing an ngram q. Given
a question or an answer, we obtain the set of concepts that are
likely to be linked to by occurence of ngrams in that question or
answer in Wikipedia anchor texts, as well as the score (a sum of
the probabilities of all ngrams in the piece of text linking to that
specific concept).
Using these concepts as document vectors, and their scores as
the values in those vectors, rather than tf-idf on the bag-of-words
representation of questions and answers, we hope to obtain a less
noisy vector space (change between answers is based solely on
concepts present in the text), and diminish the influence of spelling
and vocabulary mismatch. We will refer to documents processed
this way as ‘semanticized’.
We view the clustering of questions as a preprocessing step and
therefore take it as part of the experimental setup. We explore three
approaches to finding similar questions: latent semantic analysis [8],
latent Dirichlet allocation [4] and locality senstive hashing [6].
From the output of each clustering method on the 10K dataset,
we sampled 559 pairs of questions and manually labeled 205 as cor-
rectly clustered together and 354 as wrongly clustered together. We
used the combined set of labels (randomly sampling 205 questions
from the wrongly-clustered set) to arrive at the results in Table 2;
for each labeled pair of questions we observe whether the algorithm
was correct in either putting both questions in the same cluster or
keeping them separate.
Based on these accuracy results we decided on using LSA as our
clustering approach for the remainder of our experiments. We also
decided on taking the sample of 10K documents as the basis for our
analysis.
While we consider clustering of similar questions as a preprocess-
ing step for our approaches to sustainability, we can not ignore the
fact that obtaining a reasonable clustering performance is important
for our sustainability estimation. Therefore, we opt to manually
label data for further investigation, as our clustering methods per-
formed rather poor.
We manually divided the 904 clusters in the output of our LSA
clustering approach on the before mentioned subset of 10K questions
in three classes: 752 all clusters, 143 clusters with similar questions
and 7 clusters with sustainable questions.
−0.5 0.0 0.5 1.0 1.5Average cosine distance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Den
sity
allsimilarsustainable
Figure 5: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between answers labeled as best accord-ing to either the user or the community.
The clusters in the similar class are only required to have similar
questions—questions asking for the same information—regardless
of the answers; these clusters can thus be sustainable and unsustain-
able. Additionally, the clusters in the sustainable class are required
to have answers that do not change over time. Note that this defini-
tion implies that the sustainable class is a subset of the similar class
which is a subset of the all class.
Subsequently, for each cluster we compute cosine distances be-
tween chronologically sorted best answers, as described in section
3.2.1. For each set of distances (per cluster), we compute the aver-
age, standard deviation, average change per day, standard deviation
on the cumulative distances, as well as the slope and sum of squared
errors of a linearly fitted function on the cumulative distances.
Also, we compute for each cluster the time between the moment a
question was posted and the answer labeled as best answer, and the
time between the last answer that question received, as described in
section 3.2.2. For each set of these distances in time, we compute the
average, standard deviation, standard deviation on the cumulative
distances in time, as well as the slope and sum of squared errors of
a linearly fitted function on the cumulative distances in days.
4.2 ResultsFigure 5 shows a kernel density estimation
9plot of the average
cosine distance between the best answers for each class of clusters.
Although there seems to be some evidence for this metric to distin-
guish similar and sustainable clusters from regular clusters, it is not
that strong.
Figure 6 also shows a kernel density estimation plot, for the aver-
age cosine distance between the semanticized vector representations
of the best answers for each class of clusters. As the remarkable sim-
ilar plots for the distances between tf-idf vectors and semanticized
vector representations in the single cluster example in section 3.2.1
already suggested, semanticizing answers does not seem to be an
improvement on the traditional tf-idf bag-of-words representation.
Also, considering time to answer as a distinctive property of
sustainable questions does not yield decisive results. Figure 7 shows
a kernel density estimation of the average time in days between the
9We use kernel density estimation because it models the density
of data points at a value. In this way, a fairer comparison between
the instances of our three classes can be made; we have far less
sustainable than similar questions [37].
−0.5 0.0 0.5 1.0 1.5Average cosine distance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Den
sity
allsimilarsustainable
Figure 6: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between semanticized answers labeledas best according to either the user or the community.
posting of a question and that question being marked as resolved.
Almost all questions are answered within days of posting, although
similar and sustainable question clusters seem to incorporate more
questions that require longer to be answered satisfactory than regular
clusters. However, the distinction is not that clear.
−400 −200 0 200 400 600 800Days between question posted and best answer
0.000
0.005
0.010
0.015
0.020
0.025
0.030
Den
sity
allsimilarsustainable
Figure 7: Kernel density estimation of average time in daysbetween posting of a question and the question being markedas resolved.
When we consider the time between the moment of posting a
question and the moment that question receives its final answer, we
see that questions we deem sustainable keep receiving answers far
longer than ‘regular’ or even similar questions. Figure 8 shows a
kernel density estimation10
plot for the time between the posting of
a question and the reception of its last answer. It should be noted
that the set of sustainable clusters is a subset of the set of similar
clusters, and that the set of similar clusters is a subset of the set
of all clusters. This explains the second local maximum in the ‘all
clusters’ line.
4.3 AnalysisWhen comparing a kernel density estimation of the average cosine
distance between the best answers to the questions in a cluster
10This is why the plot covers negative values for time as well.
−200 −100 0 100 200 300 400 500Days between question posted and last answer
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
Den
sity
allsimilarsustainable
Figure 8: Kernel density estimation of the average time in daysbetween posting of a question and the last answer a questionreceived.
(shown in Figure 5) with a kernel density estimation of the average
time in days between posting a question and that question receiving
its last answer (shown in Figure 8) we see that the time between the
posting of a question and receiving its last answer is very indicative
in describing sustainability: the longer a question solicits answers,
the higher the probability of said question to be sustainable.
In addition, from the simple properties (average, standard devi-
ation, slope, SSE; detailed in Section 4.1.2) of clusters, we con-
structed five feature sets, as listed in Table 3. These correspond to
approaches disscussed in Section 3.2; change per question (i.e. the
amount of change between sequential questions), change per ques-
tion normalised for time, and the change over time for semanticized
representations of questions, as well as the time between asking
and answering of questions (both between asking and labeling of
the best answer, and time between asking and reception of the last
answer). Also, we used a combination of the ‘change over time’ and
‘time to answer’ sets.
feature set accuracy
change per question 66,9%
change over time 86,0%
semanticized change over time 75,3%
time to answer 89,3%
change/time combination 91,5%
Table 3: Accuracy of different feature sets. ‘Change’ featuresets typically contain average, (cumulative) standard deviation,slope and SSE of change rates (detailed in Section 3.2.1). ‘Timeto answer’ contains the time in days between asking and an-swering a question (detailed in Section 3.2.2). ‘Combination’contains features from both ‘change over time’ and ‘time to an-swer’ (detailed in Section 4.1.2).
When training a simple tree classifier11
using the properties in
each property set as features—on re-sampled data to balance the
classes—we find that the combination of both change and time
features is capable of obtaining a classification accuracy of 91.5% in
stratified 10 fold cross-validation, indicating that using very simple
properties such as the time between a question and its last answer,
and the cosine distance between the answers over time allow for
a reasonable distinction between sustainable and non-sustainable
11We use the WEKA [14] implementation of C4.5 by Quinlan [27].
Classify clusters as sustainable
• Construct feature sets (change, change over time, time to answer)
• Train a classifier* on re-sampled data
• Accuracy in stratified 10-fold cross-validation:
−0.5 0.0 0.5 1.0 1.5Average cosine distance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Den
sity
allsimilarsustainable
Figure 6: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between semanticized answers labeledas best according to either the user or the community.
posting of a question and that question being marked as resolved.
Almost all questions are answered within days of posting, although
similar and sustainable question clusters seem to incorporate more
questions that require longer to be answered satisfactory than regular
clusters. However, the distinction is not that clear.
−400 −200 0 200 400 600 800Days between question posted and best answer
0.000
0.005
0.010
0.015
0.020
0.025
0.030
Den
sity
allsimilarsustainable
Figure 7: Kernel density estimation of average time in daysbetween posting of a question and the question being markedas resolved.
When we consider the time between the moment of posting a
question and the moment that question receives its final answer, we
see that questions we deem sustainable keep receiving answers far
longer than ‘regular’ or even similar questions. Figure 8 shows a
kernel density estimation10
plot for the time between the posting of
a question and the reception of its last answer. It should be noted
that the set of sustainable clusters is a subset of the set of similar
clusters, and that the set of similar clusters is a subset of the set
of all clusters. This explains the second local maximum in the ‘all
clusters’ line.
4.3 AnalysisWhen comparing a kernel density estimation of the average cosine
distance between the best answers to the questions in a cluster
10This is why the plot covers negative values for time as well.
−200 −100 0 100 200 300 400 500Days between question posted and last answer
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
Den
sity
allsimilarsustainable
Figure 8: Kernel density estimation of the average time in daysbetween posting of a question and the last answer a questionreceived.
(shown in Figure 5) with a kernel density estimation of the average
time in days between posting a question and that question receiving
its last answer (shown in Figure 8) we see that the time between the
posting of a question and receiving its last answer is very indicative
in describing sustainability: the longer a question solicits answers,
the higher the probability of said question to be sustainable.
In addition, from the simple properties (average, standard devi-
ation, slope, SSE; detailed in Section 4.1.2) of clusters, we con-
structed five feature sets, as listed in Table 3. These correspond to
approaches disscussed in Section 3.2; change per question (i.e. the
amount of change between sequential questions), change per ques-
tion normalised for time, and the change over time for semanticized
representations of questions, as well as the time between asking
and answering of questions (both between asking and labeling of
the best answer, and time between asking and reception of the last
answer). Also, we used a combination of the ‘change over time’ and
‘time to answer’ sets.
feature set accuracy
change per question 66,9%
change over time 86,0%
semanticized change over time 75,3%
time to answer 89,3%
change/time combination 91,5%
Table 3: Accuracy of different feature sets. ‘Change’ featuresets typically contain average, (cumulative) standard deviation,slope and SSE of change rates (detailed in Section 3.2.1). ‘Timeto answer’ contains the time in days between asking and an-swering a question (detailed in Section 3.2.2). ‘Combination’contains features from both ‘change over time’ and ‘time to an-swer’ (detailed in Section 4.1.2).
When training a simple tree classifier11
using the properties in
each property set as features—on re-sampled data to balance the
classes—we find that the combination of both change and time
features is capable of obtaining a classification accuracy of 91.5% in
stratified 10 fold cross-validation, indicating that using very simple
properties such as the time between a question and its last answer,
and the cosine distance between the answers over time allow for
a reasonable distinction between sustainable and non-sustainable
11We use the WEKA [14] implementation of C4.5 by Quinlan [27].
*We use the WEKA (Hall et al., 2009) implementation of C4.5 by Quinlan (1993)
Conclusions
• Explored a new problem concerning sustainability and reusability of questions in a CQA setting
• Sustainability can be reasonably estimated by simple question properties, where time is most descriptive (RQ1)
• These properties can be obtained easily, also from data from other CQA services (RQ2)
• Using a simple classifier, these properties can be used to distinguish sustainable from non-sustainable questions (RQ3)
Future work
• Scaling (considered sample 3% of training set)
• Clustering:
• on answers (twice as long as questions)
• both (where do clusters of answers and questions ‘agree’?)
• retrieval approach
• Evaluation; does factoring in sustainability have a positive effect on precision?
Questions?
References
• D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3: 993–1022, March 2003. ISSN 1532-4435. URL http:// dl.acm.org/citation.cfm?id=944919.944937
• M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388. ACM, 2002.
• S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391–407, 1990.
• M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data mining software: an update. SIGKDD, 11(1):10–18, 2009.
• J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, 1993.• C. Shah, S. Oh, and J. Oh. Research agenda for social Q&A. Library & Information Science
Research, 31(4):205–209, 2009.
Data descriptives
Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006−2
−1
0
1
2
3
4
5
6
7
8
Cum
ulat
ive
cosi
nedi
stan
ce
Linear fitted lineCumulative cosine distance
Figure 4: As in Figure 3, the cumulative cosine distance be-
tween semanticized representations of answers with linear fit-
ted line for a single cluster. However, here the timing of the
answers is taken in to account.
ition is that sustainable questions are more likely to solicit answerslonger after they were posted than non-sustainable questions; manyquestions are answered straightaway and disappear in the timelinequickly, whereas some questions keep getting attention, and aretherefore not expired (yet).
4. EXPERIMENTS
Our experiments are aimed at answering the following researchquestions. What are the distinguishing properties of sustainablequestions? Can we measure these properties of sustainability? Canwe tell sustainable and unsustainable questions apart based on theseproperties?
4.1 Experimental Setup
Yahoo! Answers is a question answering community website,where users can ask and answer questions. Users are encouragedto answer questions by rewarding points, with accompanying ranksand earnable badges.
4.1.1 DataAll our experiments are run on the Yahoo! Answers Compre-
hensive Questions and Answers version 1.06 dataset. This data setconsists of 4.5M questions with often multiple answers, of whichwe used 3.2M.7
We have sampled two sets from the training set in order to de-velop, test and obtain clusters of similar questions; given the avail-able resources we were not able to perform all clustering methods(discussed in section 3.1) on the complete training data set. Table 1shows some decriptive statistics of the two subsets and the completeset, indicating that on a superficial level, the distributions of ques-tions do not differ much. However, we do note that the amount ofdifferent languages grows with the size of the data set.
Also, we can see that questions and answers tend to be short. Al-though the information need is conveyed with a richer representation
6http://webscope.sandbox.yahoo.com/catalog.php?datatype=l7For validation purposes, we sorted the set by date, and then splitthe set (80% training set, 20% test set), and held 10% of the trainingset back as dev-test set. However, due to time constraints we neverused the held back data.
sample sizeStatistic 10K 100K all
Number of questions 10K 100K 3.2M
Average number of answers/question 7,1 7,1 7,1Std. dev. number of answers/question 7,4 7,2 8,1
Average number of characters/question 175,0 176,7 177,3Std. dev. of characters/question 204,2 200,0 201,7Median of characters/question 103 104 105Average number of characters/answer 332,8 336,5 336,0Std. dev. of characters/answer 507,6 503,7 499,6Median of characters/answer 168 175 177
Average number of sentences/question 2,8 2,8 2,9Std. dev. number of sentences/question 2,7 2,6 2,6Median number of sentences/questions 2 2 2Average number of sentences/answer 3,9 3,9 3,9Std. dev. number of sentences/answer 6,3 5,2 5,1Median number of sentences/answer 2 2 2
Question languages 6 12 28Main categories 163 176 179Categories 869 1744 2853Sub categories 677 1245 1539
Table 1: Descriptive statistics of the Yahoo! Answer data set.
The average number of answers is per question, the aver-
age question and answer length is in characters (spaces in-
cluded). Languages and categories are the amount of unique
occurences.
in natural language, we must also consider that our ‘documents’ arefar sparser as would be the case in a traditional retrieval setting, andcan influence our efforts to estimate similarity between questionsnegatively.
Additionally, we extracted several attributes of each question,such as the date the question was posted, was resolved, when thelast answer was solicited (and how much time was between askingand answering), how many answers were given, what the best answerwas (either chosen by the asker, or voted for by the community), orto which category it was assigned.
4.1.2 PreprocessingIn order to perform the change rate measures as described in
section 3.2.1, we employ two strategies to model the answer space.However, we first perform for each cluster of questions in the output(clustering methods are discussed in section 3.1) case and accentfolding and simple tokenisation on both questions and answers. Wethen model the answers as a tf-idf vector space [23].8
However, while this approach is a traditional, well-tested anddescribed approach in information retrieval, the vector space tendsto be very sparse. As Table 1 indicates, questions and answers tendto be very short. There are some elaborate answers in the corpus,skewing the average, but more than half of the questions and answersare represented by two sentences.
In addition, Yahoo! Answers community members represent theirquestions and answers in natural language, creating an abundanceof different spellings, synonyms and complexity. This results in aneven sparser vector space, as different spellings of the same word(including typographical errors) result in separate features in thevector space, while the same semantic meaning is intended.8We used the implementation from the scikit-learn package;http://scikit-learn.org/stable/index.html.
Cluster properties
sample size
algorithm 10K 100K all
LDA 0.435 0.500 -
LSA 0.706 0.638 -
LSH16bits 0.472 0.484 0.500
LSH24bits 0.465 0.502 0.495
LSH32bits 0.512 0.514 0.509
LSH40bits 0.523 0.537 0.542
Table 2: Accuracy of several question clustering methods. Miss-ing values represent experiments that never terminated.
In order to overcome vocabulary mismatch (different words with
the same meaning are used in either the question or the answer),
different ways of spelling, and improve overall matching on a se-
mantic level, we use the semantic linking system of Meij et al. [24],
developed to determine concepts in tweets.
This system approaches a similar problem; finding out what short
pieces of text are about. In operationalisation, pages on Wikipedia
are considered as concepts. Subsequently, a model is trained to
estimate the probability that a concept c is the target of an hyperlink
(in Wikipedia) witn an anchor text containing an ngram q. Given
a question or an answer, we obtain the set of concepts that are
likely to be linked to by occurence of ngrams in that question or
answer in Wikipedia anchor texts, as well as the score (a sum of
the probabilities of all ngrams in the piece of text linking to that
specific concept).
Using these concepts as document vectors, and their scores as
the values in those vectors, rather than tf-idf on the bag-of-words
representation of questions and answers, we hope to obtain a less
noisy vector space (change between answers is based solely on
concepts present in the text), and diminish the influence of spelling
and vocabulary mismatch. We will refer to documents processed
this way as ‘semanticized’.
We view the clustering of questions as a preprocessing step and
therefore take it as part of the experimental setup. We explore three
approaches to finding similar questions: latent semantic analysis [8],
latent Dirichlet allocation [4] and locality senstive hashing [6].
From the output of each clustering method on the 10K dataset,
we sampled 559 pairs of questions and manually labeled 205 as cor-
rectly clustered together and 354 as wrongly clustered together. We
used the combined set of labels (randomly sampling 205 questions
from the wrongly-clustered set) to arrive at the results in Table 2;
for each labeled pair of questions we observe whether the algorithm
was correct in either putting both questions in the same cluster or
keeping them separate.
Based on these accuracy results we decided on using LSA as our
clustering approach for the remainder of our experiments. We also
decided on taking the sample of 10K documents as the basis for our
analysis.
While we consider clustering of similar questions as a preprocess-
ing step for our approaches to sustainability, we can not ignore the
fact that obtaining a reasonable clustering performance is important
for our sustainability estimation. Therefore, we opt to manually
label data for further investigation, as our clustering methods per-
formed rather poor.
We manually divided the 904 clusters in the output of our LSA
clustering approach on the before mentioned subset of 10K questions
in three classes: 752 all clusters, 143 clusters with similar questions
and 7 clusters with sustainable questions.
−0.5 0.0 0.5 1.0 1.5Average cosine distance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Den
sity
allsimilarsustainable
Figure 5: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between answers labeled as best accord-ing to either the user or the community.
The clusters in the similar class are only required to have similar
questions—questions asking for the same information—regardless
of the answers; these clusters can thus be sustainable and unsustain-
able. Additionally, the clusters in the sustainable class are required
to have answers that do not change over time. Note that this defini-
tion implies that the sustainable class is a subset of the similar class
which is a subset of the all class.
Subsequently, for each cluster we compute cosine distances be-
tween chronologically sorted best answers, as described in section
3.2.1. For each set of distances (per cluster), we compute the aver-
age, standard deviation, average change per day, standard deviation
on the cumulative distances, as well as the slope and sum of squared
errors of a linearly fitted function on the cumulative distances.
Also, we compute for each cluster the time between the moment a
question was posted and the answer labeled as best answer, and the
time between the last answer that question received, as described in
section 3.2.2. For each set of these distances in time, we compute the
average, standard deviation, standard deviation on the cumulative
distances in time, as well as the slope and sum of squared errors of
a linearly fitted function on the cumulative distances in days.
4.2 ResultsFigure 5 shows a kernel density estimation
9plot of the average
cosine distance between the best answers for each class of clusters.
Although there seems to be some evidence for this metric to distin-
guish similar and sustainable clusters from regular clusters, it is not
that strong.
Figure 6 also shows a kernel density estimation plot, for the aver-
age cosine distance between the semanticized vector representations
of the best answers for each class of clusters. As the remarkable sim-
ilar plots for the distances between tf-idf vectors and semanticized
vector representations in the single cluster example in section 3.2.1
already suggested, semanticizing answers does not seem to be an
improvement on the traditional tf-idf bag-of-words representation.
Also, considering time to answer as a distinctive property of
sustainable questions does not yield decisive results. Figure 7 shows
a kernel density estimation of the average time in days between the
9We use kernel density estimation because it models the density
of data points at a value. In this way, a fairer comparison between
the instances of our three classes can be made; we have far less
sustainable than similar questions [37].
Cluster properties
−0.5 0.0 0.5 1.0 1.5Average cosine distance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Den
sity
allsimilarsustainable
Figure 6: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between semanticized answers labeledas best according to either the user or the community.
posting of a question and that question being marked as resolved.
Almost all questions are answered within days of posting, although
similar and sustainable question clusters seem to incorporate more
questions that require longer to be answered satisfactory than regular
clusters. However, the distinction is not that clear.
−400 −200 0 200 400 600 800Days between question posted and best answer
0.000
0.005
0.010
0.015
0.020
0.025
0.030
Den
sity
allsimilarsustainable
Figure 7: Kernel density estimation of average time in daysbetween posting of a question and the question being markedas resolved.
When we consider the time between the moment of posting a
question and the moment that question receives its final answer, we
see that questions we deem sustainable keep receiving answers far
longer than ‘regular’ or even similar questions. Figure 8 shows a
kernel density estimation10
plot for the time between the posting of
a question and the reception of its last answer. It should be noted
that the set of sustainable clusters is a subset of the set of similar
clusters, and that the set of similar clusters is a subset of the set
of all clusters. This explains the second local maximum in the ‘all
clusters’ line.
4.3 AnalysisWhen comparing a kernel density estimation of the average cosine
distance between the best answers to the questions in a cluster
10This is why the plot covers negative values for time as well.
−200 −100 0 100 200 300 400 500Days between question posted and last answer
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
Den
sity
allsimilarsustainable
Figure 8: Kernel density estimation of the average time in daysbetween posting of a question and the last answer a questionreceived.
(shown in Figure 5) with a kernel density estimation of the average
time in days between posting a question and that question receiving
its last answer (shown in Figure 8) we see that the time between the
posting of a question and receiving its last answer is very indicative
in describing sustainability: the longer a question solicits answers,
the higher the probability of said question to be sustainable.
In addition, from the simple properties (average, standard devi-
ation, slope, SSE; detailed in Section 4.1.2) of clusters, we con-
structed five feature sets, as listed in Table 3. These correspond to
approaches disscussed in Section 3.2; change per question (i.e. the
amount of change between sequential questions), change per ques-
tion normalised for time, and the change over time for semanticized
representations of questions, as well as the time between asking
and answering of questions (both between asking and labeling of
the best answer, and time between asking and reception of the last
answer). Also, we used a combination of the ‘change over time’ and
‘time to answer’ sets.
feature set accuracy
change per question 66,9%
change over time 86,0%
semanticized change over time 75,3%
time to answer 89,3%
change/time combination 91,5%
Table 3: Accuracy of different feature sets. ‘Change’ featuresets typically contain average, (cumulative) standard deviation,slope and SSE of change rates (detailed in Section 3.2.1). ‘Timeto answer’ contains the time in days between asking and an-swering a question (detailed in Section 3.2.2). ‘Combination’contains features from both ‘change over time’ and ‘time to an-swer’ (detailed in Section 4.1.2).
When training a simple tree classifier11
using the properties in
each property set as features—on re-sampled data to balance the
classes—we find that the combination of both change and time
features is capable of obtaining a classification accuracy of 91.5% in
stratified 10 fold cross-validation, indicating that using very simple
properties such as the time between a question and its last answer,
and the cosine distance between the answers over time allow for
a reasonable distinction between sustainable and non-sustainable
11We use the WEKA [14] implementation of C4.5 by Quinlan [27].
Cluster properties
−0.5 0.0 0.5 1.0 1.5Average cosine distance
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Den
sity
allsimilarsustainable
Figure 6: Kernel density estimation of the average cosine dis-tance (i.e. change rate) between semanticized answers labeledas best according to either the user or the community.
posting of a question and that question being marked as resolved.
Almost all questions are answered within days of posting, although
similar and sustainable question clusters seem to incorporate more
questions that require longer to be answered satisfactory than regular
clusters. However, the distinction is not that clear.
−400 −200 0 200 400 600 800Days between question posted and best answer
0.000
0.005
0.010
0.015
0.020
0.025
0.030
Den
sity
allsimilarsustainable
Figure 7: Kernel density estimation of average time in daysbetween posting of a question and the question being markedas resolved.
When we consider the time between the moment of posting a
question and the moment that question receives its final answer, we
see that questions we deem sustainable keep receiving answers far
longer than ‘regular’ or even similar questions. Figure 8 shows a
kernel density estimation10
plot for the time between the posting of
a question and the reception of its last answer. It should be noted
that the set of sustainable clusters is a subset of the set of similar
clusters, and that the set of similar clusters is a subset of the set
of all clusters. This explains the second local maximum in the ‘all
clusters’ line.
4.3 AnalysisWhen comparing a kernel density estimation of the average cosine
distance between the best answers to the questions in a cluster
10This is why the plot covers negative values for time as well.
−200 −100 0 100 200 300 400 500Days between question posted and last answer
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
Den
sity
allsimilarsustainable
Figure 8: Kernel density estimation of the average time in daysbetween posting of a question and the last answer a questionreceived.
(shown in Figure 5) with a kernel density estimation of the average
time in days between posting a question and that question receiving
its last answer (shown in Figure 8) we see that the time between the
posting of a question and receiving its last answer is very indicative
in describing sustainability: the longer a question solicits answers,
the higher the probability of said question to be sustainable.
In addition, from the simple properties (average, standard devi-
ation, slope, SSE; detailed in Section 4.1.2) of clusters, we con-
structed five feature sets, as listed in Table 3. These correspond to
approaches disscussed in Section 3.2; change per question (i.e. the
amount of change between sequential questions), change per ques-
tion normalised for time, and the change over time for semanticized
representations of questions, as well as the time between asking
and answering of questions (both between asking and labeling of
the best answer, and time between asking and reception of the last
answer). Also, we used a combination of the ‘change over time’ and
‘time to answer’ sets.
feature set accuracy
change per question 66,9%
change over time 86,0%
semanticized change over time 75,3%
time to answer 89,3%
change/time combination 91,5%
Table 3: Accuracy of different feature sets. ‘Change’ featuresets typically contain average, (cumulative) standard deviation,slope and SSE of change rates (detailed in Section 3.2.1). ‘Timeto answer’ contains the time in days between asking and an-swering a question (detailed in Section 3.2.2). ‘Combination’contains features from both ‘change over time’ and ‘time to an-swer’ (detailed in Section 4.1.2).
When training a simple tree classifier11
using the properties in
each property set as features—on re-sampled data to balance the
classes—we find that the combination of both change and time
features is capable of obtaining a classification accuracy of 91.5% in
stratified 10 fold cross-validation, indicating that using very simple
properties such as the time between a question and its last answer,
and the cosine distance between the answers over time allow for
a reasonable distinction between sustainable and non-sustainable
11We use the WEKA [14] implementation of C4.5 by Quinlan [27].
Cluster properties
We sort the list of hashes generated from our question input and
then pass over the list of hashes, while computing the Hamming
distance [15] between consecutive items. The first hash is initialised
to a cluster. Then, we proceed to the second hash. If the current hash
has three or less positions that are different from the previous hash,
it is appended to that cluster, otherwise a new cluster is initialised.
In doing so, clustering in linear time can be achieved, although
at the loss of some precision. Succesful application to detecting
near-duplicate pages for web crawling [22] and first story detection
on Twitter have been reported [26].
3.2 Measuring Sustainability
The operationalisation of our definition of sustainable questions,
as described in section 2.4, implies that we first have to identify
similar questions; in order to estimate to what extent the answers
to very similar or the same questions change over time, we need to
find sets of such similar questions.
3.2.1 Change rate of answersFor each cluster of questions we create a tf-idf vector space of the
answers labeled as ‘best answer’ by either the question asker or the
community. Subsequently, we fit a linear function on the cumulative
cosine distances between the answers (as shown in Figure 1), as
well as the cumulative cosine distances between answers over time
(shown in Figure 2). Figure 1 shows a rather constant change in
answers over time, whereas Figure 2 displays differences in the
speed of change between different answers over time, suggesting
that time might be an important factor in determining the evolution
in answers to questions over time. Figure 1 and Figure 2 show the
same cluster.
0 1 2 3 4 5 6 7 8−1
0
1
2
3
4
5
6
7
Cum
ulat
ive
cosi
nedi
stan
ce
Cumulative cosine distanceLinear fitted line
Figure 1: Cumulative cosine distance between vector represen-
tations of answers with a linear fitted line for a single cluster.
For the 9 best answers in this cluster, the theoretical maximum
of the cumulative distance is 8.
The idea behind this approach is that the slope of this linear
function will provide an indication how fast the answers to a set
of similar questions change. Also, the sum of squared errors for
this function given the dataset might provide clues to periodicity; if
the answers to similar questions exhibit large amounts of change in
short periods of time, that might indicate that the subject of these
questions are subject to periodic changes (for example, ‘who is theworld champion soccer’ is expected to change suddenly at periodic
time intervals). Additionally, we compute the standard deviation of
the set of distances.
We also represented our set of answers as semanticized vectors.
This approach will be described more in-depth in section 4.1.2.
Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006−2
−1
0
1
2
3
4
5
6
7
Cum
ulat
ive
cosi
nedi
stan
ce
Cumulative cosine distanceLinear fitted line
Figure 2: As in Figure 1, the cumulative cosine distance be-
tween vector representations of answers with linear fitted line
for a single cluster. However, here the timing of the answers is
taken in to account.
0 1 2 3 4 5 6 7 8−1
0
1
2
3
4
5
6
7
8
Cum
ulat
ive
cosi
nedi
stan
ce
Linear fitted lineCumulative cosine distance
Figure 3: Cumulative cosine distance between semanticized
representations of answers with a linear fitted line for a single
cluster. For the 9 best answers in this cluster, the theoretical
maximum of the cumulative distance is 8.
Figure 3 and Figure 4 display the results of the same approach
applied to the tf-idf vector representations used on the semanticized
answers.
Both figures represent the same cluster of questions shown in
figures 1 and 2, and show a remarkably similar graph, for both the
change and change over time approach; only small difference, such
as at the second question, can be observed.
3.2.2 Speed of responseAnother property that might be indicative to the sustainability
of a question, is the time it takes for a question to be answered.
The intuition here is that the probability of a sustainable question
soliciting answers over a longer period of time would be higher, as
the question would still be relevant.
For each cluster, we computed the average time for a question to
be resolved in days (i.e. the time between the posting of a question
and the posting of the best answer). Also, we computed the standard
deviation in answering time, as well as the total amount of days
questions in a cluster had to ‘wait’ for their best answer.
In addition, we computed the average time in days between the
posting of a question, and the last answer it received. The intu-
Cluster properties
Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006−2
−1
0
1
2
3
4
5
6
7
8
Cum
ulat
ive
cosi
nedi
stan
ce
Linear fitted lineCumulative cosine distance
Figure 4: As in Figure 3, the cumulative cosine distance be-
tween semanticized representations of answers with linear fit-
ted line for a single cluster. However, here the timing of the
answers is taken in to account.
ition is that sustainable questions are more likely to solicit answerslonger after they were posted than non-sustainable questions; manyquestions are answered straightaway and disappear in the timelinequickly, whereas some questions keep getting attention, and aretherefore not expired (yet).
4. EXPERIMENTS
Our experiments are aimed at answering the following researchquestions. What are the distinguishing properties of sustainablequestions? Can we measure these properties of sustainability? Canwe tell sustainable and unsustainable questions apart based on theseproperties?
4.1 Experimental Setup
Yahoo! Answers is a question answering community website,where users can ask and answer questions. Users are encouragedto answer questions by rewarding points, with accompanying ranksand earnable badges.
4.1.1 DataAll our experiments are run on the Yahoo! Answers Compre-
hensive Questions and Answers version 1.06 dataset. This data setconsists of 4.5M questions with often multiple answers, of whichwe used 3.2M.7
We have sampled two sets from the training set in order to de-velop, test and obtain clusters of similar questions; given the avail-able resources we were not able to perform all clustering methods(discussed in section 3.1) on the complete training data set. Table 1shows some decriptive statistics of the two subsets and the completeset, indicating that on a superficial level, the distributions of ques-tions do not differ much. However, we do note that the amount ofdifferent languages grows with the size of the data set.
Also, we can see that questions and answers tend to be short. Al-though the information need is conveyed with a richer representation
6http://webscope.sandbox.yahoo.com/catalog.php?datatype=l7For validation purposes, we sorted the set by date, and then splitthe set (80% training set, 20% test set), and held 10% of the trainingset back as dev-test set. However, due to time constraints we neverused the held back data.
sample sizeStatistic 10K 100K all
Number of questions 10K 100K 3.2M
Average number of answers/question 7,1 7,1 7,1Std. dev. number of answers/question 7,4 7,2 8,1
Average number of characters/question 175,0 176,7 177,3Std. dev. of characters/question 204,2 200,0 201,7Median of characters/question 103 104 105Average number of characters/answer 332,8 336,5 336,0Std. dev. of characters/answer 507,6 503,7 499,6Median of characters/answer 168 175 177
Average number of sentences/question 2,8 2,8 2,9Std. dev. number of sentences/question 2,7 2,6 2,6Median number of sentences/questions 2 2 2Average number of sentences/answer 3,9 3,9 3,9Std. dev. number of sentences/answer 6,3 5,2 5,1Median number of sentences/answer 2 2 2
Question languages 6 12 28Main categories 163 176 179Categories 869 1744 2853Sub categories 677 1245 1539
Table 1: Descriptive statistics of the Yahoo! Answer data set.
The average number of answers is per question, the aver-
age question and answer length is in characters (spaces in-
cluded). Languages and categories are the amount of unique
occurences.
in natural language, we must also consider that our ‘documents’ arefar sparser as would be the case in a traditional retrieval setting, andcan influence our efforts to estimate similarity between questionsnegatively.
Additionally, we extracted several attributes of each question,such as the date the question was posted, was resolved, when thelast answer was solicited (and how much time was between askingand answering), how many answers were given, what the best answerwas (either chosen by the asker, or voted for by the community), orto which category it was assigned.
4.1.2 PreprocessingIn order to perform the change rate measures as described in
section 3.2.1, we employ two strategies to model the answer space.However, we first perform for each cluster of questions in the output(clustering methods are discussed in section 3.1) case and accentfolding and simple tokenisation on both questions and answers. Wethen model the answers as a tf-idf vector space [23].8
However, while this approach is a traditional, well-tested anddescribed approach in information retrieval, the vector space tendsto be very sparse. As Table 1 indicates, questions and answers tendto be very short. There are some elaborate answers in the corpus,skewing the average, but more than half of the questions and answersare represented by two sentences.
In addition, Yahoo! Answers community members represent theirquestions and answers in natural language, creating an abundanceof different spellings, synonyms and complexity. This results in aneven sparser vector space, as different spellings of the same word(including typographical errors) result in separate features in thevector space, while the same semantic meaning is intended.8We used the implementation from the scikit-learn package;http://scikit-learn.org/stable/index.html.