Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without...

11
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–55 New Orleans, Louisiana, June 5, 2018. c 2018 Association for Computational Linguistics Estimating Linguistic Complexity for Science Texts Farah Nadeem and Mari Ostendorf Dept. of Electrical Engineering University of Washington {farahn,ostendor}@uw.edu Abstract Evaluation of text difficulty is important both for downstream tasks like text simplification, and for supporting educators in classrooms. Existing work on automated text complexity analysis uses linear models with engineered knowledge-driven features as inputs. While this offers interpretability, these models have lower accuracy for shorter texts. Traditional readability metrics have the additional draw- back of not generalizing to informational texts such as science. We propose a neural ap- proach, training on science and other informa- tional texts, to mitigate both problems. Our results show that neural methods outperform knowledge-based linear models for short texts, and have the capacity to generalize to genres not present in the training data. 1 Introduction A typical classroom presents a diverse set of stu- dents in terms of their reading comprehension skills, particularly in the case of English language learners (ELLs). Supporting these students often requires educators to estimate accessibility of in- structional texts. To address this need, several automated systems have been developed to es- timate text difficulty, including readability met- rics like Lexile (Stenner et al., 1988), the end-to- end system TextEvaluator (Sheehan et al., 2013), and linear models (Vajjala and Meurers, 2014; Pe- tersen and Ostendorf, 2009; Schwarm and Osten- dorf, 2005). These systems leverage knowledge- based features to train regression or classifica- tion models. Most systems are trained on liter- ary and generic texts, since analysis of text diffi- culty is usually tied to language teaching. Existing approaches for automated text complexity analy- sis pose two issues: 1) systems using knowledge based features typically work better for longer texts (Vajjala and Meurers, 2014), and 2) complex- ity estimates are less accurate for informational texts such as science (Sheehan et al., 2013). In the context of science, technology and engineer- ing (STEM) education, both problems are signif- icant. Teachers in these areas have less expertise in identifying appropriate reading material for stu- dents as opposed to language teachers, and shorter texts become important when dealing with assess- ment questions and identifying the most difficult parts of instructional texts to modify for support- ing students who are ELLs. Our work specifically looks at ways to address these two problems. First, we propose recurrent neural network (RNN) architectures for estimating linguistic complexity, using text as input without feature engineering. Second, we specifically train on science and other informational texts, using the grade level of text as a proxy for linguistic com- plexity and dividing grades k-12 into 6 groups. We explore four different RNN architectures in order to identify aspects of text which contribute more to complexity, with a novel structure introduced to account for cross-sentence context. Experi- mental results show that when specifically trained for informational texts, RNNs can accurately pre- dict text difficulty for shorter science texts. The models also generalize to other types of texts, but perform slightly worse than feature-based regres- sion models on a mix of genres for texts longer than 100 words. We use attention with all mod- els, both to improve accuracy, and as a tool to visualize important elements of text contributing to linguistic complexity. The key contributions of the work include new neural network architectures for characterizing documents and experimental re- sults demonstrating good performance for predict- ing reading level of short science texts. The rest of the paper is organized as follows: section 2 looks at existing work on automated readability analysis and introduces RNN architec- 45

Transcript of Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without...

Page 1: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–55New Orleans, Louisiana, June 5, 2018. c©2018 Association for Computational Linguistics

Estimating Linguistic Complexity for Science Texts

Farah Nadeem and Mari OstendorfDept. of Electrical Engineering

University of Washington{farahn,ostendor}@uw.edu

Abstract

Evaluation of text difficulty is important bothfor downstream tasks like text simplification,and for supporting educators in classrooms.Existing work on automated text complexityanalysis uses linear models with engineeredknowledge-driven features as inputs. Whilethis offers interpretability, these models havelower accuracy for shorter texts. Traditionalreadability metrics have the additional draw-back of not generalizing to informational textssuch as science. We propose a neural ap-proach, training on science and other informa-tional texts, to mitigate both problems. Ourresults show that neural methods outperformknowledge-based linear models for short texts,and have the capacity to generalize to genresnot present in the training data.

1 Introduction

A typical classroom presents a diverse set of stu-dents in terms of their reading comprehensionskills, particularly in the case of English languagelearners (ELLs). Supporting these students oftenrequires educators to estimate accessibility of in-structional texts. To address this need, severalautomated systems have been developed to es-timate text difficulty, including readability met-rics like Lexile (Stenner et al., 1988), the end-to-end system TextEvaluator (Sheehan et al., 2013),and linear models (Vajjala and Meurers, 2014; Pe-tersen and Ostendorf, 2009; Schwarm and Osten-dorf, 2005). These systems leverage knowledge-based features to train regression or classifica-tion models. Most systems are trained on liter-ary and generic texts, since analysis of text diffi-culty is usually tied to language teaching. Existingapproaches for automated text complexity analy-sis pose two issues: 1) systems using knowledgebased features typically work better for longertexts (Vajjala and Meurers, 2014), and 2) complex-

ity estimates are less accurate for informationaltexts such as science (Sheehan et al., 2013). Inthe context of science, technology and engineer-ing (STEM) education, both problems are signif-icant. Teachers in these areas have less expertisein identifying appropriate reading material for stu-dents as opposed to language teachers, and shortertexts become important when dealing with assess-ment questions and identifying the most difficultparts of instructional texts to modify for support-ing students who are ELLs.

Our work specifically looks at ways to addressthese two problems. First, we propose recurrentneural network (RNN) architectures for estimatinglinguistic complexity, using text as input withoutfeature engineering. Second, we specifically trainon science and other informational texts, using thegrade level of text as a proxy for linguistic com-plexity and dividing grades k-12 into 6 groups. Weexplore four different RNN architectures in orderto identify aspects of text which contribute moreto complexity, with a novel structure introducedto account for cross-sentence context. Experi-mental results show that when specifically trainedfor informational texts, RNNs can accurately pre-dict text difficulty for shorter science texts. Themodels also generalize to other types of texts, butperform slightly worse than feature-based regres-sion models on a mix of genres for texts longerthan 100 words. We use attention with all mod-els, both to improve accuracy, and as a tool tovisualize important elements of text contributingto linguistic complexity. The key contributions ofthe work include new neural network architecturesfor characterizing documents and experimental re-sults demonstrating good performance for predict-ing reading level of short science texts.

The rest of the paper is organized as follows:section 2 looks at existing work on automatedreadability analysis and introduces RNN architec-

45

Page 2: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

tures we build on for this work. Section 3 lays outthe data sources, section 4 covers proposed mod-els, and section 5 presents results. Discussion andconcluding remarks follow in sections 6 and 7.

2 Background

Studies have shown that language difficulty of in-structional materials and assessment questions im-pacts student performance, particularly for lan-guage learners (Hickendorff, 2013; Abedi andLord, 2001; Abedi, 2006). This has lead to exten-sive work on readability analysis, some of whichis explored here. The second part of this sectionlooks at work that leverages RNNs in automatictext classification tasks and the use of attentionwith RNNs.

2.1 Automated Readability Analysis

Traditional reading metrics including Flesch-Kincaid (Kincaid et al., 1975) and Coleman-Liauindex (Coleman and Liau, 1975) are often usedto assess a text for difficulty. These metrics uti-lize surface features such as average length of sen-tences and words, or word lists (Chall and Dale,1995). The development of automated text analy-sis systems has made it possible to leverage addi-tional linguistic features, as well as conventionalreading metrics, to estimate text complexity quan-tified as reading level. NLP tools can be used toextract a variety of lexical, syntactic and discoursefeatures from text, which can then be used withtraditional features as input to models for predict-ing reading level. Some of the models include sta-tistical language models (Collins-Thompson andCallan, 2004), support vector machine classifiers(Schwarm and Ostendorf, 2005; Petersen and Os-tendorf, 2009), and logistic regression (Feng et al.,2010). Text coherence has also been explored asa predictor of difficulty level in (Graesser et al.,2004), with an extended feature set that includessyntactic complexity and discourse in addition tocoherence (Graesser et al., 2011).

A study conducted in (Nelson et al., 2012) in-dicates that metrics that incorporate a large set oflinguistic features perform better at predicting textdifficulty level; the metrics were specifically testedon the Common Core Standards (CCS) texts.1

Features from second language acquisition com-plexity measures were used in (Vajjala and Meur-ers, 2012) to improve readability assessment. This

1http://www.corestandards.org/

feature set was further extended to include mor-phological, semantic and psycholinguistic featuresto build a readability analyzer for shorter texts (Va-jjala and Meurers, 2014). A tool specifically builtfor text complexity analysis for teaching and as-sessing is the TextEvaluatorTM. While knowledge-based features offer interpretability, a drawback isthat if the text being analyzed is short, the featurevector is sparse, and prediction accuracy drops(Vajjala and Meurers, 2014). This is particularlytrue for assessment questions, which are shorterthan the samples most models are trained on.

Generally, for any text classification task, thetype of text used for training the model is im-portant in terms of how well it performs; train-ing on more representative text tends to improveperformance. The work in (Sheehan et al., 2013)shows that traditional readability measures under-estimate the reading level of literary texts, andoverestimate that of informational texts, such ashistory, science and mathematics articles. This isdue, in part, to the vocabulary specific to the genre.Science texts have longer words, though they maybe easier to infer from context. Literary texts,on the other hand, might have simpler words, butmore complicated sentence structure. The workdemonstrated that more accurate grade level esti-mates can be obtained by two stage classification:i) classify the text as either literary, informational,or mixed, and then ii) use a genre-dependent ana-lyzer to estimate the level. In an analysis on howwell a model trained on news and informationalarticles generalizes to the categories in CCS, thework in (Vajjala and Meurers, 2014) shows betterperformance on informational genre than literarytexts. Training on more representative text, how-ever, requires genre-specific annotated data.

2.2 Text Classification with RNNsRecurrent neural networks (RNNs) are adept atlearning text representations, as demonstrated bylanguage modeling (Mikolov et al., 2010) and textclassification tasks (Yogatama et al., 2017). Addi-tional RNN structures have been proposed for im-proved representation, including tree LSTMs (Taiet al., 2015) and a hierarchical RNN (Yang et al.,2016). In addition, hierarchical models have beenproposed to better represent document structure(Yang et al., 2016).

Attention mechanisms were introduced to im-prove neural machine translation tasks (Bahdanauet al., 2014), and have also been shown to im-

46

Page 3: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

prove the performance of text classification (Yanget al., 2016). In machine translation, attention iscomputed over the source sequence when predict-ing the words in the target sequence. This “con-text” attention is based on a score computed be-tween the target hidden state ht and a subset of thesource hidden states hs. The score can be com-puted in several ways, of which a general form isscore(ht, hs) = hTt Wαh

Ts (Luong et al., 2015).

Attention has also been used for a variety ofother language processing tasks. In particular, fortext classification, attention weights are learnedthat target the final classification decision. Thisapproach is referred to as “self attention” in (Linet al., 2017), but will be referred to here as “taskattention.” The hierarchical RNN in (Yang et al.,2016) uses task attention mechanisms at both wordand sentence levels. Since our work builds on thismodel, it is described in further detail in section4. In addition, we propose extensions of the hi-erarchical RNN that leverage attention in differentways, including combining the concept of contextattention from machine translation with task atten-tion to capture interdependence of adjoining sen-tences in a document.

3 Data

For our work we consider grade level as a proxyfor linguistic complexity. Within a grade level,there is variability across different genres, whichstudents are expected to learn. Since there is nopublicly available data set for estimating gradelevel and text difficulty aimed at informationaltexts, we created a corpus using online science,history and social studies textbooks. The text-books are written for either specific grades, or fora grade range, e.g. grades 6-8. There are a totalof 44 science textbooks and 11 history and socialstudies textbooks, distributed evenly across gradesK-12. Given the distribution of textbooks for eachgrade level, we decide to classify into one of sixgrade bands: K-1, 2-3, 4-5, 6-8, 9-10 and 11-12.Because of our interest in working with short texts,we split the books into paragraphs, using end lineas the delimiter.2 In addition to the textbooks, wealso used the WeeBit corpus (Vajjala and Meurers,2012) for training, again split into paragraphs.

2In splitting the text into paragraphs, we are implicitly as-suming that all paragraphs have the same linguistic complex-ity as the textbook, which is probably not the case. Thus,there will be noise in both the training and test data, so somevariation in the predicted levels is to be expected.

Grade Level All chapters Test setchapters

K-1 25 -2-3 22 24-5 53 96-8 165 129-10 48 511-12 28 3

Table 1: Chapter-based test data split

We have three different sources of test data: i)the CCS appendix B texts, ii) a subset of the on-line texts that we collected,3 and iii) a collectionof science assessment items.

The CCS appendix B data is of interest be-cause it has been extensively used for evaluat-ing linguistic complexity models, e.g. in (Sheehanet al., 2013; Vajjala and Meurers, 2014). It in-cludes both informational and literary texts. Weuse document-level samples from the CCS datafor comparison to prior work, and paragraph-levelsamples to provide a more direct comparison tothe information test data we created.

For the informational texts, we selected chap-ters from multiple open source texts. Since we hadso few texts at the K-1 level, the test data only in-cluded texts from higher grade levels, as shownin table 1. The paragraphs in these chapters wererandomly assigned to test and validation sets.

To assess the models on stand alone texts, weassembled a corpora of science assessment ques-tions from (Khot et al., 2015; Clark et al., 2018),AI2 Science Questions Mercury,4 and AI2 Sci-ence Questions v2.1 (October 2017).5 This testset includes 5470 questions for grades 6-8 fromsources including standardized state and nationaltests. The average length of a question is 49 words.

For training, two data configurations were used.When testing on the CCS data and the scienceassessment questions, there is no concern aboutoverlap between training and test data, so all textcan be used for training. We held out 10% of thisdata for analysis, and the remaining text is used forthe D1 training configuration. Data statistics aregiven in table 2. About 20% of the training sam-

3Available at https://tinyurl.com/yc59hlgj.4http://data.allenai.org/

ai2-science-questions-mercury/5http://data.allenai.org/

ai2-science-questions/

47

Page 4: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

GradeLevel

TrainSamples

MeanLength

K-1 739 24.422-3 723 62.054-5 4570 63.826-8 15940 74.79

9-10 3051 68.2411-12 2301 75.28

Table 2: Training data (D1) with mean length of text inwords

ples (5152) are from WeeBit, spread across grades2-12. For testing on all three sets, we defineda training configuration D2 that did not includeany text from chapters overlapping with the testdata, so there training set is somewhat smaller thanfor D1, except for grades K-1. The same WeeBittraining data was included in both cases.

For the elementary grade levels, we have muchless data than for middle school, and for highschool, we have substantial training data withcoarser labels (grades 9-12). To work around bothissues, we first used all training samples to trainthe RNN to predict one of four labels (grades K-3,4-5, 6-8 and 9-12). We then used the training datawith fine labels to train to predict one of six labels.This approach was more effective than alternatingthe training.

4 Models for Estimating LinguisticComplexity

This section introduces the four RNN structuresfor linguistic complexity estimation, including: asequential RNN with task attention, a hierarchicalattention network, and two proposed extensionsof the hierarchical model using multi-head atten-tion and attention over bidirectional context. In allcases, the resulting document vector is used in a fi-nal stage of ordinal regression to predict linguisticcomplexity. All systems are trained in an end-to-end fashion.

4.1 Sequential RNNThe basic RNN model we consider is a sequentialRNN with task attention, where the entire text ina paragraph or document is taken as a sequence.For a document ti with words K words wik k ∈{1, 2, ...,K}, a bidirectional GRU is used to learnrepresentation for each word hik, using a forwardrun from wi1 to wiK , and a backward run from

wiK to wi1.

−→hik =

−−−→GRU(wik) (1)

←−hik =

←−−−GRU(wik) (2)

hik = [−→hik,←−hik] (3)

Attention is computed over the entire sequenceαik, and used to compute the document represen-tation vseqi :

uik = tanh(Wshik + bs) (4)

αik =exp(uTikus)∑ikexp(uT

ikus)

(5)

vseqi =∑k αikhik (6)

The document vector is used to predict readinglevel. Since the grade levels are ordered categori-cal labels, we implement ordinal regression usingthe proportional odds model (McCullagh, 1980).For the reading level labels j ∈ {1, 2, ..., J}, thecumulative probability is modeled as

P (y ≤ j|vseqi ) = σ(βj − wTordvseqi ), (7)

where σ(.) is the sigmoid function, and βj andword are estimated during training by minimizingthe negative log-likelihood

Lord = −∑i log(σ(βj(i) − wTordv

seqi )− (8)

σ(βj(i)−1 − wTordvseqi )).

4.2 Hierarchical RNN

While a sequential RNN has the capacity to cap-ture discourse across sentences, it does not capturedocument structure. Therefore, we also exploredthe hierarchical attention network for text classifi-cation from (Yang et al., 2016). The model buildsa vector representation vi for each document tiwith L sentences sl, l ∈ {1, 2, .., L}, each withTl words wlt, t ∈ {1, 2, ..., Tl}. The first level ofthe hierarchy takes words as input and learns a rep-resentation for each word hlt using a bidirectionalGRU. Task attention at the word level αlt high-lights words important for the classification task,and is computed using the word level context vec-tor uw. The word representations are then aver-aged using attention weights to form a sentencerepresentation sl

αlt =exp(uTltuw)∑texp(uT

ltuw)

(9)

sl =∑t αlthlt, (10)

48

Page 5: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

where ult = tanh(Wwhlt + bw) is a projectionof the target hidden state for learning word-levelattention. The second level of the hierarchy takesthe sentence vectors as input, learns representationhl for them using a bidirectional GRU. Using amethod similar to the word-level attention, a doc-ument representation vi is created using sentence-level task attention αl which is computed using thesentence level context vector us

αl =exp(uTl us)∑lexp(uT

lus)

(11)

vi =∑l αlhl, (12)

where ul = tanh(Wshl+bs) is analogous to ult atthe sentence level. The word- and sentence-levelcontext vectors, uw and us, as well asWw, Ws, bwand bs, are learned during training.

4.3 Multi-Head Attention

Work has shown that having multiple attentionheads improves neural machine translation tasks(Vaswani et al., 2017). To capture multiple aspectscontributing to text complexity, we learn two setsof word level task attention over the word levelGRU output. These two sets of sentence vectorsfeed into separate sentence-level GRUs to give ustwo document vectors by averaging using task at-tention weights at the sentence level. The doc-ument vectors are then concatenated to form thedocument representation. The multi-head atten-tion RNN is shown in figure 1.

4.4 Hierarchical RNN with BidirectionalContext

The hierarchical model is designed for represent-ing document structure, however, the sentenceswithin a document are encoded independently. Tocapture information across sentences, we extendthe concept of context attention used in machinetranslation, using it to learn context vectors foradjoining sentences. We extend the hierarchicalRNN by introducing bi-directional context withattention. Using the word level GRU output, a“look-back” context vector cl−1(wlt) is calculatedusing context attention over the preceding sen-tence, and a “look-ahead” context vector cl+1(wlt)using context attention over the following sen-tence for each word in the current sentence.

α(l−1)t(wlt) =exp(score(hlt,h(l−1)t))∑t′ exp(score(hlt,h(l−1)t′ ))

(13)

cl−1(wlt) =∑t′ α(l−1)t′(wlt)h(l−1)t′ (14)

α(l+1)t(wlt) =exp(score(hlt,h(l+1)t))∑t′ exp(score(hlt,h(l+1)t′ ))

(15)

cl+1(wlt) =∑t′ α(l+1)t′(wlt)h(l+1)t′ (16)

where score(hlt, hkt) = hltWαhTkt and a single

Wα is used for computing the score in both direc-tions. The context vectors are concatenated withthe hidden state to form the new hidden state h′lt.

h′lt = [cl−1(wlt), hlt, cl+1(wlt)] (17)

The rest of the structure is the same as a hierarchi-cal RNN, using equations 9-12 with h′lt instead ofhlt. Figure 2 shows the structure for calculating‘look-back” context.

4.5 Implementation DetailsThe implementation is done via the Tensorflowlibrary (Abadi et al., 2016).6 All RNNs useGRUs (Cho et al., 2014) with layer normalization(Ba et al., 2016), trained using Adam optimizer(Kingma and Ba, 2014) with a learning rate of0.001. Regularization was done via drop out. Thevalidation set was used to do hyper-parameter tun-ing, with a grid search over drop out rate, numberof epochs, and hidden dimension of GRU cells.Good result for all four architectures are obtainedwith a batch size of 10, a dropout rate of 0.5-0.7, acell size of 75-250 for the word-level GRU, and acell size of 40-75 for the sentence-level GRU. Forthe RNN, we also trained a version with a largerword-level hidden layer cell size of 600.

Pre-trained Glove embeddings7 are used for allmodels (Pennington et al., 2014), using a vocabu-lary size of 65000-75000.8 The out of vocabulary(OOV) percentage on the CCS test set was 3%,and on the informational test set was 0.5%. AllOOV words were mapped to an ‘UNK’ token. Thetext was lower-cased, and split into sentences forthe hierarchical models using the natural languagetoolkit (NLTK) (Loper and Bird, 2002).

5 Results and Analysis

We test our models on the two science test sets,as well as on the CCS appendix B documentlevel texts and a paragraph-level version of thesetexts. We also evaluated the best performing

6The code and trained models are available at https://github.com/Farahn/Liguistic-Complexity.

7http://nlp.stanford.edu/data/glove.840B.300d.zip

8In vocabulary words not present in Glove had randomlyinitialized word embeddings.

49

Page 6: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

Figure 1: RNN with Multi-Head Attention

Figure 2: RNN with Bidirectional Context and Atten-tion

model on the middle school science questions dataset. Since both the true reading level and predictedlevels are ordered variables, we use Spearman’srank correlation as the evaluation metric to cap-ture the monotonic relation between the predic-tions and the true levels.

As a baseline, we use the WeeBit linear regres-sion system (Vajjala and Meurers, 2014). TheWeeBit system uses knowledge-based features asinput to a linear regression model to predict read-ing level as a number between 1 and 5.5, whichmaps to text appropriate for readers 7-16 yearsof age. The feature set includes parts-of-speech(e.g. density of different parts-of-speech), lexical(e.g. measurement of lexical variation), syntactic(e.g. the number of verb phrases), morphological(e.g. ratio of transitive verbs to total words) andpsycholinguistic (e.g. age of acquisition) features.There are no features related to discourse, thus itis possible to compute features for sentence leveltexts. The system was trained on a subset of thedata that our system was trained on, so it is at adisadvantage. We did not have the capability to

retrain the system.

5.1 Results by GenreResults for the different models:

• sequential RNN with self attention (RNN),

• large sequential RNN with self attention(RNN 600),

• hierarchical RNN with attention at the wordand sentence level (HAN),

• hierarchical RNN with bidirectional contextand attention (BCA), and

• multi-head attention (MHA)

are shown in table 3, together with the results forthe WeeBit system which has state-of-the-art re-sults on the CCS documents. For the CCS data,both D1 and D2 training configurations are usedfor the neural models; only D2 is used for the in-formational test set. For all of these models thehidden layer dimension for the word level was be-tween 125 and 250. We also trained a sequentialRNN with a larger hidden layer dimension of 600.

The HAN does better for document level sam-ples than a sequential RNN; the converse is truefor paragraph level texts. The RNN with a largerhidden layer dimension performs better for longertexts, while the performance for smaller dimen-sion RNN deteriorates with increasing text length.The BCA model seems to generalize to longerdocuments and new genres better than the otherneural networks.

Figure 3 shows the error distribution forBCA(D1) in terms of distance from true predictionbroken down by genre on the 168 CCS documents.The category of informational texts is often over

50

Page 7: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

Test Set Model Samples WeeBit RNN RNN 600 HAN BCA MHACCS Document D1 168 0.69 0.28 0.43 0.47 0.55 0.42CCS Paragraphs D1 1532 0.36 0.30 0.25 0.29 0.32 0.28CCS Document D2 168 0.69 0.34 0.38 0.43 0.48 0.43CCS Paragraphs D2 1532 0.36 0.27 0.26 0.24 0.30 0.29InformationalParagraphs

D2 1361 0.22 0.51 0.60 0.60 0.62 0.60

Table 3: Results (Spearman Rank Correlation)

Figure 3: Error distribution for the CCS documentsBCA(D1)

predicted, which we hypothesize is roughly due tospecific articles related to the United States historyand constitution. The only training data for ourmodels with that subject is in the grades 6-8 and9-12 categories. The performance for literary andmixed texts, on the other hand, is roughly unbi-ased; this shows that the model is better at general-izing to non-informational texts, even when thereare no literary text samples in the training data.

5.2 Results by Length

Figures 4 and 5 show the performance of our mod-els and the WeeBit model as a function of docu-ment length, both on the informational paragraphstest set and the CCS paragraph level test set. Theresults indicate that for shorter texts, particularlyunder 100 words, neural models tend to do better.Even for a mixture of genres, the model with bidi-rectional context performs better than the feature-based regression model, as shown in figure 5.

It is likely that the WeeBit results results onshorter texts would improve if trained on the sametraining set that is used for the neural models.However, we hypothesize that the feature-basedapproach is less well suited for shorter documentsbecause the feature vector will be more sparse.

Figure 4: Performance vs. text length for informationalparagraphs BCA(D2)

Comparing the CCS document- and paragraph-level test sets, the average percentage of featuresthat are zero-valued is 28% for document-leveltexts and 44% for paragraph-level texts. The mostsparse vectors are 40% and 81% for document andparagraph-level texts, respectively.

5.3 Results for Science Assessment QuestionsFinally, we apply both the baseline WeeBit systemand our best model (BCA trained onD1) to the setof 5470 grade 6-8 science questions. The resultsare shown in figures 6 and 7, where the grade 6-8 category (ages 11-14) corresponds to predictedlevel 3 for BCA and predicted level 4 for WeeBit.The results indicate that BCA predictions are bet-ter aligned with human rankings than the baseline.As expected, grade 6 questions more likely to bepredicted as less difficult than grade 8 questions.

5.4 Attention VisualizationAttention can help provide insight into what themodel is learning. In the analyses here, all at-tention values are normalized by dividing by thehighest attention value in the sentence/documentto account for different sequence lengths.

Figure 8 shows the word-level attention for the

51

Page 8: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

Figure 5: Performance vs. maximum text length forCCS paragraphs BCA(D1)

BCA and HAN for a sample text from the scienceassessment questions test set. (Attention weightsin the figure are smoothed to reflect the fact thata word vector from a biLSTM reflects the word’scontext.) The results show that attention weightsare more sparse for HAN than for BCA. At thesentence level (not shown here), the BCA sentenceweights tend to be more uniformly distributed,whereas HAN weights are again more selective.

Another aspect of the attention is that a worddoes not have the same attention level for all oc-currences in a document. We look at maximumand minimum values of attention as a function ofword frequency for each grade band, shown in fig-ure 9 for grade 6-8 science assessment questions.

The pattern is similar for each grade band inthe validation and test sets. The minimum atten-tion values assigned to a word drop with increas-ing word frequency, while the maximum valuesincrease. This suggests that the attention weightsare more confident for more frequent words, suchas of. Words like fusion and m/s get high max-

Figure 6: BCA predicted levels for middle school sci-ence assessment questions

imum attention values, despite not being as highfrequency as words like of and the. This may in-dicate that they are likely to contribute to linguis-tic complexity. The fact that transformation has ahigh minimum is also likely an indicator of its im-portance. For HAN without bidirectional context,a similar visualization shows that while the trendis similar, the attention weights typically tend to belower, both for minimum and maximum values.

We find that sentence-end tokens (period, ex-clamation and question mark) have high averageattention weight, ranging from 0.54 to 0.81, whilesentence-internal punctuation (comma, colon andsemicolon) get slightly lower weights, rangingfrom 0.20 to 0.47. The trend is similar for allgrades. These high attention values might be dueto punctuation serving as a proxy for sentencestructure. It is interesting to note that the questionmark gets higher minimum attention value thanperiod, despite being high frequency. It may bethat questions carry information that is particularlyrelevant to informational text difficulty.

6 Discussion

Our work differs from existing models that es-timate text difficulty since we do not use engi-neered features. There are advantages and dis-advantages to both approaches, which we brieflydiscuss here. Models using engineered featuresbased on research on language acquisition offerinterpretability and insight into which specific lin-guistic features are contributing to text difficulty.An additional advantage of using engineered fea-tures in a regression or classification model is thatless training data is required.

However, given both the evolving theories in

Figure 7: WeeBit predicted levels for middle schoolscience assessment questions

52

Page 9: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

Figure 8: Word level attention visualization for BCA(top) and HAN (bottom) for a middle school scienceassessment question

Figure 9: Maximum and minimum values of attentionas a function of word count for BCA

language acquisition and the large number of vari-ables that impact second language acquisition, themethodologies used in language acquisition re-search have certain limitations. For example, thenumber of variables that can be considered in astudy is practically limited, the sample popula-tion is often small, and the question of qualita-tive vs. quantitative methodologies used can influ-ence outcomes (more details in (Larsen-Freemanand Long, 2014; Mitchell et al., 2013)). Theselimitations can carry into the feature engineeringprocess. Using a model with text as input ensuresthat these constraints are not inherently part of themodel; the performance of the system is not lim-ited by the features provided. Of course, perfor-mance is limited by the training data, both in termsof the cost of collection and any biases inherent inthe data. In addition, with advances in neural ar-chitectures such as attention modeling, there may

be opportunities for identifying specific aspects oftexts that are particularly difficult, though researchin this direction is still in early stages.

7 Conclusion

In summary, this work explored different neuralarchitectures for linguistic complexity analysis, tomitigate issues with accuracy of systems based onengineered features. Experimental results showthat it is possible to achieve high accuracy on textsshorter than 100 words using RNNs with attention.Using hierarchical structure improves results, par-ticularly with attention models that leverage bidi-rectional sentence context. Testing on a mix ofgenres shows that the best neural model can gen-eralize to subjects beyond what it is trained on,though it performs slightly worse than a feature-based regression model on texts longer than 100words. More training data from other genres willlikely reduce the performance gap. Analysis ofattention weights can provide insights into whichphrases/sentences are important, both at the aggre-gate and sample level. Developing new methodsfor analysis of attention may be useful both forimproving model performance and for providingmore interpretable results for educators.

Two aspects not considered in this work are ex-plicit representation of syntax and discourse struc-ture. Syntax can be incorporated by concatentat-ing word and dependency embeddings at the tokenlevel. Our BCA model was designed to capturecross-sentence coherence and coordination, but itmay be useful to extend the hierarchy for longerdocuments and/or introduce explicit models of thetypes of discourse features used in Coh-Metrix(Graesser et al., 2004).

Acknowledgments

We thank Dr. Meurers, Professor University ofTubingen, and Dr. Vajjala-Balakrishna, AssistantProfessor Iowa State University, for sharing theWeeBit training corpus, their trained readabilityassessment model and the Common Core test cor-pus.

References

Martın Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg S Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, et al.2016. Tensorflow: Large-scale machine learning on

53

Page 10: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

heterogeneous distributed systems. arXiv preprintarXiv:1603.04467.

Jamal Abedi. 2006. Psychometric issues in the ELL as-sessment and special education eligibility. TeachersCollege Record, 108(11):2282.

Jamal Abedi and Carol Lord. 2001. The language fac-tor in mathematics tests. Applied Measurement inEducation, 14(3):219–234.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Jeanne Sternlicht Chall and Edgar Dale. 1995. Read-ability revisited: The new Dale-Chall readabilityformula. Brookline Books.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,Ashish Sabharwal, Carissa Schoenick, and OyvindTafjord. 2018. Think you have solved question an-swering? try arc, the ai2 reasoning challenge. arXivpreprint arXiv:1803.05457.

Meri Coleman and Ta Lin Liau. 1975. A computerreadability formula designed for machine scoring.Journal of Applied Psychology, 60(2):283.

Kevyn Collins-Thompson and James P. Callan. 2004.A language modeling approach to predicting read-ing difficulty. In Human Language Technology Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics, HLT-NAACL2004, Boston, Massachusetts, USA, May 2-7, 2004,pages 193–200.

Lijun Feng, Martin Jansche, Matt Huenerfauth, andNoemie Elhadad. 2010. A comparison of featuresfor automatic readability assessment. In Proceed-ings of the 23rd International Conference on Com-putational Linguistics: Posters, pages 276–284. As-sociation for Computational Linguistics.

Arthur C. Graesser, Danielle S. McNamara, andJonna M. Kulikowich. 2011. Coh-metrix. Educa-tional Researcher, 40(5):223–234.

Arthur C Graesser, Danielle S McNamara, Max MLouwerse, and Zhiqiang Cai. 2004. Coh-metrix:Analysis of text on cohesion and language. Behav-ior Research Methods, 36(2):193–202.

Marian Hickendorff. 2013. The language factor in el-ementary mathematics assessments: Computationalskills and applied problem solving in a multidimen-sional irt framework. Applied Measurement in Edu-cation, 26(4):253–278.

Tushar Khot, Niranjan Balasubramanian, EricGribkoff, Ashish Sabharwal, Peter Clark, and OrenEtzioni. 2015. Markov logic networks for naturallanguage question answering. arXiv preprintarXiv:1507.03045.

J Peter Kincaid, Robert P Fishburne Jr, Richard LRogers, and Brad S Chissom. 1975. Derivation ofnew readability formulas (automated readability in-dex, fog count and flesch reading ease formula) fornavy enlisted personnel. Technical report, NavalTechnical Training Command Millington TN Re-search Branch.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Diane Larsen-Freeman and Michael H Long. 2014.An introduction to second language acquisition re-search. Routledge.

Zhouhan Lin, Minwei Feng, Cicero Nogueira do San-tos, Mo Yu, Bing Ziang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentenceembedding. In Proc. ICLR.

Edward Loper and Steven Bird. 2002. Nltk: The natu-ral language toolkit. In Proceedings of the ACL-02Workshop on Effective tools and methodologies forteaching natural language processing and computa-tional linguistics-Volume 1, pages 63–70. Associa-tion for Computational Linguistics.

Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025.

Peter McCullagh. 1980. Regression models for ordinaldata. Journal of the royal statistical society. SeriesB (Methodological), pages 109–142.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. InEleventh Annual Conference of the InternationalSpeech Communication Association.

Rosamond Mitchell, Florence Myles, and Emma Mars-den. 2013. Second language learning theories.Routledge.

Jessica Nelson, Charles Perfetti, David Liben, andMeredith Liben. 2012. Measures of text difficulty:Testing their predictive value for grade levels andstudent performance. Council of Chief State SchoolOfficers, Washington, DC.

54

Page 11: Estimating Linguistic Complexity for Science Textslinguistic complexity, using text as input without feature engineering. Second, we specically train on science and other informational

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

Sarah E Petersen and Mari Ostendorf. 2009. A ma-chine learning approach to reading level assessment.Computer speech & language, 23(1):89–106.

Sarah E Schwarm and Mari Ostendorf. 2005. Readinglevel assessment using support vector machines andstatistical language models. In Proceedings of the43rd Annual Meeting on Association for Computa-tional Linguistics, pages 523–530. Association forComputational Linguistics.

Kathleen M Sheehan, Michael Flor, and Diane Napoli-tano. 2013. A two-stage approach for generating un-biased estimates of text complexity. In Proceedingsof the Workshop on Natural Language Processingfor Improving Textual Accessibility, pages 49–58.

AJ Stenner, Ivan Horabin, Dean R Smith, and MalbertSmith. 1988. The lexile framework. Durham, NC:MetaMetrics.

Kai Sheng Tai, Richard Socher, and Christopher DManning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. arXiv preprint arXiv:1503.00075.

Sowmya Vajjala and Detmar Meurers. 2012. On im-proving the accuracy of readability classification us-ing insights from second language acquisition. InProceedings of the Seventh Workshop on BuildingEducational Applications Using NLP, pages 163–173. Association for Computational Linguistics.

Sowmya Vajjala and Detmar Meurers. 2014. Read-ability assessment for text simplification: Fromanalysing documents to identifying sentential sim-plifications. ITL-International Journal of AppliedLinguistics, 165(2):194–222.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 6000–6010.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alexander J Smola, and Eduard H Hovy. 2016. Hi-erarchical attention networks for document classifi-cation. In HLT-NAACL, pages 1480–1489.

Dani Yogatama, Chris Dyer, Wang Ling, and Phil Blun-som. 2017. Generative and discriminative text clas-sification with recurrent neural networks. arXivpreprint arXiv:1703.01898.

55