Data Science: Predict Bill Passage with Topics Only
-
Upload
pauline-chow-data-scientist -
Category
Data & Analytics
-
view
81 -
download
0
Transcript of Data Science: Predict Bill Passage with Topics Only
Data Science: Predict Success of Legislation with Topics OnlyNatural Language Processing with Sunlight Foundation Open States API!Pauline Chow Fall 2016
What policies and laws relate to your well being?
When I first asked this question I was working and interested in transportation policy, especially walking and bicycling
• Increase transparency of the various levels of decision making - federal, state, and local
• Effectively understand trends in public policy, in order to educate and influence
• Distill legislative process into logical system of features
• Extract and identify relationships of decision makers with communities, topics, and laws
Why Does Analyzing Elements of Successful Legislation Important?
Data Science Steps for Predicting Legislative Results
1. Collect Data from Sunlight Foundation API and other open data sources
2. Clean text from legislative bills via web scraping, including removing html, stop words, target variable (i.e. bill passage)
3. Extract features from text in python
4. Build topics from text using Latent Dirchlet Allocation (LDA), probabilistic approach
5. Implement supervised learning models
6. Analyze results
1. Collect: As the initial step to building predictive models, insights reflect features from California bills text between 2009-2014
3. How to Extract Features from Text?
Sick of Having to go 2 different hut buy pizza sunglass
1 1 1 2 1 1 1 1 1 1 1
4. Build: What is the Latent Dirchlet Allocation (LDA) Topic Model?
• Finds hidden semantic structure, aka context, where topics are cluster of similar words: P(word | context)
• Each document is a mixture of topics, words and phrases, which are split into probabilities
• Tune parameters: # of words in each topic, mixture within each topic, threshold for frequency and probability
• For example: Topic A (1,2,5): breakfast 30%, pizza 10%, smoothie 5%
1 Sick of having to go to two huts for pizza and sunglasses
2 I ate a cold pizza and spinach smoothie for breakfast.
3 I wear my sunglasses at night so I can see
4 Sometimes I get really sick when I go on roller coasters
5 Coffee only for breakfast because coffee is for closers
4. Build Word2Vec: Find Similarities
• Extract relationships in unstructured text
• Leverage context of documents and LDA’s probabilistic models
• Hierarchical structure of probabilities
• Derive meaning from cleaned vector of words and phrases
5. Implement Logistic Regression: CA Bills Over Time
1. Model predicts failure better than successful legislation
2. Model with 50 versus 100 topics predictive results did not differ significantly
3. Precision (TP / TP + FP)
4. Recall (TP / TP + FN)
Model Classification Report: All Topics Over Time
target precision recall f1-score support
failed 0.70 0.98 0.82 3118
passed 0.47 0.04 0.07 1360
avg / total 0.63 0.69 0.59 4478
6. Analyze California Bills (100 Topic Models)
• Bills have an average of 6.57 number of topics, ranging from 2 - 16.
• Passage rate by topic ranged from 18% to 36%, averaging 28% for all bills in the database
• Most frequent topics of legislation relate to local government funding/taxes/leadership initiatives, health care, education, budget and taxes, and court system
• Highest and lowest passage rate topics are reviewed in the next few slides
6. Analyze: Distribution of Topics in California Bills
Top 10 Topics by Frequency
Topic #s Frequency
topic 48 13569
topic 11 11838
topic 51 8024
topic 73 4913
topic 6 3675
topic 63 2782
topic 1 2615
topic 22 1879
topic 64 1726
topic 45 1663
6. What Topics Support Bill Passage?
Rank Topic # Odds Ratio LDA Topics
1 70 453.981 0.023*tank + 0.019*underground + 0.015*transferor + 0.011*lie + 0.010*decennial + 0.008*storage + 0.008*cotenant + 0.008*stanford + 0.006*petroleum +
0.006*orphan2 74 32.797 0.020*contribution + 0.014*calendar + 0.010*canyon + 0.009*lincoln + 0.009*shoulder + 0.007*stenographer + 0.006*inflation + 0.005*dispatcher +
0.005*vine + 0.005*boyer3 47 32.695 0.024*cemetery + 0.011*mexican + 0.010*interment + 0.007*salton + 0.006*elsinore + 0.006*tuberculosis + 0.005*burial + 0.004*bacteria + 0.004*creek
+ 0.004*coliform4 42 28.312 0.008*hoover + 0.003*tricare + 0.002*shower + 0.002*crutch + 0.002*contractholders + 0.002*bath + 0.001*dme + 0.001*durable + 0.000*hcpcs +
0.000*wheelchair5 21 25.041 0.021*wyland + 0.015*reorganization + 0.014*brown + 0.012*gordon + 0.010*presidential + 0.008*ford + 0.008*gerald + 0.008*battalion +
0.007*mitochondrial + 0.007*remembrance6 71 9.608 0.064*andwhereas + 0.024*awareness + 0.020*week + 0.014*whereas + 0.013*violence + 0.012*woman + 0.010*disease + 0.010*resolution + 0.010*month
+ 0.009*furtherresolved7 65 8.798 0.029*pipeline + 0.027*ronald + 0.022*sea + 0.013*coastal + 0.012*marine + 0.009*rise + 0.008*reagan + 0.008*thomas + 0.007*climate + 0.007*arctic
8 34 8.183 0.030*candidate + 0.011*teen + 0.011*precinct + 0.011*nomination + 0.011*poll + 0.009*freeway + 0.009*say + 0.009*dating + 0.008*teenager + 0.007*sca
9 58 5.086 0.022*autism + 0.014*nursing + 0.014*therapist + 0.013*mr + 0.013*calderon + 0.011*backpack + 0.009*credentialing + 0.008*therapy + 0.008*acupuncture +
0.008*marriage10 44 4.899 0.021*scientist + 0.017*negrete + 0.016*factfinding + 0.015*hepatitis + 0.015*mcleod + 0.011*maternity + 0.009*interdistrict + 0.009*knuckle + 0.008*liver
+ 0.005*infected15 94 2.103 0.023*bicycle + 0.017*bus + 0.013*midwife + 0.011*deployed + 0.010*roadway +
0.009*smog + 0.008*schoolbus + 0.007*safer + 0.005*polluter + 0.005*overtaking
Sample CA Bills Containing “Strong” Topics
Bill Status Bill Session, ID (Link) Topic # All Topics
Passed2011-2012-0 AB291
Underground storage tanks: petroleum: charges.
70 11, 45, 48, 70
Passed
2013-2014-0 AB1286 Personal income tax: voluntary contributions: California Breast
Cancer Research Fund
74 11, 45, 74
Passed 2009-2010-0 AB1969 Elsinore Valley Cemetery 47 1, 11, 47, 48
Passed 2011-2012-0 AB2488 Vehicles: buses: length limitations 94 1, 11, 48, 49, 73, 94
6. What Topics Have Weak Bill Passage?
Rank Topic # Odds Ratio Topics
-10 79 0.000498 0.057*emission + 0.050*greenhouse + 0.041*gas + 0.019*warming + 0.017*global + 0.016*climate + 0.014*air + 0.013*carbon + 0.013*reduction + 0.013*solution
-9 92 0.000156 0.038*trafficking + 0.012*duress + 0.010*menace + 0.010*fiduciary + 0.010*ammunition + 0.009*chvez + 0.009*human + 0.008*achadjian + 0.008*wilk + 0.006*bigelow
-8 76 0.000150 0.071*inmate + 0.065*parole + 0.026*parolee + 0.023*prison + 0.021*correction + 0.019*rehabilitation + 0.010*released + 0.009*recidivism + 0.008*reentry + 0.007*journalist
-7 28 0.000113 0.036*bag + 0.024*plastic + 0.015*carryout + 0.011*positioning + 0.007*tends + 0.007*electorate + 0.006*store + 0.006*deliberately + 0.006*undetermined + 0.005*el
-6 50 0.0000630.010*romero + 0.009*antipsychotic + 0.008*medication + 0.006*dementia + 0.006*detachable + 0.005*salvage + 0.004*dietary + 0.004*psychotropic + 0.004*repurchase + 0.003*diminishes
-5 24 0.000053 0.023*baby + 0.006*depression + 0.005*paratransit + 0.005*stewardship + 0.004*producer + 0.004*perinatal + 0.003*unwanted + 0.003*obstetrics + 0.003*sleep + 0.002*calhome
-4 8 0.000035 0.054*gang + 0.013*immunity + 0.010*tort + 0.008*rifle + 0.007*european + 0.007*magazine + 0.007*pervasive + 0.005*deadly + 0.005*mentally + 0.005*disordered
-3 81 0.0000130.014*interpreter + 0.007*excellence + 0.006*digitized + 0.005*reelected + 0.005*easy + 0.004*fluency + 0.004*biodegradable + 0.002*willfulness + 0.002*annoyance + 0.002*disincentive
-2 10 0.000004 0.022*budget + 0.013*muratsuchi + 0.013*sawyer + 0.013*mullin + 0.012*bloom + 0.012*nazarian + 0.012*daly + 0.012*campos + 0.011*rodriguez + 0.010*dababneh
-1 62 0.0000010.005*consummated + 0.004*nonconsenting + 0.003*nonsupervisory + 0.001*peculiar + 0.001*culminating + 0.000*overdue + 0.000*reputation + 0.000*unimpeded + 0.000*foster + 0.000*licentious
Sample CA Bills Containing “Weak” Topics
Bill Status Bill Session, ID (Link) Topic # All Topics
Passed 2011-2012-0 SB1219 Recycling Plastic Bags 28 28, 45, 48, 57
Passed2013-2014-0 AB1405
Subversive Organization Registration Law: repeal
92 6, 11, 20, 37, 48, 51, 66
Passed 2011-2012-0 AB220Interstate Compact for Juveniles. 76 48, 51, 63, 76
Passed2009-2010-0 AB863
Public utilities: municipal districts: civil service exemptions.
62 11, 30, 48, 51, 62
Next Steps for Legislative Predictions
• Add time context for bills in terms of legislative session, chamber, and major political events
• Adding features about the bill, sponsors, districts, political context, duration, committees, public comments
• Include exploratory data analysis from bill and legislator data
• Tune model to apply predictions to current bills
Relevant Citations
• Gensim: Topic Modeling for Humans by Radium Hurek, open source python package
• Wallach, H. M. (n.d.). Topic Modeling: Beyond Bag-of-Words. Retrieved from poster link
• Gerrish, S. M., & Blei, D. M. (2011). Predicting legislative roll calls from text. In Proc. of ICML. Retrieved from article link
• Rong, X. (2016, June 5). Word2vec Parameter Learning Explained. doi:arXiv:1411.2738 [cs.CL]: article link
• Unsplash for stock photo