Text modeling with R, Python, and Spark
-
Upload
frank-evans -
Category
Data & Analytics
-
view
1.367 -
download
3
Transcript of Text modeling with R, Python, and Spark
Data Set• 70 years of the State of the Union address
• 1945 (Truman) - 2015 (Obama)
• Avg. Length: ~ 6,700 words
• longest: ~34,000 words
• shortest: ~ 2,000 words
• total: 467,000 words
• Raw Data: 2.4 MB
Config Wrangle Model Cluster Visualize
America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.
america enjoyed twenty-two months uninterrupted economic recovery recovery not enough prevail long run expand long-run strength economy
Config Wrangle Model Cluster Visualize
america enjoyed twenty-two months uninterrupted economic recovery recovery not enough prevail long run expand long-run strength economy
Data Set• 70 years of the State of the Union address
• 1945 (Truman) - 2015 (Obama)
• Avg. Length: ~ 6,700 words
• longest: ~34,000 words
• shortest: ~ 2,000 words
• total: 467,000 words
• Raw Data: 2.4 MB
Config Wrangle Model Extract Visualize
America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.
america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy
Config Wrangle Model Extract Visualize
America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.
america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy
Data Set (Congress loves to talk)
• 20 years of Congressional Hearings (1995 - 2015)
• 19,381 documents (about 1,000 a year)
• Avg. Length: ~ 32,000 words (5x SOTU)
• longest: ~ 900,000 words (length of all 7 HP books)
• shortest: ~ 50 words
• total: 613 million words (1,300x SOTU)
• Raw Data: 3.8 GB
Config Wrangle Model Extract Visualize
America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.
america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy