How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf ·...

Post on 18-Mar-2018

230 views 10 download

Transcript of How to Teach a Computer to Read the Internetc2.com/~ward/sao/TechIgnite-v1/TechIgnite.pdf ·...

How to Teach a Computer to Read

the Internet

An Open-Source Project on GitHub

StructuredSemi-

StructuredUnstructured

RSSXMLJSON

WikiTwitter

Meta Tags

NewsTranslations

Poetry

1. parse what you expect

2. see what else you get

3. repeat real fast

Methodology

1. parse what you expect

there are factsthat have keys made of wordsand values that aren’t keys

2. see what else you get

we found 19,498 keysone key was partners

3. repeat real fast

the run started at 18:02:32parsed 87,325,460,601 bytes

key

value

upper

lower

familia

r

Real Fast Iterating

Real Fast Parsing

The quick brown fox

b e b e b e b e

fox\0

yybuf

yythunk

yytext

yybuf

yybegin

yyend

yylimit

yybuflen

yypos

nounadj

Real Fast Data

50 min for every byte5 min for useful prefix5 sec for last sampling

(30GB of wikitext)

Compiler Writer:

Parsing Explorer:

grammartext

text

text

grammar

text

text

text

programs

wikitext

regex text

substitutions

extra code for nesting

primary = identifier ! ‘=‘ | ‘(‘ expression ‘)’

nesti

ng

lookahead

as in regexas

in ya

cc

Piumarta 2007

Ford 2004

read this

dev.AboutUs.org

http: //c2.com/~ward/sao/TechIgnite-v1/