Data deluge/Adatáradat

9

Click here to load reader

description

"Data is the new oil" as the saying goes. The recent developments in IT opened up the possibility of collecting, storing and analyzing large amounts of data. Norvig et al. argues [1] that given a large enough data set, naive algorithms outperform highly sophisticated ones. On the other hand, Bender and Good [2] suggest we have to review our theories about language in the light of the unprecedented amount of available empirical data. This approach is parallel to so-called probabilistic linguistics research program[3]. Using the Internet as a source of data is exciting and challenging. Information is usually encoded into text files and we have to employ natural language processing techniques to extract it. To cope with the sheer size of today's data sets, we have to adapt our algorithms to the modern parallel distributed processing systems. [1] Alon Halevy, Peter Norvig, and Fernando Pereira: The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, March/April, 2009 [2] Emily M. Bender and Jeff Good. 2010. A Grand Challenge for Linguistics: Scaling Up and Integrating Models. White paper contributed to NSF's SBE 2020 initiative. http://www.nsf.gov/sbe/sbe_2020/submission_detail.cfm?upld_id=81 (06.06.2012) [3]Rens Bod, Jennifer Hay, and Stefanie Jannedy (eds): Probabilistic Linguistics, MIT Press, 2003

Transcript of Data deluge/Adatáradat

Page 1: Data deluge/Adatáradat

Adataradat“Nem a problemak megoldasa a nehez, hanem az, hogy

mikent vessuk fel oket.”

Varju Zoltan

Weblib Kft.

2012-06-23

Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 1 / 6

Page 2: Data deluge/Adatáradat

A keresestol az adataradatig

Dean - Ghemawat: MapReduce: Simplified Data Processing onLarge Clusters

Halevy - Norvig - Pereira: The Unreasonable Effectiveness of Data

Hadoop

NoSQL (Couchbase, MondoDB, stb.)

statisztika - adatbanyaszat - gepi tanulas - adattudomany

Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 2 / 6

Page 3: Data deluge/Adatáradat

A keresestol az adataradatig

Dean - Ghemawat: MapReduce: Simplified Data Processing onLarge Clusters

Halevy - Norvig - Pereira: The Unreasonable Effectiveness of Data

Hadoop

NoSQL (Couchbase, MondoDB, stb.)

statisztika - adatbanyaszat - gepi tanulas - adattudomany

Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 2 / 6

Page 4: Data deluge/Adatáradat

A big data majd megold mindent?

Kelloen nagy adathalmazon egyszeru n-gram modellek jobbanteljesıtenek mint szofisztikalt tarsaik.

Nyelveszeti megkozelıtesben a generatıv iskola es a probabilisztikusmegkozelıtes viaskodik.

Bender - Good: A Grand Challenge for Linguistics: Scaling Upand Integrating Models

Radikalisan at kell gondolnunk eddigi elmeleteinket.

Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 3 / 6

Page 5: Data deluge/Adatáradat

A big data majd megold mindent?

Kelloen nagy adathalmazon egyszeru n-gram modellek jobbanteljesıtenek mint szofisztikalt tarsaik.

Nyelveszeti megkozelıtesben a generatıv iskola es a probabilisztikusmegkozelıtes viaskodik.

Bender - Good: A Grand Challenge for Linguistics: Scaling Upand Integrating Models

Radikalisan at kell gondolnunk eddigi elmeleteinket.

Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 3 / 6

Page 6: Data deluge/Adatáradat

Regi problemak uj kontosben

“In 1998, Merrill Lynch cited estimates that as much as 80% of allpotentially usable business information originates in unstructuredform.”

— http://en.wikipedia.org/wiki/Unstructured_data

Hogyan tudjuk kinyerni az informaciot a strukturalatlan adatokbol?

Szovegbanaszat es szovegfeldolgozas problemainak atfogalmazasamapreduce kerdesekre (Lin es Dyer: Data-Intensive TextProcessing with MapReduce)

Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 4 / 6

Page 7: Data deluge/Adatáradat

A Hadoop okoszisztema megoldasai

Mahout http://mahout.apache.org/ - skalazhato algoritmusokgepi tanulasra Hadoop-on

Integralas analitikai eszkozokkel (pl. R): Cloudera, Greenplum,RevolutionAnalytics

Radoop http://signup.radoop.eu/ - a RapidMiner vizualiselemzokornyezetre epıtve kınal megoldasokat

InfoHarvester http://weblib.hu/termekeink/infoharvester -kifejezetten strukturatlan adatokkal foglalkozik, iranyıtott crawler azadatok begyujtesere, integralt analitikai es szovegbanyaszatimegoldasok

Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 5 / 6

Page 8: Data deluge/Adatáradat

A Hadoop okoszisztema megoldasai

Mahout http://mahout.apache.org/ - skalazhato algoritmusokgepi tanulasra Hadoop-on

Integralas analitikai eszkozokkel (pl. R): Cloudera, Greenplum,RevolutionAnalytics

Radoop http://signup.radoop.eu/ - a RapidMiner vizualiselemzokornyezetre epıtve kınal megoldasokat

InfoHarvester http://weblib.hu/termekeink/infoharvester -kifejezetten strukturatlan adatokkal foglalkozik, iranyıtott crawler azadatok begyujtesere, integralt analitikai es szovegbanyaszatimegoldasok

Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 5 / 6

Page 9: Data deluge/Adatáradat

Koszonom a figyelmet

Kereso Vilag http://kereses.blog.hu/

Szamıtogepes nyelveszethttp://szamitogepesnyelveszet.blogspot.com/

Twitter: @zoltanvarju

Email: [email protected]

Varju Zoltan (Weblib Kft.) Adataradat 2012-06-23 6 / 6