Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.

36
Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Kathleen Fisher AT&T Labs Research PADS: A System for Managing Ad Hoc Data.

Kathleen FisherAT&T Labs Research

PADS:A System for Managing

Ad Hoc Data

AT&T

Data, data, everywhere!

Incredible amounts of data stored in well-behaved formats:

Tools• Schema

• Browsers

• Query languages

• Standards

• Libraries

• Books, documentation

• Conversion tools

• Vendor support

• Consultants…

Databases:

XML:

AT&T

… but not all data is well-behaved!

Vast amounts of chaotic ad hoc data:

Tools• Perl?

• Awk?

• C?

AT&T

format-version: 1.0date: 11:11:2005 14:24auto-generated-by: DAG-Edit 1.419 rev 3default-namespace: gene_ontologysubsetdef: goslim_goa "GOA and proteome slim"

[Term]id: GO:0000001name: mitochondrion inheritancenamespace: biological_processdef: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824,PMID:11389764, SGD:mcc]is_a: GO:0048308 ! organelle inheritanceis_a: GO:0048311 ! mitochondrion distribution

format-version: 1.0date: 11:11:2005 14:24auto-generated-by: DAG-Edit 1.419 rev 3default-namespace: gene_ontologysubsetdef: goslim_goa "GOA and proteome slim"

[Term]id: GO:0000001name: mitochondrion inheritancenamespace: biological_processdef: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824,PMID:11389764, SGD:mcc]is_a: GO:0048308 ! organelle inheritanceis_a: GO:0048311 ! mitochondrion distribution

www.geneontology.orgwww.geneontology.org

Ad Hoc Data in Genetics

AT&T

Ad Hoc Data in Biology: Newick Format

((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700, seal:12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201, weasel:18.87953):2.09460):3.87382,dog:25.46154); (Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10; (Bovine:0.69395,(Hylobates:0.36079,(Pongo:0.33636,(G._Gorilla:0.17147, (P._paniscus:0.19268,H._sapiens:0.11927):0.08386):0.06124):0.15057):0.54939, Rodent:1.21460);

((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700, seal:12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201, weasel:18.87953):2.09460):3.87382,dog:25.46154); (Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10; (Bovine:0.69395,(Hylobates:0.36079,(Pongo:0.33636,(G._Gorilla:0.17147, (P._paniscus:0.19268,H._sapiens:0.11927):0.08386):0.06124):0.15057):0.54939, Rodent:1.21460);

AT&T

Ad Hoc Data in Business

HA00000000START OF TEST CYCLEaA00000001BXYZ U1AB0000040000100B0000004200HL00000002START OF OPEN INTERESTd 00000003FZYX G1AB0000030000300000HM00000004END OF OPEN INTERESTHE00000005START OF SUMMARYf 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000HF00000007END OF SUMMARYk 00000008LYXW B1KB0000065G0000009900100000001000020000HB00000009END OF TEST CYCLE

HA00000000START OF TEST CYCLEaA00000001BXYZ U1AB0000040000100B0000004200HL00000002START OF OPEN INTERESTd 00000003FZYX G1AB0000030000300000HM00000004END OF OPEN INTERESTHE00000005START OF SUMMARYf 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000HF00000007END OF SUMMARYk 00000008LYXW B1KB0000065G0000009900100000001000020000HB00000009END OF TEST CYCLE www.opradata.comwww.opradata.com

AT&T

Ad Hoc Data in Finance

Date: 3/21/2005 1:00PM PACIFIC Investor's Business Daily ®Stock List Name: DAVE

Stock Company Price Price Volume EPS RSSymbol Name Price Change % Change % Change Rating Rating

AET Aetna Inc 73.68 -0.22 0% 31% 64 93GE General Electric Co 36.01 0.13 0% -8% 59 56HD Home Depot Inc 37.99 -0.89 -2% 63% 84 38IBM Intl Business Machines 89.51 0.23 0% -13% 66 35INTC Intel Corp 23.50 0.09 0% -47% 39 33

Data provided by William O'Neil + Co., Inc. © 2005. All Rights Reserved.Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc.Reproduction or redistribution other than for personal use is prohibited.All prices are delayed at least 20 minutes.

Date: 3/21/2005 1:00PM PACIFIC Investor's Business Daily ®Stock List Name: DAVE

Stock Company Price Price Volume EPS RSSymbol Name Price Change % Change % Change Rating Rating

AET Aetna Inc 73.68 -0.22 0% 31% 64 93GE General Electric Co 36.01 0.13 0% -8% 59 56HD Home Depot Inc 37.99 -0.89 -2% 63% 84 38IBM Intl Business Machines 89.51 0.23 0% -13% 66 35INTC Intel Corp 23.50 0.09 0% -47% 39 33

Data provided by William O'Neil + Co., Inc. © 2005. All Rights Reserved.Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc.Reproduction or redistribution other than for personal use is prohibited.All prices are delayed at least 20 minutes.

www.investors.comwww.investors.com

AT&T

Ad Hoc Binary Data: DNS packets

00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000 ................000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e ..ns0...........000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100 ......_gc...!...000000c0: 0002 5800 1d00 0000 640c c404 7068 7973 ..X.....d...phys000000d0: 0872 6573 6561 7263 6803 6174 7403 636f .research.att.co

00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000 ................000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e ..ns0...........000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100 ......_gc...!...000000c0: 0002 5800 1d00 0000 640c c404 7068 7973 ..X.....d...phys000000d0: 0872 6573 6561 7263 6803 6174 7403 636f .research.att.co

AT&T

Ad Hoc Data from AT&T

Name & Use Representation Size

Web server logs (CLF): Measure web workloads

Fixed-column ASCII records

12 GB/week

Sirius data: Monitor service activation

Variable-width ASCII records

2.2GB/week

Call detail: Detect fraud

Fixed-width binary records

~7GB/day

Altair data: Track billing process

Various Cobol data formats

~4000 files/day

Regulus data: Monitor IP network

ASCII 15 sources, ~15 GB/day

Netflow: Monitor IP network

Data-dependent number of fixed-width binary records

>1Gigabit/second

AT&T

Technical Challenges of Ad Hoc Data

• Data arrives “as is.”• Documentation is often out-of-date or nonexistent.

– Hijacked fields.– Undocumented “missing value” representations.

• Data is buggy.– Missing data, human error, malfunctioning machines, race conditions

on log entries, “extra” data, …– Processing must detect relevant errors and respond in application-

specific ways.– Errors are sometimes the most interesting portion of the data.

• Data sources often have high volume.– Data may not fit into main memory.

AT&T

Existing Approaches

• Lex/Yacc– Over- and under-kill.

• Perl/C– Code brittle with respect to changes in input format.

– If written, error-detection code swamps main-line computation. If not written, errors can corrupt “good” data.

– Everything has to be coded by hand. – Analysis often ends up interwoven with parsing.

• Data description languages (PacketTypes, Datascript)– Binary data– Focus on correct data.

qr/^(\d+)\|(?:[^|]*\|){12}(?:[^|]*\|[^|]*\|)*$STATE\|/;qr/^(\d+)\|(?:[^|]*\|){12}(?:[^|]*\|[^|]*\|)*$STATE\|/;

AT&T

Our Approach: PADS

Data expert writes declarative description of data source:– Physical format information

– Semantic constraints

Many data consumers use description and parser.• Description serves as living documentation.• Parser exhaustively detects errors without cluttering

user code.• From description, we generate auxiliary tools.

PLDI 2005PLDI 2005

AT&T

PADS Architecture

AT&T

PADS Architecture

AT&T

PADS Architecture

AT&T

PADS Language

• Provides rich and extensible set of base types.– Pint8, Puint8, … // -123, 44

– Pstring(:’|’:) // hello | Pstring_FW(:3:) // catdog Pstring_ME(:”/a*/”:) // aaaaaab

– Pdate, Ptime, Pip, …

• Provides type constructors to describe data source structure:

• Pstruct, Parray, Punion, Ptypedef, Penum

• Allows arbitrary predicates to describe expected properties.

Type-based model: types indicate how to process associated data.

AT&T

Running Example: Web Server Logs

• Common Log Format from Web Protocols and Practice.

• Fields:– IP address of remote host

– Remote identity (usually ‘-’ to indicate name not collected)

– Authenticated user (usually ‘-’ to indicate name not collected)

– Time associated with request

– Request (request method, request-uri, and protocol version)

– Response code

– Content length

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

AT&T

Example: Pstruct

Precord Pstruct http_weblog { host client; /- Client requesting service ' '; auth_id remoteID; /- Remote identity ' '; auth_id auth; /- Name of authenticated user “ [”; Pdate(:']':) date; /- Timestamp of request “] ”; http_request request; /- Request ' '; Puint16_FW(:3:) response; /- 3-digit response code ' '; Puint32 contentLength; /- Bytes in response};

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

AT&T

Example: Parray

Parray Phostname{ Pstring_SE(:"/[. ]/":)[] : Psep('.') && Pterm(Pnosep); };

Array declarations allow the user to specify:

• Size (fixed, lower-bounded, upper-bounded, unbounded)

• Psep, Pterm, and termination predicates

• Constraints over sequence of array elements

Array terminates upon exhausting EOF, reaching terminator, reaching maximum size, or satisfying termination predicate.

www.cnn.com - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013www.cnn.com - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

AT&T

Example: Punion

Punion auth_id { Pchar unavailable : unavailable == '-'; Pstring(:' ':) id; };

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

• Union declarations allow the user to describe variations.

• Implementation tries branches in order.

• Stops when it finds a branch whose constraints are all true.

• Switched unions jump to a particular branch based on a selector.

AT&T

Example: Ptypedef

Ptypedef Puint16_FW(:3:) response : response x => { 100 <= x && x < 600};

Typedefs allow the user to add constraints to existing types.

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

AT&T

Example: Penum

Penum method { GET, PUT, POST, HEAD, DELETE, LINK, /- Unused after HTTP 1.0 UNLINK /- Unused after HTTP 1.0};

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

Enumerations are strings on disk, 32-bit integers in memory.

AT&T

Example: User Constraints

int chkVersion(http_v version, method meth) { … };

Pstruct http_request { '\"'; method meth; ' '; Pstring(:' ':) req_uri; ' '; http_v version : chkVersion(version, meth); '\"';};

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

AT&T

Dependencies

• “Early” data often affects parsing of later data:– Lengths of sequences

– Branches of switched unions

• To accommodate this usage, we allow PADS types to be parameterized:

Punion packets_t (: Puint8 which, Puint8 length:) {

Pswitch (which) {

Pcase 1: header_t header;

Pcase 2: body_t body;

Pcase 3: trailer_t trailer;

Pdefault: Pstring_FW(: length :) unknown;

};

Punion packets_t (: Puint8 which, Puint8 length:) {

Pswitch (which) {

Pcase 1: header_t header;

Pcase 2: body_t body;

Pcase 3: trailer_t trailer;

Pdefault: Pstring_FW(: length :) unknown;

};

AT&T

Common Log Format in PADS

Parray Phostname{ Pstring_SE(:"/[. ]/":) [] : Psep('.') && Pterm(Pnosep); };

Punion host { Pip ip; /- 135.207.23.32 Phostname host; /- www.research.att.com};

Punion auth_id { Pchar unauthorized : unauthorized == '-'; Pstring(:' ':) id; };

Penum method { GET, PUT, POST, HEAD, DELETE, LINK, UNLINK };

Pstruct version { "HTTP/"; Puint8 major; '.'; Puint8 minor; };

int chkVersion(version v, method m) { if ((v.major == 1) && (v.minor == 1)) return 1; if ((m == LINK) || (m == UNLINK)) return 0; return 1;};

Pstruct request { '\"'; method meth; ' '; Pstring(:' ':) req_uri; ' '; version version : chkVersion(version, meth); '\"';};

Ptypedef Puint16_FW(:3:) response : response x => { 100 <= x && x < 600};

Punion length { Pchar unavailable : unavailable == '-'; Puint32 len; };

Precord Pstruct entry { host client; ' '; auth_id remoteID; ' '; auth_id auth; " ["; Pdate(:']':) date; "] "; request request; ' '; response response; ' '; length length; };

Psource Parray clf { entry [];}

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

AT&T

PADS Parsing

Perror_t entry_read(P_t *pdc, entry_m* mask, entry_pd* pd, entry* rep);

Invariant: If mask is “check and set” and parse descriptor reports no errors, then the in-memory representation is correct.

AT&T

Using the Generated Code

P_t *p; entry rep;entry_pd pd; entry_m mask;

P_open(&p, 0, 0);P_io_fopen(p, “clf/data/2004.11.11");entry_m_init(p, &mask, P_CheckAndSet);...while (!P_io_at_eof(p)) { entry_read(p, &mask, &pd, &rep); if (pd.nerr > 0) { entry_write2io(p, ERR_FILE, &pd, &rep); } else { cnvIPAddress(&rep); if (entry_verify(&rep)) { entry_write2io(p, CLEAN_FILE, &pd, &rep); } else { error(2, "Data transform failed."); }}}

AT&T

Leverage!

• Convert PADS description into a collection of tools:– Accumulators

– Histograms

– Clustering tool

– Formatters

– Translator into XML, with corresponding XML Schema.

– XQueries using Galax’s data interface

– Ad hoc data management console

– …

• Long term goal: Provide a compelling suite of tools to overcome inertia of a new language and system.

AT&T

Accumulators

• Statistical profile of data:

• Bird’s eye view of 4000 daily feeds.• Used to vet data (and debug PADS descriptions).

<top>.length : uint32good: 53544 bad: 3824 pcnt-bad: 6.666min: 35 max: 248591 avg: 4090.234top 10 values out of 1000 distinct values:tracked 99.552% of values val: 3082 count: 1254 %-of-good: 2.342 val: 170 count: 1148 %-of-good: 2.144 . . .

. . . . . . . . . . . . . . . . . . . . . . SUMMING count: 9655 %-of-good: 18.032

<top>.length : uint32good: 53544 bad: 3824 pcnt-bad: 6.666min: 35 max: 248591 avg: 4090.234top 10 values out of 1000 distinct values:tracked 99.552% of values val: 3082 count: 1254 %-of-good: 2.342 val: 170 count: 1148 %-of-good: 2.144 . . .

. . . . . . . . . . . . . . . . . . . . . . SUMMING count: 9655 %-of-good: 18.032

Not all lengths were legal!

AT&T

Pretty Printer

• Customizable program to reformat data:

• Users can override pretty printing on a per type basis. • Used to normalize monitoring data before loading into

a relational database.

207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941

207.136.97.49|-|-|10/16/97:01:46:51|GET|/tk/p.txt|1|0|200|30tj62.aol.com|-|-|10/16/97:21:32:22|POST|/scpt/[email protected]/confirm|1|0|200|941207.136.97.49|-|-|10/16/97:01:46:51|GET|/tk/p.txt|1|0|200|30tj62.aol.com|-|-|10/16/97:21:32:22|POST|/scpt/[email protected]/confirm|1|0|200|941

Normalize time zonesNormalize delimiters

Drop unnecessary valuesFilter/repair errors

AT&T

XML Conversion

• PADS compiler generates canonical XML Schema:

• Converter maps ad hoc data to XML conforming to schema.

<xs:complexType name="entry"><xs:sequence><xs:element name="client" type="client"/><xs:element name="remoteID" type="auth_id"/> …<xs:element name="date" type="p:val_Pdate"/> …<xs:element name="length" type="length"/><xs:element name="pd" type="entry_pd" minOccurs="0" maxOccurs="1"/></xs:sequence></xs:complexType>

<xs:complexType name="entry"><xs:sequence><xs:element name="client" type="client"/><xs:element name="remoteID" type="auth_id"/> …<xs:element name="date" type="p:val_Pdate"/> …<xs:element name="length" type="length"/><xs:element name="pd" type="entry_pd" minOccurs="0" maxOccurs="1"/></xs:sequence></xs:complexType>

AT&T

$pads/Psource/elt [date/rep >= xs:dateTime("2004-10-01:00:00:00") and date/rep < xs:dateTime("2004-11-01:00:00:00")]]

$pads/Psource/elt [date/rep >= xs:dateTime("2004-10-01:00:00:00") and date/rep < xs:dateTime("2004-11-01:00:00:00")]]

XQuery Integration

• XQueries can run over ad hoc data with PADS descriptions without converting data to XML.

• PADS compiler generates a description-specific instance of the Galax data API, conforming to the generated XML Schema.

AT&T

LaunchPADS: GUI for ad hoc data

SIGMOD 2006 DemoSIGMOD 2006 Demo

AT&T

PacketTypesPacketTypesPADSPADS

DataScriptDataScript

DDC

Formal Theory

• A core data description calculus (DDC)– Based on dependent type theory

– Simple, orthogonal, composable types

– Types transduce external data source to internal representation.

• Encodings of high-level DDLs in low-level DDC

POPL 2006POPL 2006

AT&T

Future Research Directions

• Design– How can we specify error-aware data transformations?– Can we infer a data transformation between two descriptions? – How can we express application-specific information?– Can we automatically generate conforming data?– Can we integrate with a data visualization system?

• Implementation– How can we specialize generated libraries to incorporate application-

specific information?– How can we optimize Xqueries over PADS data sources?

• Theory– What is the expressiveness of PADS vs. context free grammars? – How do we add parametric polymorphism to PADS?

• Engineering– How do we build the system to make it easy to add new base types? – New libraries and tools? New language bindings?

AT&T

Try it!

• Available for download with a CPL license.

• Demo of accumulators, format program, and XML conversion.

• Send us feedback!

www.padsproj.org

(Growing!) PADS Team:Kathleen Fisher (AT&T)

Mary Fernandez (AT&T)

Joel Gottlieb (AT&T)

David Walker (Princeton)

Yitzhak Mandelbaum (Princeton)

Mark Daly (Princeton)

Robert Gruber (Google)

Martin Strauss (Michigan)

Xuan Zheng (Michigan)