Phasic Systems - Dr. Geoffrey Malafsky
-
Upload
inside-analysis -
Category
Technology
-
view
48 -
download
6
Transcript of Phasic Systems - Dr. Geoffrey Malafsky
Hadoop Powered Corporate Data
How to Produce and Manage Meaningful Data and Analytics
Dr. Geoffrey Malafsky
Phasic Systems Inc.
Phasic Systems Inc. 2
Governance
Warehouse
Analytics
NoSQL Streaming
BIIntegration
Architecture
Modeling
Big Data Hadoop Velocity,Volume,Variety
Veracity
Phasic Systems Inc. 3
Governance
Warehouse
Analytics
NoSQL Streaming
BIIntegration
Architecture
Modeling
Big Data Hadoop Velocity,Volume,Variety
Veracity
What does this really mean for my corporate
data?
Disruption
Phasic Systems Inc. 4
Organizational Issues
Technology IssuesBusiness Issues
Phasic Systems Inc. 5
Are we discovering new knowledge?
Are we analyzing business and operations for decisions, audit, compliance, consolidation?
Are we fulfilling required reports?
Phasic Systems Inc. 6
Veracity, Meaningful
Does it matter?Topic Should Does
BI Yes Sometimes
Required Reports Yes Sometimes
Audit Yes Yes
Compliance Yes Yes
Consolidation Yes Sometimes
Marketing Yes Sometimes
Financial Yes Yes but….
Decision Making Yes Yes but….
TechLab by InsideAnalysis
Phasic Systems Inc. 7
Normalizing Corporate Small Data With Hadoop and Data ScienceBy Dr. Geoffrey P Malafsky
In part one of this discussion series (Hadoop for Small Data), I introduced the idea that Small Data is the mission-critical data management challenge. To
reiterate, Small Data is “corporate structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past
the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision
making, applications, reports, and Business Intelligence.”
I am excluding what I call stochastic data use cases which can succeed even if there is error in the source data and uncertainty in the results since the
business objective is getting trends or making general associations. Most Big Data examples are this type. In stark contrast are deterministic use cases,
which I am focusing on here and in the next TechLab in September, where the ramifications for wrong results are severely negative. This is the realm of
executive decision making, Accounting, Risk Management, regulatory compliance, security, to name a few.
Corporate Small Data is
structured data that is the
fuel of its main activities
Data Normalization combines
subject matter knowledge,
governance, business rules,
and raw data to make it
meaningful.
Phasic Systems Inc. 8
Hadoop was created to handle extraordinarily large and constantly changing data sets. It is a very well-engineered software framework and set of tools for distributed storage and cluster computing. But, can it help solve the intractable challenges with key corporate data ?
The Challenge of Corporate Small Data
Phasic Systems Inc. 9
multiple sources multiple definitions multiple copies
variable structures
different data values
hidden conflicts in data definitions
which to use
different model types & standards
more storage more data flows
Many DW & marts different ETL
complex dependencies
conflicting business rules
analyses restricted by inconsistencies
Phasic Systems Inc. 10
An example of embedded errors that defy traditional tools and methods. Two authoritative data systems have many occurrences of conflicts, errors, and quantitative discrepancies. Finding these has been too difficult with common tools. But, using small Hadoop cluster (this is Corporate Data not Big Data) allows us to iteratively detect, learn, adjust. Once detected, investigated, and understood we can find just the one answer from business needed to correct.
Phasic Systems Inc. 11
136666505 adese genc petrol
136666505 amy lily chung
136666505 anderson erin ruth
136666505 andrew william knef
136666505 anduaga-arias laura
136666505 angelica m. de la cruz
136666505 anthony o'brien, 330531-5100194
136666505 batac belle
136666505 bottesini beth ms.
136666505 bouck shannon
136666505 bunn amy b.
136666505 carlene clark
136666505 cho, boong haeng
136666505 choe, sun young
136666505 christina michajlyszyn
136666505 christopher cannon
136666505 christopher l. booth
136666505 chun, kil mo
136666505 conflict + transition consultancies
136666505 cozzone elaine
136666505 deborah p. carney
136666505 denihan patricia joann
136666505 dong sook mcgeorge, 690525-2716816
136666505 dorene d.lukewalton,pharm d.
136666505 dr. terry a. klein
0
10
20
30
40
50
60
70
80
90
100
WhiteSpace Transpose Acronym NoiseWord LowSim Punctuation
Perc
ent
of
DU
NS
Wit
h >
=50
% N
ames
Mat
ched
Proportion of DUNS Matched by Transform Type
FPDS FPDS-WAWF FPDS-WAWF-GDUNS
Requirements for Data Analytics1. Data must be understood
2. The right definitions must apply at the right time for the right user
3. Data’s lineage and provenance must be clear
4. Data integrity must be preserved
5. Data must be accurate, consistent, complete, timely, unique and valid
6. Data and system access must be secure
7. Data must be provided in multiple arrangements to meet different user needs and analytical processing requirements
8. Data must be prepared and tracked to support meaningful analysis for different user needs
9. Data processing must be flexible to adapt to new knowledge and discoveries on data already being used
10. Data must be normalized using authoritative or best known sets of codes, lookup values, and source adjudication knowledge and rules
11. High speed, low maintenance techniques and tools are needed to be cost and time effective
12. Lifecycle audits and data maintenance must be performed including maintaining and documenting data from raw source to intermediate transformed to full normalized
13. Use Common data models that align, correct, and semantically unify data from multiple sources to enforce meaningful and consistent analysis
Phasic Systems Inc. 12
Phasic Systems Inc. 13
An Example of Hidden Business Rules and Logic
• If (DELIVERY_ORDER=NULL) v_piid = CONTRACT else v_piid = DELIVERY_ORDER
• If ( x1='0') v_modification_number = '0‘ else v_modification_number = x2
• where x1: if (ACO_MOD=NULL) x1 = x3 else x1 = ACO_MOD
• where x3: if (PCO_MOD=NULL) x3='0‘ else x3=PCO_MOD
• where x2: if (x4=NULL) x2='0‘ else x2=x4
• where x4: x4= LTRIM(x5)
• where x5: x5=x1• essentially this first tries to use ACO_MOD, and if this is NULL then it tries
to use PCO_MOD and sets = '0' if these are NULL
• If (DELIVERY_ORDER=NULL) v_idv_piid = y1 else v_idv_piid = CONTRACT
• where y1: y1 = REF_PROC_INSTRUMENT with all '-' characters removed
Phasic Systems Inc. 14
key business logic as buried in a database stored procedure (condensed)
Phasic Systems Inc. 15
Flexible, Fast, Adaptive, Multi-Tool Data Analytics Environment
Phasic Systems Inc. 16
Phasic Systems Inc. 17
0
50
100
150
200
250
300
350
400
Hive Impala SQLServer
FPDS Hadoop Query Times Text Field (secs)
Text Parquet Parquet Partitioned
Phasic Systems Inc. 18
Parallel Jobs in Hadoop
Phasic Systems Inc. 19
Phasic Systems Inc. 20
Phasic Systems Inc. 21