1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of...
-
Upload
shauna-hancock -
Category
Documents
-
view
216 -
download
2
Transcript of 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of...
![Page 1: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/1.jpg)
1
A K-Means Based Bayesian Classifier Inside a DBMS
Using SQL & UDFsPh.D Showcase, Dept. of Computer Science
Sasi Kumar PitchaimalaiPh.D Candidate
Database Systems Group, Department of Computer ScienceUniversity of Houston
Advisor: Dr. Carlos Ordonez
![Page 2: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/2.jpg)
2
Motivation
• Naïve Bayes Classifier(NB)–One of the most popular and
important classifiers in Machine Learning–Robust, Powerful, Fast to Compute
And Easy to Understand• Programming Inside A DBMS–SQL can easily handle complex
computations–UDFs can use arrays and processed
in memory
![Page 3: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/3.jpg)
Data Mining Inside A DBMS
•Avoids Exporting the data outside the DBMS
•Major overhead•Data Security
•Scales Linearly with large data sets•Exploit parallelism provided by a DBMS•Use optimized queries with simple database operations•Objective: Push computations involving large data sets inside the DBMS
![Page 4: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/4.jpg)
4
Bayesian Classifier Based On K-Means (BKM)
• A Generalization Of Naïve Bayes(NB)• The Algorithm
– Initialization: Randomly initialize k clusters per class from the data set.
– E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics.
– M-Step: Re-compute cluster centers and radii. Check Convergence.
• The E-Step and M-Step are repeated until model converges i.e clusters do not move
![Page 5: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/5.jpg)
BKM: Finding the clusters per class
![Page 6: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/6.jpg)
6
Database Optimizations
• Five different query optimization techniques for distance computation were introduced.
• User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF.
• Using CASE statement instead of aggregations.
• Sufficient Statistics of the clusters were computed in a single table scan.
![Page 7: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/7.jpg)
7
Comparing Accuracy – NB Vs BKM Vs DT
Data Set
Algorithm
Global
Class-0
Class-1
pima NB 76% 80% 68%
BKM 76% 87% 53%
DT 68% 76% 53%
Spam NB 70% 87% 45%
BKM 73% 91% 43%
DT 80% 85% 72%
Bscale NB 50% 51% 30%
BKM 59% 59% 60%
DT 89% 96% 0%
Wbcancer
NB 93% 91% 95%
BKM 93% 84% 97%
DT 95% 94% 96%
•Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases
•Class Breakdown Accuracy:
BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale.
![Page 8: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/8.jpg)
8
BKM Scalability- Varying n,d,k
Times per Iteration. Defaults: d=4,k=4,n=100k
![Page 9: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/9.jpg)
Comparing DBMS with MapReduce
MapReduce: A distributed non-transactional high performance data intensive processing framework.
![Page 10: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/10.jpg)
Incremental Mining
• An UDF performing incremental data mining exploiting data parallelism
• Minimizing the number of scans(1-3) on the data set
• Provides an approximation of the model before we scan through the complete data set
• Requires thread safe sharing of the model without affecting performance
![Page 11: 1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.](https://reader036.fdocuments.in/reader036/viewer/2022082710/56649e5c5503460f94b547cb/html5/thumbnails/11.jpg)
Papers• Carlos Ordonez, Sasi K. Pitchaimalai: One-pass
data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: 1217-1220
• Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010
• Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010
• Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008
• Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008