Data Mining La Plata 11 Nov 2002 - CENSUS
Transcript of Data Mining La Plata 11 Nov 2002 - CENSUS
Census Data Analysis & Data Mining
Dat a Min ing
Mar ía del Rosar io Bruer a
I BM Scholar s Pr ogram
Census Data Analysis & Data Mining
Pregunt as y respuest as
Pr egunt as:· ¢&XiO�HV�HO�YDORU�GH�ORV�FOLHQWHV"· ¢&XiOHV� VRQ� ORV� FOLHQWHV� TXH� WLHQHQ� PD\RUSUREDELOLGDG�GH�GHVHUWDU"
· ¢&XiOHV� VRQ� ORV� SURGXFWRV� TXH� VH� YHQGHQ� HQIRUPD�FRQMXQWD"���«
Respuest as:· (VWiQ�HQ�ORV�GDWRV�GHO�XVXDULR· 6H�QHFHVLWDQ�KHUUDPLHQWDV�HVSHFLDOHV�SDUDHQFRQWUDUODV
Census Data Analysis & Data Mining
Business In t e l l igenc e
S´(V� XQ� SDUDJXDV� EDMR� HO� TXH� VHLQFOX\H� XQ� FRQMXQWR� GH� FRQFHSWRV� \PHWRGRORJtDV�FX\D�PLVLyQ�FRQVLVWH�HQPHMRUDU� HO� SURFHVR� GH� WRPD� GHGHFLVLRQHV�HQ�ORV�QHJRFLRV�EDViQGRVHHQ� KHFKRV� \� VLVWHPDV� TXH� WUDEDMDQFRQ�KHFKRVµ
���+RZDUG�'UHVQHU�*DUWQHU�*URXS������
Census Data Analysis & Data Mining
B.I .: rec ursos y her ram ient as
S)XHQWHV� GH� GDWRV� �� ZDUHKRXVHV�GDWD�PDUWV��HWF
S+HUUDPLHQWDV� GH� DGPLQLVWUDFLyQ� GHGDWRV
S+HUUDPLHQWDV� GH� H[WUDFFLyQ� \FRQVXOWD
S+HUUDPLHQWDV�GH�PRGHOL]DFLyQ��'DWD0LQLQJ�
Census Data Analysis & Data Mining
¿Qué es Dat a Min ing? (1997)
·'DWD� 0LQLQJ� �� es el pr oceso deexplor ación y análisis - de maner aaut omát ica o semiaut omát ica - de losdat os par a obt ener pat r onessignif icat ivos y r eglas de negocio.
· 0LFKDHO�%HUU\��*RUGRQ�/LQRII'DWD�0LQLQJ�IRU�PDUNHWLQJ�VDOHVDQG�FXVWRPHU�VXSSRUW�:LOH\��86$������
Census Data Analysis & Data Mining
Ref lex iones (2000)
S��� QRV� JXVWD� OD� QRFLyQ� GH� TXH� ORV� SDWURQHVGHEHQ�VHU�VLJQLILFDWLYRV�«
S���6L�KD\�DOJR�TXH�UHFKD]DPRV�HV�OD�IUDVH�´SRUPHGLRV� DXWRPiWLFRV� R� VHPLDXWRPiWLFRVµ�� QRSRUTXH� QR� VHD� FLHUWR� �� VLQ� DXWRPDWL]DFLyQ� HVLPSRVLEOH�PLQDU� JUDQGHV� FDQWLGDGHV� GH� GDWRV� �VLQR� SRUTXH� HQWHQGHPRV� TXH� VH� KD� SXHVWRGHPDVLDGR� pQIDVLV� HQ� OD� DXWRPDWL]DFLyQ� \� QRVXILFLHQWH� HQ� ODV� HWDSDV� GH� H[SORUDFLyQ� \DQiOLVLV
S���'DWD�0LQLQJ�HV�XQ�SURFHVR��� ����������� � ��������� ���������� ����� ��!��"$#��� � ��%'&(��#��)�*� ��� ��%�+�,-� �����/.10�243�5565
Census Data Analysis & Data Mining
Qué NO es Dat a Min ing
S1R� HV� XQ� SURGXFWR� TXH� VH� FRPSUDHQODWDGR� VLQR� XQD� GLVFLSOLQD� TXHGHEH�VHU�GRPLQDGD�
S1R�HV�XQD�VROXFLyQ�LQVWDQWiQHD�D�ORVSUREOHPDV�GH�QHJRFLR�
S1R� HV� XQ� ILQ� HQ� Vt� PLVPR� VLQR� XQSURFHVR� TXH� D\XGD� D� HQFRQWUDUVROXFLRQHV�D�SUREOHPDV�GH�QHJRFLR�
Census Data Analysis & Data Mining
Pi lares de l proc esode Dat a Min ing
S 'DWRVS $OJRULWPRV�\�WpFQLFDVS 3UiFWLFDV�GH�PRGHOL]DFLyQ�
Census Data Analysis & Data Mining
Disc ip l inas que se in t egran
S,QWHOLJHQFLD�$UWLILFLDOS(VWDGtVWLFDS7HFQRORJtDV� GH� VRSRUWH� GHGHFLVLRQHV���2/73�
S7HFQRORJtDV�GH�KDUGZDUH�\�VRIWZDUH
Census Data Analysis & Data Mining
Perspec t iva h is t ór ic a
Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data
Census Data Analysis & Data Mining
Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data
Census Data Analysis & Data Mining
Et apas en e l proc esode Dat a Min ing
·,GHQWLILFDU�HO�SUREOHPD�GH�QHJRFLR·7UDQVIRUPDU� ORV� GDWRV� HQLQIRUPDFLyQ
·$FWXDU�D�SDUWLU�GH�ORV�UHVXOWDGRV·0HGLU�ORV�UHVXOWDGRV�GH�ODV�DFFLRQHV
Census Data Analysis & Data Mining
The Mining
Pr ocess
Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data
Census Data Analysis & Data Mining
El Anal is t a de Dat os
S(V� HO� YtQFXOR� HQWUH� ODV� iUHDV� GHWHFQRORJtD� LQIRUPiWLFD� \� ODV� iUHDV� GHQHJRFLRV
S7UDGXFH� ORV� UHTXHULPLHQWRV� GHLQIRUPDFLyQ�HQ�SUHJXQWDV�DSURSLDGDV�SDUDVX� DQiOLVLV� FRQ� ODV� KHUUDPLHQWDV� GHPLQHUtD�
S5HDOLPHQWD�HO�'DWD�:DUHKRXVH�GH�ODFRPSDxtD�FRQ�QXHYRV�FULWHULRV�GH�GDWDFOHDQLQJ�\�GDWD�YDOLGDWLRQ�
Census Data Analysis & Data Mining
El Anal is t a de Dat os
7HFQRORJtDLQIRUPiWLFD
�8VXDULRVGH�QHJRFLR
Census Data Analysis & Data Mining
7�8�96: ;�<�9�9>=?@A 9B6@�96CD�: 9�E�?�F�<G H
D�@�I @�J�@G <K�?�8�9�<
D�@�I @D�: 9�E�?�F�<G H
LM: ;6: ;�=ND�@�I @
D�@�I @POQA <�@;6: ;�=D�@�I @MBRG @;�9�ST?UGTV>@�IW: ?U;
D�@�I @YX�;�@A H�96: 9D�@�I @*LZ?[<A A : ;�=
\ ;�96: =�K�I 9]�;�?�^MA <�[6=�<D�: 9�E�?�F�<G H
El Anal is t a de Dat os
Census Data Analysis & Data Mining
Habi l idades requer idas
S'DWD�PDQLSXODWLRQ��64/�S&RQRFLPLHQWR�GH�ODV�WpFQLFDV�GHPLQHUtD�\�DQiOLVLV�H[SORUDWRULR
S+DELOLGDG� GH� FRPXQLFDFLyQ�LQWHUSUHWDFLyQ��GH�ORV�SUREOHPDV�GHQHJRFLR
S&UHDWLYLGDG
Census Data Analysis & Data Mining
Dat a Min ing Team
Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data
Census Data Analysis & Data Mining
Cost os de proyec t o
Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data
Census Data Analysis & Data Mining
Origen de los dat os
S%DVHV�GH�'DWRV�5HODFLRQDOHVS'DWD�:DUHKRXVHVS'DWD�0DUWV�DQG�2/$3S2WURV�IRUPDWRV���([FHO��DUFKLYRV$6&,,��HQFXHVWDV��GDWRV�FHQVDOHV�HWF�
Census Data Analysis & Data Mining
Tipos de fuent es de dat os
S7UDQVDFFLRQDOHV��HM��ODV�RSHUDFLRQHVUHDOL]DGDV�FRQ�WDUMHWD�GH�FUpGLWR
S5HODFLRQDOHV���HM��OD�HVWUXFWXUD�GHORV�SURGXFWRV�TXH�RIUHFH�HO�%DQFR
S'HPRJUiILFRV��HM��FDUDFWHUtVWLFDVGHO�JUXSR�IDPLOLDU
Census Data Analysis & Data Mining
La form a de los dat ospara Dat a Min ing
S6H�RUJDQL]DQ�HQ�IRUPD�GH�XQD�WDEODSODQD� FRPSXHVWD� SRU� ILODV� \FROXPQDV�
S/DV� )LODV� �� XQLGDG� GH� DQiOLVLV�3RUHMHPSOR��XQD�FXHQWD��XQ�WLFNHW
S/DV� FROXPQDV� �� ORV� DWULEXWRV� GHFDGD� XQLGDG� GH� DQiOLVLV�3RU� HMHPSOR��IUHFXHQFLD�GH�XVR�GH�OD�WDUMHWD�GHFUpGLWR
Census Data Analysis & Data Mining
Carac t er ís t ic as de las t ab las dedat os para Dat a Min ing
S7RGRV�ORV�GDWRV�GHEHQ�HVWDU�HQ�XQD�VRODWDEOD�R�´YLVWDµ�GH�OD�%DVH�GH�'DWRV
S&DGD� ILOD� GHEH� FRUUHVSRQGHU� D� XQDLQVWDQFLD�UHOHYDQWH�DO�QHJRFLR
S/DV� &ROXPQDV� VLQ� YDULDELOLGDG� GHEHQ� VHULJQRUDGDV
S/DV� &ROXPQDV� FRQ� YDORUHV� ~QLFRV� SDUDFDGD� FDVR� GHEHQ� VHU� LJQRUDGDV� �1UR� GHFXHQWD�
Census Data Analysis & Data Mining
La c a l idad de los dat os
·(O� p[LWR� GH� ODV� DFWLYLGDGHV� GH� Dat aMining VH� UHODFLRQD� GLUHFWDPHQWH� FRQ� ODCALI DAD�GH�ORV�GDWRV�
·6H� GHEH� LGHQWLILFDU� ORV� GDWRV� IDOWDQWHV“missings” R�IXHUD�GH�UDQJR�“out lier s”�
Census Data Analysis & Data Mining
La c a l idad de los dat os
·0XFKDV�YHFHV�UHVXOWD�QHFHVDULR�SUH�SURFHVDU�ORVGDWRV�DQWHV�GH�GHULYDUORV�DO�PRGHOR�GH�DQiOLVLV�(O�SUH�SURFHVDPLHQWR�SXHGH�LQFOXLUWUDQVIRUPDFLRQHV��UHGXFFLRQHV�R�FRPELQDFLRQHVGH�ORV�GDWRV�
· /D�VHPiQWLFD�GH�ORV�GDWRV�GHEH�D\XGDU�SDUD�ODVHOHFFLyQ�GH�XQD�FRQYHQLHQWH�r epr esent ación \ODV��ERQGDGHV�GH�OD�UHSUHVHQWDFLyQ�HOHJLGDJUDYLWDQ�GLUHFWDPHQWH�VREUH�OD�FDOLGDG�GHOPRGHOR�\�GH�ORV�UHVXOWDGRV�SRVWHULRUHV�
Census Data Analysis & Data Mining
Problem as c on los dat os
· 'HPDVLDGRV�GDWRV_ GDWRV�FRUUXSWRV�R�FRQ�UXLGR_ GDWRV�UHGXQGDQWHV��UHTXLHUHQ�IDFWRUL]DFLyQ�_ GDWRV�LUUHOHYDQWHV_ H[FHVLYD�FDQWLGDG�GH�GDWRV��PXHVWUHR�
· 3RFRV�GDWRV_ DWULEXWRV�SHUGLGRV��PLVVLQJV�_ YDORUHV�SHUGLGRV_ SRFD�FDQWLGDG�GH�GDWRV
· 'DWRV�IUDFWXUDGRV_ GDWRV�LQFRPSDWLEOHV_ P~OWLSOHV�IXHQWHV�GH�GDWRV
Census Data Analysis & Data Mining
6HOHFW 7UDQVIRUP 0LQH
`ba cTadea�f$g�h�i�j�klg m g�n g�o$cTg�p`ba cTa
$VVLPLODWH
$VVLPLODWHG,QIRUPDWLRQ
([WUDFWHG,QIRUPDWLRQ
7UDQVIRUPHG�'DWD
Preparac ión de los dat os
Census Data Analysis & Data Mining
Dat a Warehouse
S'DWD� :DUHKRXVH� LV� D� VXEMHFW�RULHQWHG�LQWHJUDWHG�� WLPH�YDULDQW�� QRQ� YRODWLOHFROOHFWLRQ� RI� GDWD� LQ� VXSSRUW� RIPDQDJHPHQW�GHFLVLRQV
%LOO�,QPRQ�������S$�FRS\�RI�WUDQVDFWLRQ�GDWD�VSHFLILFDOO\VWUXFWXUHG�IRU�TXHU\�DQG�DQDO\VLV�
�������������������������5DOSK�.LPEDOO
Census Data Analysis & Data Mining
Dat a Mart s
S7pFQLFDPHQWH�HV�XQ�VXEFRQMXQWR�GHO':� RULHQWDGR� D� XQD� ILQDOLGDGHVSHFtILFD� GH� QHJRFLR� �� PDUNHWLQJ�ILQDQ]DV��SURGXFFLyQ��HWF
S(O� WpUPLQR� VH� XWLOL]D� WDPELpQ� SDUDLGHQWLILFDU� VROXFLRQHV� DOWHUQDWLYDV� DXQ�':�FRUSRUDWLYR�PiV�UHGXFLGDV�\GH� PHQRU� FRVWR� \� WLHPSR� GHLPSODQWDFLyQ�
Census Data Analysis & Data Mining
Arqu i t ec t ura de lDat aw arehouse
q�rts uWvtwWwTx ylz�{$wT| }Wvlz~���� wWv$uT�tvR��}��vls �t�
DW
Metadata
Datosoperacionales y
externos
ReportQuery,EIS
OLAP
DataMining
Census Data Analysis & Data Mining
Herram ient as deex plo t ac ión de l DW
S+HUUDPLHQWDV�GH�YLVXDOL]DFLyQS5HSRUWLQJS2/$3S'DWD�0LQLQJ
Census Data Analysis & Data Mining
OLAP
S2Q�/LQH�$QDO\WLFDO�3URFHVVLQJS3HUPLWHQ� OD� HODERUDFLyQ� GH� YLVWDVPXOWLGLPHQVLRQDOHV� GHO� ':� SDUDRSWLPL]DU�SHUIRUPDQFH
S(VWiQ� VRSRUWDGDV� SRU� PRWRUHV� GHDGPLQLVWUDFLyQ� GHO� ':� TXH� DGPLWHQOD�FRQVWUXFFLyQ�GH�HVWRV�´FXERVµ
Census Data Analysis & Data Mining
OLAP
S+HUUDPLHQWDV�~WLOHV�\�SRGHURVDVSDUD�DFFHGHU�D�%DVHV�GH�'DWRV�\'DWD�:DUHKRXVHV�\�REWHQHU´UHSRUWHVµ�GH�LQIRUPDFLyQ�
S/D� WHFQRORJtD� 2/$3� FRPPSOHPHQWDODV� DFWLYLGDGHV� GH� 'DWD� 0LQLQJ� \VXSHUD�ODV�SRVLELOLGDGHV�GHO�64/
Census Data Analysis & Data Mining
Dat a Min ing y OLAP
S/DV� KHUUDPLHQWDV� GH� UHSRUWLQJ�2/$3� \� FRQVXOWD� UHVSRQGHQHIHFWLYDPHQWH� SDUD� OD� FRQVWUXFFLyQGH� PRGHORV� GHVFULSWLYRV� \UHWURVSHFWLYRV� SDUD� FRQILUPDU� RUHFKD]DU� KLSyWHVLV� SUHYLDV� GHOXVXDULR
Census Data Analysis & Data Mining
Dat a Min ing y OLAP
S/DV� KHUUDPLHQWDV� GH� 'DWD� 0LQLQJSHUPLWHQ� HQFRQWUDU� SDWURQHV� QRHYLGHQWHV� HQ� ORV� JUDQGHV� YRO~PHQHVGH� LQIRUPDFLyQ� GHO� ':� \� SURSRQHUPRGHORV�SUHGLFWLYRV
Census Data Analysis & Data Mining
Qué es la Est adíst ic a
S(V� OD� GLVFLSOLQD� TXH� H[WUDHLQIRUPDFLyQ� JHQHUDO� D� SDUWLU� GHGDWRV�HVSHFtILFRV�
S(V�HO�HVWXGLR�GH�OD�HVWDELOLGDG�HQ�ODYDULDFLyQ
S(V� HO� DUWH� GH� H[DPLQDU�� VXPDUL]DU\� H[WUDHU� FRQFOXVLRQHV� D� SDUWLU� GHORV�GDWRV�
Census Data Analysis & Data Mining
Dat a Min ing y Est adís t ic a
S/RV� PpWRGRV� HVWDGtVWLFRV� VRQ� HOFRUD]yQ� GH� PXFKDV� GH� ODV� WpFQLFDVGH�PLQHUtD�GH�GDWRV�
S2ULJLQDOPHQWH� PXFKDV� GH� HVWDVWpFQLFDV� IXHURQ� GLVHxDGDV� FRQSURSyVLWRV�FRQILUPDWRULRV�
S/D�HVWDGtVWLFD�H[SORUDWRULD�DSDUHFHHQ� ORV� ��� FRQ� ORV� DSRUWHV� GH-�7XFNH\
Census Data Analysis & Data Mining
Dat a Min ing y Est adís t ic a
S(Q� OD� 0LQHUtD� GH� 'DWRV� QR� VH� KDFHQVXSXHVWRV�D�SULRUL�VREUH� OD�QDWXUDOH]DGH� ODV� YDULDEOHV� \� GH� ODV� UHODFLRQHVHQWUH� HOODV� �QRUPDOLGDG�� OLQHDOLGDG�HWF��
S/RV�DOJRULWPRV�HVWDGtVWLFRV�VH�DGDSWDQ�� SDUD� 0LQHUtD� GH� 'DWRV� �� DOSURFHVDPLHQWR� GH� JUDQGHV� YRO~PHQHVGH�GDWRV
Census Data Analysis & Data Mining
Dat a Min ing e IA
S/D�,QWHOLJHQFLD�$UWLILFLDO�VH�LQWHJUDD� OD� 0LQHUtD� GH� 'DWRV� D� SDUWLU� GHODV�UHGHV�QHXURQDOHV�DUWLILFLDOHV
S6H� XWLOL]DQ� SDUD� FRQVWUXLU� PRGHORVSUHGLFWLYRV�QR�OLQHDOHV�TXH�DSUHQGHQD�WUDYpV�GH�HQWUHQDPLHQWR�\�TXH�VHDVLPLODQ��D�ORV�PRGHORV�GH�UHGHV�GHQHXURQDV�ELROyJLFDV�
Census Data Analysis & Data Mining
Redes neuronales
S/DV�UHGHV�QHXURQDOHV�VRQ�DGHFXDGDVSDUD�SUREOHPDV�GH�WLSR�SUHGLFWLYR�
S8Q�SUREOHPD�DSURSLDGR�SDUD�XQD�UHGQHXURQDO�WLHQH�WUHV�FDUDFWHUtVWLFDV�� Se compr enden clar ament e los I NPUTS� Se compr ende clar ament e el OUTPUT� Exist en ej emplos (exper iencia)
suf icient es par a ent r enar a la r ed
Census Data Analysis & Data Mining
Los m odelos neuronales
S/D� UHG� QHXURQDO� QR� SURGXFH� UHJODVH[SOtFLWDV�TXH�GHVFULEDQ�HO�PRGHOR
S8Q�PRGHOR�QHXURQDO�HV�WDQ�EXHQR�FRPR�ORHV� HO� VHW� GH� GDWRV� XVDGR� SDUD� HQWUHQDUOD�UHG
S(O� PRGHOR� HV� HVWiWLFR� \� GHEH� VHUH[SOtFLWDPHQWH� DFWXDOL]DGR� DJUHJDQGRHMHPSORV� UHFLHQWHV� \� UH�HQWUHQDQGR� ODUHG�SDUD�DVHJXUDU�VX�YLJHQFLD�\�XWLOLGDG
Census Data Analysis & Data Mining
Los m odelos neuronales
S&RQ�PRGHORV� QHXURQDOHV� VH� SXHGH� DWDFDUXQD� JUDQ� YDULHGDG� GH� SUREOHPDV� \SURGXFLU� EXHQRV� UHVXOWDGRV� D~Q� HQGRPLQLRV� FRPSOHMRV� FRQ� YDULDEOHVFRQWLQXDV�\�FDWHJyULFDV
S6RQ� DSURSLDGRV� SDUD� WDUHDV� GHFODVLILFDFLyQ� \� SUHGLFFLyQ� FXDQGR� ORVUHVXOWDGRV� GHO� PRGHOR� VRQ� PiVLPSRUWDQWHV� TXH� FRPSUHQGHU� FyPRIXQFLRQD�HO�PRGHOR�
Census Data Analysis & Data Mining
Cust om er Rela t ionsh ipManagem ent
S(V� HO� SURFHVR� TXH� DGPLQLVWUD� ODUHODFLyQ� HQWUH� OD� FRPSDxtD� \� VXVFOLHQWHV
S3DUD� TXH� UHVXOWH� H[LWRVR� UHVXOWDQHFHVDULR� LGHQWLILFDU� ORV� SDWURQHVGH�FRQVXPR�\�FRPSRUWDPLHQWR�GH�ORVFOLHQWHV
Census Data Analysis & Data Mining
Dat a Min ing - CRM
S'DWD� 0LQLQJ� VH� XWLOL]D� SDUDVLVWHPDWL]DU� ORV� SURFHVRV� GHE~VTXHGD� GH� ORV� SUHGLFWRUHV� GHFRPSRUWDPLHQWR� GH� ORV� FOLHQWHV� HQODV�HWDSDV�GH�GLVHxR�GH�FDPSDxDV
S7DPELpQ� VH� DSOLFD� SDUD� OD�PHGLFLyQGH�ORV�UHVXOWDGRV�GH�OD�FDPSDxD�\�ODUHDOLPHQWDFLyQ�GHO�&50
Census Data Analysis & Data Mining
Problem as t íp ic os de Dat a Min ing
S&ODVLILFDFLyQS(VWLPDFLyQS3UHGLFFLyQS$JUXSDPLHQWR�D�SDUWLU�GH�UHJODV�GHDVRFLDFLyQ
S&OXVWHULQJS'HVFULSFLyQ�\�YLVXDOL]DFLyQ��HWF
Census Data Analysis & Data Mining
Problem a de Clust er ing
$JUXSDU�D�ORV�FOLHQWHV�VHJ~Q�VXV�LQGLFDGRUHV5�5HFHQF\��� �)�)UHFXHQFLD���0� �0RQWR��� HWFHQ�VHJPHQWRV�GH�FRPSRUWDPLHQWR�KRPRJpQHR�5HVXOWDGR� ��&OLHQWHV�+HDY\��0HGLXP��/LJKW�HWF
��(O�����GH�OD�IDFWXUDFLyQ�VH�FRQFHQWUD�HQ�HOFOXVWHU�+HDY\������GH�ORV�FOLHQWHV��
� � /RV� FOLHQWHV�+HDY\� VRQ� FDVDGRV�� FRQ� KLMRV�WUDEDMDGRUHV� DXWyQRPRV� FRQ� XQ� LQJUHVRVXSHULRU�D�������
Census Data Analysis & Data Mining
Problem a de Clas i f ic ac ión
&ODVLILFDU� XQ� QXHYR� � FOLHQWH� �� GHDFXHUGR� D� VX� SHUILOVRFLRGHPRJUiILFR� �� FRPR� SRWHQFLDOFOLHQWH�+HDY\��0HGLXP��/LJKW�
Census Data Analysis & Data Mining
Problem a de Est im ac ión
(VWLPDU� HO� FRQVXPR� GH� XQGHWHUPLQDGR� UXEUR� GH� DUWtFXORV� GHXQ� JUXSR� � FOLHQWHV� HQ� HO� SUy[LPRWULPHVWUH�
(VWLPDU� HO� /79� �/LIH� 7LPH� 9DOXH�SRWHQFLDO�GH�XQ�QXHYR�FOLHQWH
Census Data Analysis & Data Mining
Problem a de Predic c ión
3UHGHFLU�HO�DEDQGRQR�GH�XQ�FOLHQWH�FKXUQLQJ��DWULWWLRQ�
��3DUD�XQD�FRPSDxtD�GH�WHOHIRQtDFHOXODU��3DUD�XQD�$)-3��3DUD�XQD�WDUMHWD�GH�FUpGLWR
Census Data Analysis & Data Mining
Problem a de Asoc iac ión
(QFRQWUDU�ODV�UHJODV�TXH�GHWHUPLQDQHO�FURVV���WUDIILF�HQWUH�SURGXFWRVSDUD�ORV�FOLHQWHV�GH�XQ�%DQFR��3RUHMHPSOR�´&XDQGR�XQ�FOLHQWH�VH�DFWLYD�HQ�&DMDGH�$KRUURV���HO�VLJXLHQWH�SURGXFWRHQ�GRQGH�VH�DFWLYD�HV�3UpVWDPRVSHUVRQDOHV���(VWH�SDWUyQ�RFXUUH�HQHO�����GH�ORV�FDVRV�µ
Census Data Analysis & Data Mining
Problem a de v isual izac ión
� � 5HSUHVHQWDU� PHGLDQWH� XQ� VRIWZDUHGH� JHRORFDOL]DFLyQ� �*,6�� ODGLVWULEXFLyQ� GH� ORV� FOLHQWHV� HQ� OD]RQD�GH�LQIOXHQFLD�GH�ODV�VXFXUVDOHVGH�XQ�FRPHUFLR�
Census Data Analysis & Data Mining
Problem as usuales
S&DUDFWHUL]DFLyQ� GH� SHUILOHV� GHFOLHQWHV�SDUD�GHILQLU�DFFLRQHV�GH�8SVHOOLQJ�\�&URVV�VHOOLQJ
S7UDFNLQJ� GH� FDPSDxDV� \� SUHGLFFLyQGH�UHVSXHVWD���QR�UHVSXHVWD
S&DQDVWD�GH�FRQVXPR�GH�WDUMHWDV�GHFUpGLWR�\�SUHYHQFLyQ�GH�IUDXGHV
S0RGHORV�GH�SUHGLFFLyQ�GH�DEDQGRQR
Census Data Analysis & Data Mining
Problem as usuales
S3URJUDPDV�GH�PLOODMH�\�ILGHOL]DFLyQS&RQVROLGDFLyQ�GH�%DVHV�GH�'DWRVSURSLDV�FRQ�IXHQWHV�H[WHUQDV
S:HE�PLQLQJ�\�DQiOLVLV�GH�WUiILFR�\XVR�GH�UHFXUVRV�GH�H�EXVLQHVV
S'HILQLFLyQ� GH� PDUFRV� PXHVWUDOHVSDUD� LQYHVWLJDFLRQHV� GH� PHUFDGR� \HQFXHVWDV�GH�FXVWRPHU�VDWLVIDFWLRQ�
Census Data Analysis & Data Mining
La e lec c ión de l m odelopara Dat a Min ing
·3ULQFLSDOHV�REMHWLYRV�GHO�SURFHVR�GH�'DWD0LQLQJ� pr edicción� descr ipción
·(O� PpWRGR� D� XWLOL]DU� GHSHQGH� GH� ORVREMHWLYRV�SHUVHJXLGRV�SRU�HO�DQiOLVLV�SHURWDPELpQ� GH� OD� FDOLGDG� \� FDQWLGDG� GH� ORVGDWRV�GLVSRQLEOHV
Census Data Analysis & Data Mining
Fuente: Mining your Own Business Data Using DB2 Intelligent Miner for Data
Census Data Analysis & Data Mining
Cóm o selec c ionar unapot enc ia l ap l ic ac ión de DM
&RQVLGHUDFLRQHV�SUiFWLFDV�·3RWHQFLDO�LPSDFWR�VLJQLILFDWLYR��5HODFLyQFRVWR���EHQHILFLR��
·1R�KD\�RWUD�DOWHUQDWLYD·([LVWH�VRSRUWH�LQVWLWXFLRQDO·1R�H[LVWHQ�LPSHGLPHQWRV�OHJDOHV�GH�XVRGH�OD�LQIRUPDFLyQ
Census Data Analysis & Data Mining
Cóm o selec c ionar unapot enc ia l ap l ic ac ión de DM
Consider aciones t écnicas:·'LVSRQLELOLGDG�VXILFLHQWH�GH�GDWRV·5HOHYDQFLD�GH�DWULEXWRV·%DMRV�QLYHOHV�GH�UXLGR�HQ�ORV�GDWRV·3UHFLVDU�HO�QLYHO�GH�FRQILDQ]D�SDUD�ORVUHVXOWDGRV
·&RQRFLPLHQWR�DQWHULRU�H[LVWHQWH
Census Data Analysis & Data Mining
La evaluac ión de los m odelos
·&XiQ�DMXVWDGR�HV�HO�PRGHOR"·(V�FRUUHFWD�VX�GHVFULSFLyQ�GH�ORVGDWRV�REVHUYDGRV"
·&XDQWD�FRQILDQ]D�VH�SXHGH�WHQHU�HQVXV�SUHGLFFLRQHV"
·&XiQ�FRPSUHQVLEOH�HV�HO�PRGHOR"
Census Data Analysis & Data Mining
Las m edidas
·/D� FRQFRUGDQFLD� GH� XQ�PRGHOR� SUHGLFWLYRFRQ�OD�UHDOLGDG�VH�PLGH�FRQ�UHODFLyQ�D�ODWDVD�GH�HUURU��HV�GHFLU��HO�SRUFHQWDMH�GHFDVRV� FODVLILFDGRV� R� FX\D� SUHGLFFLyQ� IXHLQFRUUHFWD�
·3DUD� HOOR� VH� GLVSRQH� GH� GDWRV� GHYDOLGDFLyQ� \� WHVWLQJ� VREUH� ORV� TXH� GHEHDSOLFDUVH� SHULyGLFDPHQWH� HO� PRGHOR� DPRGR�GH�FRQWURO�
Census Data Analysis & Data Mining
Las m edidas
·(Q� HO� FDVR� GH� ORV� PRGHORV� GHVFULSWLYRV�XQD�EXHQD�UHJOD��HV�OD�TXH�SURSRUFLRQD�ODLQIRUPDFLyQ� PiV� FRPSUHQVLEOH� FRQ� ODPHQRU� ´ORQJLWXGµ� GH� H[SUHVLyQ� GH� ODUHJOD�
·(Q� GHILQLWLYD�� OD�PHGLGD�PiV� LPSRUWDQWHGH� HIHFWLYLGDG� HV� HO� UHWRUQR� GH� ODLQYHUVLyQ
Census Data Analysis & Data Mining
Un proyec t o ex i t oso
S8Q�~QLFR�SURMHFW�OHDGHUS8Q�HTXLSR�PXOWLGLVFLSOLQDULR� LQWHJUDGR�SRUSHUVRQDV�GH�ODV�iUHDV�GH�,7�\�GH�QHJRFLR
S/DV� XQLGDGHV� GH� QHJRFLR� HVWiQLQYROXFUDGDV�GHVGH�HO�FRPLHQ]R
S(O�iUHD�GH�,7�HVWi� LQYROXFUDGD�GHVGH�HOFRPLHQ]R
S8Q� SHTXHxR� SUR\HFWR� SLORWR� TXH� PXHVWUHODV�YHQWDMDV�GH�'DWD�0LQLQJ
Census Data Analysis & Data Mining
Las nuevas t ec nologías
Census Data Analysis & Data Mining
Web Min ing
S(V� HO� GHVFXEULPLHQWR� GH� SDWURQHVVLJQLILFDWLYRV�D�SDUWLU�GHO�DQiOLVLV�GHOD�HVWUXFWXUD��FRQWHQLGRV�\�XVR��GHOD�:HE
Census Data Analysis & Data Mining
Web Min ing Tax onom y
:HE�FRQWHQW :HE�6WUXFWXUH :HE�XVDJH
:HE�0LQLQJ
Census Data Analysis & Data Mining
Resul t ados Web m in ing
S(O� ���� GH� ORV� YLVLWDQWHV� TXHDFFHGHQ� D� ZZZ�LEP�FRP�UHGERRNVDFFHGHQ�D�ZZZ�LEP�FRP�VRIWZDUH�GDWD�LPLQHU�IRUGDWD�
S(QWU\�DQG�([LW�SRLQWV
Census Data Analysis & Data Mining
Resul t ados Web m in ing
S/LQN� DQDO\VLV� \� SDWURQHVVHFXHQFLDOHV�GH�HQODFHV�GH�SiJLQDV
S6HJPHQWDFLyQ�GH�FOLHQWHV�GH�H�FRPPHUFH
S&DQDVWD�GH�SURGXFWRVSHWF��HWF��HWF
Census Data Analysis & Data Mining
Tex t Min ing
S6RQ�QXHYDV�KHUUDPLHQWDV�GHVWLQDGDVD� H[WUDHU� LQIRUPDFLyQ� GHGRFXPHQWRV� ´QR� HVWUXFWXUDGRVµ�RUJDQL]DUORV�� VHJPHQWDUORV�LQGH[DUORV�
Census Data Analysis & Data Mining
Problem as de Tex t Min ing
S'LUHFFLRQDPLHQWR� DXWRPiWLFR� GHHPDLOV�VHJ~Q�VX�FRQWHQLGR
S&ODVLILFDFLyQ� DXWRPiWLFD� GHGRFXPHQWRV�GH�XQD�LQWUDQHW
S%~VTXHGD� GH� LQIRUPDFLyQ� HQGRFXPHQWRV� GH� GLVWLQWRV� LGLRPDVVLPXOWiQHDPHQWH�
Census Data Analysis & Data Mining
Problem as de Tex t Min ing
S$QiOLVLV� GH� FRQWHQLGRV� GH� SiJLQDV:HE
S2UJDQL]DFLyQ� GH� VHUYLFLRV� GHE~VTXHGD�HQ�OD�:HE
S([WUDFFLyQ�GH�FRQFHSWRV�GH�VtQWHVLVHQ� GRFXPHQWRV� UHIHULGRV� DO� PLVPRDVXQWR�
Census Data Analysis & Data Mining
Conc lus iones
Census Data Analysis & Data Mining
Para qué Miner ía de Dat os
S/D� 0LQHUtD� GH� 'DWRV� HV� XQDKHUUDPLHQWD�HILFD]�SDUD�GDU�UHVSXHVWDSUHJXQWDV�FRPSOHMDV�GH�,QWHOLJHQFLD�GH1HJRFLRV
S/DV�KHUUDPLHQWDV�GLVSRQLEOHV�SHUPLWHQDXWRPDWL]DU� SDUWH� GH� OD� WDUHD� GHHQFRQWUDU� ORV� SDWURQHV� GHFRPSRUWDPLHQWR�RFXOWRV�HQ�ORV�GDWRV
S3HUR�«�
Census Data Analysis & Data Mining
Qué no puedeaut om at izarse (t odavía)
S/D� HOHFFLyQ� GH� ORV� SUREOHPDV� GH� QHJRFLRFDQGLGDWRV�SDUD�WDUHDV�GH�'DWD�0LQLQJ
S/D� LGHQWLILFDFLyQ� \� UHFROHFFLyQ� GH� ORVGDWRV� TXH� FRQWLHQHQ� OD� LQIRUPDFLyQEXVFDGD
S(O� PDVDMHR� \� WUDWDPLHQWR� GH� ORV� GDWRVTXH�SRVLELOLWD�OD�E~VTXHGD�GH�SDWURQHV
S(O�GLVHxR�\�FiOFXOR�GH�YDULDEOHV�GHULYDGDV
Census Data Analysis & Data Mining
Qué no puedeaut om at izarse (t odavía)
S(O�SODQ�GH�DFFLRQHV�TXH�DSR\iQGRVH�HQ�ORVUHVXOWDGRV�GHO�PRGHOR�SURGX]FD�HO�52,
S/D� PHGLFLyQ� GHO� p[LWR� GH� ODV� DFFLRQHVUHDOL]DGDV� D� SDUWLU� GH� ORV� UHVXOWDGRVSURSRUFLRQDGRV�SRU�'DWD�0LQLQJ
Census Data Analysis & Data Mining
Conc lus iones
S&RQYLHUWD� D� 'DWD� 0LQLQJ� HQ� XQDSDUWH�GH�VX�SUR\HFWR�GH�QHJRFLR�
S,QFOX\D� D� 'DWD� 0LQLQJ� HQ� OD´FXOWXUDµ�GH�VX�RUJDQL]DFLyQ�
Census Data Analysis & Data Mining
Ejem plos c onDB2 Int e l l igent Miner for
Dat a
Census Data Analysis & Data Mining
Téc nic as ut i l i zadas
S&OXVWHULQJ��VHJPHQWDFLyQ�S&DQDVWD�GH�SURGXFWRVS$UERO�GH�GHFLVLyQS5HG�QHXURQDO�FRPR�PRGHORSUHGLFWLYR
Census Data Analysis & Data Mining
¿Qué es “ c lust er ing”?
S(V� OD� SDUWLFLyQ� GHO� FRQMXQWR� GHLQGLYLGXRV� HQ� VXEFRQMXQWRV� OR� PiVKRPRJpQHRV�SRVLEOHV�
S(O�REMHWLYR�HV�PD[LPL]DU�OD�VLPLOLWXGGH� ORV� LQGLYLGXRV� GHO� FOXVWHU� \PD[LPL]DU� ODV� GLIHUHQFLDV� HQWUHFOXVWHUV�
Census Data Analysis & Data Mining
Apl ic ac iones de la t éc nic a
S6HJPHQWDFLyQ�GH�OD�EDVH�GH�GDWRVS'HWHFFLyQ�GH�IUDXGHVS'HWHFFLyQ�GH�GHIHFWRV
Census Data Analysis & Data Mining
Objet ivos
S'HWHUPLQDU�HO�Q~PHUR�ySWLPR�GHFOXVWHUV
S$VLJQDU�D�FDGD�LQGLYLGXR�D�XQ�~QLFRFOXVWHU
S(YDOXDU�HO�LPSDFWR�GH�ODV�YDULDEOHVHQ�OD�IRUPDFLyQ�GHO�FOXVWHU
S&RPSUHQGHU� HO� ´SHUILOµ� GH� FDGDFOXVWHU
Census Data Analysis & Data Mining
Medidas de s im i lar idad
S9DULDEOHV� FDWHJyULFDV� �HVFDODVQRPLQDOHV� \� RUGLQDOHV�� �� VRQVLPLODUHV�VL�VRQ�LJXDOHV�
S9DULDEOHV� QXPpULFDV� �HVFDODVPpWULFDV�� �� HO� DOJRULWPR� GHWHUPLQDVX�GLIHUHQFLD�H[SUHVDGD�HQ�XQLGDGHVGH�GHVYLDFLRQHV�VWDQGDUG�
Census Data Analysis & Data Mining
Ejem plo s im i lar idad
1RPEUH 6H[R (VW��&LYLO /XJDU 6LPLODULGDGJuan M C Cap.Fed 0.33Maria F C GBA 0.33No evaluado Diferente Igual Diferente
Census Data Analysis & Data Mining
Cri t er io Condorc et
S(V�XQD�PHGLGD�GH�VLPLODULGDG�TXH�YDUtDHQWUH���\��
S9DOH� �� �� ORV� LQGLYLGXRV� HVWiQ� XELFDGRVDOHDWRULDPHQWH�HQ�ORV�FOXVWHUV
S9DOH� �� �� 7RGRV� ORV� LQGLYLGXRV� GH� ORVFOXVWHUV�VRQ�LGpQWLFRV�\�QR�KD\�LQGLYLGXRVFRQ� HVDV� FDUDFWHUtVWLFDV� IXHUD� GH� FDGDFOXVWHU�
S&RQGRUFHW�PtQLPR�XVXDO� �����
Census Data Analysis & Data Mining
El problem a
6H�WUDWD�GH�VHJPHQWDU��OD�%DVH�GH'DWRV�GH�ORV�FOLHQWHV�GH�XQD�WDUMHWDGH� FUpGLWR� D� SDUWLU� GH� VXVLQGLFDGRUHV� GH� FRQVXPR� SDUDLGHQWLILFDU� DO� VHJPHQWR� GH� PD\RUYDORU�
Census Data Analysis & Data Mining
Los dat os d isponib les
S$�SDUWLU�GH� OD�%DVH�GH�'DWRV�GH�WUDQVDFFLRQHVGHO�~OWLPR�DxR�GH� ORV�FOLHQWHV�VH�REWLHQHQ�FRPRYDULDEOHV�� )UHFXHQFLD�GH�XVR�GH�OD�WDUMHWD : calculada
como media de días ent re t r ansacciones.� 6DOGR�SURPHGLR�PHQVXDO�GH�WUDQVDFFLRQHV�HQ��
� 0RQWR�SURPHGLR�SRU�WUDQVDFFLyQ� &DQWLGDG�GH�VHUYLFLRV�SRU�GpELWR�DXWRPiWLFR� 'DWRV�VRFLRGHPRJUiILFRV���VH[R��HGDG�HVWDGR�FLYLO��RFXSDFLyQ��KLMRV
Census Data Analysis & Data Mining
La preparac ión de dat os
S'HILQLU�OD�XQLGDG�GH�DQiOLVLV��¢FXHQWD�R�WDUMHWD"
S'HILQLU�TXp�HV�XQD�WUDQVDFFLyQ�HM��¢FyPR�VH�FRQVLGHUDQ�ORV�DMXVWHV�PRQWRV�QHJDWLYRV�"
S'HILQLU�YDULDEOHV�GHULYDGDV���HQ�ODIUHFXHQFLD�¢FyPR�LQWHUYLHQHQ�ORVGpELWRV�DXWRPiWLFRV"
Census Data Analysis & Data Mining
La preparac ión de dat os
S'HVFULELU�ODV�YDULDEOHV�D�LQFOXLU�HQHO�PRGHOR�SDUD�� Calcular medidas de posición y disper sión� I dent if icar dist r ibuciones asimét r icas� I dent if icar missings� I dent if icar valor es incor r ect os o f uer a
de r ango� I dent if icar out lier s
Census Data Analysis & Data Mining
E s ta d is tic a s C lu s te r 0 1 0 0 ,0 0 % d e p o b la c ió n
s c io s e d a d e s ta d o _c ivil
D ivo rc ia d o /Viud oCa s a d o
S o lte ro
o c up
C ue nta P ro p iaR e la c io n d e p e n d e nc iaN o tra b a ja
s e xo
F e m e nin oMa s c ulino
h ijo s
N oS i
a vg tc kt fre c u p e s o s
'HVFULSWLYRV�JHQHUDOHV
Census Data Analysis & Data Mining
Cri t er ios de segm ent ac ión
S6H� WRPDQ� FRPR� YDULDEOHV� ´DFWLYDVµODV� TXH� FRUUHVSRQGHQ� DOFRPSRUWDPLHQWR�GH�FRQVXPR�
S6H� WRPDQ� FRPR� YDULDEOHVVXSOHPHQWDULDV� ORV� DWULEXWRVVRFLRGHPRJUiILFRV�
Census Data Analysis & Data Mining
Credit Card
55
1
27
2
18
0
s cios [s e xo]
FemeninoMasculino
[es tad o_ civil]
Divo rciado /Viud oCasado
So ltero
[ocup]
Cuenta Pro piaRelacio n d ep end enciaNo trab aja
[hijos ]
NoSi
fre cu pe s os a vgtckt [e da d]
s cios [es tad o_ civil]
Divo rciado /Viud oCasado
So ltero
[ocup]
Cuenta Pro piaRelacio n d epend enciaNo trab aja
[s e xo]
FemeninoMasculino
pe s os [hijos ]
NoSi
fre cu [e da d] a vgtckt
s cios fre cu pe s os [es tad o_ civil]
Divo rciado /Viud oCasado
So ltero
[ocup]
Cuenta Pro piaRelacio n d ep end enciaNo trab aja
[hijos ]
NoSi
[s e xo]
FemeninoMasculino
a vgtckt [e da d]
Census Data Analysis & Data MiningCredit Ca rd Clus ter 2 27,21% de pobla ción
s cios [e s tado_civil]
Divo rc ia d o /Viud oCa s a d o
S o lte ro
[ocup]
Cue nta P ro p iaR e la c io n d e p e nd e nc iaNo tra b a ja
[s e xo]
Fe me ninoMa s c ulino
pe s os [hijos ]
NoS i
fre cu [e dad] a vg tckt
Tienen 4 o másdébitos automáticos
Casados Trabajo Cta Propia
Varones
Saldo >>>
Con hijos
Uso frecuenteEdad 40-45 Ticket >>>
Census Data Analysis & Data Mining
Paret o
0
20
40
60
80
100
120
% Cuentas % Suma Saldo
Cluster 0Cluster 1Cluster 2
Census Data Analysis & Data Mining
Arboles de dec is ión
S6RQ� WpFQLFDV� TXH� VH� XWLOL]DQ� FRQILQDOLGDG�SUHGLFWLYD�\�GH�FODVLILFDFLyQ�
S6H� REWLHQH� FRPR� UHVXOWDGR� ´UHJODVµTXH�H[SOLFDQ�HO�FRPSRUWDPLHQWR�GH�XQDYDULDEOH� �7$5*(7�� FRQ� UHODFLyQ� DRWUDV��35(',&725$6��
S(Q� HVWH� HMHPSOR� VH� XWLOL]DQ� SDUD´H[SOLFDUµ�ORV�FOXVWHUV�
Census Data Analysis & Data Mining
Algor i t m os
S&+$,'��&KL�6TXDUHG�$XWRPDWLF'HWHFWLRQ��
S&57� �� &ODVVLILFDWLRQ� DQG5HJUHVVLRQ�7UHH�
S&�����4XHVW�\�RWURVS,QWHOOLJHQW� 0LQHU� XWLOL]D� XQDYDULDQWH�GH�&57
Census Data Analysis & Data Mining
Arbol de c om port am ient o
Si tiene 4 o másdébitos automáticos yun saldo > $ 727entonces suprobabilidad depertenecer al cluster 2es del 99%
Census Data Analysis & Data Mining
Arbol soc iodem ográf ic o
Census Data Analysis & Data Mining
Mark et Bask et Analys is
S(O�SUREOHPD��6H�WUDWD�GH�HQFRQWUDUODV� UHJODV� GH� DVRFLDFLyQ� TXHRUJDQL]DQ� ORV� SHGLGRV� GH� ´WRSSLQJVµH[WUD�GH�XQD�SL]]HUtD��D�SDUWLU�GHODQiOLVLV� GH� XQ� FRQMXQWR� GH� ����WLFNHWV�GH�YHQWD�
Census Data Analysis & Data Mining
La t abla de Dat a Min ing
S,G�WLFNHWS&yGLJR�GH�SURGXFWR�
� ��+RQJRV� ��3HSSHURQL� ��4XHVR� ��&HUYH]D� ��*DVHRVD� ��2WUD�EHELGD
Census Data Analysis & Data Mining
Propósi t o de MBA
S*HQHUDU�UHJODV�GHO�WLSR�� I F (SI ) condición ENTONCES (THEN)
r esult ado
S(MHPSOR�� 6L�pr oduct o A y pr oduct o C
ENTONCES pr oduct o B
Census Data Analysis & Data Mining
Tipos de reglas
S8WLOHV� �� DSOLFDEOHV� �� UHJODV� TXHFRQWLHQHQ� EXHQD� FDOLGDG� GHLQIRUPDFLyQ� TXH� SXHGHQ� WUDGXFLUVHHQ�DFFLRQHV�GH�QHJRFLR�
S7ULYLDOHV���UHJODV�\D�FRQRFLGDV�HQ�HOQHJRFLR�SRU�VX�IUHFXHQWH�RFXUUHQFLD
S,QH[SOLFDEOHV� �� FXULRVLGDGHVDUELWUDULDV�VLQ�DSOLFDFLyQ�SUiFWLFD
Census Data Analysis & Data Mining
Problem as del MBA
S/D�H[LVWHQFLD�GH�PXFKRV�LWHPV�HQ�HOVHW� GH� DQiOLVLV� FRPSOLFDH[SRQHQFLDOPHQWH� HO� WLHPSR� GHFiOFXOR
S5HVXOWD� QHFHVDULR� GHILQLU� FULWHULRVSDUD�VHOHFFLRQDU�ODV�PHMRUHV�UHJODV
S(V� LPSRUWDQWH� OD� FRQVWUXFFLyQ� GHXQD�WD[RQRPtD�GH�SURGXFWRV
Census Data Analysis & Data Mining
¿Cuán buena es una regla?
S0HGLGDV�TXH�FDOLILFDQ�D�XQD�UHJOD�� Sopor t e� Conf ianza� Lif t (I mpr ovement )
Census Data Analysis & Data Mining
Sopor t e
S(V�OD�FDQWLGDG�����GH�WUDQVDFFLRQHVHQ�GRQGH�VH�HQFXHQWUD�OD�UHJOD�� Ej : “Si A ent onces B” est á pr esent e en
4000 de 10000 t r ansacciones.� Sopor t e (A/ B) : 40%
Census Data Analysis & Data Mining
Conf ianza
S&DQWLGDG�����GH�WUDQVDFFLRQHV�TXHFRQWLHQHQ�OD�UHJOD�UHIHULGD�D�ODFDQWLGDG�GH�WUDQVDFFLRQHV�TXHFRQWLHQHQ�OD�FOiXVXOD�FRQGLFLRQDO� Ej : Par a el caso ant er ior , si A est á
pr esent e en 6000 t r ansacciones (60%)� Conf ianza (A/ B) = 40% / 60% = 66%
Census Data Analysis & Data Mining
Mejora (Im provem ent )
S&DSDFLGDG�SUHGLFWLYD�GH�OD�UHJOD�� Mej or a = p(A/ B) / p(A) * p(B)� Ej : p(A/ B) = 40% ; p(A) = 60%; p(B) = 30%
I mpr ov (A/ B) = 40% / (60% * 30%) = 2.22
Mayor a 1 : la r egla t iene valor pr edict ivo
Census Data Analysis & Data Mining
Ejem plo de c á lc u lo
Census Data Analysis & Data Mining
Dat os básic os
+RQJRV 3HSSHURQL 4XHVR &DQWLGDGSi Si Si 100Si Si No 400Si No Si 300Si No No 100No Si Si 200No Si No 150No No Si 200No No No 550TOTAL 2000
Census Data Analysis & Data Mining
Reglas
�U�6��� � �������R����� � � ������ � �U� ���������W�6� �Hongos 900 0.45Pepperoni 850 0.43Queso 800 0.40Hongos --> Pepperoni 500 0.25 0.56 1.31Hongos --> Queso 400 0.20 0.47 1.18Queso --> Pepperoni 300 0.15 0.38 0.88Hongos + Pepperoni --> Queso 100 0.05 0.20 0.80Hongos + Queso --> Pepperoni 100 0.05 0.25 0.59Queso + Pepperoni --> Hongos 100 0.05 0.33 0.74
Pueden descartarse por bajo soporteReglas significativas
Census Data Analysis & Data Mining
Ot ro e jem plo de MBA
S/D� DVRFLDFLyQ� VH� SODQWHD� HQWUH� ORVWRSSLQJV� GH� ODV� SL]]DV� \� ODVEHELGDV�
S/RV� JUiILFRV� GH� UHJODV� SHUPLWHQYLVXDOPHQWH� LGHQWLILFDU� UHJODV� FRQEXHQ�VRSRUWH��FRQILDQ]D�\�OLIW
Census Data Analysis & Data Mining
Census Data Analysis & Data Mining
Soporte (%)Confianza(%) Tipo Elevación Cuerpo de regla Cabecera de regla 3.1746 80.0000 + 1.7800 [Hongos]+[Otra bebida] ==> [Pepperoni]16.6667 81.8200 + 1.7200 [Cerveza]+[Pepperoni] ==> [Hongos]13.0688 78.4100 + 1.6500 [Cerveza]+[Queso] ==> [Hongos]16.6667 63.0000 . 1.5400 [Hongos]+[Pepperoni] ==> [Cerveza]29.8413 72.8700 + 1.5300 [Cerveza] ==> [Hongos]29.8413 62.6700 + 1.5300 [Hongos] ==> [Cerveza]13.0688 61.7500 . 1.5100 [Hongos]+[Queso] ==> [Cerveza] 9.0476 57.0000 + 1.4000 [Pepperoni]+[Queso] ==> [Gaseosa] 3.0159 57.0000 . 1.3900 [Hongos]+[Pepperoni]+[Queso] ==> [Cerveza] 6.9312 56.9600 . 1.3500 [Hongos]+[Gaseosa] ==> [Queso] 9.0476 56.4400 . 1.3300 [Gaseosa]+[Pepperoni] ==> [Queso]
Reglas
Census Data Analysis & Data Mining
Web Min ing
S(O� SUREOHPD� �� VH� WUDWD� GH� DQDOL]DUODV� WUDQVDFFLRQHV� \� HO� SHUILO� GH� ORVXVXDULRV� GH� XQ� :HE� VLWH� GH� XQFRPHUFLR�GH�YHQWD�SRU�LQWHUQHW�
Census Data Analysis & Data Mining
Modelos apl ic ados
S$VRFLDFLyQ�GH�SiJLQDV�YLVLWDGDV��FDQDVWD�GH�SURGXFWRV
S3HUILO�GH�XVXDULRV���FOXVWHULQJGHPRJUiILFR
S3RWHQFLDOHV�FRPSUDGRUHV���iUERO�GHGHFLVLyQ
Census Data Analysis & Data Mining
Asoc iac ión de páginas���� R�¡�¢�£¥¤��¦U§�¨ª©�«�£�¬�6®¯�°�¦ ±6£²U«�³³ ¨t´µR¶�·¸¶¶�¨�¶�¹·6·ºµ�¨ ³ ´6·· »¼§���½b¨T¾(¢�¿R«RÀÂÁÁUà »l®�¾�U¢Q¨T¾(¢�¿R«RÀ³³ ¨t´µR¶�·>Ä´�¨�¶�µ6·6·ºµ�¨ ³ ´6·· »l®�¾�U¢Q¨T¾(¢�¿R«RÀÂÁÁUà »¼§���½b¨T¾(¢�¿R«RÀ³ ·Q¨ ³³ ´6Å>´�Æ�¨$·�µ6·6· ³ ¨t´µ6·· »WÇ6��¿U£�¦RÈUÉ�§UÈ�¾R¯���¦Z¨T¾(¢�¿R«RÀ�ÁÁUà »Ê¿Ë�È�¯®�¨W¾�¢�¿R«RÀÆ�¨tÄÄ6·�¶ÌÅÅQ¨$¹Å·6· ³ ¨t´6Í·· »l®�¾�U¢Q¨T¾(¢�¿R«RÀÂÁÁUÃb»WÇ6��¿U£�¦RÈ6É�§UÈ�¾R¯��¦Z¨T¾�¢�¿R«RÀÆ�¨tÄÄ6·�¶Î´6¹Q¨$Å�´6·6· ³ ¨t´6Í·· »WÇ6��¿U£�¦RÈUÉ�§UÈ�¾R¯���¦Z¨T¾(¢�¿R«RÀ�ÁÁUà »l®�¾�U¢Q¨T¾(¢�¿R«RÀ
Census Data Analysis & Data Mining
High r evenueLow COMMUNI CATI ON
Low FUN
High AGE
High r at e in REGI ON 6 = Fr ankf ur t
Most ar e male
Clust er ing r esult :Business clust er
Census Data Analysis & Data Mining
10% of all user s
Low r evenue
High COMMUNI CATI ON
High FUN
Low AGE
High r at e in REGI ON 5 = Cologne
Clust er ing r esult : Funclust er
Most ar e f emale
Census Data Analysis & Data Mining
,) t he int er est inI NFORMATI ON is ver y low(near ly 0) $1' inCOMMUNI CATI ON high(wit h at least an access rat eof 5) 7+(1 visit or willpr obably not buy (95.5%).
Classif icat ion result
Census Data Analysis & Data Mining
Sec uenc ia de c l ic k s
Ï6Ð�Ñ ÒÓ�Ô�Ó�Ò�Ñ Õ�Ñ�Ö ×�Ð�Ø in 17.2% (of all t r ansact ions) t he
user goes t o GOURMET.ht ml ; he t hen sends t woemails out .
Ï6Ð�Ñ ÒÓ�Ô�Ó�Ò�Ñ Õ�Ñ�Ö ×�Ð�Øin 56.9% (of all t r ansact ions) t he user
goes f ir st SPORTS.ht ml ; he t hen uses t he chat as acommunicat ion medium; f inally, he f ocus his at t ent ion t oFashion.
Ï6Ð�Ñ ÒÓ�Ô�Ó�Ò�Ñ Õ�Ñ�Ö ×�Ð�ØI n 25.9% (of all t r ansact ions) t he
user goes f ir st t o womens-f ashion.ht ml ; he t hensends a post car d, and goes t o womens-f ashion.ht mlback again.
Census Data Analysis & Data Mining
Det ec c ión t em prana dem ora
S(O�SUREOHPD�6H�WUDWD�GH�LGHQWLILFDUDQWLFLSDGDPHQWH� ORV� FOLHQWHV� FRQPD\RU�SRVLELOLGDG�GH�HQWUDU�HQ�PRUDSDUD� DQWLFLSDU� ODV� DFFLRQHVSUHYHQWLYDV�GH�FREUDQ]D�\�UHFXSHUR�
Census Data Analysis & Data Mining
Las soluc iones posib les
S5HJODV� SDUD� LGHQWLILFDU� D� ORVVHJPHQWRV� GH� FOLHQWHV� FRQ� PD\RUSURSHQVLyQ�D�PRUD
S6FRULQJ�GH�ULHVJR�GH�PRURVLGDG
Census Data Analysis & Data Mining
Modelos apl ic ables
S3DUD� ODV� UHJODV� �� iUERO� GHFODVLILFDFLyQ�
S3DUD�HO�VFRULQJ���PRGHOR�QHXURQDO
Census Data Analysis & Data Mining
Arbol m uest ra 50/50
Census Data Analysis & Data Mining
Morosos
Mo ra 6 0 d ia s R e g ió n 9 0 -9 8 9 ,3 9 % d e p o b la c ió n
MO R A 6 0 VIP C U S TO ME R
YN
LATE F E E S P AID
1
3 0 D AYS
NY
O VE R C R E D IT LIMIT
NY
C R E D IT S C O R E C U S TO ME R AG E C R E D IT LIMIT
IN C O ME ME M B E R (MO N T H S ) # P U R CH AS E S / W E E K C AS H LIMIT
Census Data Analysis & Data Mining
No MorososMo ra 6 0 d ia s Re g ió n 0 -2 6 ,8 1% de p ob la c ió n
MO RA 60 LATE F E E S P AID
0
3 0 DAYS
NY
VIP CUS TOME R
Y
OVE R CR E DIT LIMIT
NY
CUS TOME R AG E CR E DIT S COR E ME MBE R (MO NT HS )
INCO ME CAS H LIMIT C R E DIT LIMIT # P UR CHAS ES / W EEK
Census Data Analysis & Data Mining
Ver i f ic ac ión
2582947N =
Mora real
SINO
Sco
ring
pred
icho
1.2
1.0
.8
.6
.4
.2
0.0
-.2
El scoring que predice la red está netamente diferenciadopara morosos y pagadores
Census Data Analysis & Data Mining
Referenc ias
· 'DWD�0LQLQJ�7HFKQLTXHV�IRU�0DUNHWLQJ��6DOHVDQG�&XVWRPHU�6XSSRUW��0LFKDHO�%HUU\��*RUGRQ/LQRII��:LOH\��86$������
· 'DWD�0LQLQJ�ZLWK�1HXUDO�1HWZRUNV��-RVHSK%LJXV��0F�*UDZ�+LOO��86$������
· 'DWD�0LQLQJ��D�KDQGV�RQ�DSSURDFK�IRU�EXVLQHVVSURIHVVLRQDOV��5REHUW�*URWK��3UHQWLFH�+DOO�86$������
·0DVWHULQJ�'DWD�0LQLQJ��0LFKDHO�%HUU\��*RUGRQ/LQRII��:LOH\�86$������
Census Data Analysis & Data Mining
Referenc ias
S'DWD�SUHSDUDWLRQ�IRU�'DWD�0LQLQJ��'RULDQ�3\OH�0RUJDQ�.DXIPDQQ�3XEOLVKHUV�,QF��6DQ�)UDQFLVFR�86$������
S$QiOLVLV�0XOWLYDULDQWH�+DLU��$QGHUVRQ��7DWKDP�%ODFN��3UHQWLFH�+DOO��0DGULG������
S%XLOGLQJ�'DWD�0LQLQJ�DSSOLFDWLRQV�IRU�&50��$�%HUVRQ��6��6PLWK��.��7KHDUOLQJ��0F�*UDZ�+LOO�����
Census Data Analysis & Data Mining
Referenc ias
· ,%0�Ù ZZZ�LEP�FRP�VRIWZDUH�GDWD�LPLQHU�IRUGDWD�Ù ZZZ�GPJ�RUJÙ ZZZ�LEP�FRP�UHGERRNV
· 7KH�'DWD�0LQH��ZZZ�WKH�GDWD�PLQH�FRP�· .''�0LQH��ZZZ�NGQXJJHWV�FRP�· FKE#FHQVXV�FRP�DU