Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis
description
Transcript of Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis
![Page 1: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/1.jpg)
Improving Web Sites Improving Web Sites with Web Usage with Web Usage
Mining,Mining,Web Content Mining, Web Content Mining,
and Semantic Analysisand Semantic AnalysisJean-Pierre Norguet
![Page 2: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/2.jpg)
W eb siteV isitor
Log file
request
log transactionresponse
Web CommunicationWeb Communication
• Web transaction = request + response• Meta-data in Web logs:
– Request date et time– Page reference (URI)– Referral URI– Client machine information
![Page 3: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/3.jpg)
W eb site
W eb designer
Log files
100 90 80 70
R eports
W ebanalytics
tool
updateV isitors
Web Analytics ProcessWeb Analytics Process
![Page 4: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/4.jpg)
Web Analytics ToolsWeb Analytics Tools
• Results– Page views– Number of visitors– Debit– Traffic
• Exploitation– Self-promotion– Sales planning– Technical resizing– Structure Optimization
Low semantics Low-level decisions
![Page 5: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/5.jpg)
Organization StructureOrganization Structure
Web analytics tools
O rganizationm anager
W eb sitechief editor
Sub-editor Sub-editorSub-editor
![Page 6: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/6.jpg)
Web Analytics ResultsWeb Analytics Results
• Low semantics low intuitivity• Too numerous results
![Page 7: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/7.jpg)
Adress: http://www.ulb.ac.be/cgi/search
PPage age Ref. Ambiguity Ref. Ambiguity (1)(1)
![Page 8: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/8.jpg)
PPage age Ref. Ambiguity Ref. Ambiguity (2)(2)
Adress: http://www.ulb.ac.be/cgi/search
![Page 9: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/9.jpg)
PPage age VolatilityVolatilityAdress: http://www.ulb.ac.be/cgi/search
![Page 10: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/10.jpg)
Page Synonymy (1)Page Synonymy (1)
![Page 11: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/11.jpg)
Page Synonymy (2)Page Synonymy (2)
![Page 12: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/12.jpg)
Page PolysemyPage Polysemy
![Page 13: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/13.jpg)
PPage age Temporality (1)Temporality (1)
![Page 14: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/14.jpg)
PPage age Temporality (2)Temporality (2)
![Page 15: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/15.jpg)
Problems SummaryProblems Summary
• Low semantics low intuitivity• Too numerous results• Page reference ambiguity• Page synonymy• Page polysemy• Page temporality• Page volatility
![Page 16: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/16.jpg)
Our solutionOur solution
• Summarized and conceptual results for:– Chief editors– Organization managers
• Generic solution, independent from:– Web site content– Web site language– Web site technology
analyze output text content
![Page 17: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/17.jpg)
Output Page CollectionOutput Page Collection
• Mining points in Web environment:1. Web logs (+ content journal)2. Web server3. Network wire4. On-screen Web page
W eb server
R outer
Browser
2. S erver m onito ring
4. C lien t-s ide
3. N etwork m onitoring
1. W eb log files
In ternet
V is itor
![Page 18: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/18.jpg)
Lexical AnalysisLexical Analysis
• Output page mining Web pages• Unformatting text• Tokenization terms• Stopwords removal• Stemming• Term selection index terms• Occurrence counting audience
metrics
![Page 19: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/19.jpg)
PresenceConsultation
Online pagesOutput pages
Interest
• Term occurrence counting in pages:
Term-Based MetricsTerm-Based Metrics
![Page 20: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/20.jpg)
Term-Based MetricsTerm-Based Metrics
• Term-based metrics:– Consultation– Presence– Interest
• Limitations:– Too many terms– Term synonymy– Term polysemy
Ontology-based term grouping
![Page 21: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/21.jpg)
Hierarchical Hierarchical AggregationAggregation
• Consultation• Presence
Apple Straw berry
Fruit
CarotPotato
Vegetable
Food
22
644
162
324
11
84
44
![Page 22: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/22.jpg)
Apple Straw berry
Fruit
CarotPotato
Vegetable
Food
14101.44
644
1616
11210
11.232
32488
12721
6.0437
8422
4411
Hierarchical Hierarchical AggregationAggregation
• Consultation• Presence• Interest (x2)
![Page 23: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/23.jpg)
Apple Straw berry
Fruit
CarotPotato
Vegetable
Food
14101.44
644
1616
11210
11.232
32488
12721
6.0437
8422
4411
Hierarchical Hierarchical AggregationAggregation
• Consultation• Presence• Interest (x2)
![Page 24: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/24.jpg)
Data modelData model
• Ontology term hierarchy• Number of occurrences: by day, by
term• List of days (possibly aggregated)
day term
DailyTerm Occurrences
day : DATETIME term : VARCHAR consultation : INT presence : INT
OntologyElem ent
term : VARCHAR parentTerm : VARCHAR
Day
day : VARCHAR label : VARCHAR
parentTerm
![Page 25: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/25.jpg)
OLAP ModelOLAP Model
• Parent-child ontology dimension• Time dimension• Measures
Term -basedm etrics
C onsu lta tionP resence
In te res t
T im e O nto logy
![Page 26: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/26.jpg)
Case StudyCase Study
• Web site: cs.ulb.ac.be– 1.500 pages– 100 page views/day– Knowledge domain: computer science
• Ontology: ACM classification– Knowledge domain: computer science– 11 top domains– 3 levels– 1230 terms
![Page 27: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/27.jpg)
Experimental settingExperimental setting
• WASA prototype• SQL Server OLAP Analysis Service
V isito rs
W eb server stats server SQ L server
H TTP S erver
Logs
C ontentJourna l
W A S A
M yS Q L M yO D B C
S Q L S erver
O LA P
E xce l
![Page 28: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/28.jpg)
Concept-Based MetricsConcept-Based Metrics
• Y: top ontology domains• X: consultation, presence, interest
![Page 29: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/29.jpg)
ResultsResults
![Page 30: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/30.jpg)
Exploitation ProcessExploitation Process
W ASA adm inistrator
chief editor
sub-editors
configures andrun
viewreports
redefine writingtasks
W ASA
defineconcepts
organization m anagerview
reports redefine W eb com m unicationobjectives
m anage organization
update W eb sitecontent
...
...
![Page 31: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/31.jpg)
SummarySummary
• Web analytics• Output page mining• Lexical analysis• Concept-based metrics with OLAP• Experiments• Conclusion & future work
![Page 32: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/32.jpg)
ConclusionConclusion
• Most Web sites supported• Approach validated by experiments• Topic-based metrics are intuitive• Exploitation at higher decision levels• Limitation: ontology availability• Future work: ontology enrichment Integration into Web analytics tools
![Page 33: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/33.jpg)
Thank you Thank you forfor your your attentionattention
![Page 34: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/34.jpg)
Q & AQ & A
![Page 35: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/35.jpg)
• Web logs + content journal• (+) Easy to setup• (+) Minimal storage and
computation• (-) Dynamic pages
Content JournalingContent Journaling
W eb server
R outer
Browser
1. W eb log files
In ternet
V is itor
![Page 36: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/36.jpg)
• Web server plugin• (+) Dynamic pages• (+) Fast• (-) Risky
Server MonitoringServer Monitoring
W eb server
R outer
Browser
2. S erver m onito ring
In ternet
V is itor
![Page 37: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/37.jpg)
• TCP/IP packet sniffing• (+) Independent from Web server• (-) Ethernet only• (-) Encrypted content• (-) CPU-intensive
Network MonitoringNetwork Monitoring
W eb server
R outer
Browser
3. N etwork m onitoring
In ternet
V is itor
![Page 38: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/38.jpg)
• Page-embedded program1. Parses page2. Sends content to mining server
• (+) Distributed workload• (+) Supports client-side XML/XSL• (-) Visibility and vulnerability
Client-Side CollectionClient-Side Collection
W eb server
R outer
Browser
4. C lien t-s ide
In ternet
V is itor
![Page 39: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/39.jpg)
Output Page CollectionOutput Page Collection
• Collection methods alone or in combination any Web site output is collectable1. Implemented: WASA-CJ2. Implemented: Sourceforge mod_trace_output
W eb server
R outer
Browser
2. S erver m onito ring
4. C lien t-s ide
3. N etwork m onitoring
1. W eb log files
In ternet
V is itor
![Page 40: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/40.jpg)
ExperimentsExperiments
• Experimental settings• Visualization• Ontology coverage• Validation• Scalability
![Page 41: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/41.jpg)
Experimental settingExperimental setting
• WASA prototype• SQL Server OLAP Analysis Service
V isito rs
W eb server stats server SQ L server
H TTP S erver
Logs
C ontentJourna l
W A S A
M yS Q L M yO D B C
S Q L S erver
O LA P
E xce l
![Page 42: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/42.jpg)
EUROVOC ThesaurusEUROVOC Thesaurus
• European Commission thesaurus• Knowledge domain: EC-related
domains• 21 top domains• 8 levels• 6650 terms
![Page 43: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/43.jpg)
Eurovoc ExampleEurovoc Example• 04 Politics• 08 International Relations• 10 European Communities• 12 Law• 16 Economics• 20 Trade• 24 Finance• 28 Social Questions• 32 Education and Competition• 36 Science• 40 Business and Competition• 44 Employment and Working Conditions• 48 Transport• 52 Environment• 56 Agriculture, Forestry and Fisheries• 60 Agri-Foodstuffs• 64 Production, Technology and Research• 66 Energy• 68 Industry• 72 Geography• 76 International Organisations
28 SOCIAL QUESTIONS• 2806 family• 2811 migration• 2816 demography and population• 2821 social framework• 2826 social affairs• 2831 culture and religion
– arts– cultural policy– culture– acculturation– civilization– cultural difference– cultural identity
• RT: protection of minorities (1236)• RT: socio-cultural group (2821)
– cultural pluralism– popular culture– regional culture– religion
• 2836 social protection• 2841 health• 2846 construction and town planning
![Page 44: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/44.jpg)
Ontology CoverageOntology Coverage
• Definition: the percentage of ontology terms that appear in the Web site
• ACM classification: 15%• Eurovoc: 0,75%• Characterizes the meaning of the
metrics ontology enrichment with terms of
the Web site
![Page 45: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/45.jpg)
Collaborative Collaborative EnrichmentEnrichment
EZchief editor
EZsub-editor
MMsub-editor
EMsub-editor
SSsub-editor
PSsub-editor
JPNsub-editor
JMDsub-editor
EZorganization m anager
JTSsub-editor
JMDwebm aster
JPNW ASA adm inistrator
![Page 46: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/46.jpg)
Methodology StepsMethodology Steps
• Editor browses his pages• Select new terms• Find enrichment point in the ontology• Insert terms into ontology• Editor sends ontology to chief editor• Chief editor commits the inserts
![Page 47: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/47.jpg)
ResultsResults
![Page 48: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/48.jpg)
ValidationValidation
• Comparison with WebTrends• Personal Web site• Optimized custom ontology of 1250
terms• Top concepts match the page
directories results should be comparable
![Page 49: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/49.jpg)
ResultsResultsUrchin
WASA
![Page 50: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/50.jpg)
Scalability: Case StudyScalability: Case Study
• Web site: www.ulb.ac.be– 800,000 pages– 100,000 page views– Knowledge domain: broad
• Ontology: Eurovoc– Knowledge domain: broad (EC’s interests)– 21 top domains– 8 levels– 6650 terms
• Run=15 hours, linear dependency reasonableand applicable to any Web site
![Page 51: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/51.jpg)
ExperimentsExperiments
• Experimental settings• Visualization• Ontology coverage• Validation• Scalability
![Page 52: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/52.jpg)
OntologiesOntologies
• Specification of a conceptualisation• Controlled vocabulary of terms and
relations• An ontology defines concepts and their
relations, that are necessary to share, reuse, and represent a domain knowledge
• Example:Fru it
S trawberry
A pp leFoodV egetab le
good m ix
![Page 53: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/53.jpg)
Ontology RestrictionOntology Restriction
• Ontology concept hierarchy
Fru itS trawberry
A pp leFoodV egetab le
Fru itS trawberry
A pp leFoodV egetab le
good m ix
![Page 54: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/54.jpg)
ContentsContents
• Context & motivations• Output page mining• Lexical analysis• Concept-based metrics with OLAP• Experiments• Exploitation• Conclusion & future work
![Page 55: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/55.jpg)
ContextContext
• Web emergence• Web communication analysis• Maintenance needs effective
decisions• Highest organization levels• Summarized and conceptual results• Web analytics tools unappropriate
![Page 56: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/56.jpg)
Exploitation ProcessExploitation Process
W ASA adm inistrator
chief editor
sub-editors
configures andrun
viewreports
redefine writingtasks
W ASA
defineconcepts
organization m anagerview
reports redefine W eb com m unicationobjectives
m anage organization
update W eb sitecontent
...
...
![Page 57: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/57.jpg)
Metric ExploitationMetric Exploitation
• High interest – Search pages about the topic– Rank pages by consultation– Optimize pages
• Low interest– Search pages about the topic– Rank pages by presence– Question the topic: important/not
important– Drain traffic to the pages/delete pages
![Page 58: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis](https://reader036.fdocuments.in/reader036/viewer/2022062722/56813afa550346895da389d3/html5/thumbnails/58.jpg)
Future WorkFuture Work
• Concept visualisation in semantic space
• Automated taxonomy enrichment• Additional OLAP dimensions