Supervisor: Mr. Phan Trường Lâm
description
Transcript of Supervisor: Mr. Phan Trường Lâm
Supervisor:Mr. Phan Trường Lâm
Students: Vũ Nhật LinhLê Quang HoànNguyễn Duy QuyềnHoàng NamNguyễn Thế Anh
Capstone Project Documents Management
Team information
Agenda
Introduction
Project plan
System Requirement Specifications
System Analysis and Design
Testing
Deployment and User Guide
Summary
Demo and Q&A
Introduction
Initial Idea
Literature Review of Existing System
Proposal & Product
1 2 3 4 5 6 7 8
Initial Idea1 2 3 4 5 6 7 8
Initial Idea1 2 3 4 5 6 7 8
We decide to develop a new system that integrated: Collect documents
Organize these documents
Extract keyword
Ranking
Searching
Literature Review of Existing System
Methods that these websites use to build their systems:
Big database
Search
Ranking and highlight return results
Compare documents to detect plagiarism
1 2 3 4 5 6 7 8
Literature Review
Achievements of the existing systems
Attractive• Easy to use• Speed & Reliability• Quality Results• Ensuring Security
Awareness
Limitations of the existing systems Costs Privacy
1 2 3 4 5 6 7 8
Proposal
•Collect and manage Capstone projects
•Support looking up Capstone projects
•Avoid repeating and copying idea
•Ranking results
•Refer to other materials
•Friendly interface like Google
•Chipper to build
•Free to use
•Public for everyone
•Inside and outside University
1 2 3 4 5 6 7 8
Product
(in future)Mobile application
Web application
1 2 3 4 5 6 7 8
Project Plan
Development environment
Process
Project organization
Project schedule
Risk management
1 2 3 4 5 6 7 8
Development Environment1 2 3 4 5 6 7 8
1 Gb of RAM100Gb of hard diskCore 2 Duo 2.0 GHz
2 Gb of RAM100Gb of hard disk
Core 2 Duo 2.0 GHz
HARD WARE
SOFT WARE
Process
Follow Waterfall model
1 2 3 4 5 6 7 8
Project organization1 2 3 4 5 6 7 8
Controlling and Monitoring
• Meeting
• Assign task
• Tracking task
• Issue resolve
• Review task
• Report
1 2 3 4 5 6 7 8
Project organization
Communication control Online activity
• Email• Chat• Phone
Offline activity• Kick-Off project• Team building
1 2 3 4 5 6 7 8
Project organization
Project Schedule1 2 3 4 5 6 7 8
Overall plan
Risk Management
Risk Management
People risk
Estimation risk
Technology risk
Requirement risk
Schedule risk
1 2 3 4 5 6 7 8
System Requirement Specifications1 2 3 4 5 6 7 8
User Requirements
System Requirements
Non-functional requirements
User Requirements1 2 3 4 5 6 7 8
Lecturers and Students:• Search project documents.• Download documents.
Librarians:• Edit profile.• Search documents.• Add/Edit/Delete document.• Add/Edit/Delete category.
Administrator• Edit profile.• Add/Edit/Delete account.
User Requirements1 2 3 4 5 6 7 8
Other requirement• Searched results will be ranked.
• Document has following information:
Name
Author
Supervisor
Category
Description
User Requirements1 2 3 4 5 6 7 8
• Input files:
Keyword file
Abstract file
Full document file
Other materials
System Requirements1 2 3 4 5 6 7 8
Communicate via the protocol HTTP to complete interactions based on service with client computers and use standard protocols.
Configuration Server: Windows Server 2008 operating system
.NET framework 3.5SQL server 2008
IIS 7 Client: Web browser
Non-functional Requirements1 2 3 4 5 6 7 8
Usability
Availability
Security
Reliability
PerformanceSecurity
Maintainability
System Analysis and Design1 2 3 4 5 6 7 8
Architectural design
Detail design
Database design
Coding convention
Extract Keyword algorithm
Ranking
Architectural design1 2 3 4 5 6 7 8
Overall architecture MVC architecture design pattern
Detail design1 2 3 4 5 6 7 8
CProDMS Component Diagram
Database design1 2 3 4 5 6 7 8
Entity diagram
Coding convention1 2 3 4 5 6 7 8
Follow:
Microsoft .NET Library StandardsFxCop rules and Code Analysis for Managed Code Warnings
Extract Keyword Algorithm1 2 3 4 5 6 7 8
Introduction
Study Algorithm
Evaluation
Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information
(YUTAKA MATSUO and MITSURU ISHIZUKA)(Dec. 10, 2003)
Algorithm – What is the keyword?1 2 3 4 5 6 7 8
Position
Meaning
Frequency
Keyword
Algorithm – Step by step 1 2 3 4 5 6 7 8
Preprocessing
Processing
Discard stop words Stem Extract
frequency
Calculate X’2 value
Output
Expected probability
Select frequent term
Algorithm – Studying1 2 3 4 5 6 7 8
Original Text
Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information.
Example:
Information powerful weapon modern society day overflowed huge amount data electronic newspaper articles emails web pages search results Often information receive incomplete such further search activities required enable correct interpretation usage information
Stemmed Words
Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information.
Discarded Stop Words
Step1
Step2
Using Porter Stemming Algorithm
Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, emails, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information.
Informat power weapon modern societi day overflow huge amoun data electronic newspaper articl email web page search result Often informat receive incomplet such further search activ requir enable correct interpret usag informat
Algorithm – Studying1 2 3 4 5 6 7 8
The top ten frequent terms (denoted as G) and the probability of occurrence, normalized so that the sum is to be 1.
Select frequent Term
As study, number of keyword is about 10% number of term in document and no more than 30 terms.
Algorithm – Studying1 2 3 4 5 6 7 8
Two terms in a sentence are considered to co-occur once.
Co-occurrence and Importance
Example:
The imitation game could then be played with the machine in question and the mimicking digital computer and the interrogator would be unable to distinguish them.
“imitation” and “digital computer” have one co-occurrence
Algorithm – Studying1 2 3 4 5 6 7 8
Co-occurrence and Importance
Algorithm – Studying1 2 3 4 5 6 7 8
The degree of biases of co-occurrence can be used as a indicator of term importance
Co-occurrence and Importance
Algorithm – Studying1 2 3 4 5 6 7 8
The statistical value of χ2 is defined as
pg Unconditional probability of a frequent term g G ∈(the expected probability)
nw The total number of co-occurrence of term w and frequent terms G
freq (w, g) Frequency of co-occurrence of term w and term g
Algorithm – Studying1 2 3 4 5 6 7 8
pg (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document)
nw The total number of terms in the sentences where w appears including w
We consider the length of each sentence and revise our definitions
1 2 3 4 5 6 7 8
Algorithm – Studying
Algorithm – Studying1 2 3 4 5 6 7 8
the following function to measure robustness of bias values
Subtracts the maximal term from the X2 value
1 2 3 4 5 6 7 8
Algorithm – Studying
To improve extracted keyword, we will cluster terms
Two major approaches (Hofmann & Puzicha 1998) are:
Similarity-based clustering If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be the same cluster.
Pairwise clustering If terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster.
Eg: Monday is a day in week.Tuesday is a day in week.Wednesday is a day in week.
1 2 3 4 5 6 7 8
Algorithm – Studying
Similarity-based clustering centers upon Red Circles
Pairwise clustering focuses on Green Circles
1 2 3 4 5 6 7 8
Algorithm – Studying
Where:
Similarity-based clusteringCluster a pair of terms whose Jensen-Shannon divergence is
and:
1 2 3 4 5 6 7 8
Algorithm – Studying
Cluster a pair of terms whose mutual information is
Pairwise clustering
Where:
1 2 3 4 5 6 7 8
Algorithm – Studying
Algorithm – Evaluation1 2 3 4 5 6 7 8
Precision: Ratio of right keyword to number of keywordCoverage: Ratio of indispensable keyword in list to all the indispensable terms
Frequency index: average frequency of keyword in list
Ranking – Why?1 2 3 4 5 6 7 8
Ranking Result
Ranking1 2 3 4 5 6 7 8
Ranking1 2 3 4 5 6 7 8
Use rank calculate formula Term in a collection documents: ( Automatic Keyword Extraction for Database Search First examiner : Prof. Dr. techn. Dipl.-Ing. Wolfgang Nejdl Second examiner : Prof. Dr. Heribert Vollmer Supervisor : MSc. Dipl.-Inf. Elena Demidova ) R(t) = Fd(t)*log(1 + N/N(t)) (1)
Rank of Term t in all the collection
Total number of documents in the
collection
Frequency of Term t in the
given document
Total number of documents that contain Term t
Ranking formula :Rank = d * Rd(t) / R(t) (2)
=> Rank = d * Rd(t) / (Fd(t)*log(1 + N/N(t))) (3)
reliability coefficient
Rank of Term t in document, which
extracted by Extract Service
Searching1 2 3 4 5 6 7 8
Testing1 2 3 4 5 6 7 8
V - model
Testing1 2 3 4 5 6 7 8
Testing1 2 3 4 5 6 7 8
No Tester Module code Pass Fail Untested N/A Number of test cases1 AnhNT Master Page 18 0 0 0 18
2 AnhNT Home Page 12 0 0 0 12
3 AnhNT Search Result 5 0 0 0 5
4 AnhNT User Account 69 0 0 0 69
5 AnhNT Error Page 8 0 0 0 8
6 NamH Category 36 0 0 0 36
7 NamH Document 47 0 0 0 47
8 NamH Authenticated 81 0 0 0 81
9 NamH User Document Detail 9 0 0 0 9
Sub total 285 0 0 0 285
Test coverage 100.00 %
Test successful coverage 100.00 %
Test result
Deployment
Package Source Code
Client side
Server side
User guide1 2 3 4 5 6 7 8
Summary1 2 3 4 5 6 7 8
Strong point• Enthusiasm• Creative• Cope with change
Weak point• Lack of technical skill• Lack of management skills
Lessons learned• Improve technical & management skills• Release on-time product with the restriction of time and resource• Improve communication skills & problem solving
1 2 3 4 5 6 7 8
Demo & Q&A