Project in Distributed Search Engine Sampler (044167)
Transcript of Project in Distributed Search Engine Sampler (044167)
![Page 1: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/1.jpg)
Dec 2010
![Page 2: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/2.jpg)
![Page 3: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/3.jpg)
AgendaProblem Definition & Solution
Why bothering?
Implementation Techniques
Problems (& Solutions)
HLD (High-Level Design)
Main Communication Data-Flows
UI snapshotsDemo Example: Distributed webpage downloading and parsing
Performance Analysis
![Page 4: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/4.jpg)
Problem DefinitionMain Problem:
Executing large set of small computational tasks consumes numerous processing time on a single machine.
Tasks are homogeneous and non-relatedExecuting is serial. Execution order is not significant.
![Page 5: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/5.jpg)
Solution
Distribute tasks over more than one machine and exploit computational power of remote machines.
![Page 6: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/6.jpg)
Why bothering?Several solution already exist, such as:
Condor Complex Syntax One task per run Not developer-friendly
MPI Networking understanding needed Executing and Synchronizing tasks is the user
responsibility
And more…
![Page 7: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/7.jpg)
Why bothering? (cont.)
Implementing new solution:User-friendly API. (ease of usage)User transparent.Dynamic System-Management.Task generic.Easy to convert from serial to parallel.
![Page 8: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/8.jpg)
Implementation Techniques
Java:Object orientedCross-Platform
Java RMI (Remote Method Invocation)Easy to useTransparent networking mechanism
![Page 9: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/9.jpg)
Problems
FirewallLoad BalancingAuto-UpdateEfficiencyExecution TransparencyFault Tolerance
![Page 10: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/10.jpg)
Firewall
Problem:A firewall can exist between user machine and
remote-machines (Executers).Solution:
One side connection (user side)
![Page 11: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/11.jpg)
Executers
Machines
Client Machine
Firewall
AB Send Task
Send Result
Get Results
Send TaskSend Task
Send Result
Send Result
Firewall (cont.)
![Page 12: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/12.jpg)
Load Balancing (Scheduling)Problem:
Tasks can be submitted to the system asynchronously. Load balancing needed for efficiency.
Solution:Distributing tasks considering remote-machine
load, task weight and priority.Round-Robin based.Prevent starvation.
![Page 13: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/13.jpg)
Auto-UpdateProblem:
In regular system, when end-user need to update the tasks and the code executing them (for fixing bugs or changing tasks executing purposes) , he needs to go over all remote-machines (Executers) -which can be far far away- and update the code on the machines themselves.
Solution:Support the ability of updating the code from the
System Manager machine –which is always near the end-user. RMI connection problem arises.
![Page 14: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/14.jpg)
EfficiencyProblem:
Executing large amount of small tasks one by one could be inefficient (overhead expenses).
For each task: Scheduling Sending RMI messages
Solution:formation of smaller units of information into
large coordinated units – Chunking.Send a collection of tasks to the same remote-
machine (Executer).
![Page 15: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/15.jpg)
Execution TransparencyProblem:
Same I/O operation’s output must be the same in both serial and distributed system.
Feel like running locally. Output Stream Exceptions More…
Solution:Simulating these operations.
![Page 16: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/16.jpg)
Fault ToleranceProblem:
Remote-Machines may disconnect the network anytime.
Executing tasks on the machine will be lost.
Solution:Failure DetectorSave the machine state until it connect again,
then resumes its work.
![Page 17: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/17.jpg)
High-Level Design
System ManagerClient
Executer
Task
Result
UI
![Page 18: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/18.jpg)
High-Level Design (cont.)Client
Result
* 1
Task
*
11
1ConcreteTask
ConcreteResult
1
*
Executer
1
*
SystemManager1
*
ConcreteExecuter
1
*
UI / CMD
1
1
• Concrete Components (Provided by user)
![Page 19: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/19.jpg)
Common ComponentsItem
Base communication item for the System. Task, Result, Chunk etc…
TaskType of task user wants to execute.
Resultthe result of the Task execution.
Chunkholds a bundle of Items for network optimizations
Used for efficiency
![Page 20: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/20.jpg)
Common Components (cont.)Synchronized Counter Mechanism
Used to determine whether the user is idle or not.
Log TracingLog tracer, resided on each remote-object.
Used basically for debug reasons and I/O redirection.
Networking LayerResponsible for communication purposes.
Taking in account Firewall existence.
![Page 21: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/21.jpg)
Executer ComponentMain Functionality
Resides on remote-machine waiting for tasks to execute.
Task ExecuterHolds the concrete task Executer provided by
the user.Results Organization
Preparing results for Clients
![Page 22: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/22.jpg)
ClientMain Functionality
Resides on user-machine.Provides the implementation for the user API.
Results CollectorPolling prepared results from Executers.
![Page 23: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/23.jpg)
System ManagerMain functionality
Match-making between Clients and Executers. (Scheduling)
Holds lists of Executers/Clients connectedManages common system operations
Auto-UpdateClean ExitEtc…
Failure Detection Mechanism
![Page 24: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/24.jpg)
Main Communication Diagram
![Page 25: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/25.jpg)
Client System
Data-Flow
![Page 26: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/26.jpg)
Executer System
Data-Flow
![Page 27: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/27.jpg)
User Interface
Updates
Task1Task2
Result1Result2
System Manager
Executer 1Executer 2 Executer 3
Client
Firewall
Main User Scenario
![Page 28: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/28.jpg)
User Interface
![Page 29: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/29.jpg)
User InterfaceExecuters Tab
![Page 30: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/30.jpg)
User InterfaceExecuter Trace Log
![Page 31: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/31.jpg)
User InterfaceUpdate Tab
![Page 32: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/32.jpg)
Demo Example: Distributed webpage downloading and parsing
Distributed webpage downloading and parsing demo
Created in order to test System performance Ease of usage
From Serial code to Distributed using the system API
Tested on Windows (XP) and Linux (Suse)
![Page 33: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/33.jpg)
Beforepublic class DownloadFiles {
private final String rootDir_;
private final DocIndexRepository downloadedDocs_;
private void run(String inputFileName) throws Exception {
BufferedReader input = FileHandler.openFileReader(inputFileName);
while(true) {
String url = input.readLine();
try {
String text = downloadAndParseFile(url);
String fullFileName = createOutputFile(rootDir_);
writeResultToFile(text, fullFileName);
outputFileStream.close();
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
This Line needs to be distributed
to make it possible run
over more than one machine
![Page 34: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/34.jpg)
Changes to be doneImplement
Task classResult classExecuter class (downloadAndParseFile
function)
Modify the code using the API
![Page 35: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/35.jpg)
AfterTask Code:import diSys.Common.Item;
public class DownloadTask extends Item {
public String url;
public DownloadTask() {
super(0);
this.url="";
}
public DownloadTask(long id, String url) {
super(id);
this.url=url;
}
public String toString(){
return "Task ID: " + this.getId();
}
}
Result Code:import diSys.Common.Item;
public class DownloadResult extends Item {
public String text;
public String url;
public DownloadResult(long id) {
super(id);
}
}
![Page 36: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/36.jpg)
After (cont.)Executer Code:
public class DownloadExecuter implements IExecutor<DownloadTask,DownloadResult> {
public DownloadResult run(DownloadTask task) throws Exception {
DownloadTask task=task;
DownloadResult res=new DownloadResult();
//Output will be redirected (appears on Client machine)
System.out.println("trying to download url: "+task.url);
//Do the job...
res.text = downloadAndParseFile(task.url);
res.url = task.url;
System.out.println("url: " + task.url + " downloaded!);
return res;
}
}
![Page 37: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/37.jpg)
After (cont.)public class DownloadClient { Modified Client
Codeprivate final String rootDir_;
private final DocIndexRepository downloadedDocs_;
public static void main(Sting[] args) throws Exception {//Initialization (connecting to system manager)RemoteClient<DownloadTask, DownloadResult> client = new RemoteClient<DownloadTask,DownloadResult>("localhost", 5000 /*port*/, 10 /*chunk size*/);
//Start the Client
client.Start();
BufferedReader input = FileHandler.openFileReader(inputFileName);
while((String url = input.readLine())!= null) {
client.addTask(new DownloadTask(url)); //Submit tasks to System
//String text = downloadAndParseFile(url);
}
while(client.GetTaskNum() > 0) //While there are tasks in execution, try to get results
try { DownloadResult dr = client.GetResult();
//May throw exception (if exception thrown in Executer code)
String fullFileName = createOutputFile(rootDir_);
writeResultToFile(dr.text, fullFileName);
outputFileStream.close();
}
catch (Exception e) {
System.out.println(e.getMessage());
}
}
cilent.Stop() }
![Page 38: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/38.jpg)
Performance AnalysisStudy on a sample application: Distributed webpage downloading and parsing
The system has been tested with the following workload: the Task is downloading a web page given its URL, and the Result is text extracted from the HTML of the web-page using external HtmlParser library.
![Page 39: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/39.jpg)
Benchmark #1the Performance (Total time) of executing 150
Tasks on 1, 2, 4 and 8 Executers which ran over 2 windows different machines
![Page 40: Project in Distributed Search Engine Sampler (044167)](https://reader030.fdocuments.in/reader030/viewer/2022012420/61750c4e70889b4d644dee9e/html5/thumbnails/40.jpg)
Benchmark #2the Performance (Total time) of executing 150
Tasks on 3, 6, 9 and 12 Executers which ran over 3 machines
(2 windows machines and 1 Linux machine).