Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

18
Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine Learning Techniques COST-TMA Meeting, Samos 2008 Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc.edu

description

Identification of Network Applications based on Machine Learning Techniques COST-TMA Meeting, Samos 2008. Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc.edu. Outline. Scenario and objectives Existing solutions Well-known ports - PowerPoint PPT Presentation

Transcript of Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Page 1: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Centre de Comunicacions Avançadesde Banda Ampla (CCABA)

Universitat Politècnicade Catalunya (UPC)

Identification of Network Applications based on Machine Learning Techniques

COST-TMA Meeting, Samos 2008

Valentín Carela-EspañolPere Barlet-Ros

Josep Solé-Pareta

{vcarela, pbarlet, pareta}@ac.upc.edu

Page 2: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Outline

Scenario and objectives Existing solutions

Well-known ports Payload based (pattern matching) Machine Learning

– Supervised– Unsupervised

Proposed method Results Conclusions and Future work

Page 3: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Scenario and objectives

Scenario: SMARTxAC Traffic Monitoring and Analysis System for the Anella Científica Real-time classification Independent from packet contents High-speed link

Objectives: Development of a ML Technique to identify applications in

SMARTxAC Automate the ML training phase Adapt our solution to Netflow Study how it affects the sampling

Page 4: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Outline

Scenario and objectives Existing solutions

Well-known ports Payload based (pattern matching) Machine Learning

– Supervised– Unsupervised

Proposed method Results Conclusions and Future work

Page 5: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Existing Solutions

Well-known ports+ Computationally lightweight- Very low accuracy

Payload based (pattern matching)+ High accuracy- Packet contents are required- Computationally expensive- Content encryption- Privacy legislations

Consequence: Not a feasible solutions

Page 6: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Existing Solutions Machine Learning Techniques

- Difficult training phase+ Packet contents are not required+ High accuracy+ Computationally viable

Two main possibilities: Supervised methods:

+ Better accuracy for classes expected- Need a complete pre-labeled dataset- Difficult detection of retraining necessity - No detection of new classes

Unsupervised methods: + Do not need a full labeled dataset+ Automatic detection of new classes+ Better accuracy for new classes

Page 7: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Outline

Scenario and objectives Existing solutions

Well-known ports Payload based (pattern matching) Machine Learning

– Supervised– Unsupervised

Proposed method Results Conclusions and Future work

Page 8: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Proposed method

Supervised identification based on C4.5 algorithm Developed by Ross Quinlan as extension of ID3 Based on the construction of a classification tree

Training set Actual traffic flows Pairs <flow features, applications> Feature vector contains relevant characteristics of traffic flows Application is identified using L7-filter

Page 9: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Machine Learning process

1) Collection of the training set• Representative flows of the environment to be monitored

2)Automatic flow classification → application class• Pattern matching using L7-filter• It can be simplified if an artificial training set is used in 1)

3) Feature extraction from the training flows

4) Construction of a C4.5 classification tree• E.g. using Weka

5) Deployment of the tree obtained in 4) in the monitoring system

6) Retraining of the system• Starting from phase 1)

Page 10: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Outline

Scenario and objectives Existing solutions

Well-known ports Payload based (pattern matching) Machine Learning

– Supervised– Unsupervised

Proposed method Results Conclusions and Future work

Page 11: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Accuracy

Page 12: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Netflow Accuracy

Page 13: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Accuracy

Page 14: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Features Accuracy

· Best Normal Feature Subset : dport, bytes_out, avg_out_size, sport, avg_in_size, push_in.

· Best Netflow Feature Subset: dport, bytes, push

Page 15: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

How it affects the sampling?

Page 16: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Outline

Scenario and objectives Existing solutions

Well-known ports Payload based (pattern matching) Machine Learning

– Supervised– Unsupervised

Proposed method Results Conclusions and Future work

Page 17: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Conclusions and Future Work

Machine learning techniques are a good solution to identify applications

The identification in sampled scenarios are still very open

Future work:

Find a more accurate automatic system to label the dataset Build early decision trees to identify the flow as soon as

possible Find features that achieves more accuracy and more resilient

to sampling Test with traces from another networks to check the generality

of the solution.

Page 18: Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc

Thank you for your attention

Questions?