Handy Tools and Frameworks … for our projects and work.

20
Handy Tools and Frameworks … for our projects and work

Transcript of Handy Tools and Frameworks … for our projects and work.

Page 1: Handy Tools and Frameworks … for our projects and work.

Handy Tools and Frameworks

… for our projects and work

Page 2: Handy Tools and Frameworks … for our projects and work.

Tools …

• Apache Lucene• WEKA• itpp• Misc– *tex– imagemagick, inkscape– graphviz, gnuplot, gs

Page 3: Handy Tools and Frameworks … for our projects and work.

APACHE LUCENE

Tech presentation by Pavel Patzhtpp:// lucene.apache.org/java/docs/index.html

Page 4: Handy Tools and Frameworks … for our projects and work.

Apache Lucene• Apache Lucene is a free/open source information retrieval software library• Doug Cutting’s grandmother’s middle name!

• And also most powerful OpenSource indexer / search engine

• Library for Java, C (with Perl and Python bindings), C++, Objective C, Delphi, Ruby, PHP, Common Lisp and C# (yep, even .net)

• Fast and efficient solution– Over 20 MB/s on P 1.5 GHz– Index size 20-30% the size of indexed text

• Widely adopted solution– Wikipedia (and MediaWiki as well), E.ON, Beagle, Strigi (Desktop search), isoHunt, Eclipse,

Jira, Digg (it!), abclinuxu.cz, BlogScope, CNET, European Bioinformatics Institute, etc.

Page 5: Handy Tools and Frameworks … for our projects and work.

Apache Lucene – Text processing• Stemmers

– removes suffixes to find root of a word– Vs Lemmatizers

• Create index storage a.k.a. Directory– In Database, in RAM, on File system

• Create Analyzer– We need somehow separate tokens, find roots, exclude stop words

• Create IndexWriter– Based on Directory and Analyzer

• For each “record” (file, row in table…) create Document and store it

Page 6: Handy Tools and Frameworks … for our projects and work.

Apache Lucene – Directory & Indexing

• Directory consist of documents• Document consist of fields

– Like ID, content, timestamps – what do you want to store• Fields

– Can be stored, compressed (useful for long strings), not stored• Content of stored fields can be retrieved from directly search result.

– Content can be indexed as • Tokenized • Not tokenized (for instance brand names – “Faster Runner”)• Indexed without NORMS (=no scoring)• Not indexed (but can be stored)

• Indexing– Each document and / or field can have it’s “boost” value– Score (hitpoints) counting of results is based on many factors, boost value

multiplies score of document / field.

Page 7: Handy Tools and Frameworks … for our projects and work.

Apache Lucene – Search• We have index. So open it!– Use IndexSearcher – use singleton to better performance

• Prepare Query– Lucene has simple query language– We should use same analyzer for querying as for indexing– We can search in fields, boost parts of query, make

Boolean queries etc.• Execute Query

• Enjoy results

Page 8: Handy Tools and Frameworks … for our projects and work.

WEKAhtpp://www.cs.waikato.ac.nz/ml/weka/

Page 9: Handy Tools and Frameworks … for our projects and work.

Weka 3: Data Mining Software in Java

• collection of machine learning algorithms for data mining tasks

• Library AND environment in one• Tools for data pre-processing, classification,

regression, clustering, association rules, and visualization

Page 10: Handy Tools and Frameworks … for our projects and work.

WEKA Tools

• Collection of machine learning algorithms for data mining tasks

• Library AND environment• Tools for data pre-processing, classification, regression,

clustering, association rules, and visualization• Own data format (ARFF)– Text oriented, easily editable

• Many algorithms (classifiers, preprocessors)• Many parameters– Possible to set in the GUI or in API

Page 11: Handy Tools and Frameworks … for our projects and work.

WEKA Modules

• The WEKA GUI consists of more parts– Explorer• Data analysis, visualisation, model management

– Knowledge flow• Streaming data processing

– Experimenter• Parameterized tests, statistics, performance evaluation,

significance tests– CLI• Command line!

Page 12: Handy Tools and Frameworks … for our projects and work.

ITPPhttp://itpp.sourceforge.net/

Page 13: Handy Tools and Frameworks … for our projects and work.

ITPP Intro• Do you Matlab?

– Nope? But there is a number of examples in *.m– … and the API is actually nice

• You can IT++– C++ library of mathematical, signal processing and

communication classes and functions– IT++ makes an extensive use of existing open-source or

commercial libraries for increased functionality, speed and accuracy. In particular BLAS, LAPACK and FFTW

– IT++ should work on GNU/Linux, Sun Solaris, Microsoft Windows (with Cygwin, MinGW/MSYS or Microsoft Visual C++) and Mac OS X operating systems

Page 14: Handy Tools and Frameworks … for our projects and work.

ITPP Features• Basic mathematical features

– templated vector and matrix classes– sparse vectors and matrix classes– elementary functions on vectors and matrices– statistics classes and functions– matrix decompositions such as eigenvalue, Cholesky,

LU, Schur, SVD, and QR– solving linear system of equations (including over and

underdetermined)– random number generation (Mersenne Twister

generator)– binary and Galois types (both scalar and vector and

matrices)– integration of 1-dimensional functions– unconditional nonlinear optimization (Quasi-Newton

search)

• Signal processing– filter functions and classes– frequency domain filtering– FFT, DFT, DCT, and Hadamard transforms– time and frequency domain windows– evaluating and finding roots of polynomials (and

inverse operations)– filter design functions– fast independent component analysis (fast ICA)

• Communications– modulators (BPSK, PSK, PAM, QAM)– vector modulators (e.g. for OFDM and MIMO)– OFDM and CDMA modulators– pulse shaping filters (including RC and RRC)– binary symmetric (BSC) and additive white Gaussian Noise (AWGN)

channels– multipath fading channels (both frequency-flat and frequency-

selective)– COST 207, COST 257, and ITU channel models– Hamming, extended Golay, and CRC codes– BCH and Reed-Solomon codes– convolutional and punctured convolutional codes– recursive convolutional codes, turbo codes, Interleavers

• Protocol simulation– event-based simulation classes– signal and slots for simplified syntax– TCP clients and servers, selective repeat ARQ– queue classes, packet generators, Source coding– Scalar Quantizer (SQ) and Vector Quantizer (VQ) classes and

functions for training of these– LPC, LSF, and cepstrum parameter calculation for speech

processing– Gaussian Mixture Modeling– reading and saving several different audiofile formats– reading and saving images in PNM format

Page 15: Handy Tools and Frameworks … for our projects and work.

Building ITPP

• Cygwin & linux• Autotools– ./configure [--without-blas --without-lapack --

without-fft– make -j– make -j install

Page 16: Handy Tools and Frameworks … for our projects and work.

MISC.

Page 17: Handy Tools and Frameworks … for our projects and work.

HeuristicLab

Page 18: Handy Tools and Frameworks … for our projects and work.

((pdf)La | xe)TeX

• Tex makes beautiful pdf– looks professional, math, graphics– typesetting can be done like SW development– Portable, vector-oriented, blah blah– Scriptable

Page 19: Handy Tools and Frameworks … for our projects and work.

Beautiful figures• ImageMagick– Converts many formats (e.g. to pdf)

• GraphViz– Create graphs () from text files– Many layouts– Ps, pdf, svg outputs– Java/.NET alternative

• GNUPlot– Non-graph plots– Many flavors of graphs (pie charts, etc.)

Page 20: Handy Tools and Frameworks … for our projects and work.

All together

• Put it all together– Test data– Test program– Text output of results (gnuplot, graphviz)– Prepared source for report (latex)

=– On-demand generated seminary projects ;)