Handy Tools and Frameworks … for our projects and work.

Post on 27-Dec-2015

218 views 4 download

Transcript of Handy Tools and Frameworks … for our projects and work.

Handy Tools and Frameworks

… for our projects and work

Tools …

• Apache Lucene• WEKA• itpp• Misc– *tex– imagemagick, inkscape– graphviz, gnuplot, gs

APACHE LUCENE

Tech presentation by Pavel Patzhtpp:// lucene.apache.org/java/docs/index.html

Apache Lucene• Apache Lucene is a free/open source information retrieval software library• Doug Cutting’s grandmother’s middle name!

• And also most powerful OpenSource indexer / search engine

• Library for Java, C (with Perl and Python bindings), C++, Objective C, Delphi, Ruby, PHP, Common Lisp and C# (yep, even .net)

• Fast and efficient solution– Over 20 MB/s on P 1.5 GHz– Index size 20-30% the size of indexed text

• Widely adopted solution– Wikipedia (and MediaWiki as well), E.ON, Beagle, Strigi (Desktop search), isoHunt, Eclipse,

Jira, Digg (it!), abclinuxu.cz, BlogScope, CNET, European Bioinformatics Institute, etc.

Apache Lucene – Text processing• Stemmers

– removes suffixes to find root of a word– Vs Lemmatizers

• Create index storage a.k.a. Directory– In Database, in RAM, on File system

• Create Analyzer– We need somehow separate tokens, find roots, exclude stop words

• Create IndexWriter– Based on Directory and Analyzer

• For each “record” (file, row in table…) create Document and store it

Apache Lucene – Directory & Indexing

• Directory consist of documents• Document consist of fields

– Like ID, content, timestamps – what do you want to store• Fields

– Can be stored, compressed (useful for long strings), not stored• Content of stored fields can be retrieved from directly search result.

– Content can be indexed as • Tokenized • Not tokenized (for instance brand names – “Faster Runner”)• Indexed without NORMS (=no scoring)• Not indexed (but can be stored)

• Indexing– Each document and / or field can have it’s “boost” value– Score (hitpoints) counting of results is based on many factors, boost value

multiplies score of document / field.

Apache Lucene – Search• We have index. So open it!– Use IndexSearcher – use singleton to better performance

• Prepare Query– Lucene has simple query language– We should use same analyzer for querying as for indexing– We can search in fields, boost parts of query, make

Boolean queries etc.• Execute Query

• Enjoy results

WEKAhtpp://www.cs.waikato.ac.nz/ml/weka/

Weka 3: Data Mining Software in Java

• collection of machine learning algorithms for data mining tasks

• Library AND environment in one• Tools for data pre-processing, classification,

regression, clustering, association rules, and visualization

WEKA Tools

• Collection of machine learning algorithms for data mining tasks

• Library AND environment• Tools for data pre-processing, classification, regression,

clustering, association rules, and visualization• Own data format (ARFF)– Text oriented, easily editable

• Many algorithms (classifiers, preprocessors)• Many parameters– Possible to set in the GUI or in API

WEKA Modules

• The WEKA GUI consists of more parts– Explorer• Data analysis, visualisation, model management

– Knowledge flow• Streaming data processing

– Experimenter• Parameterized tests, statistics, performance evaluation,

significance tests– CLI• Command line!

ITPPhttp://itpp.sourceforge.net/

ITPP Intro• Do you Matlab?

– Nope? But there is a number of examples in *.m– … and the API is actually nice

• You can IT++– C++ library of mathematical, signal processing and

communication classes and functions– IT++ makes an extensive use of existing open-source or

commercial libraries for increased functionality, speed and accuracy. In particular BLAS, LAPACK and FFTW

– IT++ should work on GNU/Linux, Sun Solaris, Microsoft Windows (with Cygwin, MinGW/MSYS or Microsoft Visual C++) and Mac OS X operating systems

ITPP Features• Basic mathematical features

– templated vector and matrix classes– sparse vectors and matrix classes– elementary functions on vectors and matrices– statistics classes and functions– matrix decompositions such as eigenvalue, Cholesky,

LU, Schur, SVD, and QR– solving linear system of equations (including over and

underdetermined)– random number generation (Mersenne Twister

generator)– binary and Galois types (both scalar and vector and

matrices)– integration of 1-dimensional functions– unconditional nonlinear optimization (Quasi-Newton

search)

• Signal processing– filter functions and classes– frequency domain filtering– FFT, DFT, DCT, and Hadamard transforms– time and frequency domain windows– evaluating and finding roots of polynomials (and

inverse operations)– filter design functions– fast independent component analysis (fast ICA)

• Communications– modulators (BPSK, PSK, PAM, QAM)– vector modulators (e.g. for OFDM and MIMO)– OFDM and CDMA modulators– pulse shaping filters (including RC and RRC)– binary symmetric (BSC) and additive white Gaussian Noise (AWGN)

channels– multipath fading channels (both frequency-flat and frequency-

selective)– COST 207, COST 257, and ITU channel models– Hamming, extended Golay, and CRC codes– BCH and Reed-Solomon codes– convolutional and punctured convolutional codes– recursive convolutional codes, turbo codes, Interleavers

• Protocol simulation– event-based simulation classes– signal and slots for simplified syntax– TCP clients and servers, selective repeat ARQ– queue classes, packet generators, Source coding– Scalar Quantizer (SQ) and Vector Quantizer (VQ) classes and

functions for training of these– LPC, LSF, and cepstrum parameter calculation for speech

processing– Gaussian Mixture Modeling– reading and saving several different audiofile formats– reading and saving images in PNM format

Building ITPP

• Cygwin & linux• Autotools– ./configure [--without-blas --without-lapack --

without-fft– make -j– make -j install

MISC.

HeuristicLab

((pdf)La | xe)TeX

• Tex makes beautiful pdf– looks professional, math, graphics– typesetting can be done like SW development– Portable, vector-oriented, blah blah– Scriptable

Beautiful figures• ImageMagick– Converts many formats (e.g. to pdf)

• GraphViz– Create graphs () from text files– Many layouts– Ps, pdf, svg outputs– Java/.NET alternative

• GNUPlot– Non-graph plots– Many flavors of graphs (pie charts, etc.)

All together

• Put it all together– Test data– Test program– Text output of results (gnuplot, graphviz)– Prepared source for report (latex)

=– On-demand generated seminary projects ;)