A Spiky Study of Tamilnadu Crime Data Using Datamining

A SPIKY STUDY OF TAMILNADU CRIME DATA USING DATAMINING

Abstract

It is evident from newspapers,TV,web and other sources of news, the occurrence of crime and terrerorism in India is increasing year by year .Generally the crime rate is not reduced rather it increases. This despicable act of terrerorism and growing crime is a big threat for the countrys peace and likely hood. They highly devastating the countrys resources. The increase in crime rate and terrerorism threat needs to be controlled in long run to be eradicated before it depletes the resources gradually. The crime occurrence and terrorist attacks have been recorded by the police department country wide. This huge volume of crime records needs to be thoroughly analyzed to reveal the frequency of crime occurrence crime type. Type of terrorist attack and other factors. The outcome of analysis should be interpreted and concluded. The conclusion should be submitted to police higher officials as suggestions and recommendations. Analyzing this volume data manually is a cumbersome task. Data mining techniques and tools have been proposed to be used in this research by the researcher. Research in respect of whole India is a tedious task.So the research focused here is to do research only about Tamilnadu. The Research addresses two problems related to crime analysis. The first part of this paper deals with data clustering .This paper reviews six types of clustering techniques are presented and compared. It is used to identify the most suitable algorithms from the six different algorithms such as techniques.- k-Means Clustering, Hierarchical Clustering, DBScan clustering, Density Based Clustering, Optics, EM Algorithm. It is used to identify the most suitable algorithms from the different clustering algorithm .The second part of this paper deals with an intelligent crime analysis and recording system designed to overcome problems that appear mainly in the Tamilnadu police department. It is a GIS based system which comprises of data mining techniques such as Hotspot detection, Crime clock, Crime comparison, Crime pattern visualization, Outbreaks detection and the nearest police station detection. Salient features of the proposed system include a rich environment for crime data analysis and a simplified environment for location based data analysis. It facilitates the identification of various types of crimes in detail and assists the police personals to control and prevent such incident efficiently. The conclusion of the study will be recommended to the Tamilnadu police department as suggestions to reduce the crime level to a limit. TermsData clustering, K-Means Clustering, Hierarchical Clustering, DB Scan Clustering, Density Based Clustering, OPTICS, EM Algorithm, Crime Analysis, Crime Investigation, Data MiningIntroduction

The primary goal of crime data mining is to identify crime trends and patterns/series. Mining of crime data provides timely and pertinent information about crime patterns. It will also provide trend analysis to assist the law enforcement personnel, which would help them in planning and deployment of resources for the prevention and suppression of criminal activities. Combining historical data with current data sometimes would aid to unearth new clues, thus helping in solving many pending crimes. Also, it will aid in the investigation process. In crime data analysis, statistical examinations are performed on the frequency of specific crimes in order to evaluate the security of the property and persons. It involves careful analysis of time, location, type of crime that has been committed at a particular place and the appropriate steps are taken to reduce crime. Through research and documentation of crimes and categorization by type of offenses, location and time, gradual patterns and trends will emerge which will lead to preventive solutions. The objective of crime data mining is evaluating the probability of a crime and assessing risks. This involves the analysis of data pertaining to observed behavior and modeling it in order to determine the likelihood of its occurrence again. An estimation of the probability of a crime or attack occurring is made using documented historical data such as crime reports. For e.g. a security professional may entail the documented statistics on car thefts for a building over a one year period.

CLUSTERING is a data mining technique to group the similar data into a cluster and dissimilar data into different clusters. Clustering can be considered the most important unsupervised learning technique so as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. Clustering is the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). Data clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups such that the similarity within a group is larger than among groups. Moreover, most of the data collected in many problems seem to have some inherent properties that lend themselves to natural groupings. Clustering algorithms are used extensively not only to organize and categorize data, but are also useful for data compression and model construction. Finding these groupings or trying to categorize the data is not a simple task for or three dimensions at maximum.) Another reason for clustering is to discover relevance knowledge in data. Data cluster are created to meet specific requirements that cannot created using any of the categorical levels. One can combine data subjects as a temporary group to get a data cluster.

Disk structure: (A) Track (B) Geometrical Sector (C) Track Sector (D) Cluster

The common approach of all the clustering techniques presented here is to find cluster centers that will represent each cluster. A cluster center is a way to tell where the heart of each cluster is located, so that later when presented with an input vector, the system can tell which cluster this vector belongs to by measuring a similarity metric between the input vector and all the cluster centers, and determining which cluster is the nearest or most similar one.

Some of the clustering techniques rely on knowing the number of clusters apriori. In that case the algorithm tries to partition the data into the given number of clusters. K-means and Fuzzy C-means clustering are of that type.

The grouping step can be performed in a number of ways. The output clustering (or clusterings) can be hard (a partition of the data into groups) or fuzzy (where each pattern has a variable degree of membership in each of the output clusters).

Aim of the Research

1. Classifying the different types of crimes

2. To identify the most suitable algorithms from the different clustering algorithms.

3. To represent graphically Hotspot detection, Crime clock, Crime comparison, Crime pattern visualization, Outbreaks detection and the nearest police station detectionNeed and significance

Many police departments all around the world lack good and efficient crime recording and analysis systems. The vast geographical diversity and the complexity of crime patterns have made the analyzing and recording of crime data even difficult. According to the Tamilnadu police department, they face these problems for many years. They need good and efficient system to control and prevent various crime incident efficiently

Significance

The Research addresses two problems related to crime analysis. The first part of this paper deals with data clustering .This paper reviews six types of clustering techniques are presented and compared. It is used to identify the most suitable algorithms from the six different algorithms such as techniques.- k-Means Clustering, Hierarchical Clustering, DBScan clustering, Density Based Clustering, Optics , EM Algorithm. It is used to identify the most suitable algorithms from the different clustering algorithm .The second part of this paper deals with an intelligent crime analysis and recording system designed to overcome problems that appear mainly in the Tamilnadu police department. It is a GIS based system which comprises of data mining techniques such as Hotspot detection, Crime clock, Crime comparison, Crime pattern visualization, Outbreaks detection and the nearest police station detection. Salient features of the proposed system include a rich environment for crime data analysis and a simplified environment for location based data analysis. It facilitates the identification of various types of crimes in detail and assists the police personals to control and prevent such incident efficiently. The conclusion of the study will be recommended to the Tamilnadu police department as suggestions to reduce the crime level to a limit.

Data Analysis

1.2000-2012 Crime record report collected from State Crime Records Bureau, Tamil Nadu, Chennai 600 028.

2. SPSS16.0 software used for finding the statistical report.

3. The clustering techniques are implemented and analyzed using a clustering tool WEKA. Performance of the 6 techniques are presented and compared4. Database: MySQL database and PostGIS/PostgreSQL database

Limitations

Research in respect of whole India is a tedious task. So the research focused here is to do research only about Tamilnadu.

DATAMINING TECHNIQUES

Crime analysis is carried out as a collection of steps: Hotspot detection, Crime clock, Crime comparison, Crime pattern visualization, Outbreaks detection and nearest police station detection. Each of these steps has been automated as a tool in the Tamilnadu-crime analysis Net system. Therefore, the police personals can use different tools in different times according to the situation at hand and decisions can be taken in fast and well organized manner. This section describes each of those analysis tools in detail.

1. Hotspot Detection Cluster analysis is the process of identifying groups of a dataset in such a way that the data inside those groups have specific similarities while the relationships among those groups are minimal. Therefore in order to identify hotspots with high crime density, cluster analysis is used for identifying the clusters of crime spots. The clustering algorithm of the system first accepts the area to be investigated as the input. According to the users inputs the algorithm measures the Euclidian distances among all the data points with each other within the defined area. Then it clusters the data points into the most suitable number of clusters using the nearest neighbor concept and the calculated Euclidian distances. Finally, the coordinates of the centers of the clusters are identified and the number of crime points inside each of those clusters are returned. Depending on the values returned with a coordinate, each cluster is assigned a color darkness and a radius according to the magnitude of the cluster.

2. Crime Clock A crime clock is a representation of the number of crime scenes that has been taken place within the 24 hours of a day. A crime clock is represented as a bar chart. The 24 hour clock is represented using 24 bars on the graph and the height of each bar represents the number of crime scenes per hour. Three extra bars are used to represent the crime scenes without an exact time of incident. The day bar represents the crime scenes which were taken place in the day time, the night bar represents the crime scenes which were taken place in the night time and the unknown bar represents the crime scenes which cannot be assigned to any time duration.

3. Crime Comparison Comparing different types of crimes is very important to get an idea about the growth of a particular crime over the other types of crimes. A pie-graph is used to satisfy this requirement by allowing the analyst the maximum freedom to compare the different types of crimes in an optimal way. It shows the percentage comparison between different crime types.

4. Crime Pattern Visualization In statistics, signal processing, econometrics and mathematical finance, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. A time series plot is used to represent the changes in frequency of crime occurrence. The Y-axis represents the frequency of crimes and the X-axis represents the time.

5. Outbreaks Detection A crime outbreak is the occurrence of any crime incidents in excess of what would normally be expected in a defined geographical area or a time period. Crime outbreaks detection tool is an agent system that observes for number of crimes in different regions. If the number of crimes is increased out of control, an alert will be prompted by the system to all the relevant police stations. In this system, initially the user can define a reference time frame and then the system will calculate the average ( ), and the standard deviation () of the number of crimes per day per each cluster. If, in a particular cluster, number of crimes within a day is greater than the system will prompt an alert.

6. Nearest Police Station Detection

The J48 decision tree is a predictive machine-learning model that decides the target value (dependent variable) of a new sample based on various attribute values of the available data. In an emergency like following a suspect on per suit, it is very important to know clearly about the available police support around the current location. To achieve this task, a nearest police station detection tool has been integrated. The J48 classification algorithm is the methodology used in building this tool. First, the J48 algorithm is trained for about 150 data points per each 400 Km2 area. Those data points include the coordinates and the nearest police stations. The algorithm was trained several times to adopt the coordinates to the predefined classes (police stations). When the user clicks on a desired point on the map, that coordinate will be analyzed by the algorithm and the most suitable class of that coordinates will be returned.

System ArchitectureThe Crime data and analyzing system was built using the following software tools. All packages are free or Open Source software. Java 6 is a powerful object oriented language. Eclipse J2EE version 3.4 is the Java 2 Enterprise Edition version of the Java Integrated Development Environment. Apache Tomcat 6.0 is the latest Open Source web application server. The Google Maps API offers a 2D mapping interface with a robust overlay capability. PostgreSQl database with support for geometry and geospatial query capability used in conjunction with PostGIS 1.3.2. WEKA is a data mining tool with a collection of machine learning algorithms. The purpose of this course project is to develop a web application that was capable of searching and visualizing crime report data. The major aspects of this project involved extracting the data from Xml data files into text format and storing the data into the database. The next major step involved applying the mining algorithms on the data to extract meaningful patterns from the data. The final step is creating a web based front end (visualization) to interact with data stored at the back end to represent the data. The model of the Tamilnadu Net is composed of a MySQL database, a PostGIS/PostgreSQL database and a Map Layers container. Tamilnadu Net analysis tools communicate with the two databases, MySQL and PostGIS/PostgreSQL, while the Geoserver communicates with the map layers and the PostGIS/PostgreSQL database. When the user request is for a map, the system communicates with the Open Layers API. In turn, the API communicates with the Geoserver to resolve the WMS and WFS requests sent by the Geoserver and provides a layered view of maps to the user. The Open Layers API uses the Google Maps as the base layers while GeoExt API helps the Open Layers API to view these information in graphically rich environment.

Conclusion

The project is a good starting point for implementation of data mining for real world examples. This project has brought us insight into various techniques not only in the field of data mining but also in database utilization, visualization, etc. Few points of consideration are for the project itself are Data quality is an extremely important aspect, and we have realized during the course of implementing the project that more time should have been spent in checking how sane the data we had was. This, however, would have had no effect at all on the work done, but it would definitely result in much more useful information about the data. Although the problem of parsing crime reports wasn't tackled in this work, we realize how important it is, and how challenging it can be. From the variation we've seen among the different datasets, we believe that some sort of standardization should be enforced among the different police departments in order to make automatic parsing of crime reports more reliable. One more issue that could be considered is the use of open-source data mining tools, even though WEKA is a very useful alternative many other tools exist that are more robust and feature rich. Utilization of such tools would proved for more open and feature rich application.

References[1] Crime Mapping and Reporting System. (2011, August 31). [Online]. Available: https://www.crimereports.com/

[2] Intelligent Mapping System. (2010, October 16). [Online]. Available: http://maps.met.police.uk/

[3] OpenLayers: Free Maps for the Web. (2010, September 15). [Online]. Available: http://openlayers.org/

[4] GeoServer. (2010, September 15). [Online]. Available: http://geoserver.org/display/GEOS/Welcome

[5] GeoExt. (2010, September 15). [Online]. Available: http://geoext.org/

[6] PostGIS. (2010, September 15). [Online]. Available: http://postgis.refractions.net/

[7] Craig Walls & Ryan Breidenbach, Spring in Action, 2nd Edition, Manning Publications, USA(2005).

[8] Time Series. (2010, September 21). [Online]. Available: http://en.wikipedia.org/wiki/Time_series

[9] Classification Methods. (2010, September 21). [Online]. Available: http://www.d.umn.edu/~padhy005/Chapter5.html

[10] What is MySQL?. (2010, September 23). [Online]. Available: http://dev.mysql.com/doc/refman/5.0/en/what-is-mysql.htm l.

[11] Grave Crime Abstract for Full Year 2010 for Whole Island From 01.01.2010 To 31.12.2010. (2010, September 26). [Online]. Available: http://www.police.lk/images/others/crime_trends/2010/grave_crime_ abstract_for_full_year%202010.pdf.

[12] Chen, H.,W.Chung, et al.(2004). Crime data mining: a general framework and some examples. Computer 37 (4):50-56.

A Spiky Study of Tamilnadu Crime Data Using Datamining

Documents

Transcript of A Spiky Study of Tamilnadu Crime Data Using Datamining