220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio...
Transcript of 220CT Coursework Question #1: Database Design (This task ... · ISS-3664 526 ESA EU 16/01/2014 Bio...
220CT Coursework
Question #1: Database Design (This task is worth 20 marks)
The International Space Station (ISS) is a habitable artificial satellite in low Earth orbit. It is the ninth space station to be inhabited by crews following previous orbital stations that were launched by the US, the former Soviet Union and later Russia. The ISS is intended to be a laboratory, observatory and factory in space as well as to provide transportation, maintenance, and act as a staging base for possible future missions to the Moon, Mars and beyond. In order to support the crew and overall operation of ISS the space agencies in charge of running the station conduct regular missions to launch spacecraft carrying payloads of essential or replacement equipment up to ISS. A payload inventory, see table below, is recorded of each mission, consisting of the space agency leading the mission and the equipment payload to be sent up to ISS. The overall weight of the payload is also determined in order to calculate the fuel needed for orbital insertion of the spacecraft to successfully rendezvous with ISS.
Mission No.
Agcy No. Lead Agency Country Mission Date Equipment Qty Item
Weight
Total
Weight
ISS-2237 178 JAXA Japan 14/12/2013 Potable water dispenser
2 100kg 211kg
Flexible air duct 6 0.5kg
Small storage Rack 4 2kg
ISS-3664 526 ESA EU 16/01/2014 Bio filter 6 0.20kg 1.20kg
ISS-2356 167 NASA USA 12/02/2014 Small storage Rack 3 2kg 69kg
Battery pack 2 5kg
Urine transfer tubing 2 1.5kg
O2 scrubber 1 50kg
ISS-1234 032 Roskosmos Russia 16/04/2014 Small storage Rack 1 2kg 2.5kg
Flexible air duct 2 0.5kg
Currently there is no database being used for managing the payload inventory information in the
table above.
This task is split up into two parts:
1. In its current form, it’s a traditional DB. Keep it that way? Your call. Explain your decision.
For this example I would put the database in a traditional relational database form because it
is a small dataset and can be split into tables, such as Mission, Agency and Equipment, using
normalization. This database does not need to be flexible, therefore SQL’s rigid schema
would be suitable for this example (Albodour 2015). Furthermore this example does not
require fast response times and therefore a traditional rational database can be used.
2. Design the database for the information above. (Hint- relationships? Tables? Data?) And then
Implement the DB using the method of your choice (SQL, MongoDB, CassandraDB or
GraphDB).
Normalization
1NF
The original database was not in 1NF because ‘Equipment’ had multiple values. Therefore the
solution was to create an ‘Equipment’ table with a copy of the key from the un-normalised table.
Mission No. (*) Equipment Qty Item
Weight
ISS-2237 Potable water dispenser 2 100kg
ISS-2237 Flexible air duct 6 0.5kg
ISS-2237 Small storage Rack 4 2kg
ISS-3664 Bio filter 6 0.20kg
ISS-2356 Small storage Rack 3 2kg
ISS-2356 Battery pack 2 5kg
ISS-2356 Urine transfer tubing 2 1.5kg
ISS-2356 O2 scrubber 1 50kg
ISS-1234 Small storage Rack 1 2kg
ISS-1234 Flexible air duct 2 0.5kg
Mission No. Agcy No. Lead Agency Country Mission Date Total
Weight
ISS-2237 178 JAXA Japan 14/12/2013 211kg
ISS-3664 526 ESA EU 16/01/2014 1.20kg
ISS-2356 167 NASA USA 12/02/2014 69kg
ISS-1234 032 Roskosmos Russia 16/04/2014 2.5kg
2NF
To transform the data in the 1NF into 2NF, any non-key attributes that only depend on part of the
table key have to be put into a new table. Therefore a new table must be created to hold the
equipment and item weight.
3NF
In order to transform from 2NF to 3NF, you must remove any non-key attributes that are more
dependent on other non-key attributes than the table key and place them in a new table. Therefore
an agency table must be created. The equipment table does not have a primary key because there
are no unique attributes, therefore two foreign keys are required to link the table to the inventory
and mission tables.
Mission No. Agcy No. Lead Agency Country Mission Date Total
Weight
ISS-2237 178 JAXA Japan 14/12/2013 211kg
ISS-3664 526 ESA EU 16/01/2014 1.20kg
ISS-2356 167 NASA USA 12/02/2014 69kg
ISS-1234 032 Roskosmos Russia 16/04/2014 2.5kg
Mission No.(*) Equipment (*) Qty
ISS-2237 Potable water dispenser 2
ISS-2237 Flexible air duct 6
ISS-2237 Small storage Rack 4
ISS-3664 Bio filter 6
ISS-2356 Small storage Rack 3
ISS-2356 Battery pack 2
ISS-2356 Urine transfer tubing 2
ISS-2356 O2 scrubber 1
ISS-1234 Small storage Rack 1
ISS-1234 Flexible air duct 2
Equipment Item
Weight
Potable water dispenser 100kg
Flexible air duct 0.5kg
Small storage Rack 2kg
Bio filter 0.20kg
Battery pack 5kg
Urine transfer tubing 1.5kg
O2 scrubber 50kg
Mission No. Agcy No. (*) Mission Date Total
Weight
ISS-2237 178 14/12/2013 211kg
ISS-3664 526 16/01/2014 1.20kg
ISS-2356 167 12/02/2014 69kg
ISS-1234 032 16/04/2014 2.5kg
Equipment Item
Weight
Potable water dispenser 100kg
Flexible air duct 0.5kg
Small storage Rack 2kg
Bio filter 0.20kg
Battery pack 5kg
Urine transfer tubing 1.5kg
O2 scrubber 50kg
E-R Diagram
The entities in the E-R diagram are mission, agency, equipment and inventory. The relationships
between the entities are:
An agency must have at least one mission but further mission may occur later on so
therefore it could have many.
A mission must have at least one equipment item or it can have many.
Equipment must have at least one item from the inventory or it can have many.
Agcy No. Lead Agency Country
178 JAXA Japan
526 ESA EU
167 NASA USA
032 Roskosmos Russia
Mission No. (*) Equipment (*) Qty
ISS-2237 Potable water dispenser 2
ISS-2237 Flexible air duct 6
ISS-2237 Small storage Rack 4
ISS-3664 Bio filter 6
ISS-2356 Small storage Rack 3
ISS-2356 Battery pack 2
ISS-2356 Urine transfer tubing 2
ISS-2356 O2 scrubber 1
ISS-1234 Small storage Rack 1
ISS-1234 Flexible air duct 2
MISSION
INVENTORY
AGENCY
EQUIPMENT
Includes
Part_of Has
Included_on
Included_on Includes
Implementing in SQL
CREATE TABLE mission(
mis_no VARCHAR2(8) PRIMARY KEY,
agency_no NUMBER(3) NOT NULL,
mis_date DATE NOT NULL,
total_weight NUMBER(5,2) NOT NULL);
CREATE TABLE inventory(
item VARCHAR2(23) PRIMARY KEY,
item_weight NUMBER(5,2) NOT NULL);
CREATE TABLE equipment(
mis_no VARCHAR2(8) NOT NULL,
item VARCHAR2(23) NOT NULL,
quantity NUMBER(1) NOT NULL);
CREATE TABLE agency(
agency_no NUMBER(3) PRIMARY KEY,
lead_agency VARCHAR2(9) NOT NULL,
country VARCHAR2(6) NOT NULL);
ALTER TABLE mission
ADD CONSTRAINT AGENCY_NO_FK FOREIGN KEY (AGENCY_NO) REFERENCES agency (AGENCY_NO);
ALTER TABLE equipment
ADD CONSTRAINT MIS_NO_FK FOREIGN KEY (MIS_NO) REFERENCES mission (MIS_NO);
ALTER TABLE equipment
ADD CONSTRAINT ITEM_FK FOREIGN KEY (ITEM) REFERENCES inventory (ITEM);
INSERT INTO agency VALUES (178, 'JAXA', 'Japan'); INSERT INTO agency VALUES (526, 'ESA', 'EU'); INSERT INTO agency VALUES (167, 'NASA', 'USA'); INSERT INTO agency VALUES (032, 'Roskosmos', 'Russia'); INSERT INTO mission VALUES ('ISS-2237', 178, '14-Dec-13', 211); INSERT INTO mission VALUES ('ISS-3664', 526, '16-Jan-14', 1.20); INSERT INTO mission VALUES ('ISS-2356', 167, '12-Feb-14', 69); INSERT INTO mission VALUES ('ISS-1234', 032, '16-Apr-14', 2.5); INSERT INTO inventory VALUES ('Potable water dispenser', 100); INSERT INTO inventory VALUES ('Flexible air duct', 0.5); INSERT INTO inventory VALUES ('Small storage rack', 2); INSERT INTO inventory VALUES ('Bio filter', 0.20); INSERT INTO inventory VALUES ('Battery Pack', 5); INSERT INTO inventory VALUES ('Urine transfer tubing', 1.5); INSERT INTO inventory VALUES ('O2 Scrubber', 50); INSERT INTO equipment VALUES ('ISS-2237', 'Potable water dispenser', 2); INSERT INTO equipment VALUES ('ISS-2237', 'Flexible air duct', 6); INSERT INTO equipment VALUES ('ISS-2237', 'Small storage rack', 4); INSERT INTO equipment VALUES ('ISS-3664', 'Bio filter', 6); INSERT INTO equipment VALUES ('ISS-2356', 'Small storage rack', 3); INSERT INTO equipment VALUES ('ISS-2356', 'Battery Pack', 2); INSERT INTO equipment VALUES ('ISS-2356', 'Urine transfer tubing', 2); INSERT INTO equipment VALUES ('ISS-2356', 'O2 Scrubber', 1); INSERT INTO equipment VALUES ('ISS-1234', 'Small storage rack', 1); INSERT INTO equipment VALUES ('ISS-1234', 'Flexible air duct', 2);
Query Examples
1. Produce a list of missions in descending order of their total weight
2. Produce a list of all the missions that have a total weight less than 69kg
3. Produce a list of all the missions carried out in between 01-Dec-2013 and 20-Feb-2014
Question 3– A data mining system for a bank (This task is worth 25 Marks)
A bank has been collecting a great deal of data on their customers and have heard that use of data
mining could increase their competitiveness. They would like you to create a brief report that
includes the following.
i. What data mining is and an appropriate application for the bank.
One definition for data mining is “the nontrivial process of identifying valid, novel,
potentially useful and ultimately understandable patterns in data” (Fayyad, Piatetsky-
Shapiro & Smyth 1996). Another definition for data mining is the “process of analysing
data from different perspectives and summarizing” it into beneficial information (Frand
n.d). Data mining takes large data sets and discovers patterns to make the data into
something understandable, which can be used to generate new business for
organizations. Data mining would be beneficial for the bank because the bank could use
data mining to detect fraud, access credit risk applications and to tailor specific products
towards customers (Moin & Ahmed n.d)
ii. How you would go about creating the system using the data mining lifecycle below.
Problem Definition
The aim is to maintain customer loyalty, advertise specific products to customers and
increase number of customers, by determining which products should be advertised to
specific customers.
Data Gathering and Preparation
The bank should use the following attributes: the person’s current status of their
checking account, their credit history, savings account, employment history, job, personal
status, age and purpose. From this the bank can create a case table for mining. A data
sample is not required because there are only 1000 instances in the database. However if
there were over 2000 instances then a data sample could be used because a
representative random sample is more efficient to mine, which is therefore more cost-
effective and the results produced are similar to those produced by an entire database
(University of Nevada 2003).
Model Building and Evaluation
Data mining is required to analyse patterns in the customer’s transactions, loans and if a
specific insurance or new bank account is set up by a certain age range. For example
customers who are in their 30s or above are expected to be buying or own a house, and
would therefore require home insurance. This data could be used to advertise home
insurance to these customers. A model could be created using clusters for age which
could be used to determine what services should be targeted at specific ages, such as
student accounts for young people. Data mining could be used to determine if a
customer would require a loan, especially if they are self-employed as they may need to
buy supplies or are looking to expand their workforce. Also data mining could be used to
show which bank accounts are more popular and these findings could be used to entice
new customers. Furthermore to increase the number of customers, data mining could be
used to find out why customers switched to another bank and change their own bank
offers to attract new customers (Bhasin 2006). Forecasting could also be used to predict
if a customer is going to transfer to another bank by looking at the customer’s previous
transactions and if they are no longer putting money into their savings account.
Use Knowledge
From the results, a report would be produced to outline the findings of the model. This
could then be used to increase the competitiveness of the bank because they would be
able to market specific services towards specific customers based on the patterns found
from data mining.
iii. Whether the small amount of data (credit.arff) collected so far by the bank, to see
whether you feel that they are collecting the right data for the task of assessing credit
worthiness.
Applications are evaluated based on the 5 C’s of credit. The five C’s are the following:
1. Character – assesses the individual’s willingness and ability to repay the loan.
Therefore the attributes credit history, checking account, savings account and their
job will need to be analysed.
2. Capital – assess the individual’s investment in a business or project. The attributes
property and housing will need to be assessed.
3. Capacity – is the assessment of the individual’s ability to repay the loan with their
current financial means. An individual’s savings account, checking account and credit
history will need to be evaluated.
4. Conditions – measures the overall economic environment against the individual’s
ability to repay loans.
5. Collateral – is in the event if the individual cannot repay the loan (M&T Bank 2015).
The data collected so far by the bank is the right data for assessing credit worthiness
because the person’s current status of their checking account, their credit history,
savings account, employment history, job and purpose can be used to determine
whether an individual should be allowed a loan or if they should be worthy of a
favourable rate (Investopedia n.d). However if they are a foreign worker or if they have a
telephone or not, are not relevant to assessing credit worthiness and should therefore be
ignored. The attributes personal status and sex and age cannot be used to determine
whether an individual should be given credit or not, because the Federal Trade
Commission enforced the Equal Credit Opportunity Act which prevents the
discrimination of sex, marital status and age when determining if you are credit worthy
(Federal Trade Commission 2013).
iv. The use of a data mining model such as a multilayer perceptron or decision tree to
determine a person’s credit worthy. Note, you will need to use a data mining tool like
WEKA to create your model and use the credit.arff data to train and test this model.
Decision Tree with All Attributes
Summary
Correctly Classified Instances 705 70.5%
Incorrectly Classified Instances 295 29.5%
Total Number of Instances 1000
Confusion Matrix a b <-- classified as 588 112 | a = good 183 117 | b = bad Therefore based on this model 771 people are credit worthy and 229 people are not.
Multi-Layer Perception Tree with All Attributes
Summary
Correctly Classified Instances 715 71.5%
Incorrectly Classified Instances 285 28.5%
Total Number of Instances 1000
Confusion Matrix a b <-- classified as 561 139 | a = good 146 154 | b = bad Therefore based on this model 707 people are credit worthy and 293 people are not.
Decision Tree with Foreign Worker, Telephone, Personal Status and Age Attributes Removed
Summary
Correctly Classified Instances 711 71.1%
Incorrectly Classified Instances 289 28.9%
Total Number of Instances 1000
Confusion Matrix a b <-- classified as 579 121 | a = good 168 132 | b = bad Overall based on this model 747 people are credit worthy and 253 people are not.
Multi-Layer Perception Tree with Foreign Worker, Telephone, Personal Status and Age Attributes
Removed
Summary
Correctly Classified Instances 710 71%
Incorrectly Classified Instances 290 29%
Total Number of Instances 1000
Confusion Matrix a b <-- classified as 569 131 | a = good 159 141 | b = bad Overall 728 people are credit worthy and 272 people are not credit worthy.
Actual Credit Worthiness Results
I chose to represent the data as a decision tree and a multi-layer perception tree to compare the
models. A decision tree is a structure that represent sets of decisions, whereas a multi-layer
perception tree produces a neural network which are non-linear predictive models that learn
through training (Frand n.d). Despite the multi-layer perception tree of all attributes having the
highest accuracy of classifying the instances, we are not allowed to use this model to determine
credit worthiness due to the Equal Credit Opportunity Act. Therefore from comparing the decision
tree and multi-layer perception tree with some of the attributes removed, I believe the best model is
a decision tree to determine a person’s credit worthiness. The visualisation, provided by the decision
tree, clearly shows the decision pathways when calculating the credit worthiness of a person. From
looking at the results from the decision tree with some of the attributes removed, the number of
credit worthy people is 747, because they are classified as good in the model. We are then left with
253 people who are not credit worthy. The results produced from this decision tree have been 71.1%
correctly classified compared to the 71% classified correctly in the multi-layer perception tree. The
results from the decision tree can be compared to the actual credit worthiness results, and from this
we can see that 47 instances have been incorrectly classified as credit worthy, because the actual
number of people not credit worthy is 300 instead of 253. However I would not recommended this
model to be used in determining credit worthiness in the real-world because the percentage of
correctly classified instances is only 71.1%, and is therefore not a highly accurate model to be used.
Question 4: Big Data Idea (30 marks)
Aim and Objectives
Aim
The aim of this project was to explore the crimes carried out in San Francisco between 6th January
2003 and 17th November 2015, in order to gain a clear understanding of how crime has changed over
the years.
Objectives
Collect the data from the City and County of San Francisco
Create a dataset using Excel
Analyse the data
Visualisation of the data using Excel and Google Fusion Tables
Results and findings from the analysis
Future developments
Background
The history of crime is one of San Francisco’s
tourist attractions. Alcatraz is a popular tourist
attraction and was a federal prison from 1934
to 1963. It held notorious convicts such as Al
“Scarface” Capone and Robert “Birdman of
Alcatraz” Stroud (San Francisco Travel n.d)
(History.com 2009). Alcatraz never had any
reported prisoners escape however three
prisoners, Clarence and John Anglin and Frank
Morris managed to construct a raft and set sail
but were never found, and were therefore
presumed dead from drowning. (SF Gate
2013)
San Francisco has one of highest crime rates in
America and the overall crime rate is 114% higher than the national average (Area Vibes 2013).
Violent crimes and property crimes are major contributors to San Francisco’s overall crime rate
(Neighbourhood Scout n.d). The crime rate in San Francisco has risen in recent years whilst the
number of arrests has declined, and numbers of police staff has also decreased (SF Examiner 2015).
Reason why I picked this Project
The analysis of the San Francisco dataset will provide an understanding of the crimes committed
between 2009 and 2015. This analysis could then be used in future forecasting to identify trends in
how crime has changed or which crime is most active in certain areas. The results could then be used
to target specific areas to lower the crime rate.
(Google Maps, 2015)
Acquiring the Data
I gathered the dataset from SF OpenData and I was able to view the data in Excel. The dataset
contains 1,842,719 instances of crimes between 1th January 2003 and 17th November 2015. In the
original dataset the attributes location and PdId were included and I removed these attributes,
because I would not be using these attributes for analysis. Once this was completed I added filters to
every column which would allow efficient filtering for a specific result, such as a specific category. For
example if I wanted all the crimes which were a robbery, the result would be all the instances with
robbery as their category. Then I filtered the data by only selecting the data from 2009 onwards
because this is the data I want to look at for analysis.
(Original dataset above and the edited dataset is below)
Description of Attributes
IncidntNum – a unique incident number
Category o Arson o Assault o Bad Checks o Bribery o Burglary o Disorderly Conduct o Driving Under the Influence o Drug/Narcotic o Drunkenness o Embezzlement o Extortion o Family Offenses o Forgery/Counterfeiting o Fraud o Gambling o Kidnapping o Larceny/Theft o Liquor Laws o Loitering o Missing Person o Non-Criminal o Other Offenses o Pornography/Obscene Mat o Prostitution o Recovered Vehicle o Robbery o Runaway o Secondary Codes o Sex Offenses Forcible o Sex Offenses Non Forcible o Stolen Property o Suicide o Suspicious OCC o Trea o Trespass o Vandalism o Vehicle Theft o Warrants o Weapon Laws
Descript – a description of the crime
DayOfWeek – Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday
Date – written in DD-MM-YYYY format
Time – written in HH:TT format
PdDistrict – name of the Police Department district o Bayview o Central o Ingleside
o Mission o Northern o Park o Richmond o Southern o Taraval o Tenderloin
Resolution – how the crime incident was resolved o Arrest, Booked o Arrest, Cited o Cleared – Contact juvenile for more info o Complainant refuses to prosecute o District attorney refuses to prosecute o Exceptional clearance o Juvenile Admonished o Juvenile Booked o Juvenile Cited o Juvenile Diverted o Located o None o Not prosecuted o Prosecuted by outside agency o Prosecuted for lesser offense o Psychopathic case o Unfounded
Address – approximate street address where the crime incident took place
X - longitude
Y – latitude
Analysis with Visualisation
Total Number of Each Crime from 2009-2015
I decided to first look at the total number of each crime in the 6 years and represent the results as a
bar graph, because it provides a simple visualisation of which crimes are most prevalent in San
Francisco. In order to get the total number of incidents for each crime, I used the function =COUNTIF
(range, criteria) in Excel. For this function the range was B2:B1024989 and the criteria was each
specific category e.g. =COUNTIF (B2:B1024989, “Assault”) would give me the total of 89,144.
From this analysis the top three crimes can be identified, which are larceny/theft, other offences and
non-criminal incidents.
225898148270
11942589144
5346353521
4596542392
3961840648
3164826390
1979113497
10548853680076389549564695097476836613264297618811931172812108225575714223233311601412110
0 50000 100000 150000 200000 250000
Larceny/TheftOther Offenses
Non-CriminalAssault
Drug/NarcoticVandalismWarrants
Vehicle TheftSuspicious OCC
BurglaryMissing Person
RobberyFraud
Secondary CodesWeapon Laws
TrespassForgery/Counterfeiting
Stolen PropertySex Offenses, Forcible
ProstitutionDrunkenness
Disorderly ConductRecovered Vehicle
Driving Under the InfluenceKidnapping
Liquor LawsRunaway
ArsonEmbezzlement
LoiteringFamily Offenses
SuicideBribery
Bad ChecksExtortion
Sex Offenses, Non ForcibleGambling
Pornography/Obscene MatTrea
Number of Incidents
Typ
e o
f C
rim
e
Total Number of Each Crime in the 6 Years
Percentage of Each Resolution from 2009-2015
After looking at the total number of each crime, I chose to look at the total resolutions over the 6
years, because this would provide an understanding of how successful the police departments and
the courts were of sentencing the offender. I chose to represent the information as a pie chart
because the resolutions could be represented as a percentage, which I believe is the best way to
compare the different resolutions. This is because you can easily see which resolutions are the most
popular.
This pie chart shows us that the most common successful resolutions are arrested and booked or
cited. However 60.9% of all incidents over the 6 years did not have a resolution which indicates a lack
of evidence for prosecution or the offender was let off with a warning.
23.40%
7.86% 0.03%
0.58%
0.32%
0.30%
0.15%
0.65%
0.37%
0.04%
1.96%
60.90%
0.10%
0.25% 0.00%
2.00%
1.08%
Percentage of each Resolution over the 6 Years
Arrest, Booked
Arrest, Cited
Cleared – Contact juvenile for more infoComplainant refuses toprosecuteDistrict attorney refuses toprosecuteExceptional clearance
Juvenile Admonished
Juvenile Booked
Juvenile Cited
Juvenile Diverted
Located
None
Not prosecuted
Prosecuted by outside agency
Prosecuted for lesser offense
Psychopathic case
Unfounded
Crime Comparison
I have chosen the years 2009, 2012 and 2015 to compare each type of crime so I can see if there is a
trend. Also the scatter graph displays which crimes occur the most in a specific year, such as in 2015
there has been a significant rise in larceny/theft compared to 2012 and 2009. However we must take
into consideration that the data for 2015 is only up to the 17th November 2015, and is therefore not a
complete representation for the whole year.
The Category Numbers are as follows:
1. Arson 11. Extortion 21. Non-Criminal 31. Stolen Property
2. Assault 12. Family Offenses 22. Other Offenses 32. Suicide
3. Bad Checks 13. Forgery/Counterfeiting 23. Pornography/Obscene Mat
33. Suspicious OCC
4. Bribery 14. Fraud 24. Prostitution 34. Trea
5. Burglary 15. Gambling 25. Recovered Vehicle 35. Trespass
6. Disorderly Conduct 16. Kidnapping 26. Robbery 36. Vandalism
7. Driving Under the Influence
17. Larceny/Theft 27. Runaway 37. Vehicle Theft
8. Drug/Narcotic 18. Liquor Laws 28. Secondary Codes 38. Warrants
9. Drunkenness 19. Loitering 29. Sex Offenses Forcible 39. Weapon Laws
10. Embezzlement 20. Missing Person 30. Sex Offenses Non Forcible
0
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
Nu
mb
er o
f In
cid
ents
Category Number
Crime Comparison
2009
2012
2015
Resolution Comparison
I decided to represent the resolution comparison of 2009, 2012 and 2015 as a bar graph, because the
number of instances for each type of resolution, for each year are side by side which enables quick
comparison between the results.
From the graph above, the top 3 resolutions for all 3 years are arrested and booked or cited, and no
resolution. However there has been a 25.6% rise in the number of no resolutions because in 2015
there was 96,669 incidents with no resolution compared to 76,947 incidents with no resolution in
2009.
37
62
8
16
22
2
21 35
4
72
9
21
1
24
4
69
5
49
3
49 2
04
3
76
94
7
13
6
51
0
11 2
26
1
13
07
28
71
6
10
65
2
33 37
8
58
2
25
2
23
7
90
4
69
8
68
34
70
89
98
3
19
4
34
4
0
31
56
11
96
34
62
9
92
7
16
3
4 0 68
5
1 90
6
6 3 13
2
96
66
9
6 0 0 13
6 22
04
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
85000
90000
95000
100000
Nu
mb
er o
f In
cid
ents
Resolution
Resolution Comparison
2009
2012
2015
Crimes with the Most Change
I used line graphs to show the change in the number of incidents for a particular crime because they
show the rise and fall over the years. I chose to represent the change of three different crimes, which
are drug/narcotics, larceny/theft and other offenses, by selecting the crimes with the most difference
in the scatter graph for all crimes within all three years.
From the graph above there is a clear indication that shows that drug and narcotic crimes are on the
decline, and from 2009 to 2015 there has been a 68.9% decrease which further indicates a decline.
The number of larceny/theft incidents has considerably increased over the 6 years. Between 2009
and 2012 there was a 21% increase but between 2012 and 2015 there was a 20.7%, which suggests
that the number of larceny/theft incidents is growing at a steady rate.
11950
6447
3715
0
2000
4000
6000
8000
10000
12000
14000
Nu
mb
er o
f In
cid
ents
Year
Change in Drug/Narcotic
Drug/Narcotic
25584
30973
37393
0
5000
10000
15000
20000
25000
30000
35000
40000
Nu
mb
er o
f In
cid
ents
Year
Change in Larceny/Theft
Larceny/Theft
2009 2012 2015
2009 2012 2015
The number of other offense incidents has greatly declined between 2009 and 2012 whilst between
2012 and 2015 there has only been a reduction of 1,056 incidents. However we must take into
account that 2015 incidents are only up until the 17th November and therefore incidents after this
date will change the result.
Crime in San Francisco in 2015
I imported the dataset into Google Fusion Tables and modified the X and Y attributes to be a location
data type, where Y represents the latitude and X represents the longitude (Google 2015) (Google
2015a). Then I filtered the crime incidents to just show incidents between 01/01/2015 and
17/11/2015.
The map below represents all the crimes carried out in 2015 and each red dot is the location where
each incident took place. This provides a geographical view of where crime is most prevalent by the
density of the red dots and where crime is less prevalent where the red dots are more distributed.
24690
1864617590
0
5000
10000
15000
20000
25000
30000
Nu
mb
er o
f In
cid
ents
Year
Change in Other Offenses Incidents
Other Offenses
2009 2012 2015
I decided to look at larceny/theft as a sample for what the map could be used for. You can filter the
results further by selecting a specific a police department that responded to the crime, which
indicates the location the crime occurred. Below is the map that represents all the larceny/theft
incidents that took place in 2015.
Then I filtered the result to show the police department that responded to the most incidents, which
was the Northern Police Department. This map shows that the crimes are carried out in very similar
locations.
Next I filtered the result to show the police department that responded to the fewest incidents,
which was the Ingleside Police Department. The map below indicates that there is no distinct pattern
where the crimes take place, because the red dots are distributed and are not clustered in a specific
area.
Results and Findings
In conclusion larceny/theft, other offenses and non-criminal crimes are the most common crimes
carried out in San Francisco, with larceny/theft being a major contributor because there have been
225,898 incidents within the 6 years. Also 60.9% of all incidents did not have a resolution which
suggests the offender was let off with a warning. Therefore some offenders may repeat a crime
because they are not deterred from carrying out a crime, which could have had an impact on the
total incident numbers for each crime category. This could be analysed if further data was provided.
Overall from the visualisations of the change between the years 2009, 2012 and 2015 there is a clear
indication that some type of crimes, such as drug /narcotic incidents and other offenses incidents are
on the decline. On the other hand crimes such as larceny/theft incidents are on the rise because
between 2012 and 2015 there has been a 20.7% increase.
From the map visualisation of all crimes carried out in 2015 it shows that there is a high crime rate in
the North East of San Francisco particularly in the area policed by the Northern Police Department.
Whereas in the Western areas of San Francisco they have a lower crime rate.
Future Developments
The mapping of each crime could be used to target specific areas to raise awareness in the
community and to increase the number of police staff. These areas could also have additional
surveillance in order to reduce the crime rate. Also the results could be filtered further to show the
crimes with the same M.O. using the incident numbers. This could be used to find patterns in where
an offender feels comfortable in carrying out a crime and be used in criminal profiling. Furthermore
the mapping of each crime could be used when people are deciding where to live because the crime
rate would influence their decision.
List of References
Albodour, R. (2015) MongoDB Part 1 [online lecture] module 220CT, 16 November 2015. Coventry:
Coventry University. Available from < https://prezi.com/3ax4trxxq4z6/mongodb-part-
1/?utm_campaign=share&utm_medium=copy> [20 November 2015]
Area Vibes (2013) San Francisco, CA Crime Rates and Statistics [online] available from
<http://www.areavibes.com/san+francisco-ca/crime/> [1 December 2015]
Bhasin, M. L (2006) ‘Data Mining: A Competitive Tool in the Banking and Retail Industries’, The
Chartered Accountant [online] available from
<https://www.academia.edu/17141409/Data_Mining_A_Competitive_Tool_in_the_Banking>
[21 November 2015]
Fayyad. U, Piatetsky-Shapiro, G & Smyth, P (1996) ‘From Data Mining to Knowledge Discovery in
Databases’, AI Magazine [online] available from
<https://www.aaai.org/ojs/index.php/aimagazine/article/viewFile/1230/1131>
[21 November 2015]
Federal Trade Commission (2013) Your Equal Opportunity Rights [online] available from
<http://www.consumer.ftc.gov/articles/0347-your-equal-credit-opportunity-rights>
[22 November 2015]
Frand. J (n.d) Data Mining [online] available from
<http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/dataminin
g.htm> [22 November 2015]
Google (2015) About Fusion Tables [online] available from
<https://support.google.com/fusiontables/answer/2571232> [30 November 2015]
Google (2015a) Create a Map: Fusion Tables [online] available from
<https://support.google.com/fusiontables/answer/2527132?hl=en&topic=2573107&ctx=topi
c#mapsample> [30 November 2015]
Google Maps (2015) San Francisco [online] available from
<https://www.google.co.uk/maps/place/San+Francisco,+CA,+USA/@37.7576171,-
122.5776844,11z/data=!3m1!4b1!4m2!3m1!1s0x80859a6d00690021:0x4a501367f076adff>
[3 December 2015]
History.com (2009) Alcatraz Island [online] available from
<http://www.history.com/topics/alcatraz> [1 December 2015]
Investopedia (n.d) Creditworthiness [online] available from
<http://www.investopedia.com/terms/c/credit-worthiness.asp> [22 November 2015]
M&T Bank (2015) The 5C’S of Credit [online] available from
<https://www.mtb.com/business/businessresourcecenter/Pages/FiveC.aspx>
[22 November 2015]
Moin, K. I & Ahmed, Q. B (n.d) ‘Use of Data Mining in Banking’, International Journal of Engineering
Research and Applications [online] available from
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.416.7821&rep=rep1&type=pdf>
[21 November 2015]
Neighbourhood Scout (n.d) San Francisco Crime [online] available from
<http://www.neighborhoodscout.com/ca/san-francisco/crime/> [2 December 2015]
San Francisco Travel (n.d) Alcatraz [online] available from
<http://www.sanfrancisco.travel/alcatraz> [1 December 2015]
SF Examiner (2015) San Francisco Crime Rate Jumps Despite Fewer Arrests [online] available from
<http://www.sfexaminer.com/sf-crime-rate-jumps-despite-fewer-arrests/>
[2 December 2015]
SF Gate (2013) The 16 Most Infamous Crimes in Bay Area History [online] available from
<http://www.sfgate.com/crime/slideshow/The-16-most-infamous-crimes-in-Bay-Area-
history-72881/photo-3048055.php> [1 December 2015]
SF OpenData (2015) SFPD Incidents – From 1 January 2003 [online] available from
<https://data.sfgov.org/data?category=&dept=&search=sfpd%20incidents&type=dat
asets [28 November 2015]
University of Nevada (2003) Preparing Data for Data Mining [online] available from
<http://www.cabnr.unr.edu/gf/dm/chap02.pdf> [21 November 2015]