Assignment: VAST Mini-Challenge 2

Part 2 out of 4

Yong Kai Lim https://limyongkai.netlify.app/ (Singapore Management University)
07-23-2021

Continuing the Investigation from Part 1….

Tasks and Questions:

1. Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

The following packages are loaded for data preparation and visualisation.

packages = c('tidyverse', 'lubridate', 'MASS',
             'ggplot2', 'cdparcoord', 'ggiraph', 'plotly', 
             'geosphere', 'sf','rgeos', 'crosstalk',
             'raster', 'tmap','visNetwork','ggraph','tidygraph',
             'ggalluvial')

for(p in packages){
  if(!require(p, character.only=T)){
    install.packages(p)
  }
  library(p, character.only=T)
}

The credit card and loyalty card datasets were loaded and the structure was checked.

glimpse(cc)
Rows: 1,490
Columns: 4
$ timestamp  <chr> "1/6/2014 7:28", "1/6/2014 7:34", "1/6/2014 7:35"~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(loyalty)
Rows: 1,392
Columns: 4
$ timestamp  <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~

Customer would usually use credit card (cc) with their loyalty card, hence joining both data allows the tagging of cc to loyalty card number. A suitable left join on CC data with loyalty data using timestamp, location and price will be performed. However, both timestamp field are in character format instead of datetime format. The following adjustment will be performed:

## 1. Create column "datetime" in datetime format "YYYY-dd-mm HH:MM:SS"
## 2. Create column "date" in date format "YYYY-dd-mm"
## 3. Change encoding of locations name
cc <- as_tibble(lapply(cc, iconv, to="ASCII//TRANSLIT"))
cc <- cc %>% mutate(datetime = mdy_hm(timestamp), date = date(datetime),
                    price = as.numeric(price), last4ccnum=as.factor(last4ccnum)) 

## 1. Create column "date" in date format "YYYY-dd-mm"
## 2. Change encoding of locations name
loyalty <- as_tibble(lapply(loyalty, iconv, to="ASCII//TRANSLIT"))
loyalty <- loyalty %>% mutate(date = date(mdy(timestamp)), price=as.numeric(price))

glimpse(cc)
Rows: 1,490
Columns: 6
$ timestamp  <chr> "1/6/2014 7:28", "1/6/2014 7:34", "1/6/2014 7:35"~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ datetime   <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
glimpse(loyalty)
Rows: 1,392
Columns: 5
$ timestamp  <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
$ date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~

Prior to joining both data, a quick glance of the aggregated summary statistics in table 1 showed that there are more credit card transaction as compared to loyalty card transaction for each day. This could imply that employees did not use their loyalty card when they perform a transaction with their credit card and a perfect join of the two dataset was not possible. A left join of cc and loyalty dataset by location, date and price was performed.

## Summary statistics for cc and loyalty transaction per day
cc_t<-merge((cc %>% group_by(date) %>% summarize(cc_count = n())), 
      (loyalty %>% group_by(date) %>% summarize(loyalty_count = n())), 
      by="date") %>% mutate(diff = cc_count-loyalty_count)
knitr::kable(cc_t, "simple",
             caption="Summary statistics for cc and loyalty transaction per day")
Table 1: Summary statistics for cc and loyalty transaction per day
date cc_count loyalty_count diff
2014-01-06 128 119 9
2014-01-07 130 122 8
2014-01-08 129 122 7
2014-01-09 133 118 15
2014-01-10 116 103 13
2014-01-11 61 51 10
2014-01-12 55 54 1
2014-01-13 121 117 4
2014-01-14 128 123 5
2014-01-15 126 122 4
2014-01-16 131 123 8
2014-01-17 113 108 5
2014-01-18 70 67 3
2014-01-19 49 43 6
## Left join cc with loyalty data
trans <- left_join(cc, loyalty, by=c("location", "date", "price")) %>%
  dplyr::select(-c(timestamp.x, timestamp.y, datetime))
glimpse(trans)
Rows: 1,496
Columns: 5
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ date       <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ loyaltynum <chr> "L8566", NA, "L8148", "L5553", "L3800", "L2247", ~

The trans data mostly tagged a unique “last4ccnum” to a unique “loyaltynum”. However, the number of rows increase from 1490 to 1496, implying that multiple matches occur. It is most likely because there were 6 transaction in the loyalty data with the same location, date and price value from different loyaltynum card.

To investigate the multiple tagging of each unique cc number or unique loyalty card number, the data was transformed and visualise using an interactive parallel coordinate graph in Figure 1. Clicking on either vertical axis “last4ccnum” or “loyaltynum” highlights only the matching lines.

bind_rows(
  trans %>% na.omit() %>% 
    group_by(last4ccnum)%>% filter(n_distinct(loyaltynum)>1),
  trans %>% na.omit() %>%
    group_by(loyaltynum) %>% filter(n_distinct(last4ccnum)>1)
) %>% distinct() %>% mutate(last4ccnum = as.character(last4ccnum)) %>%
  dplyr::select(last4ccnum,loyaltynum) %>%
  discparcoord(k=1000, 
               interactive=TRUE, 
               name="Multiple tags of CC and loyalty number")

Figure 1: Parallel Coordinate plot of CC with multiple tags to Loyalty card number

Selecting credit card number ending 8332, 7889, 5921, 5368, 4948 and 4795 revealed that those credit card were tagged to two different unique loyalty card number and one of them has low transaction count which was represented by the dark brown line. Drilling down on the 6 credit card numbers in the trans data, the matching row had only 1 transactions. This imply that there were two loyalty card transactions that recorded the same date, location and price, resulting in a one to many join that fulfilled all conditions. Hence, these 6 rows of transaction were the difference in row count from the original cc data and the trans data.

Credit card number 1286 was tagged to loyalty number L3288 and L3572 with 15 and 13 transactions respectively. On the other hand, loyalty number L3288 is also tagged to a unique cc number 9241 with 13 transactions. A possible deduction would be the owner of cc 9241 loyalty card is L3288 and owner of cc 1286 loyalty card is L3572. However, the owner of cc 1286 often paid and use L3288 loyalty card. This could suggest close relationship between owners of cc 1286 and 9241.

Loyalty number L6267 wass tagged to cc number 6899 and 6691 with 23 and 20 transaction respectively. On the other hand, both cc 6899 and 6691 had only one unique tag to the loyalty card. Possible deduction could be that the owner of credit card number 6899 and 6691 is the same person using loyalty card L6267. Another deduction would be loyalty number L6267 is shared among the owners of cc 6899 and 6691. If the latter deduction is correct, this could suggest close relationship between owners of cc 6899 and 6691.

With these information, a new dataset card_tag was created to tag the owners of their cc and loyalty card numbers together. However, there were 409 transactions in dataset trans that were not tagged.

## Tag owners of credit card to loyalty card number
card_tag <- trans %>% 
  na.omit() %>%
  group_by(last4ccnum, loyaltynum )%>%
  summarize(count_d = n()) %>%
  filter(count_d > 1) %>%
  filter(!(last4ccnum == 1286 & loyaltynum =="L3288")) %>%
  dplyr::select(-(count_d))

The 409 cc transactions that were not tagged were analysed by mapping the cc and loyalty card. Thereafter, a left join of non-tagged transactions to the loyalty data by field “date”, “location” and loyaltynum" was performed. From Figure 2, it was observed that most of the difference in cc card price and loyalty price converges to “20”, “40”, “60” and “80” dollars. A possible deduction based on the price difference in denomination of “20” could suggest some form of discount or rebate mentioned in the background.

A deliberate shortfall is not possible as those transactions were evenly spread across different days and locations. Furthermore, as the occurrence of the difference in price exist for multiple cc and loyalty card, it was not possible that the shortfall were targeted towards specific owners or at specific locations.

## Non matching cc and loyalty card transaction
non_match_cc <- anti_join(cc, (trans %>% na.omit())) %>% left_join(card_tag)
## Non matching loyalty card and cc transaction
non_match_loy <- anti_join(loyalty, (trans%>%na.omit()))
## All non matching transaction
non_match_trans <- left_join(non_match_cc, 
                             non_match_loy, 
                             by=c("location", "date", "loyaltynum" )) %>% 
  na.omit() %>% 
  mutate(diff=price.x-price.y) %>% 
  filter(diff>=0)

## Remove outliers, select columns and visualise using parallel coordinate plot
non_match_trans %>% 
  filter(!(diff==boxplot(non_match_trans$diff, plot = FALSE)$out)) %>% 
  dplyr::select(last4ccnum,loyaltynum,location,price.x,price.y, diff) %>%
  rename(price_cc = price.x, price_loyalty = price.y) %>%
  mutate(last4ccnum = as.character(last4ccnum)) %>%
  discparcoord(k=1000, 
               interactive=TRUE, 
               name="Non-matching transactions by cc and loyalty number")

Figure 2: Parallel Coordinate plot of CC to Loyalty card number with discount

There was a subset of cc transactions that are not tagged to any loyalty card transactions. Possible deductions could be that owners forgot their loyalty card when making the transactions or there might be suspicious activities in these transactions where owners deliberately avoided using their loyalty card. This subset of transactions was visualise with a boxplot in Figure 3. The boxplot displayed one extreme outlier at Frydos Autosupply n’ More. Hovering over the red outlier circle indicates that the owner of cc 9551 spent 10,000 dollars in that transaction whereas the median price is 134.9 at Frydos Autosupply n’ More. This transaction was extremely suspicious because of the extreme outlier spending and the owner did not use his/her loyalty card despite being such a high amount transactional value.

## Transactions match equally from cc and loyalty card
match_cc <- left_join((left_join(cc, card_tag)), 
                      loyalty, by=c("location","date","price")) %>% 
  na.omit() %>% 
  group_by(last4ccnum, loyaltynum.y) %>% filter(n()>1) %>%
  dplyr::select(-(timestamp.y)) %>%
  rename(timestamp = timestamp.x, 
         loyaltynum_owner = loyaltynum.x, 
         loyaltynum_trans = loyaltynum.y) %>%
  mutate(trans_match = 1)

## Transactions match with difference in 20 dollars denomination
match_cc_dis <- anti_join(cc, match_cc, by=c("date","location","price")) %>% 
  left_join((non_match_trans %>% filter(diff %in% c(20, 40, 60, 80))), 
            by=c("location", "last4ccnum","date","price"="price.x")) %>% 
  na.omit() %>%
  dplyr::select(-timestamp.x, -datetime.y, -timestamp.y) %>%
  rename(datetime = datetime.x, 
         loyaltynum_trans = loyaltynum, 
         price_loy = price.y) %>%
  mutate(trans_match = 1)

## Transactions with cc transactions but not match to loyalty card
no_loy_trans <- anti_join(cc, match_cc, by=c("date","location","price")) %>%
  anti_join(match_cc_dis, by=c("date","location","price")) %>%
  mutate(trans_match = 0)

## Tagging all information on transactions from cc and loyalty to final_trans
final_trans <- bind_rows(match_cc, match_cc_dis, no_loy_trans)

## Determine median price per location
median_price <- no_loy_trans %>% 
           group_by(location) %>% 
           summarize(med=median(price))
## Data transformation for boxplot plotting
no_loy_trans_1 <- no_loy_trans %>% 
  left_join(median_price, by=c("location"))
## Boxplot function
boxplot1 <- ggplot(no_loy_trans_1, aes(x=location, y=price, text=paste("Median:", med))) +
  geom_boxplot(outlier.color="red",outlier.fill="red") + 
  geom_point(alpha=0) + scale_y_log10() + coord_flip() +
  ggtitle("Boxplot of CC transaction NOT tagged to loyalty card") +
  theme(axis.title=element_blank(),
        plot.title=element_text(size=16, face="bold")) +
  xlab("Price")
boxplot_p1<-ggplotly(boxplot1, width_svg = 7, height_svg = 7)
boxplot_p1$x$data[[1]]$hoverinfo <- "none"
# overrides black outline of outliers
boxplot_p1$x$data[[1]]$marker$line$color = "red"
# overrides black extreme outlier color
boxplot_p1$x$data[[1]]$marker$outliercolor = "red"
# overrides black not as extreme outlier color
boxplot_p1$x$data[[1]]$marker$color = "red"
boxplot_p1

Figure 3: Boxplot of cc transaction without loyalty card

To determine the most popular location in Abila, the visualisation in Figure 4 shows the frequency of the transactions and the transaction prices for each location. The plot Number of transactions per day by location shows which location had the highest number of transaction each day separated by time period and the weekends are shaded in grey. The plot Boxplot of transaction prices per location shows the prices for each location. Log transformation was performed on the boxplot x-axis(Price). The following insights are inferred from the plot.

1. Transactions occurring only on weekdays morning.

The 3 location seems to be coffee shops based on their location name or logo and Brew’ve Been Served is the most popular location among them. Based on the locations, price and timestamp of the transactions, a possible deduction would be these coffee shops serves take-out coffee and are located in between employees home and GAStech. The median price of each transactions were similar for all 3 locations at around 12 dollars. From the map, Coffee Cameleon is the nearest to GAStech but Brew’ve Been Served has more transactions. making Brew’ve Been Served the most popular morning coffee take-out choice among the employees.

2. Transactions occurring only on weekdays afternoon.

Based on the location name or logo, these 4 location seems to be food and beverage outlets. The median price for these locations range from 12 to 15 dollars. A possible deductions could be these location only operates on weekday lunch time and serves drinks such as coffee as they have similar price range as the take-out coffee mentioned previously.

3. Transactions occurring daily during the afternoon or night period.

The 6 locations has transactions from both afternoon and night time period on all days with a median price of 28 to 32 dollars. A possible deduction based on the location names, logo and transaction trend indicates that these are also food and beverage outlets. However, the higher median price and frequent transaction during both afternoon and night period might suggest that these are restaurants that serves full meals for lunch and dinner.

4. Higher value transactions on weekdays only.

These locations has higher median price compared to the others. The company name and logo suggests that the locations are customer or supplier of GAStech. As the bulk of transaction are on the weekday, a possible deduction would be these locations are related to work. The higher median price value could be due to the purchase raw materials which translate to much higher price transacted on weekdays only.

5. Suspicious transaction.

In the boxplot, there is an extreme outlier of a 10,000 dollars while the median price was only 149 dollars. This particular transaction was flagged out in our previous analysis of cc transaction that were not tagged to loyalty card. As individuals are more likely to use loyalty card in conjunction with the loyalty card, the scenario for this transaction further exacerbated the suspicion.

There were frequent transactions performed at Kronos Mart during the midnight period on Monday and both Sundays. The 5 transactions in during midnight is not common and it only occurs only at one specific location. These 5 transactions performed were not tagged to a loyalty card as well. This raises suspicion on the cc owner.

In the boxplot, there was an extreme outlier of 1,239.41 dollars while the median was only 211.47 dollars. It was six times the median price which might be a suspicious transactions. However, looking at the frequency of transactions at Albert’s Fine Clothing, it seems like a common place to buy clothing. Possible deduction was the person was buying lots of clothing for his family or friends, amounting to a much higher price than usual.

## Data manipulation to add more factors
final_trans_1 <- final_trans %>% ungroup() %>%
  mutate(day = as.factor(wday(date)),
         wkday = ifelse(day == "6" | day =="7", "weekend", "weekday"),
         time_bin = case_when(
              hour(datetime)>=0 & hour(datetime)<6 ~ "Midnight",
              hour(datetime)>=6 & hour(datetime)<12 ~ "Morning",
              hour(datetime)>=12 & hour(datetime) <18 ~ "Afternoon",
              hour(datetime)>=18 ~ "Night"),
          time_bin = factor(time_bin, 
                      levels = c("Midnight", "Morning", "Afternoon", "Night"))
        )

## Data transformation to plot Bar graph for transaction frequency
freq<- final_trans_1 %>% 
  group_by(location, date, time_bin) %>% summarize(co=n())
freq_location <- ggplot(freq, aes(x=date, y=co, fill=time_bin, 
  tooltip= paste(co, " transactions at ",location, " on ", date, time_bin))) +
  geom_col_interactive() + 
  annotate(geom="rect", xmin=ymd(20140111)-.5, xmax=ymd(20140113)-.5, 
           ymin=-Inf, ymax=Inf, fill='dark grey' , alpha=0.5) +
  annotate(geom="rect", xmin=ymd(20140118)-.5, xmax=ymd(20140120)-.5, 
           ymin=-Inf, ymax=Inf, fill='dark grey' , alpha=0.5) +
  facet_wrap(~location) +
  ggtitle("Number of transactions per day by location") +
  xlab("Date") + ylab("Number of transactions") +
  labs(fill="Time period") +
  theme(plot.title=element_text(size=20,face="bold"),
        axis.title=element_text(size=14,face="bold"),
        strip.text = element_text(size = 6),
        axis.text=element_text(size=6),
        axis.text.x=element_text(angle=45, hjust=1),
        legend.position="bottom") 

# Find median price per location
median_price_final <- final_trans_1 %>% 
           group_by(location) %>% 
           summarize(med=median(price))
## Data transformation for boxplot plotting
final_trans_1 <- final_trans_1 %>% 
  left_join(median_price_final, by=c("location"))

## Boxplot plotting
boxplot <- ggplot(final_trans_1, aes(x=location, y=price, text=paste("Median:", med))) +
  geom_boxplot(outlier.color="red",outlier.fill="red") + 
  geom_point(alpha=0) + scale_y_log10() + coord_flip() +
  ggtitle("Boxplot of transaction prices per location") +
  theme(axis.title=element_blank(),
        plot.title=element_text(size=20, face="bold"))
boxplot_p<-ggplotly(boxplot)
boxplot_p$x$data[[1]]$hoverinfo <- "none"
# overrides black outline of outliers
boxplot_p$x$data[[1]]$marker$line$color = "red"
# overrides black extreme outlier color
boxplot_p$x$data[[1]]$marker$outliercolor = "red"
# overrides black not as extreme outlier color
boxplot_p$x$data[[1]]$marker$color = "red"

## Plot Interactive Bar chart and Boxplot
girafe(ggobj=freq_location, width_svg = 7, height_svg = 7)

Figure 4: Visualize transactions history

boxplot_p

Figure 4: Visualize transactions history

2. Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

The GPS dataset has rows of GPS coordinates that were logged every few seconds. This signifies that the car was moving and logging different GPS coordinates. The data was transformed to only keep the stationary GPS coordinate for each car by determining rows where the time lag between subsequent GPS log by each car id was more than 5 minutes. 5 minutes was selected because the waiting time at a traffic lights is around 3 to 5 minutes hence the upper bound was chosen to eliminate situations where the stationary GPS coordinates were due stoppage at traffic lights.

2.1 The first anomaly to be investigated is the high transaction price of 10,000 dollars performed at Frydos Autosupply n’ More on 13/01/2014 night by cc 9951.

Based on the location name and logo, it is highly likely to be a mechanic repair shop for vehicle. The transaction without a matching loyalty card transaction made it more suspicious. The transaction records for cc 9951 was extracted and observed for 13/01/2014 in table 2.

There were 5 transactions made and 3 of them did not match the loyalty card transaction data. This eliminates the possibility of the owner forgetting to bring his/her loyalty card for that particular day. There were two transactions made with a time difference of 10 minutes and one of them did not use the loyalty card during both afternoon and night time period each. To further analyse the transactions, the gps log data was visualise on Abila map.

## Transactions on 13/01/2014 at "Frydos Autosupply n' More"
knitr::kable(final_trans_1 %>% 
      filter(last4ccnum==9551 & date == dmy(13012014)) %>%
      dplyr::select(datetime,location,price,last4ccnum,trans_match)%>% 
      arrange(datetime), "simple",
      caption="Table of transaction for cc 9951 on 13/01/2014") 
Table 2: Table of transaction for cc 9951 on 13/01/2014
datetime location price last4ccnum trans_match
2014-01-13 06:04:00 Daily Dealz 2.01 9551 0
2014-01-13 13:18:00 U-Pump 55.25 9551 0
2014-01-13 13:28:00 Hippokampos 30.51 9551 1
2014-01-13 19:20:00 Frydos Autosupply n’ More 10000.00 9551 0
2014-01-13 19:30:00 Ouzeri Elian 28.75 9551 1

Figure 5 shows all cars GPS travel lines for 13/01/2014. From Figure 4 frequency plot for each location, we observe that there was only 2 transaction performed at U-Pump throughout the 2 weeks data. Hovering around the GPS lines right on top of U-Pump reveals that only car id 24 visited the location. Since U-Pump is a petrol kiosk, we can confidently say that car id 24 owner used cc 9951 to make a transaction at “U-Pump”.

Car id 24 GPS line was marked in red and the stationary GPS coordinates were marked as blue dots. These blue dots will represent the GPS coordinates where the car was stationary at the particular location.

Hovering over the blue dot near U-Pump on the map shows the car stopping at 12:35:15 and leaving at 13:22:01. This matches the transaction performed at U-Pump at 13:18:00.

Thereafter, the car left U-Pump at 13:22:01 and arrived back in GAStech at 13:27:14. Hence, the transaction at 13:28:00 at Hippokampos was not possible.

In the evening, the car GPS showed that it left GAStech at 17:57:01 and stop around Ipsilantou Avenue at 18:00:31 and subsequently drove off at 19:29:01. The 10,000 dollars transaction at Frydos Autosupply n’ More was performed at 19:20:00 which fits the car gps timeline. Although the car did not stop directly at Frydos Autosupply n’ More, the distance is around 500 metres and it is possible for the owner to walk on foot to make the 10,000 dollars transactions.

Thereafter, the car started driving at 19:29:01 to the north and stop at 19:31:35. This eliminates the possibility of the transaction at 19:30:00 at Ouzeri Elian.

The combination of transaction data of cc 9551 records with car id 24 does not fit perfectly. An observation of the two possible transaction made on cc 9551 by car id 24 owner did not have a loyalty card transaction record matched. Similarly, the other two impossible transactions were both matched to a loyalty card transaction. The trend further confirmed that the transactions made on cc 9551 is extremely suspicious. Probable deduction would be cc 9551 does not belong to car id 24 while the real owner of cc 9551 was someone else who used it during the day too.

## Load Map and SHP file
bgmap <- raster("datasets/MC2-tourist.tif")

## Transform the structure of GPS data for Map
gps <- gps %>% mutate(timestamp=mdy_hms(Timestamp),id=as_factor(id))
gps_sf <- st_as_sf(gps, coords=c("long","lat"), crs=4326)
gps_stop <- gps_sf %>% group_by(id) %>% arrange(timestamp) %>%
  mutate(start_diff= as.numeric(timestamp - lag(timestamp,default=first(timestamp)))/60,
         stop_diff= as.numeric(lead(timestamp)-timestamp)/60,
         date = as.Date(timestamp)) %>% 
  filter(start_diff>5 | stop_diff >5) %>% 
  mutate(start_vec=ifelse(start_diff>5,1,0), stop_vec=ifelse(stop_diff>5,1,0))

## Convert to LINE string for 13/01/2014
gps_path_all <- gps_sf %>% 
  filter(as.Date(gps_sf$timestamp) == dmy(13012014)) %>% 
  group_by(id) %>% 
  summarize(m = mean(timestamp), do_union=FALSE) %>% st_cast("LINESTRING")
## Convert to LINE string for 13/01/2014 and car id 24
gps_path_24 <- gps_sf %>% 
  filter(as.Date(gps_sf$timestamp) == dmy(13012014), id==24) %>% 
  group_by(id) %>% 
  summarize(m = mean(timestamp), do_union=FALSE) %>% st_cast("LINESTRING")
## Filter stop points for 13/01/2014 and car id 24
gps_24_points <- gps_stop %>% filter(id ==24 & date == dmy(13012014))
 
## Plot interactive map
tmap_mode("view")
map1<-tm_shape(bgmap) +
  tm_rgb(bgmap, r=1, g=2, b=3, alpha=NA, saturation=1, 
         interpolate=TRUE, max.value=255) +
  tm_shape(gps_path_all)+
  tm_lines() +
  tm_shape(gps_path_24) +
  tm_lines(col ="red") +
  tm_shape(gps_24_points)+
  tm_dots(col="blue", shape=30)
tmap_leaflet(map1)

Figure 5: GPS data for 13/01/2014

2.2 The second anomaly were the early morning transactions records at Kronos Mart from Figure 4 frequency plot. Table 3 below displays all transactions records at Kronos Mart. Five out of the ten transactions were performed in the wee hours around 3am on three different days and three out of the five occurred on 19/01/2014. These few transactions were particularly unusual and further investigation was conducted.

## Transactions on 13/01/2014 at "Frydos Autosupply n' More"
knitr::kable(final_trans_1 %>% 
      filter(location == "Kronos Mart") %>%
      dplyr::select(datetime,location,price,last4ccnum,trans_match)%>% 
      arrange(datetime), "simple",
      caption="Table of transaction for cc 9951 on 13/01/2014") 
Table 3: Table of transaction for cc 9951 on 13/01/2014
datetime location price last4ccnum trans_match
2014-01-10 09:30:00 Kronos Mart 203.91 7688 0
2014-01-12 03:39:00 Kronos Mart 277.26 8156 0
2014-01-13 03:00:00 Kronos Mart 147.30 5407 0
2014-01-13 08:01:00 Kronos Mart 159.06 6816 0
2014-01-14 08:20:00 Kronos Mart 58.85 6899 0
2014-01-16 07:30:00 Kronos Mart 298.83 7108 0
2014-01-17 08:08:00 Kronos Mart 286.24 1415 0
2014-01-19 03:13:00 Kronos Mart 87.66 3484 0
2014-01-19 03:45:00 Kronos Mart 194.51 9551 0
2014-01-19 03:48:00 Kronos Mart 150.36 8332 0

The GPS records for 19/01/2014 were visualised to investigate the transactions. From Figure 6, there was no GPS data that passed by nor stop in the vicinity of Kronos Mart on 19/01/2014. The closest stop location was at ROBERTS AND SONS at 13:20:06 to 14:23:01 by car id 30 represented by the blue dot. The timing of the transaction does not coincide with the cc transaction timing.

Hence, possible deduction could be that cc owners of 3484, 9551 and 8332 stays within walking distance to Kronos Mart, therefore eliminating the need to drive their employee car to the location. Another possibility is that the owners of the cc used their own personal vehicles to get there, resulting in no GPS record for employees issued vehicles. Coincidentally, cc 9551 also appeared in these transaction, which warrants additional investigation.

## Map geometry for 19012014
gps_path2 <- gps_sf %>% 
  filter(as.Date(gps_sf$timestamp) == dmy(19012014) & id !=29) %>% 
  group_by(id) %>% 
  summarize(m = mean(timestamp), do_union=FALSE) %>% st_cast("LINESTRING")
gps_points2 <- gps_stop %>% filter(date == dmy(19012014))

## Plot interactive map
tmap_mode("view")
map2<-tm_shape(bgmap) +
  tm_rgb(bgmap, r=1, g=2, b=3, alpha=NA, saturation=1, 
         interpolate=TRUE, max.value=255) +
  tm_shape(gps_path2)+
  tm_lines() +
  tm_shape(gps_points2)+
  tm_dots(col="blue", shape=30)
tmap_leaflet(map2)

Figure 6: GPS data for 19/01/2014

2.3 Lastly, we will cross-check and validate the GPS data with the frequency of transactions at each location. We will first validate the weekday movement. From the earlier section, there were three groups of transaction data: weekday morning transaction only, weekday afternoon transactions only and high value transactions on weekdays only. The map with GPS movement on 07/01/2014 was visualise in Figure 7 as there were transactions performed on that day at all of the locations.

The car GPS stationary coordinates in blue dots for Coffee Cameleon and Hallowed Grounds fits the transaction data. However, the blue dots directly on Brew’ve Been Served logo in the map shows that the timing of the stationary coordinates were mainly in the afternoon or evening. This does not match the transaction timing at Brew’ve Been Served. However, looking slightly south near the main road of Ipsilantou Avenue, there were multiple GPS stationary coordinates in the morning and they fit the transaction timing at Brew’ve Been Served. This might be due to the misrepresentation of the location logo on the map.

There are 4 locations that are in this group. Based on the 4 locations name and logo, they seems to be similar to the earlier group consisting of coffee shops.

Table 4 shows the 13 transactions at the 4 locations on 07/01/2014. A common trend observed was the exact same timestamp of 12:00 on all 13 transactions. However, looking at the GPS stationary positions at those location, the GPS stationary coordinates timestamp were in the morning before 09:00 where employees would presumably visit before heading to GAStech for work.

The 4 locations were spread around Abila, and the occurrence of mismatch GPS stationary timestamp were consistent. A possible deduction could be due to faulty Point of Sales (POS) machines at those locations. Alternatively, it might be possible that they are using the same type of POS machine that performed batch processing instead of real-time processing for cc transactions which process at 12:00 daily.

## Transactions on 13/01/2014 at "Frydos Autosupply n' More"
knitr::kable(final_trans_1 %>% 
      filter((location == "Jack's Magical Beans" |
             location == "Brewed Awakenings" |
             location == "Coffee Shack" |
             location == "Bean There Done That") &
             date == dmy(07012014)) %>%
      dplyr::select(datetime,location,price,last4ccnum,trans_match, price_loy)%>% 
      arrange(datetime), "simple",
      caption="Table of transaction the 4 locations on 07/01/2014") 
Table 4: Table of transaction the 4 locations on 07/01/2014
datetime location price last4ccnum trans_match price_loy
2014-01-07 12:00:00 Coffee Shack 16.63 7117 1 NA
2014-01-07 12:00:00 Brewed Awakenings 6.72 8332 1 NA
2014-01-07 12:00:00 Bean There Done That 8.03 1321 1 NA
2014-01-07 12:00:00 Jack’s Magical Beans 18.77 9241 1 NA
2014-01-07 12:00:00 Jack’s Magical Beans 19.61 8156 1 NA
2014-01-07 12:00:00 Bean There Done That 51.25 1415 1 11.25
2014-01-07 12:00:00 Jack’s Magical Beans 23.68 6899 1 3.68
2014-01-07 12:00:00 Brewed Awakenings 64.84 3853 1 4.84
2014-01-07 12:00:00 Brewed Awakenings 71.59 2540 1 11.59
2014-01-07 12:00:00 Bean There Done That 53.89 1877 1 13.89
2014-01-07 12:00:00 Bean There Done That 46.25 6895 1 6.25
2014-01-07 12:00:00 Jack’s Magical Beans 69.84 2463 1 9.84
2014-01-07 12:00:00 Brewed Awakenings 12.17 7688 0 NA

Based on the 7 locations name and logo, they are likely to be industrial Places of Interest. Observation from the stationary GPS represented by the blue dots at these locations revealed that only truck drivers with car id 100 and above visited these locations. The stationary GPS timestamp also matches the cc transaction timestamp. Hence, a possible deduction is that these 7 locations are businesses that are close partners with GAStech and the payment were made by the lorry truck driver during the weekdays. This will align with the fact that lorry driver vehicles only operates on weekday working hours.

## Map geometry for 07012014
gps_path3 <- gps_sf %>% 
  filter(as.Date(gps_sf$timestamp) == dmy(07012014)) %>% 
  group_by(id) %>% 
  summarize(m = mean(timestamp), do_union=FALSE) %>% st_cast("LINESTRING")
gps_points3 <- gps_stop %>% filter(date == dmy(07012014)) 

## Plot interactive map
tmap_mode("view")
map3<-tm_shape(bgmap) +
  tm_rgb(bgmap, r=1, g=2, b=3, alpha=NA, saturation=1, 
         interpolate=TRUE, max.value=255) +
  tm_shape(gps_path3)+
  tm_lines() +
  tm_shape(gps_points3)+
  tm_dots(col="blue", shape=30)
tmap_leaflet(map3)

Figure 7: GPS data for 07/01/2014

To be continued with Part 3….

Citation

For attribution, please cite this work as

Lim (2021, July 23). Yong Kai: Assignment: VAST Mini-Challenge 2. Retrieved from https://limyongkai.netlify.app/posts/2021-07-23-vastmc2part2/

BibTeX citation

@misc{lim2021assignment:,
  author = {Lim, Yong Kai},
  title = {Yong Kai: Assignment: VAST Mini-Challenge 2},
  url = {https://limyongkai.netlify.app/posts/2021-07-23-vastmc2part2/},
  year = {2021}
}