to think independently: R

Showing posts with label R. Show all posts

Monday, March 25, 2013

Something from "Hedgehogging" -- trap of randomness

So I am reading a very nice book, "Hedgehogging" which tells stories about hedge fund and the people.

As I am from a statistics background, I am automatically sensitive to numbers (with uncertainty) and probability and the combination of the two. In Chapter 8, I am impressed by an example that the author cites from "Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets".

The example says:
You are an excellent investor, and are able to gain additional 15% return above US bond. The variability (standard deviation) of your return in a year is 10%. (Assuming your return rate follows a normal distribution ), then you have about 93% chance to gain more then the US bond in a year.

Then the author cites a table from that book which illustrate the relationship between the probability that you gain more than US bond and the time frame you are checking your return. However, the results in table from the original book are simulation based. Actually, we can calculate the exact number for the table. So I automatically cannot help calculating it and update the table as follows --

If you compare results in this table with the original table, you will see very tiny difference.

Here is formula to calculate the probability that you will gain more than US bond, given that you gain 15% more than US bond and the standard deviation of your return in a year is 10%.

Let X_t be your return in a time span T. Let N be number of period of time span T to be a year. For example, N=3 means time span T is 4 month, which is one season; N=365 means time span T is a day. Assuming X_t also follows a normal distribution and X_t's are independent of each other; therefore, X_t ~ N(mu, sigma/sqrt(N)), where mu is 15% and sigma is 10%. Thus, the probability you gain more than US bond,

P(X_t>0 ) = P_norm(mu/(sigma*sqrt(N)))

P_norm is cumulative probability function of standard normal distribution.

The follows is a simple R code to calculate the above:

u=0.15 # on average you earn 15% more than US bond per year
sig = 0.1 #variability of your annual return

prob_noloss <- function(u, sig, T) pnorm(u/(sig*sqrt(T)))

#probability you gain more than US bond per year
prob_noloss(u,sig,T=1)

#probability you gain more than US bond per season
prob_noloss(u,sig,T=3)

#probability you gain more than US bond per month
prob_noloss(u,sig,T=12)

#probability you gain more than US bond per day
prob_noloss(u,sig,T=365)

#probability you gain more than US bond in per hour
prob_noloss(u,sig,T=365*24)

#probability you gain more than US bond in per minute
prob_noloss(u,sig,T=365*24*60)

This example may looks simple and have strong assumption in the calculation. But the way that the authors (from both books) are valuable.

As discussed in the book (the Hedgehogging book), now suppose you are a fund manager and your, when clients of your look at annual return, they will have a high probability of feeling happy and thus have strong confidence holding putting money in your fund. But, if they are able to look at your return everyday, their happiness will decrease, thus some (or many) of them may pull their money out of your fund, which then will make you nervous, worried, maybe angry, and may make you make irrational decisions.

Actually, this is chain effect not only just apply to fund managers. To individual investors, it is the same because we are managers of "our fund". That is why I feel quite impressed when the author mentioned this first from citation of simple statistics example and come to a common phenomenon in the market -- falling into the trap of randomness.

However, it seems the author (of book: fooled by randomness) would not forbid managers or investors from looking at their daily return, but instead he recommends that we should just need to know our performance in this trading day, but keep calm when making decisions, not driven by our mood, which I feel is a true statement but just too general to follow.

I am still in progress of reading this nice "Hedgehogging" book. Hope I can learn more from it and will keep posted when new interesting thing is found.

Friday, March 15, 2013

A little more to implement the code put just now

Nothing new here. Just write a little bit code to do geocoding on a large file of addresses. The R code is as follows.

###########################################
# check and regeocoding NETS address
# using Google Map API
###########################################

setwd("C:/Geocoding")
source("C:/Geocoding/GIS_Google_Map_API.R") # code listed in the previous blog

address <- read.csv("addresses.csv",header=T)

GeoResult <- c()

for(i in 1:dim(address)[1]){
addr_item = address[i,]
Addr <- addr_item$Addr
Addr <- ifelse(is.na(Addr), "", paste(Addr,", ") )
City <- addr_item$City
City <- ifelse(is.na(City), "", paste(City,", ") )
State <- addr_item$State
State <- ifelse(is.na(State), "", paste(State,", ") )
Zip <- addr_item$ZIP
Zip <- ifelse(is.na(Zip), "", Zip )
Zip4 <-addr_item$Plus4
Zip4 <- ifelse(is.na(Zip4), "", paste("-",Zip4, sep="") )
full_addr <- paste(Addr,City,State,Zip,Zip4, sep="")
georesult <- gGeoCode(full_addr)
if(georesult$status == "OK"){
GeoResult_i <- cbind( rep(i,length(georesult$lat)) ,
rep(full_addr,length(georesult$lat)),
georesult$lat,
georesult$lng,
georesult$formatted_address,
georesult$location_type,
rep(georesult$status,length(georesult$lat))
)
}
if(georesult$status != "OK") GeoResult_i <- c(i, full_addr, NA, NA, NA, NA, georesult$status )

GeoResult <- rbind(GeoResult, GeoResult_i)
Sys.sleep(0.5) # we need to pose a little bit or else google will regard it as too-many queries a time.
}

colnames(GeoResult) <- c("item", "full_addr", "lat", "lng", "formatted_address", "location_type", "status")

write.csv(GeoResult,"GeoResult.csv")

Geocoding using Google Map API via R with examples

Although I've taken an ArcGIS course before, I still like to use R as much as possible to complete some daily tasks (the reason is obvious...). These days one issue about geocoding or location based analysis comes out, driving more attention from me. Basically the task is to geocode and to check the accuracy of business addresses.

People around me mostly are using ArcGIS or use other expensive tools like TeleAtlas to do the geocoding. As I am not the core GIS person, I would like just to use my usual way --R -- to do the task. Here is the code I learned from this blog "Calling Google Maps API from R". I feel the code that the blog offers works pretty well. So I borrow it for my future reference and made some modification and examples based on my work experience.

For more information about Google Map API, we can refer to google's official documents: https://developers.google.com/maps/documentation/geocoding/

#####################################################################

# R code to call Google Map API

# Source codes: http://svnwang.blogspot.com/
# Reference: http://statisfaction.wordpress.com/2011/10/05/calling-google-maps-api-from-r/

#####################################################################

library(XML) # use install.packages("XML") if you haven't install this XML library before

getDocNodeVal=function(doc, path)
{
sapply(getNodeSet(doc, path), function(el) xmlValue(el))
}

gGeoCode=function(str)
{
library(XML)
u=paste('http://maps.google.com/maps/api/geocode/xml?sensor=false&address=',str)
doc = xmlTreeParse(u, useInternal=TRUE)

lat=getDocNodeVal(doc, "/GeocodeResponse/result/geometry/location/lat")
lng=getDocNodeVal(doc, "/GeocodeResponse/result/geometry/location/lng")
formatted_address =getDocNodeVal(doc, "/GeocodeResponse/result/formatted_address")
location_type = getDocNodeVal(doc, "/GeocodeResponse/result/geometry/location_type")
status = getDocNodeVal(doc, "/GeocodeResponse/status")

list(lat = lat, # latitude
lng = lng, # longitude
formatted_address=formatted_address, # full addresses suggested by google
location_type = location_type, # geocoding accuracy
status=status # status of geocoding: OK or zero_results
)
}

#Example 1: you have a right address to geocode
str1 = "11 Wall St, New York, NY"
gGeoCode(str1)

#Example 2: your address is too general,
# but still google can map it by approximation.
str2 = "Wall St, New York, NY"
gGeoCode(str2)

#Comments:
#as you can see from the result,there are multiple matched pairs of geocode and matched address.
#in practice, you need to choose which one is more to your need.

#Example 3: your address is not accurate, but google can guess what it is.
str3= "11 Wall Rd, New York" #actually, it should be "Wall St"
gGeoCode(str3)

#Example 4: your address is too bad, far from accurate, thus cannot be geocoded
str4= "1021 Watl P1lz"
gGeoCode(str4)

Monday, November 19, 2012

Get ForEx data using quantmod R package

The first step of every analysis is getting enough data.

I am interested in the foreign exchange market and curious about the pattern about the exchange rate change; therefore, I try to find some convenient way to obtain the ForEx data.

Thanks to the "quantmod" package recently developed by some kind people, now I can download the forex data at least for some preliminary analysis, if not perfect data I hope to have.

The follows are what I have learnt from an example in the "quantmod" webiste.

# for example, I need to have forex data about USD again CAD price from 2004 to the current

require(quantmod)

getSymbols("USD/CAD",src="oanda",from="2012-01-01")
forex12 <- USDCAD
getSymbols("USD/CAD",src="oanda",from="2011-01-01",to="2011-12-31")
forex11 <- USDCAD
getSymbols("USD/CAD",src="oanda",from="2010-01-01",to="2010-12-31")
forex10 <- USDCAD
getSymbols("USD/CAD",src="oanda",from="2009-01-01",to="2009-12-31")
forex09 <- USDCAD
getSymbols("USD/CAD",src="oanda",from="2008-01-01",to="2008-12-31")
forex08 <- USDCAD
getSymbols("USD/CAD",src="oanda",from="2007-01-01",to="2007-12-31")
forex07 <- USDCAD
getSymbols("USD/CAD",src="oanda",from="2006-01-01",to="2006-12-31")
forex06 <- USDCAD
getSymbols("USD/CAD",src="oanda",from="2005-01-01",to="2005-12-31")
forex05 <- USDCAD
getSymbols("USD/CAD",src="oanda",from="2004-01-01",to="2004-12-31")
forex04 <- USDCAD

forex <-rbind(forex04,forex05,forex06,forex07,forex08,forex09,forex10,forex11,forex12)

# make a rough plot to see the daily change
chartSeries(forex,theme = chartTheme("white"))

This downloading method works well, but I still feel the data is not enough to support some analysis, particular some short term trading strategy, so I really need to figure out if there is any way to have data in 5 mins or hours instead of on a daily average.

Monday, November 12, 2012

illustrate the crime trends across years

One day I was asked for help to illustrate and compare the crime trends for 4 communities in Chicago which are Beverly Hill, Washington Heights, Mount Greenwood and Morgan Park. The crime count data is from the Policy Department of Chicago. It gives all crimes happened in each community from year 2001 to 2010.

Personally, I feel it would not be hard to make a quick plot, but it worth putting some time to make this small task look nicer. Therefore, I decided to use ggplot2 and make the plot in R.

Hope someone who also wants to make similar plots like I did can eventually find my code and procedure useful. If anyone has other interesting ideas to make the plots nicer or more informative, please leave your comment :)

library(ggplot2) #import ggplot2 library
dat <- read.csv("community_crime.csv",header=T)
dat$community = factor(dat$community,
labels=c("Beverly Hill","Washington Heights","Mount Greenwood","Morgan Park")) 

X11(width = 18, height = 12) 

ggplot(dat, aes(x=year, y=any_1000, color=community)) +

geom_point(shape=19,size=I(8),alpha = 0.3) +

scale_colour_hue(l=50) + # Use a slightly darker palette than normal

geom_smooth(method=lm,   # Add linear regression lines

se=FALSE,    # Don't add shaded confidence region

fullrange=T, size=1.4)+ # Extend regression lines

ylab("all crime count per 1000 population")+

xlab("Year")+

opts(title = expression("All Crime Count per 1000 population"))