Currency market is a highly complex decentralized system of exchanges connecting brokers with interbank market where the big banks operate. Electronic Brokering Services (EBS) and Reuters are two big interbank platforms that have most of the market share. The minimum ticket size required for doing business on these two platforms is 1 million units. These two platforms are the major provider of liquidity on the interbank market. It is here where the big high frequency trading (HFT) players operate. The requirement of minimum one million ticket size is a business opportunity for other players to provide services through these ECNs to the retail segment of the market. These ECNs provide full anonymity to the parties unlike the EBS and Reuters. Did you read the post on how to use state space models in trading?
EBS is the main venue where USD, EUR, GBP, JPY, CHF crosses get traded in the interbank market. EBS provides two data feeds. One data feed provides a snapshot of the order book after every 250 milliseconds. The other is the premium EBS Live feed that provides snapshot of the order book after every 100 milliseconds. Now keep this in mind we retail traders never get the chance to see the order book. This privilege is reserved for the big players so they have a big advantage against us retail traders. Big players who can afford minimum one million ticket size trade on EBS. Reuters also works similar to EBS. This was meant to give you an idea of how fast the forex market moves. Forex market moves in milliseconds while we have the minimum time bar of 1 minute on MT4. Read this post on NZDUSD long trade with 6 pip SL and 250 pip TP.
If you have been trading and know a little bit of programming in MQL4, you can see the ticks coming. The problem is MT4 does not save this tick data. For that you will need to collect the tick data from MT4 and save it in an csv file. In a few hours you can have a few million rows in the csv file. Saving tick data is done by hedge funds and other big players. But we don’t need to do that. It can be costly. What we need is a method that can extract the same information that we extract from the tick data. I will show you how do do that in this post. But before we proceed, I wanted to give you an idea on the importance of tick data. Tick data has got a lot of information. Missing that information means we will have more losing trades. Learn how to do object oriented programming in MQL5.
If you have been trading for sometime, you must have been using time bars. Have you ever thought of using tick bars and range bars? Maybe you haven’t given much thought to it. Time bar is formed after a certain fixed period of time. For example 1 hour, 30 minutes, 15 minutes, 1 day, 1 week or 1 month. On the other hand a tick bar is formed after a fixed number of trades. For example after 100 trades, 200 trades, 400 trades, 1000 trades etc. A tick is recorded whenever a trade takes place. Now the currency market is an over the counter market that has no central exchange meaning it is a segmented market with different banks forming a segment. So there is no central clearing authority like that in the futures market or the stock market where we have full volume information available with us. What this means is that we don’t have trade volume information. We only get the tick data. Learn how to trade with Aroon Indicator.
Tick Bars
Tick is recorded whenever a trade takes place. But in forex, we don’t know the volume of the trade. We are only provided the number of trades that took place in a certain time period. Time bars are created after a certain time period like 1 hour, 4 hours etc. There can be many trades in a time bar or very few trades in a time bar. This is something important for you to understand as it is related to the concept of volatility. Volatility is the price movement. If the price movement range is large we have high volatility and if the price movement range is small we have low volatility. When we have traditional time chart candlesticks, we can have candlesticks that have few trades and we can also have candlesticks that many trades meaning many ticks. We have the problem of low volatility clustering and then suddenly high volatility clustering.
# Extract prices and compute statistics data1 <- read.csv("D:/Shared/MarketData/GBPUSD#1.csv", header=FALSE) colnames(data1) <- c("Date", "Time", "Open", "High", "Low", "Close", "Volume") mean_prices <- round(mean(data1$Close), 2) sd_prices <- round(sd(data1$Close), 2) # Plot the histogram along with a legend hist(data1$Close, breaks = 100, prob=T, cex.main = 0.9) abline(v = mean_prices, lwd = 2) legend("topright", cex = 0.8, border = NULL, bty = "n", paste("mean=", mean_prices, "; sd=", sd_prices))
As you can see in the plot, the price distribution is multi model with varying mean and varying standard deviation. What this means is that the price distribution is not stationary. Time bars have this volatility varying with each bar. Another problem that we face is that time bars have heavy tails. The return distribution of time bars is not normal but it is heavy tailed and skewed.
Above is the plot of the histogram of GBPUSD price. As you can see in the above plot, price distribution is multimodal with many peaks. Dealing with multimodel distribution is not easy as it is more complex than the usual normal or Gaussian distribution. Above plot only represents price over a certain period of time. If we choose a different period of time we will get a different histogram meaning the mean and standard deviation of price distribution keeps on varing. In short, price distribution is not stationary. Quants always find is easy to deal with log returns. So let’s plot the QQ plot for the GBPUSD 30 time bars and see how much the log returns are skewed.
data1 <- read.csv("D:/Shared/MarketData/GBPUSD#30.csv", header=FALSE) colnames(data1) <- c("Date", "Time", "Open", "High", "Low", "Close", "Volume") library(quantmod) data2 <- as.xts(data1[,-(1:2)], as.POSIXct(paste(data1[,1],data1[,2]), format='%Y.%m.%d %H:%M')) #create log returns column data2$LR <- diff(log(data2$Close)) data2 <-na.omit(data2) qqnorm(data2$LR) qqline(data2$LR) > tail(data1) Date Time Open High Low Close Volume 3513 2019.03.12 05:30 1.32135 1.32195 1.32125 1.32186 328 3514 2019.03.12 06:00 1.32185 1.32194 1.32154 1.32165 330 3515 2019.03.12 06:30 1.32166 1.32168 1.32112 1.32138 363 3516 2019.03.12 07:00 1.32145 1.32147 1.32025 1.32035 394 3517 2019.03.12 07:30 1.32036 1.32098 1.32013 1.32042 509 3518 2019.03.12 08:00 1.32041 1.32118 1.32032 1.32116 383
As you can see in the plot below the log returns are highly skewed. QQ plot is made by plotting the empirical distribution against the distribution that we want to use to model the emipirial distribution. Normal distribution has nice properties. Normal distribution is easy to deal with mathematicall. But as you can see the QQ plot below. Log returns are not normally distributed. Log returns have heavy tails on both sides.The log returns are not at all normally distributed. This makes statistically analyzing log returns of time bars a difficult thing.
Since the log returns of time bars are highly skewed and not normally distributed, we cannot make predictions on them using the normal statistical concepts like stationarity and ergodicity. Read the post on how to use Random Forest in algorithmic trading.
Stationarity
Stationarity means that the data generating process does not change with time. When the probability distribution doesn’t change with it, we call it strict stationariy. In practice it is difficult to achieve strict stationarity. What we can achieve is a process that has a constant mean and constant variance meaning the first two moments of the distribution doesn’t change with it. ow do we achieve that we achieve that by analyzing the returns series instead of the price time series. Rather we take log returns as price distribution is modeled as log normal which infact it is not.We call this covariance stationarity.
Why we worry about stationarity? If we want to predict the future we need to make it like the past. If the future looks like the past we can predict it. But if the future does not look like the past we cannot predict it. So we want the data to be stationary for better prediction. As said strict stationarity means that the data generating process does not change with time. For example if we measure the height of humans and plot it, we will find it roughly normally distributed. Why? If you have taken a course in statistics you must have been taught the Central Limit Theorem (CLT). When we add large number of random variables that are independent and identically distributed, we will get a normal distribution.
Human heights are produced by a process that has not changed over many millions of years so we get a stationary distribution. But in the financial markets we are dealing with a totally different beast that keeps on changing with time. If we want to understand the financial market we need to understand its microstructure. The smallest price movement in the market is known as a tick. Financial market is operating on millisecond timeframe rather it is operating on the microsecond timeframe. We traders often ignore this fact that the market is operating at millisecond timeframe. Let’s now construct tick bars using the 1 minute GBPUSD data that we can easily download from MT4 History Center.
> data1 <- read.csv("D:/Shared/MarketData/GBPUSD#1.csv", + header=FALSE) > colnames(data1) <- c("Date", "Time", "Open", "High", "Low", + "Close", "Volume") > data3=data.frame(matrix(NA, nrow=2000, ncol=7)) > colnames(data3) <- c("Date", "Time", "Open", "High", "Low", + "Close", "Volume") > tickVolume=0.0 > i=0 > j=1 > for( k in 1:nrow(data1)){ + tickVolume = tickVolume + data1[k,7] + if (tickVolume > 985){ + i=i+1 + data3[i , 7]=tickVolume + data3[i, 1]=as.character(data1[j,1]) + data3[i, 2]=as.character(data1[j,2]) + data3[i, 3]=data1[j,3] + data3[i, 4]=max(data1[j:k,4]) + data3[i, 5]=min(data1[j:k,5]) + data3[i, 6]=data1[k,6] + tickVolume=0.0 + j=k+1 + #print(k) + } + } > > #head(data3) > data3 <-na.omit(data3) > x1 <-nrow(data3)+1 > data3[x1, 1] <- as.character(data1[j,1]) > data3[x1, 2] <- as.character(data1[j,2]) > data3[x1, 3] <- data1[j,3] > data3[x1, 4] <- max(data1[(j:nrow(data1)),4]) > data3[x1, 5] <- min(data1[(j:nrow(data1)),5]) > data3[x1, 6] <- data1[(nrow(data1)),6] > data3[x1, 7] <- sum(data1[(j:nrow(data1)),7]) > dim(data3) [1] 978 7 > #head(data1) > #tail(data3) > #tail(data1) > library(quantmod) > data4 <- as.xts(data3[,-(1:2)], + as.POSIXct(paste(data3[,1],data3[,2]), + format='%Y.%m.%d %H:%M')) > #plot the candle chart > candleChart(data4,theme='white', type='candles', + subset='last 2 days', TA=NULL) > summary(data3$Volume[-nrow(data3)]) Min. 1st Qu. Median Mean 3rd Qu. Max. 986 994 1004 1009 1019 1124 > data4$LR <- diff(log(data4$Close)) > data4 <- na.omit(data4) > qqnorm(data4$LR) > qqline(data4$LR) #plot the chart with MACD and RSI chartSeries(data4,theme='white', type='candles', subset='last 2 days', TA="addMACD(); addRSI(14)")
Now as you can see, we have improved somewhat. Since have used 1 minute GBPUSD data to construct the tick bars with volume of 1000, the log returns are still skewed with heavy tails as shown the QQ plot below. Below is the plot of tick data with 1000 ticks per bar. Compare this QQ plot with the QQ plot of the raw log returns. QQ plot below is better than that above but still we have outliers on both sides of the tail. The QQ plot below still is not normal as it has got heavy tails on both sides. QQ plot shows how much the empirical distribution deviates from the target distribution which is the normal distribution in our case. This is what I have done. MT4 does not provide price data below 1 minute. MT4 does provide tick data but it does not save the tick data. If we want tick data, we will need to save it. Now many hedge funds and many professional traders do that. They save tick data. I don’t think we need to do that. I will show you an easy way that frees you from the need to save tick data.
First we use 1 minute OHLCV data to create tick bars. Yes we can do that. If you check, M60 timeframe have volume that is on average having 1000 ticks. So this is what I will do. I will use 1 minute GBPUSD OHLCV data and use the volume to create tick bars that will have approximately 1000 ticks in them. I have provided the R code above that does that. We can use 1 minute OHLCV data to create tick bars for 100, 200,500, 1000,2000 ticks whatever suits our need. Since we have 1 minute data we will have some variation meaning sometimes the bar will have 980 ticks, sometimes 1050 ticks and so on. But the difference will not be much.
Above is the QQ plot for the 1000 tick bars. You can see the above QQ plot is better than the raw log returns QQ plot where the tails were very heavy. In the above case the heaviness is still there but the QQ plot is approximating the normal distribution quite well. The skewness is less as compared to GBPUSD M30 time bar log returns. Now this does not mean that the tick bar log returns are normally distributed. You need to keep this in mind. Time bars have equal time intervals. Tick bars don’t have equal time intervals. A tick bar can take more time or less time that the standard time bar. We can also create candlestick charts out of tick bars. Below is the candlestick plot of the tick bars that we have calculated above.
Learn how to identify market reversals on daily and weekly charts. We can also add technical analysis indicators to the tick bar candlestick charts. Below is the same candlestick chart of tick bars as above but with technical indicators MACD and RSI added to the plot. Candlesticks on tick bars have the same patterns as those on the standard time bar charts. If you look at the chart below, you can easy see that we can use candlestick patterns to predict the next price move. RSI oscillator is showing nice divergence patterns.
You can see we can use usual technical analysis when using tick bars. There are many pro traders who use both ordinary candlestick charts and combine them with candlestick charts of tick bars as above when making their trading decisions. Now tick bars is one way of looking at OHLCV data. In tick bars we take in account the volume ticks. In time bars we ignore the volume tick information. So we have different information in tick bars as compared to time bars. There is another way to look at OHLCV data.
Range Bars
Range bars are another method of sampling the price data. What we want is to create a bar when price moves a fixed number of pips. So we create a bar when price moves 10 pips in one direction. If the price moves 10 pips up, we create an up bar and when price moves 10 pips down, we create a down range bar. Range bars remove a lot of clutter that we see on time bars when we have bars where price did not move much and when we have time bars when price makes a big move. MT4 does not provide range bars by default. You can code an indicator that will draw range bars. But here we will use R and create range bars once again using 1 minute OHLCV price data. We will once again use 1 minute GBPUSD OHLCV data to resample and create range bars.
> data1 <- read.csv("D:/Shared/MarketData/GBPUSD#1.csv", + header=FALSE) > colnames(data1) <- c("Date", "Time", "Open", "High", "Low", + "Close", "Volume") > #tail(data1) > #dim(data1) > #str(data1) > data3=data.frame(matrix(NA, nrow=1000, ncol=7)) > colnames(data3) <- c("Date", "Time", "Open", "High", "Low", + "Close", "Volume") > pips=50 > i=0 > j=1 > l=1 > for( k in 2:nrow(data1)){ + rangeBar=10^4*abs(data1[k ,6]-data1[l,6]) + if (rangeBar >= 50){ + i=i+1 + data3[i, 1]=as.character(data1[j,1]) + data3[i, 2]=as.character(data1[j,2]) + data3[i, 3]=data1[j,3] + data3[i, 4]=max(data1[j:k,4]) + data3[i, 5]=min(data1[j:k,5]) + data3[i, 6]=data1[k,6] + data3[i, 7]=sum(data1[j:k,7]) + l=k + j=k+1 + #print(k) + } + } > > head(data3) Date Time Open High Low Close Volume 1 2019.01.25 07:29 1.31325 1.31380 1.30804 1.30815 6867 2 2019.01.25 10:18 1.30814 1.31319 1.30610 1.31318 20650 3 2019.01.25 16:52 1.31319 1.31848 1.31153 1.31845 12369 4 2019.01.25 20:27 1.31846 1.32163 1.31306 1.31311 62801 5 2019.01.29 09:46 1.31319 1.31890 1.31285 1.31857 19520 6 2019.01.29 15:44 1.31841 1.31986 1.31242 1.31331 10423 > data3 <-na.omit(data3) > summary(10^4*diff(data3$Close)) Min. 1st Qu. Median Mean 3rd Qu. Max. -57.700 -51.300 -50.000 -1.952 50.800 55.600 > x1 <-nrow(data3)+1 > data3[x1, 1] <- as.character(data1[j,1]) > data3[x1, 2] <- as.character(data1[j,2]) > data3[x1, 3] <- data1[j,3] > data3[x1, 4] <- max(data1[(j:nrow(data1)),4]) > data3[x1, 5] <- min(data1[(j:nrow(data1)),5]) > data3[x1, 6] <- data1[(nrow(data1)),6] > data3[x1, 7] <- sum(data1[(j:nrow(data1)),7]) > dim(data3) [1] 31 7 > tail(data3) Date Time Open High Low Close Volume 26 2019.02.19 16:49 1.29756 1.30280 1.29741 1.30261 4722 27 2019.02.19 17:55 1.30258 1.30773 1.30110 1.30766 49870 28 2019.02.20 18:11 1.30768 1.31083 1.30253 1.30253 26040 29 2019.02.21 08:35 1.30261 1.30754 1.30252 1.30754 5167 30 2019.02.21 10:20 1.30753 1.30940 1.30242 1.30249 51077 31 2019.02.22 10:31 1.30250 1.30392 1.30078 1.30160 6127 > library(quantmod) > data4 <- as.xts(data3[,-(1:2)], + as.POSIXct(paste(data3[,1],data3[,2]), + format='%Y.%m.%d %H:%M')) > #plot the candle chart > candleChart(data4,theme='white', type='candles', + subset='last 15 days', TA=NULL)
Now above you can see above I have created range bars with a range of 50 pips. I have created range bars that have 50 pips range. Now I could also have easily created a 10 pip range bar. It depends. if we want to do scalping, we should create 10 pip range bars and if we want to do swing trading, we should create 50 pips range bars. Each time price moves by 50 pips in either up or down direction, a range bar is made. In the above code, I checked the size of each bar. Median is 50 pips. Max is 55 pips and minimum is -57 pips. So we have roughly same size bars. If we had used tick data, we could have more better range bars. But these range bars are good enough for our analysis purpose. We can do technical analysis also using Quantmod package. Below is the plot of range bars.
As you can see all bars have the same length. Some bars do have wicks. These wicks are due to the reason that price moved but retraced before reaching 50 pips range. Sometimes we will get a range bar will a wick something like 40 pips. Below is the range bar plot alongwith the plot of RSI underneath it. You can see a divergence pattern is being made. Now these patterns are very clear when you draw the charts with historical data. But when the price is moving in real time, these patterns are not that reliable. You should keep this in mind. Range bars can be analyzed using logistic regression as well as quadratic discriminant analysis as well as support vector machines. More on this in a future post. Kalman filter is being used a lot in quantitative finance. If you want to learn about Kalman Filters, take a look at my Kalman Filter for Traders course.
Reading NinjaTrader Tick Data in R
I had used 1 minute OHLCV data that I had downloaded from MT4. I could have also downloaded tick data. But this time I will use NinjaTrader. Many forex traders are stuck with MT4. Take a look at new NinjaTrader 8. NinjaTrader is a powerful trading platform that is getting ignored by many trader. Let’s download and get familiar with it. I have connected NinjaTrader with FXCM demo account. I will use NinjaTrader to download tick data. I downloaded GBPUSD tick data. I will use this tick data to construct tick bars and see if we have what we want. When I download the tick data from NinjaTrader, it had more than 200,000 rows. But then I found out that a large part of the data was corrupted. So reading this tick data will also help us learn how to deal with bad data.
> #read tick data > d1 <-read.table("D:/Shared/MarketData/GBPUSD.Last.txt", + as.is = TRUE) > #d1 <-read.table("D:/Shared/MarketData/EURUSD.Last.txt", > #as.is = TRUE) > #head(d1) > #tail(d1[[3]]) > d2 <- read.table(text=d1[[3]], sep = ";") > #tail(d2) > data1 <- cbind(d1[1:2], d2) > #tail(data1) > colnames(data1) <- c("Date", "Time","Milliseconds", + "Last", "Bid", "Ask", "Volume") > head(data1) Date Time Milliseconds Last Bid Ask Volume 1 20190218 190000 920000 1.29248 1.29248 1.29283 4 2 20190218 190000 4420000 1.29248 1.29248 1.29283 2 3 20190218 190001 1550000 1.29248 1.29248 1.29283 2 4 20190218 190001 9420000 1.29248 1.29248 1.29289 10 5 20190218 190002 4920000 1.29248 1.29248 1.29285 9 6 20190218 190003 410000 1.29249 1.29249 1.29285 29 > tail(data1) Date Time Milliseconds Last Bid Ask Volume 207040 20190221 151748 1930000 1.30582 1.30582 1.30588 2 207041 20190221 151748 9300000 1.30583 1.30583 1.30588 1 207042 20190221 151749 1590000 1.30583 1.30583 1.30588 2 207043 20190221 151751 490000 1.30584 1.30584 1.30587 2 207044 20190221 151751 1960000 1.30584 1.30584 1.30588 1 207045 20190221 151752 6250000 1.30582 1.30582 1.30587 5 > dim(data1) [1] 207045 7
Tick data file that I downloaded from NinjaTrader was in a very unfamiliar format. But R is powerful. I was able to read the data as shown above. I used a few tricks that I have given in the above R code. As you can see the data has 200,000 rows roughly. This tick data is for one day only. So you can see how much tick data is created each trading day. Now above I have read the tick file. It is a text file and quite different in structure from the OHLCV csv files that we can download from MT4 History Center. Did you notice one thing, we have a column named Milliseconds. Most of the tick data is in milliseconds. In each second we can have many updates of the price data. R is powerful language. It can easily deal with millisecond data.
> library(quantmod) > options("digits.secs"=7); #Show milliseconds > data2 <-data1 > data2$DateTime <- paste(data2[,2],data2[,3], sep=".") > data2$DateTime <- paste(data2[,1], data2[,8]) > head(data2) Date Time Milliseconds Last Bid Ask Volume 1 20190218 190000 920000 1.29248 1.29248 1.29283 4 2 20190218 190000 4420000 1.29248 1.29248 1.29283 2 3 20190218 190001 1550000 1.29248 1.29248 1.29283 2 4 20190218 190001 9420000 1.29248 1.29248 1.29289 10 5 20190218 190002 4920000 1.29248 1.29248 1.29285 9 6 20190218 190003 410000 1.29249 1.29249 1.29285 29 DateTime 1 20190218 190000.920000 2 20190218 190000.4420000 3 20190218 190001.1550000 4 20190218 190001.9420000 5 20190218 190002.4920000 6 20190218 190003.410000 > data2 <- data2[,c(8,4,5,6,7)] > head(data2) DateTime Last Bid Ask Volume 1 20190218 190000.920000 1.29248 1.29248 1.29283 4 2 20190218 190000.4420000 1.29248 1.29248 1.29283 2 3 20190218 190001.1550000 1.29248 1.29248 1.29283 2 4 20190218 190001.9420000 1.29248 1.29248 1.29289 10 5 20190218 190002.4920000 1.29248 1.29248 1.29285 9 6 20190218 190003.410000 1.29249 1.29249 1.29285 29 > tail(data2,10) DateTime Last Bid Ask Volume 207036 20190221 151744.6050000 1.30582 1.30582 1.30586 1 207037 20190221 151746.2260000 1.30583 1.30583 1.30588 3 207038 20190221 151746.7730000 1.30583 1.30583 1.30587 23 207039 20190221 151747.580000 1.30583 1.30583 1.30587 2 207040 20190221 151748.1930000 1.30582 1.30582 1.30588 2 207041 20190221 151748.9300000 1.30583 1.30583 1.30588 1 207042 20190221 151749.1590000 1.30583 1.30583 1.30588 2 207043 20190221 151751.490000 1.30584 1.30584 1.30587 2 207044 20190221 151751.1960000 1.30584 1.30584 1.30588 1 207045 20190221 151752.6250000 1.30582 1.30582 1.30587 5 > str(data2) 'data.frame': 207045 obs. of 5 variables: $ DateTime: chr "20190218 190000.920000" "20190218 190000.4420000" "20190218 190001.1550000" "20190218 190001.9420000" ... $ Last : num 1.29 1.29 1.29 1.29 1.29 ... $ Bid : num 1.29 1.29 1.29 1.29 1.29 ... $ Ask : num 1.29 1.29 1.29 1.29 1.29 ... $ Volume : int 4 2 2 10 9 29 10 6 8 10 ... > sum(is.na(data2)) [1] 0 > dim(data2) [1] 207045 5 > data2 <-na.omit(data2) > dim(data2) [1] 207045 5 > sum(is.na(data2[ ,1])) [1] 0
Data preprocessing is very important. Data preprocessing is also known as data cleaning. Above I have checked for bad data meaning rows that have missing data. But I couldn’t find any. This is the most important lesson that you learn in doing data science. I have combined the first three columns into the datetime column. I need to convert this column into a POSIXct object which I do now below. Learn how to use Fibonacci in trading.
> data2[,1] <-as.POSIXct(data2[,1], format="%Y%m%d %H%M%OS") > newDF <- subset(data2, is.na(data2)) > head(newDF) DateTime Last Bid Ask Volume 12850 1.29229 1.29229 1.29233 2 12851 1.29226 1.29226 1.29230 9 12852 1.29228 1.29228 1.29230 2 12853 1.29228 1.29228 1.29231 1 12854 1.29228 1.29228 1.29231 3 12855 1.29220 1.29220 1.29224 9 > tail(newDF) DateTime Last Bid Ask Volume 183612 1.30664 1.30664 1.30670 21 183613 1.30665 1.30665 1.30668 26 183614 1.30666 1.30666 1.30668 6 183615 1.30667 1.30667 1.30667 3 183616 1.30665 1.30665 1.30669 13 183617 1.30665 1.30665 1.30667 5 > data2 <- na.omit(data2) > data1[12850:12855, ] Date Time Milliseconds Last Bid Ask Volume 12850 20190219 0 60000 1.29229 1.29229 1.29233 2 12851 20190219 0 4880000 1.29226 1.29226 1.29230 9 12852 20190219 0 6580000 1.29228 1.29228 1.29230 2 12853 20190219 1 2880000 1.29228 1.29228 1.29231 1 12854 20190219 2 660000 1.29228 1.29228 1.29231 3 12855 20190219 3 2610000 1.29220 1.29220 1.29224 9 > data1[183612:183617, ] Date Time Milliseconds Last Bid Ask Volume 183612 20190221 95957 1110000 1.30664 1.30664 1.30670 21 183613 20190221 95957 6840000 1.30665 1.30665 1.30668 26 183614 20190221 95958 2440000 1.30666 1.30666 1.30668 6 183615 20190221 95958 6810000 1.30667 1.30667 1.30667 3 183616 20190221 95959 3350000 1.30665 1.30665 1.30669 13 183617 20190221 95959 8300000 1.30665 1.30665 1.30667 5 > data2 <- data2[1:10000,] > data2 <- data2[-(1:184999), ]
As you can see above we have bad rows. There are many rows that have the time stamp missing. We will need to remove these rows from our data. I remove them. Do you know this fact that financial markets are being controlled by central banks. Markets versus Central banks who will you bet on?
> data2 <- data2[-(1:184999), ] > data3 <- xts(data2[ ,-1], order.by=as.POSIXct(data2[ ,1], + format="%Y%m%d %H%M%OS")) > head(data3) Last Bid Ask Volume 2019-02-21 10:19:02.483 1.30641 1.30641 1.30643 11 2019-02-21 10:19:03.269 1.30641 1.30641 1.30644 6 2019-02-21 10:19:03.619 1.30640 1.30640 1.30644 4 2019-02-21 10:19:04.279 1.30641 1.30641 1.30642 1 2019-02-21 10:19:04.457 1.30641 1.30641 1.30643 1 2019-02-21 10:19:05.729 1.30641 1.30641 1.30642 1 > str(data3) An 'xts' object on 2019-02-21 10:19:02.483/2019-02-21 15:17:52.625 containing: Data: num [1:22046, 1:4] 1.31 1.31 1.31 1.31 1.31 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:4] "Last" "Bid" "Ask" "Volume" Indexed by objects of class: [POSIXct,POSIXt] TZ: xts Attributes: NULL > dim(data3) [1] 22046 4
Finally we have cleaned the bad data and made an xts dataframe. There were a lot of rows with bad tick data. I removed those rows that had bad data. Now I only have 22000 rows will good data that I have converted into an xts dataframe. If you see above, the timestamp is now in milliseconds. We have three decimal places for each second. Watch this 2 hour documentary and learn how hedge funds are made. Below I convert this tick data into tick bars.
> data3$TickBars=NA > tickVolume=0 > i=0 > j=1 > for( k in 1:nrow(data3)){ + tickVolume = tickVolume + data1[k,4] + if (tickVolume >=1000){ + data3[k , 5]=tickVolume + tickVolume=0.0 + j=k+1 + #print(k) + } + } > > data4 <- na.omit(data3) > tail(data4) Last Bid Ask Volume TickBars 2019-02-21 14:24:32.220 1.30695 1.30695 1.30698 13 1001.121 2019-02-21 14:33:19.733 1.30679 1.30679 1.30682 2 1001.139 2019-02-21 14:44:39.308 1.30691 1.30691 1.30695 35 1001.088 2019-02-21 14:53:32.950 1.30599 1.30599 1.30604 2 1000.885 2019-02-21 15:03:33.194 1.30548 1.30548 1.30551 3 1001.022 2019-02-21 15:12:59.490 1.30576 1.30576 1.30580 10 1001.200 > tail(data4) Last Bid Ask Volume TickBars 2019-02-21 15:12:07.246 1.30582 1.30582 1.30588 4 100.6360 2019-02-21 15:12:56.355 1.30571 1.30571 1.30577 7 100.6251 2019-02-21 15:13:53.295 1.30591 1.30591 1.30596 12 100.6262 2019-02-21 15:14:43.684 1.30594 1.30594 1.30597 7 100.6247 2019-02-21 15:15:50.970 1.30584 1.30584 1.30587 1 100.6237 2019-02-21 15:16:53.489 1.30571 1.30571 1.30576 2 100.6301 > dim(data4) [1] 282 5 > summary(data4$TickBars)[ ,2] "Min. :100.6 " "1st Qu.:100.6 " "Median :100.8 " "Mean :100.8 " "3rd Qu.:100.8 " "Max. :100.8 " > data4$LR <- diff(log(data4$Last)) > data4 <- na.omit(data4) > qqnorm(data4$LR) > qqline(data4$LR) > data2[which(data2$Volume==max(data2$Volume)), ] DateTime Last Bid Ask Volume 191866 20190221 120322.9720000 1.30684 1.30684 1.30691 164
It is always a good idea to check the tick data. If we check the tick data we find that we have a row with 164 tick. Now you can see there is a row that has 164 ticks in one row. Having 164 ticks in a row is a sign that something unusual is happening at this time. Let’s investigate and see what is happening around this time. Learn how a 23 year old trader makes $700K in 1 year.
> data2[191860:191870, ] DateTime Last Bid Ask Volume 191860 20190221 120319.1070000 1.30626 1.30626 1.30632 40 191861 20190221 120319.6860000 1.30604 1.30604 1.30609 55 191862 20190221 120320.2140000 1.30603 1.30603 1.30607 25 191863 20190221 120320.7790000 1.30609 1.30609 1.30616 51 191864 20190221 120321.3360000 1.30615 1.30615 1.30617 28 191865 20190221 120321.8890000 1.30636 1.30636 1.30640 73 191866 20190221 120322.9720000 1.30684 1.30684 1.30691 164 191867 20190221 120323.5140000 1.30671 1.30671 1.30675 66 191868 20190221 120324.670000 1.30642 1.30642 1.30647 67 191869 20190221 120324.6350000 1.30644 1.30644 1.30647 31 191870 20190221 120325.1790000 1.30642 1.30642 1.30645 29
You can see above high volume. Each row has 40, 50 and more ticks. Below is the plot of the time difference in seconds between the trades.
Normal tick activity is not that high. Below is a histogram of the tick activity.
SIx banks get fined $ 3 Billion for currency market manipulation. Is this a sign of currency manipulation? We have no idea. Below is the normal tick activity. Each row is having 1-5 ticks.
> tail(data2,10) DateTime Last Bid Ask Volume 207036 20190221 151744.6050000 1.30582 1.30582 1.30586 1 207037 20190221 151746.2260000 1.30583 1.30583 1.30588 3 207038 20190221 151746.7730000 1.30583 1.30583 1.30587 23 207039 20190221 151747.580000 1.30583 1.30583 1.30587 2 207040 20190221 151748.1930000 1.30582 1.30582 1.30588 2 207041 20190221 151748.9300000 1.30583 1.30583 1.30588 1 207042 20190221 151749.1590000 1.30583 1.30583 1.30588 2 207043 20190221 151751.490000 1.30584 1.30584 1.30587 2 207044 20190221 151751.1960000 1.30584 1.30584 1.30588 1 207045 20190221 151752.6250000 1.30582 1.30582 1.30587 5
As you can see most of the ticks are just below 5. There can be problem in the data. Watch these videos on how to exploit high frequency traders. Now below is the QQ plot of the tick data. There are again outliers that we find on both the tails. If we remove these outliers, we will have a normal distribution. Can we remove these outliers? I would not recommend that.
Reading NinjaTrader Tick Data Using Tidyverse
Tidyverse is a new R package that is very fast and very powerful. It uses tibble which is a bit different as compared to the traditional R dataframe. I had read the tick data using R standard dataframe. Let’s use the more advanced tidyverse package and read the data once again. This is just for educational purposes.
> library(tidyverse) -- Attaching packages --------------------------------------- tidyverse 1.2.1 -- v ggplot2 3.1.0 v purrr 0.2.5 v tibble 2.0.1 v dplyr 0.7.8 v tidyr 0.8.2 v stringr 1.3.1 v readr 1.3.1 v forcats 0.4.0 -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::first() masks xts::first() x dplyr::lag() masks stats::lag() x dplyr::last() masks xts::last() > d1 <- read_delim("D:/Shared/MarketData/GBPUSD.Last.txt", + delim=" ", col_names=FALSE) Parsed with column specification: cols( X1 = col_double(), X2 = col_double(), X3 = col_character() ) > #dim(d1) > #head(d1) > #head(d1[[3]]) > d2 <- read_delim(d1[[3]], delim=";", col_names=FALSE) > #dim(d2) > #head(d2) > data1 <- cbind(d1[, 1:2], d2) > colnames(data1) <- c("Date", "Time","Milliseconds", + "Last", "Bid", "Ask", "Volume") > head(data1) Date Time Milliseconds Last Bid Ask Volume 1 20190218 190000 0920000 1.29248 1.29248 1.29283 4 2 20190218 190000 4420000 1.29248 1.29248 1.29283 2 3 20190218 190001 1550000 1.29248 1.29248 1.29283 2 4 20190218 190001 9420000 1.29248 1.29248 1.29289 10 5 20190218 190002 4920000 1.29248 1.29248 1.29285 9 6 20190218 190003 0410000 1.29249 1.29249 1.29285 29 > data2 <-data1 > data2$DateTime <- paste(data2[,2],data2[,3], sep=".") > data2$DateTime <- paste(data2[,1], data2[,8]) > head(data2) Date Time Milliseconds Last Bid Ask Volume 1 20190218 190000 0920000 1.29248 1.29248 1.29283 4 2 20190218 190000 4420000 1.29248 1.29248 1.29283 2 3 20190218 190001 1550000 1.29248 1.29248 1.29283 2 4 20190218 190001 9420000 1.29248 1.29248 1.29289 10 5 20190218 190002 4920000 1.29248 1.29248 1.29285 9 6 20190218 190003 0410000 1.29249 1.29249 1.29285 29 DateTime 1 20190218 190000.0920000 2 20190218 190000.4420000 3 20190218 190001.1550000 4 20190218 190001.9420000 5 20190218 190002.4920000 6 20190218 190003.0410000 > data2 <- data2[,c(8,4,5,6,7)] > head(data2) DateTime Last Bid Ask Volume 1 20190218 190000.0920000 1.29248 1.29248 1.29283 4 2 20190218 190000.4420000 1.29248 1.29248 1.29283 2 3 20190218 190001.1550000 1.29248 1.29248 1.29283 2 4 20190218 190001.9420000 1.29248 1.29248 1.29289 10 5 20190218 190002.4920000 1.29248 1.29248 1.29285 9 6 20190218 190003.0410000 1.29249 1.29249 1.29285 29 > str(data2) 'data.frame': 207045 obs. of 5 variables: $ DateTime: chr "20190218 190000.0920000" "20190218 190000.4420000" "20190218 190001.1550000" "20190218 190001.9420000" ... $ Last : num 1.29 1.29 1.29 1.29 1.29 ... $ Bid : num 1.29 1.29 1.29 1.29 1.29 ... $ Ask : num 1.29 1.29 1.29 1.29 1.29 ... $ Volume : num 4 2 2 10 9 29 10 6 8 10 ... > library(lubridate) Attaching package: 'lubridate' The following object is masked from 'package:base': date > data2$DateTime <-ymd_hms(data2$DateTime) Warning message: 66157 failed to parse. > head(data2) DateTime Last Bid Ask Volume 1 2019-02-18 19:00:00.092 1.29248 1.29248 1.29283 4 2 2019-02-18 19:00:00.441 1.29248 1.29248 1.29283 2 3 2019-02-18 19:00:01.154 1.29248 1.29248 1.29283 2 4 2019-02-18 19:00:01.941 1.29248 1.29248 1.29289 10 5 2019-02-18 19:00:02.492 1.29248 1.29248 1.29285 9 6 2019-02-18 19:00:03.040 1.29249 1.29249 1.29285 29 > dim(data2) [1] 207045 5
Now you can see the warning message above that says that 66157 rows failed to parse as the data was not in the proper format. Watch this 50 minute documentary on how HFT is done. In this post I showed how we can easily deal with tick data. Tick data has better statistical properties as compared to the time bars.