How To Read Tick Data & Create Tick Bars & Range Bars in R?

Currency market is a highly complex decentralized system of exchanges connecting brokers with interbank market where the big banks operate. Electronic Brokering Services (EBS) and Reuters are two big interbank platforms that have most of the market share. The minimum ticket size required for doing business on these two platforms is 1 million units. These two platforms are the major provider of liquidity on the interbank market. It is here where the big high frequency trading (HFT) players operate. The requirement of minimum one million ticket size is a business opportunity for other players to provide services through these ECNs to the retail segment of the market. These ECNs provide full anonymity to the parties unlike the EBS and Reuters. Did you read the post on how to use state space models in trading?

EBS is the main venue where USD, EUR, GBP, JPY, CHF crosses get traded in the interbank market. EBS provides two data feeds. One data feed provides a snapshot of the order book after every 250 milliseconds. The other is the premium EBS Live feed that provides snapshot of the order book after every 100 milliseconds. Now keep this in mind we retail traders never get the chance to see the order book. This privilege is reserved for the big players so they have a big advantage against us retail traders. Big players who can afford minimum one million ticket size trade on EBS. Reuters also works similar to EBS. This was meant to give you an idea of how fast the forex market moves. Forex market moves in milliseconds while we have the minimum time bar of 1 minute on MT4. Read this post on NZDUSD long trade with 6 pip SL and 250 pip TP.

If you have been trading and know a little bit of programming in MQL4, you can see the ticks coming. The problem is MT4 does not save this tick data. For that you will need to collect the tick data from MT4 and save it in an csv file. In a few hours you can have a few million rows in the csv file. Saving tick data is done by hedge funds and other big players. But we don’t need to do that. It can be costly. What we need is a method that can extract the same information that we extract from the tick data. I will show you how do do that in this post. But before we proceed, I wanted to give you an idea on the importance of tick data. Tick data has got a lot of information. Missing that information means we will have more losing trades. Learn how to do object oriented programming in MQL5.

If you have been trading for sometime, you must have been using time bars. Have you ever thought of using tick bars and range bars? Maybe you haven’t given much thought to it. Time bar is formed after a certain fixed period of time. For example 1 hour, 30 minutes, 15 minutes, 1 day, 1 week or 1 month. On the other hand a tick bar is formed after a fixed number of trades. For example after 100 trades, 200 trades, 400 trades, 1000 trades etc. A tick is recorded whenever a trade takes place. Now the currency market is an over the counter market that has no central exchange meaning it is a segmented market with different banks forming a segment. So there is no central clearing authority like that in the futures market or the stock market where we have full volume information available with us. What this means is that we don’t have trade volume information. We only get the tick data. Learn how to trade with Aroon Indicator.

Tick Bars

Tick is recorded whenever a trade takes place. But in forex, we don’t know the volume of the trade. We are only provided the number of trades that took place in a certain time period. Time bars are created after a certain time period like 1 hour, 4 hours etc. There can be many trades in a time bar or very few trades in a time bar. This is something important for you to understand as it is related to the concept of volatility. Volatility is the price movement. If the price movement range is large we have high volatility and if the price movement range is small we have low volatility. When we have traditional time chart candlesticks, we can have candlesticks that have few trades and we can also have candlesticks that many trades meaning many ticks. We have the problem of low volatility clustering and then suddenly high volatility clustering.

# Extract prices and compute statistics
data1 <- read.csv("D:/Shared/MarketData/GBPUSD#1.csv", 
header=FALSE)
colnames(data1) <- c("Date", "Time", "Open", "High", "Low", 
"Close", "Volume")
mean_prices <- round(mean(data1$Close), 2)
sd_prices <- round(sd(data1$Close), 2)
# Plot the histogram along with a legend
hist(data1$Close, breaks = 100, prob=T, cex.main = 0.9)
abline(v = mean_prices, lwd = 2)
legend("topright", cex = 0.8, border = NULL, bty = "n",
paste("mean=", mean_prices, "; sd=", sd_prices))

As you can see in the plot, the price distribution is multi model with varying mean and varying standard deviation. What this means is that the price distribution is not stationary. Time bars have this volatility varying with each bar. Another problem that we face is that time bars have heavy tails. The return distribution of time bars is not normal but it is heavy tailed and skewed.

GBPUSD Price Distribution

Above is the plot of the histogram of GBPUSD price. As you can see in the above plot, price distribution is multimodal with many peaks. Dealing with multimodel distribution is not easy as it is more complex than the usual normal or Gaussian distribution. Above plot only represents price over a certain period of time. If we choose a different period of time we will get a different histogram meaning the mean and standard deviation of price distribution keeps on varing. In short, price distribution is not stationary. Quants always find is easy to deal with log returns. So let’s plot the QQ plot for the GBPUSD 30 time bars and see how much the log returns are skewed.

data1 <- read.csv("D:/Shared/MarketData/GBPUSD#30.csv", 
header=FALSE)
colnames(data1) <- c("Date", "Time", "Open", "High", "Low", 
"Close", "Volume")
library(quantmod)
data2 <- as.xts(data1[,-(1:2)], 
as.POSIXct(paste(data1[,1],data1[,2]),
format='%Y.%m.%d %H:%M'))
#create log returns column
data2$LR <- diff(log(data2$Close))
data2 <-na.omit(data2)
qqnorm(data2$LR)
qqline(data2$LR)
> tail(data1)
           Date  Time    Open    High     Low   Close Volume
3513 2019.03.12 05:30 1.32135 1.32195 1.32125 1.32186    328
3514 2019.03.12 06:00 1.32185 1.32194 1.32154 1.32165    330
3515 2019.03.12 06:30 1.32166 1.32168 1.32112 1.32138    363
3516 2019.03.12 07:00 1.32145 1.32147 1.32025 1.32035    394
3517 2019.03.12 07:30 1.32036 1.32098 1.32013 1.32042    509
3518 2019.03.12 08:00 1.32041 1.32118 1.32032 1.32116    383

As you can see in the plot below the log returns are highly skewed. QQ plot is made by plotting the empirical distribution against the distribution that we want to use to model the emipirial distribution. Normal distribution has nice properties. Normal distribution is easy to deal with mathematicall. But as you can see the QQ plot below. Log returns are not normally distributed. Log returns have heavy tails on both sides.The log returns are not at all normally distributed. This makes statistically analyzing log returns of time bars a difficult thing.

GBPUSD Log Returns Distribution

Since the log returns of time bars are highly skewed and not normally distributed, we cannot make predictions on them using the normal statistical concepts like stationarity and ergodicity. Read the post on how to use Random Forest in algorithmic trading.

Stationarity

Stationarity means that the data generating process does not change with time. When the probability distribution doesn’t change with it, we call it strict stationariy. In practice it is difficult to achieve strict stationarity. What we can achieve is a process that has a constant mean and constant variance meaning the first two moments of the distribution doesn’t change with it. ow do we achieve that we achieve that by analyzing the returns series instead of the price time series. Rather we take log returns as price distribution is modeled as log normal which infact it is not.We call this covariance stationarity.

Why we worry about stationarity? If we want to predict the future we need to make it like the past. If the future looks like the past we can predict it. But if the future does not look like the past we cannot predict it. So we want the data to be stationary for better prediction. As said strict stationarity means that the data generating process does not change with time. For example if we measure the height of humans and plot it, we will find it roughly normally distributed. Why? If you have taken a course in statistics you must have been taught the Central Limit Theorem (CLT). When we add large number of random variables that are independent and identically distributed, we will get a normal distribution.

Human heights are produced by a process that has not changed over many millions of years so we get a stationary distribution. But in the financial markets we are dealing with a totally different beast that keeps on changing with time. If we want to understand the financial market we need to understand its microstructure. The smallest price movement in the market is known as a tick. Financial market is operating on millisecond timeframe rather it is operating on the microsecond timeframe. We traders often ignore this fact that the market is operating at millisecond timeframe. Let’s now construct tick bars using the 1 minute GBPUSD data that we can easily download from MT4 History Center.

> data1 <- read.csv("D:/Shared/MarketData/GBPUSD#1.csv", 
+ header=FALSE) 
> colnames(data1) <- c("Date", "Time", "Open", "High", "Low", 
+ "Close", "Volume") 
> data3=data.frame(matrix(NA, nrow=2000, ncol=7))
> colnames(data3) <- c("Date", "Time", "Open", "High", "Low", 
+ "Close", "Volume") 
> tickVolume=0.0
> i=0
> j=1
> for( k in 1:nrow(data1)){
+ tickVolume = tickVolume 
+ data1[k,7]
+ if (tickVolume > 985){
+ i=i+1
+ data3[i , 7]=tickVolume
+ data3[i, 1]=as.character(data1[j,1])
+ data3[i, 2]=as.character(data1[j,2])
+ data3[i, 3]=data1[j,3]
+ data3[i, 4]=max(data1[j:k,4])
+ data3[i, 5]=min(data1[j:k,5])
+ data3[i, 6]=data1[k,6]
+ tickVolume=0.0
+ j=k+1
+ #print(k)
+ }
+ }
>
> #head(data3)
> data3 <-na.omit(data3) 
> x1 <-nrow(data3)+1 
> data3[x1, 1] <- as.character(data1[j,1]) 
> data3[x1, 2] <- as.character(data1[j,2]) 
> data3[x1, 3] <- data1[j,3] 
> data3[x1, 4] <- max(data1[(j:nrow(data1)),4]) 
> data3[x1, 5] <- min(data1[(j:nrow(data1)),5]) 
> data3[x1, 6] <- data1[(nrow(data1)),6] 
> data3[x1, 7] <- sum(data1[(j:nrow(data1)),7]) 
> dim(data3)
[1] 978   7
> #head(data1)
> #tail(data3)
> #tail(data1)
> library(quantmod)
> data4 <- as.xts(data3[,-(1:2)], 
+ as.POSIXct(paste(data3[,1],data3[,2]), 
+ format='%Y.%m.%d %H:%M')) > #plot the candle chart
> candleChart(data4,theme='white', type='candles',
+             subset='last 2 days', TA=NULL)
> summary(data3$Volume[-nrow(data3)])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    986     994    1004    1009    1019    1124
> data4$LR <- diff(log(data4$Close)) 
> data4 <- na.omit(data4)
> qqnorm(data4$LR)
> qqline(data4$LR)
#plot the chart with MACD and RSI
chartSeries(data4,theme='white', type='candles', 
            subset='last 2 days', 
TA="addMACD(); addRSI(14)")

Now as you can see, we have improved somewhat. Since have used 1 minute GBPUSD data to construct the tick bars with volume of 1000, the log returns are still skewed with heavy tails as shown the QQ plot below. Below is the plot of tick data with 1000 ticks per bar. Compare this QQ plot with the QQ plot of the raw log returns. QQ plot below is better than that above but still we have outliers on both sides of the tail. The QQ plot below still is not normal as it has got heavy tails on both sides. QQ plot shows how much the empirical distribution deviates from the target distribution which is the normal distribution in our case. This is what I have done. MT4 does not provide price data below 1 minute. MT4 does provide tick data but it does not save the tick data. If we want tick data, we will need to save it. Now many hedge funds and many professional traders do that. They save tick data. I don’t think we need to do that. I will show you an easy way that frees you from the need to save tick data.

First we use 1 minute OHLCV data to create tick bars. Yes we can do that. If you check, M60 timeframe have volume that is on average having 1000 ticks. So this is what I will do. I will use 1 minute GBPUSD OHLCV data and use the volume to create tick bars that will have approximately 1000 ticks in them. I have provided the R code above that does that. We can use 1 minute OHLCV data to create tick bars for 100, 200,500, 1000,2000 ticks whatever suits our need. Since we have 1 minute data we will have some variation meaning sometimes the bar will have 980 ticks, sometimes 1050 ticks and so on. But the difference will not be much.

GBPUSD Price Distribution

Above is the QQ plot for the 1000 tick bars. You can see the above QQ plot is better than the raw log returns QQ plot where the tails were very heavy. In the above case the heaviness is still there but the QQ plot is approximating the normal distribution quite well. The skewness is less as compared to GBPUSD M30 time bar log returns. Now this does not mean that the tick bar log returns are normally distributed. You need to keep this in mind. Time bars have equal time intervals. Tick bars don’t have equal time intervals. A tick bar can take more time or less time that the standard time bar. We can also create candlestick charts out of tick bars. Below is the candlestick plot of the tick bars that we have calculated above.

GBPUSD Tick Chart

Learn how to identify market reversals on daily and weekly charts. We can also add technical analysis indicators to the tick bar candlestick charts. Below is the same candlestick chart of tick bars as above but with technical indicators MACD and RSI added to the plot. Candlesticks on tick bars have the same patterns as those on the standard time bar charts. If you look at the chart below, you can easy see that we can use candlestick patterns to predict the next price move. RSI oscillator is showing nice divergence patterns.

GBPUSD Tick Chart

You can see we can use usual technical analysis when using tick bars. There are many pro traders who use both ordinary candlestick charts and combine them with candlestick charts of tick bars as above when making their trading decisions. Now tick bars is one way of looking at OHLCV data. In tick bars we take in account the volume ticks. In time bars we ignore the volume tick information. So we have different information in tick bars as compared to time bars. There is another way to look at OHLCV data.

Range Bars

Range bars are another method of sampling the price data. What we want is to create a bar when price moves a fixed number of pips. So we create a bar when price moves 10 pips in one direction. If the price moves 10 pips up, we create an up bar and when price moves 10 pips down, we create a down range bar. Range bars remove a lot of clutter that we see on time bars when we have bars where price did not move much and when we have time bars when price makes a big move. MT4 does not provide range bars by default. You can code an indicator that will draw range bars. But here we will use R and create range bars once again using 1 minute OHLCV price data. We will once again use 1 minute GBPUSD OHLCV data to resample and create range bars.

> data1 <- read.csv("D:/Shared/MarketData/GBPUSD#1.csv", 
+ header=FALSE) 
> colnames(data1) <- c("Date", "Time", "Open", "High", "Low", 
+ "Close", "Volume") 
> #tail(data1)
> #dim(data1)
> #str(data1)
> data3=data.frame(matrix(NA, nrow=1000, ncol=7))
> colnames(data3) <- c("Date", "Time", "Open", "High", "Low", 
+ "Close", "Volume") 
> pips=50
> i=0
> j=1
> l=1
> for( k in 2:nrow(data1)){
+ rangeBar=10^4*abs(data1[k ,6]-data1[l,6])
+ if (rangeBar >= 50){
+ i=i+1
+ data3[i, 1]=as.character(data1[j,1])
+ data3[i, 2]=as.character(data1[j,2])
+ data3[i, 3]=data1[j,3]
+ data3[i, 4]=max(data1[j:k,4])
+ data3[i, 5]=min(data1[j:k,5])
+ data3[i, 6]=data1[k,6]
+ data3[i, 7]=sum(data1[j:k,7])
+ l=k
+ j=k+1
+ #print(k)
+ }
+ }
>
> head(data3)
        Date  Time    Open    High     Low   Close Volume
1 2019.01.25 07:29 1.31325 1.31380 1.30804 1.30815   6867
2 2019.01.25 10:18 1.30814 1.31319 1.30610 1.31318  20650
3 2019.01.25 16:52 1.31319 1.31848 1.31153 1.31845  12369
4 2019.01.25 20:27 1.31846 1.32163 1.31306 1.31311  62801
5 2019.01.29 09:46 1.31319 1.31890 1.31285 1.31857  19520
6 2019.01.29 15:44 1.31841 1.31986 1.31242 1.31331  10423
> data3 <-na.omit(data3) 
> summary(10^4*diff(data3$Close))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-57.700 -51.300 -50.000  -1.952  50.800  55.600
> x1 <-nrow(data3)+1 
> data3[x1, 1] <- as.character(data1[j,1]) 
> data3[x1, 2] <- as.character(data1[j,2]) 
> data3[x1, 3] <- data1[j,3] 
> data3[x1, 4] <- max(data1[(j:nrow(data1)),4]) 
> data3[x1, 5] <- min(data1[(j:nrow(data1)),5]) 
> data3[x1, 6] <- data1[(nrow(data1)),6] 
> data3[x1, 7] <- sum(data1[(j:nrow(data1)),7]) 
> dim(data3)
[1] 31  7
> tail(data3)
         Date  Time    Open    High     Low   Close Volume
26 2019.02.19 16:49 1.29756 1.30280 1.29741 1.30261   4722
27 2019.02.19 17:55 1.30258 1.30773 1.30110 1.30766  49870
28 2019.02.20 18:11 1.30768 1.31083 1.30253 1.30253  26040
29 2019.02.21 08:35 1.30261 1.30754 1.30252 1.30754   5167
30 2019.02.21 10:20 1.30753 1.30940 1.30242 1.30249  51077
31 2019.02.22 10:31 1.30250 1.30392 1.30078 1.30160   6127
> library(quantmod)
> data4 <- as.xts(data3[,-(1:2)], 
+ as.POSIXct(paste(data3[,1],data3[,2]), 
+ format='%Y.%m.%d %H:%M')) > #plot the candle chart
> candleChart(data4,theme='white', type='candles',
+             subset='last 15 days', TA=NULL)

Now above you can see above I have created range bars with a range of 50 pips. I have created range bars that have 50 pips range. Now I could also have easily created a 10 pip range bar. It depends. if we want to do scalping, we should create 10 pip range bars and if we want to do swing trading, we should create 50 pips range bars. Each time price moves by 50 pips in either up or down direction, a range bar is made. In the above code, I checked the size of each bar. Median is 50 pips. Max is 55 pips and minimum is -57 pips. So we have roughly same size bars. If we had used tick data, we could have more better range bars. But these range bars are good enough for our analysis purpose. We can do technical analysis also using Quantmod package. Below is the plot of range bars.

GBPUSD Range Bars

As you can see all bars have the same length. Some bars do have wicks. These wicks are due to the reason that price moved but retraced before reaching 50 pips range. Sometimes we will get a range bar will a wick something like 40 pips. Below is the range bar plot alongwith the plot of RSI underneath it. You can see a divergence pattern is being made. Now these patterns are very clear when you draw the charts with historical data. But when the price is moving in real time, these patterns are not that reliable. You should keep this in mind. Range bars can be analyzed using logistic regression as well as quadratic discriminant analysis as well as support vector machines. More on this in a future post. Kalman filter is being used a lot in quantitative finance. If you want to learn about Kalman Filters, take a look at my Kalman Filter for Traders course.

GBPUSD Range Bars

Reading NinjaTrader Tick Data in R

I had used 1 minute OHLCV data that I had downloaded from MT4. I could have also downloaded tick data. But this time I will use NinjaTrader. Many forex traders are stuck with MT4. Take a look at new NinjaTrader 8. NinjaTrader is a powerful trading platform that is getting ignored by many trader. Let’s download and get familiar with it. I have connected NinjaTrader with FXCM demo account. I will use NinjaTrader to download tick data. I downloaded GBPUSD tick data. I will use this tick data to construct tick bars and see if we have what we want. When I download the tick data from NinjaTrader, it had more than 200,000 rows. But then I found out that a large part of the data was corrupted. So reading this tick data will also help us learn how to deal with bad data.

> #read tick data
> d1 <-read.table("D:/Shared/MarketData/GBPUSD.Last.txt", 
+ as.is = TRUE) 
> #d1 <-read.table("D:/Shared/MarketData/EURUSD.Last.txt", 
> #as.is = TRUE)
> #head(d1)
> #tail(d1[[3]])
> d2 <- read.table(text=d1[[3]], sep = ";") 
> #tail(d2)
> data1 <- cbind(d1[1:2], d2) 
> #tail(data1)
> colnames(data1) <- c("Date", "Time","Milliseconds", 
+ "Last", "Bid", "Ask", "Volume") 
> head(data1)
      Date   Time Milliseconds    Last     Bid     Ask Volume
1 20190218 190000       920000 1.29248 1.29248 1.29283      4
2 20190218 190000      4420000 1.29248 1.29248 1.29283      2
3 20190218 190001      1550000 1.29248 1.29248 1.29283      2
4 20190218 190001      9420000 1.29248 1.29248 1.29289     10
5 20190218 190002      4920000 1.29248 1.29248 1.29285      9
6 20190218 190003       410000 1.29249 1.29249 1.29285     29
> tail(data1)
           Date   Time Milliseconds    Last     Bid     Ask Volume
207040 20190221 151748      1930000 1.30582 1.30582 1.30588      2
207041 20190221 151748      9300000 1.30583 1.30583 1.30588      1
207042 20190221 151749      1590000 1.30583 1.30583 1.30588      2
207043 20190221 151751       490000 1.30584 1.30584 1.30587      2
207044 20190221 151751      1960000 1.30584 1.30584 1.30588      1
207045 20190221 151752      6250000 1.30582 1.30582 1.30587      5
> dim(data1)
[1] 207045      7

Tick data file that I downloaded from NinjaTrader was in a very unfamiliar format. But R is powerful. I was able to read the data as shown above. I used a few tricks that I have given in the above R code. As you can see the data has 200,000 rows roughly. This tick data is for one day only. So you can see how much tick data is created each trading day. Now above I have read the tick file. It is a text file and quite different in structure from the OHLCV csv files that we can download from MT4 History Center. Did you notice one thing, we have a column named Milliseconds. Most of the tick data is in milliseconds. In each second we can have many updates of the price data. R is powerful language. It can easily deal with millisecond data.

> library(quantmod)
> options("digits.secs"=7); #Show milliseconds
> data2 <-data1 
> data2$DateTime <- paste(data2[,2],data2[,3], sep=".") 
> data2$DateTime <- paste(data2[,1], data2[,8]) 
> head(data2)
      Date   Time Milliseconds    Last     Bid     Ask Volume
1 20190218 190000       920000 1.29248 1.29248 1.29283      4
2 20190218 190000      4420000 1.29248 1.29248 1.29283      2
3 20190218 190001      1550000 1.29248 1.29248 1.29283      2
4 20190218 190001      9420000 1.29248 1.29248 1.29289     10
5 20190218 190002      4920000 1.29248 1.29248 1.29285      9
6 20190218 190003       410000 1.29249 1.29249 1.29285     29
                 DateTime
1  20190218 190000.920000
2 20190218 190000.4420000
3 20190218 190001.1550000
4 20190218 190001.9420000
5 20190218 190002.4920000
6  20190218 190003.410000
> data2 <- data2[,c(8,4,5,6,7)] 
> head(data2)
                 DateTime    Last     Bid     Ask Volume
1  20190218 190000.920000 1.29248 1.29248 1.29283      4
2 20190218 190000.4420000 1.29248 1.29248 1.29283      2
3 20190218 190001.1550000 1.29248 1.29248 1.29283      2
4 20190218 190001.9420000 1.29248 1.29248 1.29289     10
5 20190218 190002.4920000 1.29248 1.29248 1.29285      9
6  20190218 190003.410000 1.29249 1.29249 1.29285     29
> tail(data2,10)
                      DateTime    Last     Bid     Ask Volume
207036 20190221 151744.6050000 1.30582 1.30582 1.30586      1
207037 20190221 151746.2260000 1.30583 1.30583 1.30588      3
207038 20190221 151746.7730000 1.30583 1.30583 1.30587     23
207039  20190221 151747.580000 1.30583 1.30583 1.30587      2
207040 20190221 151748.1930000 1.30582 1.30582 1.30588      2
207041 20190221 151748.9300000 1.30583 1.30583 1.30588      1
207042 20190221 151749.1590000 1.30583 1.30583 1.30588      2
207043  20190221 151751.490000 1.30584 1.30584 1.30587      2
207044 20190221 151751.1960000 1.30584 1.30584 1.30588      1
207045 20190221 151752.6250000 1.30582 1.30582 1.30587      5
> str(data2)
'data.frame':   207045 obs. of  5 variables:
 $ DateTime: chr  "20190218 190000.920000" "20190218 190000.4420000" 
"20190218 190001.1550000" "20190218 190001.9420000" ...
 $ Last    : num  1.29 1.29 1.29 1.29 1.29 ...
 $ Bid     : num  1.29 1.29 1.29 1.29 1.29 ...
 $ Ask     : num  1.29 1.29 1.29 1.29 1.29 ...
 $ Volume  : int  4 2 2 10 9 29 10 6 8 10 ...
> sum(is.na(data2))
[1] 0
> dim(data2)
[1] 207045      5
> data2 <-na.omit(data2) 
> dim(data2)
[1] 207045      5
> sum(is.na(data2[ ,1]))
[1] 0

Data preprocessing is very important. Data preprocessing is also known as data cleaning. Above I have checked for bad data meaning rows that have missing data. But I couldn’t find any. This is the most important lesson that you learn in doing data science. I have combined the first three columns into the datetime column. I need to convert this column into a POSIXct object which I do now below. Learn how to use Fibonacci in trading.

> data2[,1] <-as.POSIXct(data2[,1], format="%Y%m%d %H%M%OS") 
> newDF <- subset(data2, is.na(data2)) 
> head(newDF)
      DateTime    Last     Bid     Ask Volume
12850      1.29229 1.29229 1.29233      2
12851      1.29226 1.29226 1.29230      9
12852      1.29228 1.29228 1.29230      2
12853      1.29228 1.29228 1.29231      1
12854      1.29228 1.29228 1.29231      3
12855      1.29220 1.29220 1.29224      9
> tail(newDF)
       DateTime    Last     Bid     Ask Volume
183612      1.30664 1.30664 1.30670     21
183613      1.30665 1.30665 1.30668     26
183614      1.30666 1.30666 1.30668      6
183615      1.30667 1.30667 1.30667      3
183616      1.30665 1.30665 1.30669     13
183617      1.30665 1.30665 1.30667      5
> data2 <- na.omit(data2) 
> data1[12850:12855, ]
          Date Time Milliseconds    Last     Bid     Ask Volume
12850 20190219    0        60000 1.29229 1.29229 1.29233      2
12851 20190219    0      4880000 1.29226 1.29226 1.29230      9
12852 20190219    0      6580000 1.29228 1.29228 1.29230      2
12853 20190219    1      2880000 1.29228 1.29228 1.29231      1
12854 20190219    2       660000 1.29228 1.29228 1.29231      3
12855 20190219    3      2610000 1.29220 1.29220 1.29224      9
> data1[183612:183617, ]
           Date  Time Milliseconds    Last     Bid     Ask Volume
183612 20190221 95957      1110000 1.30664 1.30664 1.30670     21
183613 20190221 95957      6840000 1.30665 1.30665 1.30668     26
183614 20190221 95958      2440000 1.30666 1.30666 1.30668      6
183615 20190221 95958      6810000 1.30667 1.30667 1.30667      3
183616 20190221 95959      3350000 1.30665 1.30665 1.30669     13
183617 20190221 95959      8300000 1.30665 1.30665 1.30667      5
> data2 <- data2[1:10000,] 
> data2 <- data2[-(1:184999), ]

As you can see above we have bad rows. There are many rows that have the time stamp missing. We will need to remove these rows from our data. I remove them. Do you know this fact that financial markets are being controlled by central banks. Markets versus Central banks who will you bet on?

> data2 <- data2[-(1:184999), ] 
> data3 <- xts(data2[ ,-1], order.by=as.POSIXct(data2[ ,1], 
+ format="%Y%m%d %H%M%OS")) 
> head(data3)
                           Last     Bid     Ask Volume
2019-02-21 10:19:02.483 1.30641 1.30641 1.30643     11
2019-02-21 10:19:03.269 1.30641 1.30641 1.30644      6
2019-02-21 10:19:03.619 1.30640 1.30640 1.30644      4
2019-02-21 10:19:04.279 1.30641 1.30641 1.30642      1
2019-02-21 10:19:04.457 1.30641 1.30641 1.30643      1
2019-02-21 10:19:05.729 1.30641 1.30641 1.30642      1
> str(data3)
An 'xts' object on 2019-02-21 10:19:02.483/2019-02-21 15:17:52.625 containing:
  Data: num [1:22046, 1:4] 1.31 1.31 1.31 1.31 1.31 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:4] "Last" "Bid" "Ask" "Volume"
  Indexed by objects of class: [POSIXct,POSIXt] TZ:
  xts Attributes:
 NULL
> dim(data3)
[1] 22046     4

Finally we have cleaned the bad data and made an xts dataframe. There were a lot of rows with bad tick data. I removed those rows that had bad data. Now I only have 22000 rows will good data that I have converted into an xts dataframe. If you see above, the timestamp is now in milliseconds. We have three decimal places for each second. Watch this 2 hour documentary and learn how hedge funds are made. Below I convert this tick data into tick bars.

> data3$TickBars=NA
> tickVolume=0
> i=0
> j=1
> for( k in 1:nrow(data3)){
+ tickVolume = tickVolume + data1[k,4]
+ if (tickVolume >=1000){
+ data3[k , 5]=tickVolume
+ tickVolume=0.0
+ j=k+1
+ #print(k)
+ }
+ }
>
> data4 <- na.omit(data3) 
> tail(data4)
                           Last     Bid     Ask Volume TickBars
2019-02-21 14:24:32.220 1.30695 1.30695 1.30698     13 1001.121
2019-02-21 14:33:19.733 1.30679 1.30679 1.30682      2 1001.139
2019-02-21 14:44:39.308 1.30691 1.30691 1.30695     35 1001.088
2019-02-21 14:53:32.950 1.30599 1.30599 1.30604      2 1000.885
2019-02-21 15:03:33.194 1.30548 1.30548 1.30551      3 1001.022
2019-02-21 15:12:59.490 1.30576 1.30576 1.30580     10 1001.200

> tail(data4)
                           Last     Bid     Ask Volume TickBars
2019-02-21 15:12:07.246 1.30582 1.30582 1.30588      4 100.6360
2019-02-21 15:12:56.355 1.30571 1.30571 1.30577      7 100.6251
2019-02-21 15:13:53.295 1.30591 1.30591 1.30596     12 100.6262
2019-02-21 15:14:43.684 1.30594 1.30594 1.30597      7 100.6247
2019-02-21 15:15:50.970 1.30584 1.30584 1.30587      1 100.6237
2019-02-21 15:16:53.489 1.30571 1.30571 1.30576      2 100.6301
> dim(data4)
[1] 282   5
> summary(data4$TickBars)[ ,2]


"Min.   :100.6  " "1st Qu.:100.6  " "Median :100.8  " "Mean   :100.8
"

"3rd Qu.:100.8  " "Max.   :100.8  "
> data4$LR <- diff(log(data4$Last)) > data4 <- na.omit(data4) > qqnorm(data4$LR)
> qqline(data4$LR)
> data2[which(data2$Volume==max(data2$Volume)), ]
                      DateTime    Last     Bid     Ask Volume
191866 20190221 120322.9720000 1.30684 1.30684 1.30691    164

It is always a good idea to check the tick data. If we check the tick data we find that we have a row with 164 tick. Now you can see there is a row that has 164 ticks in one row. Having 164 ticks in a row is a sign that something unusual is happening at this time. Let’s investigate and see what is happening around this time. Learn how a 23 year old trader makes $700K in 1 year.

> data2[191860:191870, ]
                      DateTime    Last     Bid     Ask Volume
191860 20190221 120319.1070000 1.30626 1.30626 1.30632     40
191861 20190221 120319.6860000 1.30604 1.30604 1.30609     55
191862 20190221 120320.2140000 1.30603 1.30603 1.30607     25
191863 20190221 120320.7790000 1.30609 1.30609 1.30616     51
191864 20190221 120321.3360000 1.30615 1.30615 1.30617     28
191865 20190221 120321.8890000 1.30636 1.30636 1.30640     73
191866 20190221 120322.9720000 1.30684 1.30684 1.30691    164
191867 20190221 120323.5140000 1.30671 1.30671 1.30675     66
191868  20190221 120324.670000 1.30642 1.30642 1.30647     67
191869 20190221 120324.6350000 1.30644 1.30644 1.30647     31
191870 20190221 120325.1790000 1.30642 1.30642 1.30645     29

You can see above high volume. Each row has 40, 50 and more ticks. Below is the plot of the time difference in seconds between the trades.

GBPUSD Tick Difference

Normal tick activity is not that high. Below is a histogram of the tick activity.

GBPUSD Tick Difference Histogram

SIx banks get fined $ 3 Billion for currency market manipulation. Is this a sign of currency manipulation? We have no idea. Below is the normal tick activity. Each row is having 1-5 ticks.

> tail(data2,10)
                      DateTime    Last     Bid     Ask Volume
207036 20190221 151744.6050000 1.30582 1.30582 1.30586      1
207037 20190221 151746.2260000 1.30583 1.30583 1.30588      3
207038 20190221 151746.7730000 1.30583 1.30583 1.30587     23
207039  20190221 151747.580000 1.30583 1.30583 1.30587      2
207040 20190221 151748.1930000 1.30582 1.30582 1.30588      2
207041 20190221 151748.9300000 1.30583 1.30583 1.30588      1
207042 20190221 151749.1590000 1.30583 1.30583 1.30588      2
207043  20190221 151751.490000 1.30584 1.30584 1.30587      2
207044 20190221 151751.1960000 1.30584 1.30584 1.30588      1
207045 20190221 151752.6250000 1.30582 1.30582 1.30587      5

As you can see most of the ticks are just below 5. There can be problem in the data. Watch these videos on how to exploit high frequency traders. Now below is the QQ plot of the tick data. There are again outliers that we find on both the tails. If we remove these outliers, we will have a normal distribution. Can we remove these outliers? I would not recommend that.

GBPUSD Log Return Distribution

Reading NinjaTrader Tick Data Using Tidyverse

Tidyverse is a new R package that is very fast and very powerful. It uses tibble which is a bit different as compared to the traditional R dataframe. I had read the tick data using R standard dataframe. Let’s use the more advanced tidyverse package and read the data once again. This is just for educational purposes.

> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.0     v purrr   0.2.5
v tibble  2.0.1     v dplyr   0.7.8
v tidyr   0.8.2     v stringr 1.3.1
v readr   1.3.1     v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::first()  masks xts::first()
x dplyr::lag()    masks stats::lag()
x dplyr::last()   masks xts::last()
> d1 <- read_delim("D:/Shared/MarketData/GBPUSD.Last.txt", 
+ delim=" ", col_names=FALSE) 
Parsed with column specification: 
cols( 
X1 = col_double(), 
X2 = col_double(), 
X3 = col_character() ) 
> #dim(d1)
> #head(d1)
> #head(d1[[3]])
> d2 <- read_delim(d1[[3]], delim=";", col_names=FALSE) 
> #dim(d2)
> #head(d2)
> data1 <- cbind(d1[, 1:2], d2) 
> colnames(data1) <- c("Date", "Time","Milliseconds", 
+ "Last", "Bid", "Ask", "Volume") 
> head(data1)
      Date   Time Milliseconds    Last     Bid     Ask Volume
1 20190218 190000      0920000 1.29248 1.29248 1.29283      4
2 20190218 190000      4420000 1.29248 1.29248 1.29283      2
3 20190218 190001      1550000 1.29248 1.29248 1.29283      2
4 20190218 190001      9420000 1.29248 1.29248 1.29289     10
5 20190218 190002      4920000 1.29248 1.29248 1.29285      9
6 20190218 190003      0410000 1.29249 1.29249 1.29285     29
> data2 <-data1 
> data2$DateTime <- paste(data2[,2],data2[,3], sep=".") 
> data2$DateTime <- paste(data2[,1], data2[,8]) 
> head(data2)
      Date   Time Milliseconds    Last     Bid     Ask Volume
1 20190218 190000      0920000 1.29248 1.29248 1.29283      4
2 20190218 190000      4420000 1.29248 1.29248 1.29283      2
3 20190218 190001      1550000 1.29248 1.29248 1.29283      2
4 20190218 190001      9420000 1.29248 1.29248 1.29289     10
5 20190218 190002      4920000 1.29248 1.29248 1.29285      9
6 20190218 190003      0410000 1.29249 1.29249 1.29285     29
                 DateTime
1 20190218 190000.0920000
2 20190218 190000.4420000
3 20190218 190001.1550000
4 20190218 190001.9420000
5 20190218 190002.4920000
6 20190218 190003.0410000
> data2 <- data2[,c(8,4,5,6,7)] 
> head(data2)
                 DateTime    Last     Bid     Ask Volume
1 20190218 190000.0920000 1.29248 1.29248 1.29283      4
2 20190218 190000.4420000 1.29248 1.29248 1.29283      2
3 20190218 190001.1550000 1.29248 1.29248 1.29283      2
4 20190218 190001.9420000 1.29248 1.29248 1.29289     10
5 20190218 190002.4920000 1.29248 1.29248 1.29285      9
6 20190218 190003.0410000 1.29249 1.29249 1.29285     29
> str(data2)
'data.frame':   207045 obs. of  5 variables:
 $ DateTime: chr  "20190218 190000.0920000" "20190218 190000.4420000"
"20190218 190001.1550000" "20190218 190001.9420000" ...
 $ Last    : num  1.29 1.29 1.29 1.29 1.29 ...
 $ Bid     : num  1.29 1.29 1.29 1.29 1.29 ...
 $ Ask     : num  1.29 1.29 1.29 1.29 1.29 ...
 $ Volume  : num  4 2 2 10 9 29 10 6 8 10 ...
> library(lubridate)

Attaching package: 'lubridate'

The following object is masked from 'package:base':

    date

> data2$DateTime <-ymd_hms(data2$DateTime) 
Warning message: 66157 failed to parse. 
> head(data2)
                 DateTime    Last     Bid     Ask Volume
1 2019-02-18 19:00:00.092 1.29248 1.29248 1.29283      4
2 2019-02-18 19:00:00.441 1.29248 1.29248 1.29283      2
3 2019-02-18 19:00:01.154 1.29248 1.29248 1.29283      2
4 2019-02-18 19:00:01.941 1.29248 1.29248 1.29289     10
5 2019-02-18 19:00:02.492 1.29248 1.29248 1.29285      9
6 2019-02-18 19:00:03.040 1.29249 1.29249 1.29285     29
> dim(data2)
[1] 207045      5

Now you can see the warning message above that says that 66157 rows failed to parse as the data was not in the proper format. Watch this 50 minute documentary on how HFT is done. In this post I showed how we can easily deal with tick data. Tick data has better statistical properties as compared to the time bars.