Reference

Types

Arrays

The various constructors functions (vector, matrix and array) all yield the same array type, respectively 1-dimensional, 2-dimensional and >=3-dimensional arrays. In particular a scalar is a 1-dimensional array of size 1. The following types are arrays (note in particular the absence of an integer type):

The constructors are similar to R:


1             # vector of size 1
vector(mode="double", 3)

c(1, 2, 3, 4, 5)
### [1] 1 2 3 4 5
c(TRUE, FALSE, FALSE, TRUE)
### [1] TRUE  FALSE FALSE TRUE 

matrix(1:9, 3, 3, dimnames=list(c("i","ii","iii"), c("one","two","three")))

array(1:8, c(2,2,2), dimnames=list(NULL,NULL,c("one","two")))

One significant difference with R is that arrays (as well as time-series) can be persistent. This is controlled by the file argument which indicates the name of the directory that will be created in order to hold the memory mapped files associated with the array:

a <- array(1:2e6, c(1e6,2), file="/tmp/memory_mapped_array_directory")

The str function displays the memory-mapped directory if any:

str(a)

### displays:
### double - ord [1:1000000, 1:2] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ...
### - mmap file = /tmp/memory_mapped_array_directory

Another feature of arrays is that ztsdb maintains the knowledge of the ordering of each column. str will indicate that all columns are ordered by printing out "ord". The function is.ordered can also be used to determine this programmaticaly.

is.ordered(1:10)        # TRUE
is.ordered(10:1)        # FALSE
str(1:10)
### double - ord [1:10] 1 2 3 4 5 6 7 8 9 10
### - malloc-based
str(10:1)
### double [1:10] 10 9 8 7 6 5 4 3 2 1
### - malloc-based

Date/time representation

ztsdb has four built-in date and time related types. They are time, interval, duration and period.

time

A time is a specific point in time with nanosecond precision:

timepoint1 <- |.2009-01-01 13:12:00.000000001 America/New_York.|
timepoint2 <- as.time("2009-01-01 13:12:00.000000001 America/New_York")

It is encoded as the nanosecond offset since 1970-01-01 UTC. This means the time range is approximately from year 1386 to year 2554. It does not have an associated time zone, but can be displayed in any desired time zone with the print function. It follows the POSIX convention, in particular it does not have the notion of leap seconds.

interval

An interval is represented by two points in time and the start or the end of the interval can be either closed or open. In the string constructor, a closed start(end) is indicated with the '+' sign and indicates that the start(end) is included in the interval. An open start(end) is indicated with the '-' sign and means it is not included in the interval. For the constructor version that takes times, two additional arguments may be specified, sopen and eopen which, when true indicate that, respectively, the start or end of the interval is open. By default an interval has a closed start and an open end.

ival <- |+2009-01-01 13:12:00 America/New_York -> 2009-02-01 15:11:03 America/New_York-|
as.interval("-2009-01-01 13:12:00 America/New_York -> 2009-02-01 15:11:03 America/New_York+")

start <- |.2009-01-01 13:12:00 America/New_York.|
one_hour <- as.duration("01:00:00")
end  <- start + one_hour

### all the following produce the same interval:
interval(start, end)                   # by default sopen=T,eopen=F
interval(start, end, sopen=T, eopen=F)
interval(start, duration=one_hour)

It is encoded as two time values with additional flags indicating if the beginning and end of the intervals are opened or closed. Accessors are defined in order to access its components:

ival <- |+2009-01-01 13:12:00 America/New_York -> 2009-02-01 15:11:03 America/New_York-|

interval.start(ival)    # |.2009-01-01 13:12:00 America/New_York.|
interval.end(ival)      # |.2009-02-01 15:11:03 America/New_York.|
interval.sopen(ival)    # FALSE
interval.eopen(ival)    # TRUE

duration

A duration is a count of nanoseconds, which may be negative:

one_second <- as.duration(1e9)
one_second <- as.duration("00:00:01")
one_hour   <- as.duration("01:00:00")
one_nanosecond   <- as.duration("00:00:00.000_000_001")

period

period represents the calendar or "business" view of a duration with the concepts of month and day. The exact duration of a period is unknown until it is anchored to a point in time and associated with a time-zone.

period is composed of two parts a month/days part and a duration. These two components may have opposite signs.

Note that for convenience reasons the constructor syntax allows specifying years and weeks, but they are converted to their representation in months/days.

### constructor from string:
one_month_one_day <- as.period("1m1d")
one_day_minus_12_hours <- as.period("1d/-12:00:00")
one_of_everything <- as.period("1y1m1w1d/01:01:01.000_000_001")

### constructor from double and duration arguments:
one_month_one_day <- period(months=1, days=1)
one_Day_minus_12_hours <- period(days=1, duration=as.duration(01:01:01.000_000_001")

Accessor for period's components are provided:

ones <- as.period("1y1m1w1d/01:01:01.000_000_001")

period.month(ones)
period.day(ones)
period.duration(ones)

Operations

Arithmetic operations

Arithmetic operations are straighforward. Like in R one can use the infix notation of the functional notation be enclosing the operator in backquotes.

1 + 1 == `+`(1, 1)        # TRUE

Arithmetic operations on temporal types

A duration can be added, subtracted, multiplied or divided, and the result is a duration.

one_second <- as.duration("00:00:01")
one_second + one_second
### [1] 00:00:02
three_seconds = 3 * one_second
three_seconds / 3
### [1] 00:00:01

A duration can be added or subtracted to a time or to an interval, and the result remains respectively a time or an interval.

one_second <- as.duration("00:00:01")
timepoint <- |.2009-01-01 13:12:00.000000001 America/New_York.|
timepoint + one_second
### [1] 2009-01-01 13:12:01.000000001 EST

A period can be added or subtracted to a time or to an interval, and the result remains respectively a time or an interval.

The functional way of specifying the operators (`+` and `-`) is needed for the addition/subtraction of a period because an additional time-zone argument must be specified. For example:

one_month <- as.period("1m")
`+`(|.2009-01-01 13:12:00 America/New_York.|, one_month, "America/New_York")
### [1] 2009-02-01 13:12:00 EST

Subsetting and subassignment

Subsetting and subassignment work as in R. An index can be a logical vector or a double vector or a character vector. Additionally a time vector can be subsetted by either a time or an interval vector, and an interval vector can be subsetted by an interval vector (see also Date/time intersection).

a[1:10, 1]
a[1:10, c(TRUE,FALSE)]
a[,]
a[1:10, ] <- a[1:20]

Set operations

The set operations intersect, union and setdiff are defined for both dtime and interval vectors. If these vectors are not ordered, they will be sorted before the operation is carried out. This means that for unordered vectors, there is a performance penalty for these operations that must be taken into consideration.

Additionally, each of these set operations has a counterpart function intersect.idx, union.idx and setdiff.idx that, instead of computing a new set, returns the index of the set.

Intersection

time[interval]
interval[time]
interval[interval]
ts[interval]
ts[dtime]

or alternatively:

intersect(time, interval)
intersect(interval, interval)

Union

union(interval, interval)    # gives back the minimal interval set
union(time, time)

Difference

setdiff(time, time)
setdiff(interval, interval)
setdiff(time, interval)

Calendar operations

Rounding is defined for time and interval, with the following set of constants: "second", "minute", "hour", "day", "week", "month", "quarter", "year". For all constants that require the computation to take into account daylight saving time, the time zone argument tz is required. For example:

round(time, "day", tz="Europe/London")
round(time, "minute")
round(interval, "month", tz="America/New_York")
round(interval, "second")

Conversion of calendar periods to an integer value is defined for time objects:

dayweek(Sys.time(), "America/New_York")      # 0 to 6 (0 is Sunday)
daymonth(Sys.time(), "America/New_York")     # 1 to 31
dayyear(Sys.time(), "America/New_York")      # 1 to 366
month(Sys.time(), "America/New_York")        # 1 to 12
year(Sys.time(), "America/New_York")

Distance in calendar periods between two dates. Gives back a double which indicates the number units of the chosen calendar period:

dist(|.2009-01-01 13:12:00 America/New_York.|, |.2009-01-01 13:12:00 America/New_York.|, "day", tz)

Generating sequences

ztsdb proposes a function similar to R seq function. Temporal sequences (either time or interval) can be created with a by argument that can be either a duration or a period. In the case where it is a period, the tz argument must be specified in order to associate a time-zone to the operations.

one_day <- as.period("1d")
seq(from=|.2009-01-01 13:12:00 America/New_York.|,
    to=  |.2016-01-01 13:12:00 America/New_York.|,
    by=one_day, tz="America/New_York")

one_second <- as.duration("00:00:01")
seq(from=|.2009-01-01 13:12:00 America/New_York.|,
    to=  |.2009-01-02 13:12:00 America/New_York.|,
    by=one_second)

seq(from=|+2009-01-01 13:00:00 America/New_York -> 2009-01-01 15:00:00 America/New_York-|,
    to=  |+2010-01-01 13:00:00 America/New_York -> 2010-01-01 15:00:00 America/New_York-|,
    by=one_day, tz="America/New_York")

CSV read/write

Like in R the function read.csv and write.csv are provided. ztsdb does not adhere strictly to RFC 4180. In particular we use (and expect) CR and not CRLF. And although we allow quoted elements, we don't allow the separator to appear in a string. We believe these functions to be mostly useful for time-series where this limitation has little impact.

Rolling functions

These are functions "roll" over each column calculating a value over a given window of observations. They can be used either on double or on zts. Their signature is:

function(x, window, nvalid=window)

x is the double or zts, window an integer that defines the number of observation on which the operation will be performed, and nbvalid is the number of non-NaN observations needed to consider a result valid. For example, a window of 10 and a nbvalid of 5 means that if non-NaN 5 or more observations exist in the window, then the result will be computed; otherwise it will be set to NaN. The functions are rollmean, rollmin, rollmax, rollvar, rollcov. See an example of the usage of rollcov here.

Array and zts transformation

These functions tranform an double or a zts column-wise.

locf

Last observation carried forward. A non-NaN observation is carried forwards to fill-in a NaN observation is the non-NaN and NaN observations are in the same window specified by n; the signature of this function is:

function(x, n)

move

Moves all observations down or up depending on the value n. Positive n move down while negative n move up. A NaN value is assigned for observations which are moved without being filled (at the beginning or at the end of the columns depending on the direction of the move). The signature is:

function(x, n)

rotate

Works as the move function, but the observations wrap around and so no NaN are produced. The signature is:

function(x, n)

rev

Reverses, still column-wise, an array or a list. The signature is:

function(x)

Cummulative functions

These function cumulate values. The rev parameter controls in which direction. The functions of this group are: cumsum, cumprod, cumdiv, cummax, cummin. Their signature is:

function(x, rev=FALSE)

Aggregate functions

sum and prod which provide respectively the sum and the product of all elements of a vector/matrix/array are provided and work like in R.

Time-series

zts, the time series type, is composed of a time vector and a double array. The length of the time index is the same as the first dimension of the array of double. This means that each time element is associated to a "horizontal" slice of the array of double. This first dimension has the same special time subsetting capabilities as the time type.

Creation

A time series is created with a time vector and a corresponding double (i.e. the length of the first dimension of the array is the same as the length of the time vector). Note that like arrays, time series can have an arbitrary number of dimensions. And just like arrays, a time series can be memory-mapped by supplying the optional argument file which indicates where the memory-mapped files will be written.

idx <- c(|.2015-03-09 06:38:01 America/New_York.|,
         |.2015-03-09 06:38:02 America/New_York.|,
         |.2015-03-09 06:38:03 America/New_York.|)
data <- 1:6
z <- zts(idx, data, dim=c(3, 2), dimnames=list(NULL, c("one", "two")), file="memory_mapped_dir")

###                                    one two
###  2015-03-09 06:38:01.000000000 EDT 1   4  
###  2015-03-09 06:38:02.000000000 EDT 2   5  
###  2015-03-09 06:38:03.000000000 EDT 3   6  

Note the the dim argument can be omitted in the case of a two-dimensional time-series as the size can be calculated with the lenght of the vector.

Accessors

zts is an aggregate type and its components can be accessed with the following functions:

zts.idx(z)
zts.data(z)

Operations on time-series

Subset and subassign operations are defined similarly to double, and the first dimension follows the indexation semantics of a time vector.

ivl <- |+2015-03-09 06:38:01 America/New_York -> 2015-03-09 06:38:02 America/New_York+|
z[ivl,]
###                                    one two
###  2015-03-09 06:38:01.000000000 EDT 1   4  
###  2015-03-09 06:38:02.000000000 EDT 2   5  

Arithmetic operations are defined as for double:

z + z
###                                    one two
###  2015-03-09 06:38:01.000000000 EDT 2   8  
###  2015-03-09 06:38:02.000000000 EDT 4   10 
###  2015-03-09 06:38:03.000000000 EDT 6   12 

The bind family of functions is also defined, but note that a zts index must remain strictly sorted (and consequently with unique values).

Align operations

The function align has the following signature:

align(from, to, start=as.duration(0), end=as.duration(0), method="closest", tz=NULL)

It aligns the observations of the zts from onto the vector of time to, effectively yielding a new time-series that has the vector to as time index.

The arguments start and end define an interval which will be used to pick a value out of from. The alignment algorithm is the following: for each time t in to, define the interval i [t - start; t + end[ (note that start is closed whereas end is open, i.e. end is not part of the interval). For each i so defined, pick a value out of from that is computed over the values of from that fall in that interval.

start and end can either be a duration or a period. If one of the two is a period then tz needs to be defined in order to give meaning to the interval.

The argument method controls which value will be picked out of from for a given value of to and can have the values:

Here is a visualization of align(t1, t2, -one_hour, "closest"):

align closest

Here is a visualization of align(t1, t2, -one_hour, "count"):

align count

### create a zts for the example:
one_second <- as.duration("00:00:01")
idx <- seq(|.2015-01-01 12:00:00 America/New_York.|,
           |.2015-02-01 12:00:00 America/New_York.|,
           by=one_second)
data <- 0:(length(idx)-1)
z <- zts(idx, data)

### create a vector of time onto which z will be aligned:
to <- c(seq(|.2015-01-01 12:00:00 America/New_York.|,
            |.2015-02-01 00:00:00 America/New_York.|,
            by=one_hour),

align(z, to, -one_hour, method="count")     # the values of this zts will be 3600
align(z, to, -one_hour, method="closest")   # the values of this zts will be 0, 3600, 7200, ...

Additionally, the function align.idx is provided and has the signature:

align.idx(from, to, start=as.duration(0), end=as.duration(0), tz=NULL)

This function makes a "closest" align and instead of returning a time-series, it returns the index of the values in from.

op.zts operation

op.zts performs arithmetic operations between two time series and has the following signature, where _x and y are time-series and op is a string.

op.zts(x, y, op)

Each entry in the left time-series operand defines an interval from the previous entry, and the value associated with this interval will be applied to all the observations in the right time-series operand that fall in the interval. Note that the interval is closed at the beginning and open and the end. The available values for op are "*", "/", "+", "-".

Here is a visualization of op.zts(t1, t2, "*"):

align count

Connecting and querying

Connection

A connection is a handle to a remote ztsdb instance. The underlying protocol of a connection is TCP. It is created like this:

c1 <- connection(host="127.0.0.1", port=19001)

A connection is created only if the connection was successfully established with the remote instance.

With a connection it is possible to run any code remotely using the ? (query) operator:

c1 ? 1                               # evaluate 1 remotely
c1 ? 1 + 1                           # evaluate 1+1 remotely
c1 ? a <<- array(1:27, c(3,3,3))     # create 'a' remotely in the global environment
c1 ? a                               # get 'a'
c1 ? a[1, 2, 1]
c1 ? a[,1,1]
c1 ? { b <- 2; a * b }               # create 'b' in the remote context environment
                                     # and send back 'a * b'

Escape operator

It is also possible to escape code with the ++ operator, so that it is evaluated locally before being sent remotely as part of the query:

la <- 2
c1 ? ++la * a         # take 'la' locally, send it over to the remote instance,
                      # multiply it by the remote 'a' and send result back
c1 ? ++{ lb <- 2; lc <- 3; lb * lc } * a       # 6 * 'a' where 6 is evaluated locally

More complicated schemes are possible, such as defining remote handles, remote escapes, etc.

Synchronous and asynchronous queries

A query is immediately dispatched to the remote instance for interpretation. Locally, a future is created as a placeholder for the result of the query. The execution of the code then continues until the value of the future is needed when it is used in an expression (or needs to be returned as the result of a query). This means that it is possible to control if a query is synchronous or asynchronous respectively by using or not using the result of the query.

### synchronous:
a <- (c1 ? x) + (c2 ? y)   # sync'd by the '+'; the queries go out in parallel to 'c1' and 'c2'

### asynchronous:
{ c1 ? x; c2 ? y; NULL }   # the result of 'c1' and 'c2' are never used

Timers

It is possible to repeat the execution of code at interval. A timer creates a new interpretation context, but when a timer is destroyed then the interpretation context is torn down too. To avoid this a timer can of course be declared in global scope.

A timer has the following signature:

function(duration, loop, once=NULL, loop_max=0)

loop_max indicates the number of repetitions. A value of 0 indicates infinite repetitions. The once argument takes an expression that is evaluated only once. It is useful for example for creating local variables and setting up the job that will be done by the loop code. The latter is an expression that will be run at each timer expiry.

Timers are useful for a large variety of tasks: data distribution and backup, data transformation, etc. The following example calculates mean-minutes and stores them into a time-series available for querying.

Built-in functions

This source file has a list of the built-in functions together with their signatures and the allowable parameter types. For most of these functions, the functionality and parameters are the same as in R.

Environments and assignments

Environments

ztsdb has a notion of environment, but they are not a first class type like in R. Another difference is that ztsdb has dynamic scoping see Scoping.

The environment hierachy is the following: "base" <- "global" <- ... <- "current"

Managing environment content

The functions that help manage an environment's content R. They are namely assign, get, ls and rm (and its synonym remove) and they work roughly like in R. For convenience the function lsg is provided and is the same as ls except that name is by default initialized to "global" and so it lists by default the content of the global environment.

The signatures are:

assign(x, value, envir="current", inherits=FALSE)
get(x, envir="current", inherits=FALSE)
ls(name="current")
lsg(name="global")
rm(..., list=character(), envir="current", inherits=FALSE)

Assignments

Simple assign

The simple assign operator (<-) always assigns to a variable in the current environment. This means that a variable created by the simple assign will never be visible to another interpretation context. It a variable is created in a function then the variable will be local to this function.

Special assign

The special assign operator (<<-) works like in R. It looks in the current environment for the variable, and if not found it examines up to the parent environment and so on. If it doesn't find a pre-existing variable and it gets to the "global" environment, then it creates a new variable there and makes the assignment.

Caution must be exercised when using the special assign to declare global variables, because it might result in an assignment unwittingly occuring in a child of the "global" environment. A safer way to achieve global assignment is to use the assign function and specifying the parameter envir as "global":

a <<- 123                          # dangerous if 'a' is already defined in a child environment
assign("a", 123, envir="global")   # safe

Errors

Compared to R, ztsdb has a simplified mechanism for handling errors and there is no concept of warning. Errors can only be captured via .Last.error.

ztsdb implements, like in R, a try/catch mechanism which can be used like this:

## the following returns "not valid"
error_string <- "invalid type for binary operator (double + string)"
a <- tryCatch(1 + "a", if (.Last.error==error_string) -1)
a   # 'a' has value -1

Permanence

Permanent objects are arrays and zts that are declared with the file parameter (see Arrays). The objects are then memory-mapped to a set of files in the directory indicated by file. To allow a deterministic file state, the function msync is provided. It has the signature:

function(x, async=FALSE)

The async parameter determines if the operation is asynchronous or not.