Blog

rsync as R package

In this article we present our R package rsync, which serves as an interface between R and the popular Linux command line tool rsync.

Originally rsync is an open source tool for efficiently synchronizing files. Published by Paul Mackerras and Andrew Tridgell under the GNU General Public License, it allows users of Unix systems to synchronize local and remote files between two locations. Apart from copying files, a big advantage of the rsync algorithm is to only transfer differences within the files, if possible. This allows high speeds and avoids redundancies.

Rsync enjoys great popularity, e.g. when creating backups as well as transferring or mirroring data. The tool supports the synchronization of links, rights, groups and devices.

Figure 1: Possible rsync scenarios covered by the rsync R package. Top: local to local. Bottom: local to remote.

Figure 1 visualizes the usage scenarios of rsync. We distinguish two cases. First, the case between two local directories (top half of the graph) and second, the case between a local directory and a remote rsync daemon (bottom half of the graph).

Rsync is only available for Unix systems and is controlled from the command line. Unfortunately we have to put Windows users at ease here. Only via a small detour, e.g. with cygwin, Windows users can work around this issue.

An example command to synchronize the x.csv file between the local directory and an rsync daemon might look like this:

RSYNC_PASSWORD="1234" rsync -ltx /home/UserXY/Documents/x.csv rsync://user@example.de/someFolder"

Since we implement many processes in R, we thought that an R package could take some work from us here. The R package handles the communication with the command line program rsync. The same command in R now looks like this:

rsync::sendFile(con, fileName = 'x.csv')

The R interface to rsync has several advantages:

  • Intuitive operation for users with R experience
  • Interaction with rsync directly from within R
  • Typical actions can be performed without command line skills

Installation

To make it easier for our R package to synchronize, the underlying rsync tool must be installed first. On most systems it is part of the installation already.

Then install our rsync package by running the following line in R:

# install.packages("devtools")
devtools::install_github("INWTlab/rsync")

Functionality

At the beginning of an R session, a connection between two endpoints has to be established. To have a reproducible example, we demonstrate the features with two local folders. You can create a rsync configuration using:

library("rsync")
dir.create("destination")
dir.create("source")
dest <- rsync(dest = "destination", src = "source")
dest
## Rsync server: 
##   dest: /home/userxy/projects/rsync/destination
##   src: /home/userxy/projects/rsync/source
##   password: NULL 
## Directory in destination:
## [1] name         lastModified size        
## <0 rows> (or 0-length row.names)

In the case of an rsync daemon you can also supply a password. The way you think about transactions is that we have a destination folder with which we want to interact. All methods provided by this package will always operate on the destination. It will not change the source, in most cases. An exception is sendObject, that will also create a file in source.

x <- 1
y <- 2
sendObject(dest, x)
sendObject(dest, y)
## Rsync server: 
##   dest: /home/userxy/projects/rsync/destination
##   src: /home/userxy/projects/rsync/source
##   password: NULL 
## Directory in destination:
##      name        lastModified size
## 1 x.Rdata 2019-02-28 12:30:51   61
## 2 y.Rdata 2019-02-28 12:30:51   60

We can see that we have added two new files. These two files now exist in the source directory and the destination. The following examples illustrate the core features of the package:

removeAllFiles(dest) # will not change source
## Rsync server: 
##   dest: /home/userxy/projects/rsync/destination
##   src: /home/userxy/projects/rsync/source
##   password: NULL 
## Directory in destination:
## [1] name         lastModified size        
## <0 rows> (or 0-length row.names)
sendFile(dest, "x.Rdata") # so we can still send the files
## Rsync server: 
##   dest: /home/userxy/projects/rsync/destination
##   src: /home/userxy/projects/rsync/source
##   password: NULL 
## Directory in destination:
##      name        lastModified size
## 1 x.Rdata 2019-02-28 12:30:51   61

loadData() also retrieves files from con$from, except that the information is loaded directly into the R workspace. dataName is the equivalent of entryName of the previous case and allows loading files of the following formats: .Rdata, .csv, .json.

For more information please check out the README and documentation of the the rsync package.

Feedback

You are welcome to help improve this R package, just create an issue on Github.