Thursday, July 14, 2016

One more reason to use Feather

It was quiet here for some time. Not because I have nothing to blog but because I have no time to do so properly. One of those almost forgotten posts was about Feather package / module.

Feather is a fast, lightweight binary format for data frames with R and Python implementation. The original RStudio announcement is here. And sure, the speed improvement is impressive. See my numbers for saving ~ 100 million probabilities:

# haplotype probs: 192 animals x 8 x 64000 markers
> format(object.size(probs), units="Mb")
[1] "750 Mb"

# saveRDS or save needs almost a minute to write probs to disk
> system.time(saveRDS(dprobs, file="DO192_probs.rds"))
   user  system elapsed
 50.701   0.574  51.678

# write_feather needs 6-7 seconds
> system.time(write_feather(dprobs, file="DO192_probs.feather"))
   user  system elapsed  
  1.344   1.051   6.272

Feather is even better if you compare it to traditional text formats like CSV. As David Smith explains in his blog, one of the reasons is traditional formats are row-oriented while internal R's storage is column-oriented.

Diagram credit: Hadley Wickham

I have one more reason to use Feather. If you have datasets with many columns (e.g. genes in human/mouse genome) and you need fast access to just one column (e.g. Shiny app), then Feather is ideal because its columns are automatically indexed.

read_feather("DO192_probs.feather", column = "19_48310898")
   user  system elapsed
 0.068   0.000   0.069

Sure, there are other solutions, like rhdf5 or RSQLite, but Feather is the easiest to use, at least for me, at least in R. See David Smith (Microsoft R) for more details: http://blog.revolutionanalytics.com/2016/05/feather-package.html

No comments:

Post a Comment