Skip to content

fread: performance difference with/without specifying colClasses #6105

@scipima

Description

@scipima

Hello,
I noticed a large performance difference in fread() with and without specifying colClasses().
If i specify the classes of the toy dataset below, the execution takes considerably longer than without specifying any classes.
There is a similar question on StackOverflow (currently without answer), but in that case it is not entirely clear whether the issue is relative to the calculation also present in the script or due to fread() itself.
So, i created the script below which deals exclusively with fread().
In one case, I also specify the keys when reading in the data, but that does not seem to make any difference.

# Minimal reproducible example; please be sure to set verbose=TRUE where possible!

set.seed(1234)
vote_id <- sample(x = 100000:200000, size = 20000, replace = FALSE)
pers_id <- sample(x = 100000:200000, size = 900, replace = FALSE)
dates <- sample(x = seq(as.Date('2010/01/01'), as.Date('2011/01/01'), by="day"), size = 250, replace = FALSE)
result <- sample(x = -1:1, size = 1000000, replace = TRUE)
big_dt <- data.table::data.table( 
  vote_id = vote_id,
  pers_id = pers_id,
  dates = dates,
  result = result)
# write data  
data.table::fwrite(x = big_dt, "big_dt.csv")
# becnmark fread
res = bench::mark(
  fread_classes_keys = data.table::fread(
    file = "big_dt.csv",
    colClasses = list(
      integer = c("vote_id", "pers_id", "result"),
      Date = c("dates") ), 
    key = c("dates", "vote_id", "pers_id")),
  fread_classes = data.table::fread(
    file = "big_dt.csv",
    colClasses = list(
      integer = c("vote_id", "pers_id", "result"),
      Date = c("dates") )),
  fread = data.table::fread(
    file = "big_dt.csv"),
  check = FALSE, iterations = 10)
print(res)
sessionInfo()

Results

> source("/home/scipima/Documents/github/ep_rollcall/test_scripts/fread_test.R", encoding = "UTF-8")
# A tibble: 3 × 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>  <bch:> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 fread_clas…  2.54s   2.55s     0.390   159.3MB    0.780    10    20      25.6s
2 fread_clas…  2.51s   2.51s     0.397   146.9MB    0.794    10    20      25.2s
3 fread       8.75ms 12.59ms    83.4      16.8MB    8.34     10     1    119.9ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning messages:
1: Item 2 has 900 rows but longest item has 1000000; recycled with remainder. 
2: Some expressions had a GC in every iteration; so filtering is disabled. 

Output of sessionInfo()

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Linux Mint 21.3

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=nl_BE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=nl_BE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=nl_BE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Brussels
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.4.0    magrittr_2.0.3    bench_1.1.3       cli_3.6.2        
 [5] tools_4.4.0       pillar_1.9.0      glue_1.7.0        tibble_3.2.1     
 [9] utf8_1.2.4        fansi_1.0.6       vctrs_0.6.5       data.table_1.15.0
[13] jsonlite_1.8.8    lifecycle_1.0.4   pkgconfig_2.0.3   rlang_1.1.3      
[17] profmem_0.6.0    

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions