-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Labels
Description
Hello,
I noticed a large performance difference in fread() with and without specifying colClasses().
If i specify the classes of the toy dataset below, the execution takes considerably longer than without specifying any classes.
There is a similar question on StackOverflow (currently without answer), but in that case it is not entirely clear whether the issue is relative to the calculation also present in the script or due to fread() itself.
So, i created the script below which deals exclusively with fread().
In one case, I also specify the keys when reading in the data, but that does not seem to make any difference.
# Minimal reproducible example; please be sure to set verbose=TRUE where possible!
set.seed(1234)
vote_id <- sample(x = 100000:200000, size = 20000, replace = FALSE)
pers_id <- sample(x = 100000:200000, size = 900, replace = FALSE)
dates <- sample(x = seq(as.Date('2010/01/01'), as.Date('2011/01/01'), by="day"), size = 250, replace = FALSE)
result <- sample(x = -1:1, size = 1000000, replace = TRUE)
big_dt <- data.table::data.table(
vote_id = vote_id,
pers_id = pers_id,
dates = dates,
result = result)
# write data
data.table::fwrite(x = big_dt, "big_dt.csv")
# becnmark fread
res = bench::mark(
fread_classes_keys = data.table::fread(
file = "big_dt.csv",
colClasses = list(
integer = c("vote_id", "pers_id", "result"),
Date = c("dates") ),
key = c("dates", "vote_id", "pers_id")),
fread_classes = data.table::fread(
file = "big_dt.csv",
colClasses = list(
integer = c("vote_id", "pers_id", "result"),
Date = c("dates") )),
fread = data.table::fread(
file = "big_dt.csv"),
check = FALSE, iterations = 10)
print(res)
sessionInfo()
Results
> source("/home/scipima/Documents/github/ep_rollcall/test_scripts/fread_test.R", encoding = "UTF-8")
# A tibble: 3 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 fread_clas… 2.54s 2.55s 0.390 159.3MB 0.780 10 20 25.6s
2 fread_clas… 2.51s 2.51s 0.397 146.9MB 0.794 10 20 25.2s
3 fread 8.75ms 12.59ms 83.4 16.8MB 8.34 10 1 119.9ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning messages:
1: Item 2 has 900 rows but longest item has 1000000; recycled with remainder.
2: Some expressions had a GC in every iteration; so filtering is disabled.
Output of sessionInfo()
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Linux Mint 21.3
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=nl_BE.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=nl_BE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=nl_BE.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Brussels
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.4.0 magrittr_2.0.3 bench_1.1.3 cli_3.6.2
[5] tools_4.4.0 pillar_1.9.0 glue_1.7.0 tibble_3.2.1
[9] utf8_1.2.4 fansi_1.0.6 vctrs_0.6.5 data.table_1.15.0
[13] jsonlite_1.8.8 lifecycle_1.0.4 pkgconfig_2.0.3 rlang_1.1.3
[17] profmem_0.6.0
Reactions are currently unavailable