Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 21, 2025

Fix #18234

This includes quite many changes, but on a high level:

  • Accessing to server_context_impl class member is now highly restricted. Only some pointers like vocab, model, mctx are exposed.
  • Any static data (i.e. model name, context size, etc) must now be rendered into server_context_meta. This is to prevent any accesses to non-thread-safe data inside server_context_impl
  • From server_routes, the HTTP can only access some pointers like vocab, model, mctx. Any other data MUST be passed through server_context_meta

As a consequence:

  • /models and /v1/models can no longer be accessed during model loading. It is NOT thread-safe and can potentially cause data race
  • however, /models and /v1/models can now be accessed during server sleeping. This is because it no longer accesses server_context_impl directly

Also include some other fixes described in #18263 (comment) to make things safer

cc @ServeurpersoCom would appreciate if you can do some testing, thanks!

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 21, 2025

Stress testing to reproduce the race condition with and without this PR

(root|~) cat race.sh
#!/bin/bash
# Stress test pour détecter la data race #18234

set -o pipefail

readonly BASE_URL="https://www.serveurperso.com/ia/webui"
readonly MODEL_A="MoE-Qwen3-30B-A3B-Instruct-2507"
readonly MODEL_B="MoE-Qwen3-30B-A3B-Thinking-2507"
readonly ITERATIONS=100
readonly PARALLEL_REQUESTS=20

# Couleurs pour le logging old-school
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

log() { echo -e "${GREEN}[$(date '+%H:%M:%S.%3N')]${NC} $*"; }
err() { echo -e "${RED}[$(date '+%H:%M:%S.%3N')] ERROR:${NC} $*" >&2; }
warn() { echo -e "${YELLOW}[$(date '+%H:%M:%S.%3N')] WARN:${NC} $*"; }

# Payload de test lourd pour forcer le cache_prompt
generate_payload() {
    local model="$1"
    python3 -c "print('Test '*500)"
}

# Requête /v1/models - cible principale de la race
hammer_models_endpoint() {
    local req_id=$1
    local start=$(date +%s%N)

    local response=$(curl -s -w "\n%{http_code}" \
        "${BASE_URL}/v1/models" \
        -H "Content-Type: application/json" \
        --connect-timeout 2 \
        --max-time 5 2>&1)

    local http_code=$(echo "$response" | tail -1)
    local body=$(echo "$response" | head -n -1)
    local duration=$((($(date +%s%N) - start) / 1000000))

    if [[ "$http_code" != "200" ]]; then
        err "REQ-${req_id}: /v1/models failed: HTTP $http_code (${duration}ms)"
        echo "$body" | head -5
        return 1
    fi

    # Vérifier la cohérence JSON
    if ! echo "$body" | jq -e '.data | length' >/dev/null 2>&1; then
        err "REQ-${req_id}: Invalid JSON response"
        return 1
    fi

    log "REQ-${req_id}: /v1/models OK (${duration}ms)"
}

# Requête completion pour forcer les transitions d'état
hammer_completion() {
    local req_id=$1
    local model=$2
    local payload=$(generate_payload)

    local start=$(date +%s%N)
    local response=$(curl -s -w "\n%{http_code}" -N \
        "${BASE_URL}/v1/chat/completions" \
        -H "Content-Type: application/json" \
        -d '{
            "model": "'"$model"'",
            "messages": [{"role": "user", "content": "'"$payload"'"}],
            "stream": true,
            "max_tokens": 10,
            "cache_prompt": false
        }' \
        --connect-timeout 2 \
        --max-time 10 2>&1)

    local http_code=$(echo "$response" | tail -1)
    local duration=$((($(date +%s%N) - start) / 1000000))

    if [[ "$http_code" != "200" ]]; then
        err "REQ-${req_id}: Completion failed: HTTP $http_code (${duration}ms, model: $model)"
        return 1
    fi

    log "REQ-${req_id}: Completion OK (${duration}ms, model: $model)"
}

# Attaque mixte: models + completions en parallèle
parallel_assault() {
    local wave=$1
    local pids=()
    local failures=0

    warn "WAVE $wave: Launching $PARALLEL_REQUESTS parallel requests..."

    # Lancer en parallèle
    for i in $(seq 1 $PARALLEL_REQUESTS); do
        local req_id="${wave}-${i}"

        # Alterner entre models et completions
        if (( i % 3 == 0 )); then
            hammer_models_endpoint "$req_id" &
        elif (( i % 2 == 0 )); then
            hammer_completion "$req_id" "$MODEL_A" &
        else
            hammer_completion "$req_id" "$MODEL_B" &
        fi

        pids+=($!)
    done

    # Attendre et compter les échecs
    for pid in "${pids[@]}"; do
        if ! wait "$pid"; then
            ((failures++))
        fi
    done

    if (( failures > 0 )); then
        err "WAVE $wave: $failures/$PARALLEL_REQUESTS requests failed"
        return 1
    else
        log "WAVE $wave: ALL $PARALLEL_REQUESTS requests succeeded"
        return 0
    fi
}

# Test de race pendant model swap
race_during_swap() {
    warn "Testing race during model swap..."

    # Trigger swap vers MODEL_B
    hammer_completion "SWAP-1" "$MODEL_B" &
    local swap_pid=$!

    # Bombarder /v1/models pendant le swap
    sleep 0.1
    for i in {1..10}; do
        hammer_models_endpoint "SWAP-${i}" &
    done

    wait
}

# Main stress test
main() {
    log "=== llama.cpp Data Race Hunter ==="
    log "Target: $BASE_URL"
    log "Models: $MODEL_A, $MODEL_B"
    log "Parallel requests: $PARALLEL_REQUESTS"
    log "Iterations: $ITERATIONS"
    echo

    # Vérifier que le serveur répond
    if ! curl -s --connect-timeout 2 "${BASE_URL}/v1/models" >/dev/null; then
        err "Server unreachable at $BASE_URL"
        exit 1
    fi

    local total_failures=0
    local start_time=$(date +%s)

    # Phase 1: Assault par vagues
    for wave in $(seq 1 $ITERATIONS); do
        if ! parallel_assault "$wave"; then
            ((total_failures++))
        fi

        # Petit délai pour observer les transitions
        sleep 0.05
    done

    # Phase 2: Race pendant swap
    warn "=== Phase 2: Model swap race test ==="
    race_during_swap

    # Rapport final
    local duration=$(($(date +%s) - start_time))
    echo
    log "=== Test completed in ${duration}s ==="

    if (( total_failures > 0 )); then
        err "FAILED: $total_failures/$ITERATIONS waves had failures"
        err "Data race likely present - check server logs"
        exit 1
    else
        log "SUCCESS: All $ITERATIONS waves passed"
        log "No obvious race detected (but check server logs for assertions/crashes)"
        exit 0
    fi
}

main "$@"
(root|~)

No PR-merged, results :

(root|~) ./race.sh
[21:30:46.093] === llama.cpp Data Race Hunter ===
[21:30:46.093] Target: https://www.serveurperso.com/ia/webui
[21:30:46.094] Models: MoE-Qwen3-30B-A3B-Instruct-2507, MoE-Qwen3-30B-A3B-Thinking-2507
[21:30:46.095] Parallel requests: 20
[21:30:46.095] Iterations: 100

[21:30:46.318] WARN: WAVE 1: Launching 20 parallel requests...
[21:30:46.551] ERROR: REQ-1-3: /v1/models failed: HTTP 503 (231ms)
[21:30:46.551] ERROR: REQ-1-9: /v1/models failed: HTTP 503 (231ms)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
[21:30:46.559] ERROR: REQ-1-15: /v1/models failed: HTTP 503 (238ms)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
[21:30:46.560] ERROR: REQ-1-18: /v1/models failed: HTTP 503 (239ms)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
[21:30:46.563] ERROR: REQ-1-5: Completion failed: HTTP 503 (238ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.567] ERROR: REQ-1-6: /v1/models failed: HTTP 503 (246ms)
[21:30:46.567] ERROR: REQ-1-12: /v1/models failed: HTTP 503 (246ms)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
[21:30:46.568] ERROR: REQ-1-4: Completion failed: HTTP 503 (243ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.723] ERROR: REQ-1-16: Completion failed: HTTP 503 (398ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.771] ERROR: REQ-1-11: Completion failed: HTTP 503 (445ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.775] ERROR: REQ-1-10: Completion failed: HTTP 503 (449ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.777] ERROR: REQ-1-1: Completion failed: HTTP 503 (452ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.780] ERROR: REQ-1-7: Completion failed: HTTP 503 (454ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.782] ERROR: REQ-1-8: Completion failed: HTTP 503 (456ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.785] ERROR: REQ-1-14: Completion failed: HTTP 503 (459ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.785] ERROR: REQ-1-2: Completion failed: HTTP 503 (459ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.786] ERROR: REQ-1-20: Completion failed: HTTP 503 (460ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:30:46.791] ERROR: REQ-1-17: Completion failed: HTTP 503 (465ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.797] ERROR: REQ-1-13: Completion failed: HTTP 503 (471ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.950] ERROR: REQ-1-19: Completion failed: HTTP 503 (623ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:30:46.951] ERROR: WAVE 1: 20/20 requests failed
[21:30:47.002] WARN: WAVE 2: Launching 20 parallel requests...
^C
(root|~)

The script seems to be working; perhaps a little too heavy-handed. we'll see what happens with the PR!

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 21, 2025

With this PR

I'm surprised by the improvement because I really overloaded the server with this LLM-made script. I think I need to narrow down, but it's a good proof the PR make way more reliable ! There is some strange behavior remaining like HTTP 000 error :

(root|~) ./race.sh
[21:40:00.674] === llama.cpp Data Race Hunter ===
[21:40:00.674] Target: https://www.serveurperso.com/ia/webui
[21:40:00.675] Models: MoE-Qwen3-30B-A3B-Instruct-2507, MoE-Qwen3-30B-A3B-Thinking-2507
[21:40:00.676] Parallel requests: 20
[21:40:00.676] Iterations: 100

[21:40:00.923] WARN: WAVE 1: Launching 20 parallel requests...
[21:40:01.176] REQ-1-3: /v1/models OK (240ms)
[21:40:01.179] REQ-1-18: /v1/models OK (239ms)
[21:40:01.180] REQ-1-12: /v1/models OK (239ms)
[21:40:01.180] REQ-1-9: /v1/models OK (239ms)
[21:40:01.181] REQ-1-15: /v1/models OK (239ms)
[21:40:01.182] REQ-1-6: /v1/models OK (239ms)
[21:40:10.787] ERROR: REQ-1-14: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-4: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-10: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-20: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-8: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-16: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.788] ERROR: REQ-1-2: Completion failed: HTTP 500 (9849ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:10.945] ERROR: REQ-1-19: Completion failed: HTTP 000 (10006ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.945] ERROR: REQ-1-11: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.945] ERROR: REQ-1-7: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.945] ERROR: REQ-1-5: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.945] ERROR: REQ-1-13: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.946] ERROR: REQ-1-17: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.946] ERROR: REQ-1-1: Completion failed: HTTP 000 (10007ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:10.946] ERROR: WAVE 1: 14/20 requests failed
[21:40:10.997] WARN: WAVE 2: Launching 20 parallel requests...
[21:40:11.250] REQ-2-12: /v1/models OK (239ms)
[21:40:11.250] REQ-2-15: /v1/models OK (239ms)
[21:40:11.257] REQ-2-18: /v1/models OK (246ms)
[21:40:11.263] REQ-2-9: /v1/models OK (252ms)
[21:40:11.265] REQ-2-6: /v1/models OK (255ms)
[21:40:11.266] REQ-2-3: /v1/models OK (255ms)
[21:40:11.401] ERROR: REQ-2-1: Completion failed: HTTP 500 (396ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-5: Completion failed: HTTP 500 (396ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-7: Completion failed: HTTP 500 (395ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-11: Completion failed: HTTP 500 (396ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-13: Completion failed: HTTP 500 (395ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-19: Completion failed: HTTP 500 (394ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:11.401] ERROR: REQ-2-17: Completion failed: HTTP 500 (395ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:13.523] REQ-2-2: Completion OK (2518ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.608] REQ-2-10: Completion OK (2603ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.608] REQ-2-14: Completion OK (2604ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.609] REQ-2-16: Completion OK (2602ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.869] REQ-2-20: Completion OK (2863ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.879] REQ-2-8: Completion OK (2873ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.879] REQ-2-4: Completion OK (2873ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:13.880] ERROR: WAVE 2: 7/20 requests failed
[21:40:13.931] WARN: WAVE 3: Launching 20 parallel requests...
[21:40:14.179] REQ-3-6: /v1/models OK (234ms)
[21:40:14.179] REQ-3-12: /v1/models OK (235ms)
[21:40:14.180] REQ-3-3: /v1/models OK (236ms)
[21:40:14.182] REQ-3-15: /v1/models OK (237ms)
[21:40:14.188] REQ-3-9: /v1/models OK (241ms)
[21:40:14.196] REQ-3-18: /v1/models OK (249ms)
[21:40:14.280] REQ-3-4: Completion OK (341ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:14.879] ERROR: REQ-3-14: Completion failed: HTTP 500 (939ms, model: MoE-Qwen3-30B-A3B-Instruct-2507)
[21:40:14.879] ERROR: REQ-3-17: Completion failed: HTTP 500 (939ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:14.879] ERROR: REQ-3-1: Completion failed: HTTP 500 (941ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:14.880] ERROR: REQ-3-5: Completion failed: HTTP 500 (941ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:14.880] ERROR: REQ-3-11: Completion failed: HTTP 500 (939ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:14.880] ERROR: REQ-3-7: Completion failed: HTTP 500 (941ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:15.516] ERROR: REQ-3-19: Completion failed: HTTP 500 (1576ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
[21:40:15.516] ERROR: REQ-3-13: Completion failed: HTTP 500 (1576ms, model: MoE-Qwen3-30B-A3B-Thinking-2507)
^C
(root|~)

@ServeurpersoCom
Copy link
Collaborator

I try to narrow down on HTTP 000 case

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 21, 2025

quite interesting, I think this script can be useful to test changes related to batching too

btw, looking at your report "No PR results", I suppose the 503 error was because the test don't wait until server starts, right?

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 21, 2025

The HTTP 000 case has a timeout of exactly 10 seconds which seems a quite suspicious, probably curl timeout and the error code is defaulted to 000?

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 21, 2025

quite interesting, I think this script can be useful to test changes related to batching too
btw, looking at your report "No PR results", I suppose the 503 error was because the test don't wait until server starts, right?

The server was already started. I think the 503 errors in "No PR results" are the data race your PR fixes!
Your PR works perfectly for the initial data concurrency problem, in any case it does not prevent my server from working AND it improves reliability with a script coded in a brute-force/DoS cyberattack style

@ServeurpersoCom
Copy link
Collaborator

The HTTP 000 case has a timeout of exactly 10 seconds which seems a quite suspicious, probably curl timeout and the error code is defaulted to 000?

So yes, the HTTP 000 at 10s is definitely curl timeout with error code defaulted to 000. The HTTP 000 at 2s is I think reverse proxy timeout

@ServeurpersoCom
Copy link
Collaborator

Better script :

#!/bin/bash
# Stress test
# Tests concurrent access to /v1/models and /v1/chat/completions
# to verify thread-safety of server_context_meta

set -o pipefail

readonly BASE_URL="https://www.serveurperso.com/ia/webui"
readonly MODEL_A="MoE-Qwen3-30B-A3B-Instruct-2507"
readonly MODEL_B="MoE-Qwen3-30B-A3B-Thinking-2507"
readonly ITERATIONS=100
readonly PARALLEL_REQUESTS=20

# Timeout configuration
# /v1/models should respond quickly (metadata read-only)
readonly MODELS_TIMEOUT=5
# Completions can take longer under load
readonly COMPLETION_TIMEOUT=15

# Colors for logging
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
CYAN='\033[0;36m'
NC='\033[0m'

log() { echo -e "${GREEN}[$(date '+%H:%M:%S.%3N')]${NC} $*"; }
err() { echo -e "${RED}[$(date '+%H:%M:%S.%3N')] ERROR:${NC} $*" >&2; }
warn() { echo -e "${YELLOW}[$(date '+%H:%M:%S.%3N')] WARN:${NC} $*"; }
info() { echo -e "${CYAN}[$(date '+%H:%M:%S.%3N')] INFO:${NC} $*"; }

# Generate heavy payload to force cache_prompt processing
generate_payload() {
    python3 -c "print('Test '*500)"
}

# Test /v1/models endpoint - main target for data race detection
# This endpoint reads server metadata and was vulnerable to concurrent access
hammer_models_endpoint() {
    local req_id=$1
    local start=$(date +%s%N)

    local response=$(curl -s -w "\n%{http_code}" \
        "${BASE_URL}/v1/models" \
        -H "Content-Type: application/json" \
        --connect-timeout 2 \
        --max-time $MODELS_TIMEOUT 2>&1)

    local http_code=$(echo "$response" | tail -1)
    local body=$(echo "$response" | head -n -1)
    local duration=$((($(date +%s%N) - start) / 1000000))

    if [[ "$http_code" != "200" ]]; then
        # HTTP 000 with duration < MODELS_TIMEOUT indicates server-side timeout (potential bug)
        # HTTP 000 with duration >= MODELS_TIMEOUT is curl timeout (expected under extreme load)
        if [[ "$http_code" == "000" ]] && (( duration < MODELS_TIMEOUT * 1000 )); then
            err "REQ-${req_id}: /v1/models SERVER TIMEOUT: HTTP $http_code (${duration}ms) - server closed connection early!"
        else
            err "REQ-${req_id}: /v1/models failed: HTTP $http_code (${duration}ms)"
        fi
        echo "$body" | head -5
        return 1
    fi

    # Verify JSON integrity - data race can cause corrupted responses
    if ! echo "$body" | jq -e '.data | length' >/dev/null 2>&1; then
        err "REQ-${req_id}: Invalid JSON response (possible data race!)"
        return 1
    fi

    log "REQ-${req_id}: /v1/models OK (${duration}ms)"
}

# Test completion endpoint to trigger server state transitions
# Heavy payload forces queue pressure and slot allocation
hammer_completion() {
    local req_id=$1
    local model=$2
    local payload=$(generate_payload)

    local start=$(date +%s%N)
    local response=$(curl -s -w "\n%{http_code}" -N \
        "${BASE_URL}/v1/chat/completions" \
        -H "Content-Type: application/json" \
        -d '{
            "model": "'"$model"'",
            "messages": [{"role": "user", "content": "'"$payload"'"}],
            "stream": true,
            "max_tokens": 10,
            "cache_prompt": false
        }' \
        --connect-timeout 2 \
        --max-time $COMPLETION_TIMEOUT 2>&1)

    local http_code=$(echo "$response" | tail -1)
    local duration=$((($(date +%s%N) - start) / 1000000))

    if [[ "$http_code" != "200" ]]; then
        # HTTP 500 is expected when queue is full (fail-fast behavior)
        # HTTP 000 with early timeout indicates server issue
        if [[ "$http_code" == "500" ]]; then
            warn "REQ-${req_id}: Completion queue full: HTTP $http_code (${duration}ms, model: $model) - expected under load"
        elif [[ "$http_code" == "000" ]] && (( duration < COMPLETION_TIMEOUT * 1000 )); then
            err "REQ-${req_id}: Completion SERVER TIMEOUT: HTTP $http_code (${duration}ms, model: $model) - server closed connection early!"
        else
            err "REQ-${req_id}: Completion failed: HTTP $http_code (${duration}ms, model: $model)"
        fi
        return 1
    fi

    log "REQ-${req_id}: Completion OK (${duration}ms, model: $model)"
}

# Parallel assault wave - mix of /v1/models and completions
# This simulates real-world concurrent access patterns
parallel_assault() {
    local wave=$1
    local pids=()
    local failures=0

    info "WAVE $wave: Launching $PARALLEL_REQUESTS parallel requests (mixed /v1/models + completions)"

    # Launch requests in parallel
    # Pattern: 1/3 are /v1/models, 2/3 are completions alternating between MODEL_A and MODEL_B
    for i in $(seq 1 $PARALLEL_REQUESTS); do
        local req_id="${wave}-${i}"

        if (( i % 3 == 0 )); then
            # Test /v1/models endpoint (data race target)
            hammer_models_endpoint "$req_id" &
        elif (( i % 2 == 0 )); then
            # Test completions with MODEL_A
            hammer_completion "$req_id" "$MODEL_A" &
        else
            # Test completions with MODEL_B (forces model switching in multi-model setup)
            hammer_completion "$req_id" "$MODEL_B" &
        fi

        pids+=($!)
    done

    # Wait for all requests and count failures
    for pid in "${pids[@]}"; do
        if ! wait "$pid"; then
            ((failures++))
        fi
    done

    if (( failures > 0 )); then
        err "WAVE $wave: $failures/$PARALLEL_REQUESTS requests failed"
        return 1
    else
        log "WAVE $wave: ALL $PARALLEL_REQUESTS requests succeeded"
        return 0
    fi
}

# Test race condition during model context changes
# This triggers server state transitions while hammering /v1/models
race_during_model_transition() {
    info "Phase 2: Testing /v1/models stability during model transitions"

    # Trigger model activity with MODEL_B
    hammer_completion "TRANSITION-1" "$MODEL_B" &
    local trigger_pid=$!

    # Immediately hammer /v1/models while server handles the completion
    sleep 0.1
    for i in {1..10}; do
        hammer_models_endpoint "TRANSITION-${i}" &
    done

    wait
}

# Main stress test
main() {
    log "=== llama.cpp Data Race Stress Test ==="
    log "Purpose: Detect thread-safety issues in server_context metadata access"
    log "Target: $BASE_URL"
    log "Models: $MODEL_A, $MODEL_B"
    log "Parallel requests per wave: $PARALLEL_REQUESTS"
    log "Total waves: $ITERATIONS"
    log "Timeouts: /v1/models=${MODELS_TIMEOUT}s, completions=${COMPLETION_TIMEOUT}s"
    echo

    # Pre-flight check
    info "Checking server availability..."
    if ! curl -s --connect-timeout 2 "${BASE_URL}/v1/models" >/dev/null; then
        err "Server unreachable at $BASE_URL"
        exit 1
    fi
    log "Server is reachable"
    echo

    local total_failures=0
    local start_time=$(date +%s)

    # Phase 1: Sustained parallel assault
    info "=== Phase 1: Sustained parallel assault ($ITERATIONS waves) ==="
    info "Each wave: $((PARALLEL_REQUESTS / 3)) /v1/models + $((PARALLEL_REQUESTS * 2 / 3)) completions"
    echo

    for wave in $(seq 1 $ITERATIONS); do
        if ! parallel_assault "$wave"; then
            ((total_failures++))
        fi

        # Small delay to observe server state transitions
        sleep 0.05
    done

    echo
    # Phase 2: Race during transitions
    warn "=== Phase 2: Testing during model transitions ==="
    race_during_model_transition

    # Final report
    local duration=$(($(date +%s) - start_time))
    echo
    log "=== Test completed in ${duration}s ==="
    log "Total waves: $ITERATIONS"
    log "Failed waves: $total_failures"
    echo

    if (( total_failures > 0 )); then
        err "FAILED: $total_failures/$ITERATIONS waves had failures"
        err "Possible data race detected - check server logs for:"
        err "  - ThreadSanitizer warnings (if compiled with -fsanitize=thread)"
        err "  - Crashes or assertion failures"
        err "  - Corrupted JSON responses"
        err "  - SERVER TIMEOUT messages (connection closed before curl timeout)"
        exit 1
    else
        log "SUCCESS: All $ITERATIONS waves passed"
        log "No data race detected in this test run"
        log "For comprehensive validation, run with ThreadSanitizer enabled"
        exit 0
    fi
}

main "$@"

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Dec 21, 2025

I've already restarted the Windows runner. I'll have to test it on my Windows machine! I try a server-queue.cpp/h patch

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 21, 2025

I try a server-queue.cpp/h patch

If that was for the GGML_ASSERT(idx < states.size()) error, I hope the last commit will fix it

@ServeurpersoCom
Copy link
Collaborator

prompt processing progress, n_tokens = 1355, batch.n_tokens = 1355, progress = 1.000000
[59717] slot update_slots: id 3 | task 0 | prompt done, n_tokens = 1355, batch.n_tokens = 1355
srv operator(): http client error: Failed to read connection
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv operator(): instance name=Dense-Devstral-Small-2-24B-Instruct-2512 exited with status 1
srv log_server_r: request: GET /v1/models 127.0.0.1 200

from my smartphone. i check tomorrow morning

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 21, 2025

I spotted some more bugs on the way, fixed in the last commit(s):

  1. index is required by server_response_reader but is defaulted to -1, which cause some crashes. The hotfix is to default it to 0, but the proper fix (left as a TODO) is to get rid of the index altogether.
  2. server_http_req object is deleted too soon. In 121c7e7 , I fix this by associating its lifecycle to the response object. This should mimic the exact behavior of httplib's res and req objects

Edit: hmm, LoRA endpoint can also cause data race the same way as /models has been. it need to be fixed too

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 22, 2025

So turns out, the endpoints for lora hot-swap can also cause data race as it reads data directly off server_context. I refactored large a part of lora handling to make it safe.

@ServeurpersoCom
Copy link
Collaborator

Compared to what I did from the phone (where the server wasn't working), after theses 5 commit this time everything works and the last DoS script no longer displays an HTTP 000 error, and server recovers after a while. It seems more robust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: (bug) data race on /v1/models and LoRA endpoints

2 participants