Skip to content

Conversation

@m1a2st
Copy link
Collaborator

@m1a2st m1a2st commented Dec 19, 2025

In Transaction Version 2, strict epoch validation (markerEpoch > currentEpoch) causes hanging transactions in two scenarios:

  1. Coordinator recovery: When reloading PREPARE_COMMIT/ABORT from
    transaction log, retried markers are rejected with
    InvalidProducerEpochException because they use the same epoch
  2. Network retry: When marker write succeeds but response is lost,
    coordinator retries are rejected for the same reason

Both cases leave transactions permanently hanging in PREPARE state,
causing clients to fail with CONCURRENT_TRANSACTIONS.

Detect idempotent marker retries in
ProducerStateManager.checkProducerEpoch() by checking three
conditions:

  1. Transaction Version ≥ 2
  2. markerEpoch == currentEpoch (same epoch)
  3. currentTxnFirstOffset is empty (transaction already completed)

When all conditions are met, treat the marker as a successful idempotent
retry instead of throwing an error.

@github-actions github-actions bot added triage PRs from the community core Kafka Broker storage Pull requests that target the storage module clients labels Dec 19, 2025
@m1a2st m1a2st changed the title KAFKA-19999 Transaction coordinator livelock caused by invalid producer epoch KAFKA-19999 Transaction coordinator livelock caused by invalid producer epoch [WIP] Dec 19, 2025
@m1a2st
Copy link
Collaborator Author

m1a2st commented Dec 19, 2025

Test the tests/kafkatest/tests/core/transactions_upgrade_test.py::TransactionsUpgradeTest.test_transactions_upgrade test all pass.

test_id: kafkatest.tests.core.transactions_upgrade_test.TransactionsUpgradeTest.test_transactions_upgrade.from_kafka_version=3.3.2.metadata_quorum=ISOLATED_KRAFT.group_protocol=None
status: PASS
run time: 6 minutes 47.209 seconds

test_id: kafkatest.tests.core.transactions_upgrade_test.TransactionsUpgradeTest.test_transactions_upgrade.from_kafka_version=4.1.1.metadata_quorum=ISOLATED_KRAFT.group_protocol=None
status: PASS
run time: 8 minutes 3.934 seconds

test_id: kafkatest.tests.core.transactions_upgrade_test.TransactionsUpgradeTest.test_transactions_upgrade.from_kafka_version=3.4.1.metadata_quorum=ISOLATED_KRAFT.group_protocol=None
status: PASS
run time: 7 minutes 44.717 seconds

test_id: kafkatest.tests.core.transactions_upgrade_test.TransactionsUpgradeTest.test_transactions_upgrade.from_kafka_version=3.5.2.metadata_quorum=ISOLATED_KRAFT.group_protocol=None
status: PASS
run time: 7 minutes 1.438 seconds

test_id: kafkatest.tests.core.transactions_upgrade_test.TransactionsUpgradeTest.test_transactions_upgrade.from_kafka_version=3.6.2.metadata_quorum=ISOLATED_KRAFT.group_protocol=None
status: PASS
run time: 6 minutes 44.217 seconds

test_id: kafkatest.tests.core.transactions_upgrade_test.TransactionsUpgradeTest.test_transactions_upgrade.from_kafka_version=3.7.2.metadata_quorum=ISOLATED_KRAFT.group_protocol=None
status: PASS
run time: 7 minutes 44.566 seconds

test_id: kafkatest.tests.core.transactions_upgrade_test.TransactionsUpgradeTest.test_transactions_upgrade.from_kafka_version=3.9.1.metadata_quorum=ISOLATED_KRAFT.group_protocol=None
status: PASS
run time: 7 minutes 49.853 seconds

test_id: kafkatest.tests.core.transactions_upgrade_test.TransactionsUpgradeTest.test_transactions_upgrade.from_kafka_version=3.8.1.metadata_quorum=ISOLATED_KRAFT.group_protocol=None
status: PASS
run time: 7 minutes 11.152 seconds

test_id: kafkatest.tests.core.transactions_upgrade_test.TransactionsUpgradeTest.test_transactions_upgrade.from_kafka_version=4.0.1.metadata_quorum=ISOLATED_KRAFT.group_protocol=None
status: PASS
run time: 7 minutes 43.566 seconds

@chia7712 chia7712 requested a review from satishd December 19, 2025 17:25
Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m1a2st thanks for this quick fix.

@chia7712 chia7712 requested review from jolshan and removed request for satishd December 20, 2025 01:54
@github-actions github-actions bot removed the triage PRs from the community label Dec 20, 2025
@m1a2st m1a2st changed the title KAFKA-19999 Transaction coordinator livelock caused by invalid producer epoch [WIP] KAFKA-19999 Transaction coordinator livelock caused by invalid producer epoch Dec 20, 2025
Errors.NOT_LEADER_OR_FOLLOWER
case error =>
error
if (IdempotentTransactionMarkerException.isInstanceOf(exception))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add comments

// TV2 Idempotent Retry Detection:
// When markerEpoch == currentEpoch and no transaction is ongoing, this is a retry
// of a marker that was already successfully written. Common scenarios:
// 1. Coordinator recovery: reloading PREPARE_COMMIT/ABORT from transaction log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: from the transaction log

public List<RecordError> recordErrors;
public String errorMessage;
public ProduceResponseData.LeaderIdAndEpoch currentLeader;
private Throwable exception;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception field might be feeling lonely since it was not invited to the hashCode, equals, and toString party

// Verify initial state
ProducerStateEntry initialEntry = getLastEntryOrElseThrownByProducerId(stateManager, producerId);
assertEquals(0, initialEntry.producerEpoch());
assertEquals(2, initialEntry.lastSeq());
assertFalse(initialEntry.isEmpty()); // Has batch metadata

ProducerAppendInfo appendInfo = stateManager.prepareUpdate(producerId, AppendOrigin.CLIENT);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you mind removing those unrelated changes?

@m1a2st
Copy link
Collaborator Author

m1a2st commented Dec 21, 2025

Thanks for @chia7712 review, I have addressed all comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clients core Kafka Broker storage Pull requests that target the storage module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants