Skip to content

Conversation

@weizhouapache
Copy link
Member

Description

This PR fixed #5208 and #3613

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✖️ el8 ✔️ debian. SL-JID 883

@weizhouapache
Copy link
Member Author

@blueorangutan test matrix

@blueorangutan
Copy link

@weizhouapache a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1660)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 34271 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5319-t1660-xenserver-71.zip
Intermittent failure detected: /marvin/tests/smoke/test_host_maintenance.py
Smoke tests completed. 87 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

Trillian test result (tid-1662)
Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42699 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5319-t1662-vmware-65u2.zip
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_network.py
Intermittent failure detected: /marvin/tests/smoke/test_resource_accounting.py
Smoke tests completed. 85 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestRouterRules>:teardown Error 148.55 test_network.py
test_03_deploy_and_scale_kubernetes_cluster Error 167.71 test_kubernetes_clusters.py
test_07_deploy_kubernetes_ha_cluster Failure 0.04 test_kubernetes_clusters.py
test_08_deploy_and_upgrade_kubernetes_ha_cluster Failure 0.04 test_kubernetes_clusters.py
test_09_delete_kubernetes_ha_cluster Failure 0.04 test_kubernetes_clusters.py
ContextSuite context=TestKubernetesCluster>:teardown Error 46.10 test_kubernetes_clusters.py

@blueorangutan
Copy link

Trillian test result (tid-1661)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 50758 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5319-t1661-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 84 look OK, 3 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failure 313.46 test_routers_network_ops.py
test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failure 311.95 test_routers_network_ops.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 427.85 test_vpc_redundant.py
test_05_rvpc_multi_tiers Failure 476.03 test_vpc_redundant.py
test_disable_oobm_ha_state_ineligible Error 1511.55 test_hostha_kvm.py

@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1675)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33326 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5319-t1675-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Smoke tests completed. 87 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@weizhouapache weizhouapache marked this pull request as ready for review August 18, 2021 11:13
Copy link
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@davidjumani davidjumani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verifited
#5208 Is Fixed. Brought up ~15 VMs, no issues
#3613 On reload, VMs still get the IP

self.delete_leases()

self.delete_leases()
self.write_hosts()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weizhouapache can you explain what this line does? Isn't deleting leases going to cause any regression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhtyd
the method delete_leases was mainly introduced by #3351, which is used to remove entries from dnsmasq.leases file for VMs which have been removed (no impact on all existing vms).

in my opinion, it will not cause any regression. the only issue I see is that execution time will be increased few milliseconds if stop/start a vm(currently delete_leases is not triggered).

def delete_leases(self):
macs_dhcphosts = []
try:
logging.info("Attempting to delete entries from dnsmasq.leases file for VMs which are not on dhcphosts file")
for host in open(DHCP_HOSTS):
macs_dhcphosts.append(host.split(',')[0])
removed = 0
for leaseline in open(LEASES):
lease = leaseline.split(' ')
mac = lease[1]
ip = lease[2]
if mac not in macs_dhcphosts:
cmd = "dhcp_release $(ip route get %s | grep eth | head -1 | awk '{print $3}') %s %s" % (ip, ip, mac)
logging.info(cmd)
CsHelper.execute(cmd)
removed = removed + 1
self.del_host(ip)
logging.info("Deleted %s entries from dnsmasq.leases file" % str(removed))
except Exception as e:
logging.error("Caught error while trying to delete entries from dnsmasq.leases file: %s" % e)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhtyd anyway, I added a new commit to address your comment.

leases will be deleted only when one of config files is changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks.

@yadvr yadvr added this to the 4.15.2.0 milestone Aug 19, 2021
@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian. SL-JID 931

@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@weizhouapache weizhouapache linked an issue Aug 19, 2021 that may be closed by this pull request
@blueorangutan
Copy link

Trillian test result (tid-1720)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32843 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5319-t1720-kvm-centos7.zip
Smoke tests completed. 87 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

Copy link
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - did not test it.

@yadvr
Copy link
Member

yadvr commented Aug 20, 2021

Suggested tests if not already done - please check/confirm @weizhouapache cc @nvazquez

Monitor dnsmasq service and check expected outcome for cases in both isolated network and VPC tier: (maybe more cases you can think of?)

  • Deploy two VMs, stop one VM (check dnsmasq status or cloud.log), start the stopped VM, stop the VM and destroy it
  • Deploy two VMs, destroy one VM, check dnsmasq entries/logs
  • Deploy two VMs, create a new network; plug secondary nic and perform above (behaviour of dnsmasq when nics of VMs are on secondary networks)
  • Deploy two VMs on two different VPC tier, stop one VM, destroy another, start another VM in a new tier, start a VM in existing VM

@weizhouapache
Copy link
Member Author

Suggested tests if not already done - please check/confirm @weizhouapache cc @nvazquez

Monitor dnsmasq service and check expected outcome for cases in both isolated network and VPC tier: (maybe more cases you can think of?)

  • Deploy two VMs, stop one VM (check dnsmasq status or cloud.log), start the stopped VM, stop the VM and destroy it
  • Deploy two VMs, destroy one VM, check dnsmasq entries/logs
  • Deploy two VMs, create a new network; plug secondary nic and perform above (behaviour of dnsmasq when nics of VMs are on secondary networks)
  • Deploy two VMs on two different VPC tier, stop one VM, destroy another, start another VM in a new tier, start a VM in existing VM

@rhtyd @nvazquez I will test the scenarios.

@weizhouapache
Copy link
Member Author

Suggested tests if not already done - please check/confirm @weizhouapache cc @nvazquez

Monitor dnsmasq service and check expected outcome for cases in both isolated network and VPC tier: (maybe more cases you can think of?)

  • Deploy two VMs, stop one VM (check dnsmasq status or cloud.log), start the stopped VM, stop the VM and destroy it
  • Deploy two VMs, destroy one VM, check dnsmasq entries/logs
  • Deploy two VMs, create a new network; plug secondary nic and perform above (behaviour of dnsmasq when nics of VMs are on secondary networks)
  • Deploy two VMs on two different VPC tier, stop one VM, destroy another, start another VM in a new tier, start a VM in existing VM

@rhtyd ,cc @nvazquez
I have tested all scenarios above, they work fine.

@nvazquez nvazquez merged commit 16e4de0 into apache:4.15 Aug 25, 2021
@weizhouapache weizhouapache deleted the 4.15-vr-reload-dnsmasq branch December 9, 2022 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VR: dnsmasq service start failed

5 participants