wolfman1956 Posted April 15, 2020 Report Share Posted April 15, 2020 Good day community members, As this is my 1st submission to this forum, I will do my very best to attempt to describe the situation we are currently running into. As I’m new to Kazoo and CouchDB, I will try to be as clear and concise as possible with my information regarding this situation. Please feel free to ask or request any additional information that might be missing, and is considered relevant…… Situation: We are setting up a KAZOO pre-production cluster composed of 2 main KAZOO servers, 2 Kamalio servers, 2 freeswitch servers, 2 rabbitmq servers and a 6-node CouchDB cluster, divided into 2 groups, 1 x 3 nodes as primaries’, 1 x 3 secondaries’ and with time we got the environment up and running, except for one event for which we can’t truly pinpoint a cause. When running the command string (described further down), although we seem to result with a good load, we will see the occasional warning, at random areas of the script’s execution. At this point we are looking into what is causing these various “timeout” warnings, as they appear at random (full details and script output also found further down). Environment: - 2 Kazoo server with HaProxy (1 x promary, 1 x secondary) - 2 Freeswitch servers - 2 Kamalio servers - 2 Rabbitmq - 6 CouchDB nodes o 3 primaries’ o 3 secondaries’ Attached are the config files (7 in all) we investigated plus a few log extracts; - default.ini - local.ini - vm.args - prlimit command output - tcpdump extract with "timeout" strings - successful run output - unsuccessful run output Command string: The command string we run from one of the 2 masters, is: sup crossbar_maintenance init_apps /var/www/html/monster-ui/apps http://our.master1.server:8000/v2 Script execution: Observations from Kazoo/Couchdb startup process: warnings with timeouts or failed save on “.png” files while loading and process still ends with an “OK” status: (a successful run will take around 6 seconds, an successful run with "timeout" warnings, will take anywhere from 25 seconds up to 1 minute and 40 seconds, but will still end with an "OK" status. Sample run with "timeout events": [root@vlprodkazookapps01 ~]# time sup crossbar_maintenance init_apps /var/www/html/monster-ui/apps http://ourkazooserver.kazoo.ourdomain.ca:8000/v2 trying to init app from /var/www/html/monster-ui/apps/pbxs app pbxs already loaded in system not updating api_url, it is unchanged no metadata changes for app 6f12f101f866c972d04a43f960ff41b7 saved PBXconnector_app.png to 6f12f101f866c972d04a43f960ff41b7 saved pbxconnector1.png to 6f12f101f866c972d04a43f960ff41b7 saved pbxconnector2.png to 6f12f101f866c972d04a43f960ff41b7 trying to init app from /var/www/html/monster-ui/apps/webhooks app webhooks already loaded in system not updating api_url, it is unchanged no metadata changes for app e6e5774256edd2201f838865baa25ef3 saved WebHooks_app.png to e6e5774256edd2201f838865baa25ef3 saved webhooks1.png to e6e5774256edd2201f838865baa25ef3 saved webhooks2.png to e6e5774256edd2201f838865baa25ef3 trying to init app from /var/www/html/monster-ui/apps/fax app fax already loaded in system not updating api_url, it is unchanged no metadata changes for app e21e143c33312aa3495558d7936242ec failed to save Fax_app.png to e21e143c33312aa3495558d7936242ec: timeout failed to save OutboundFaxes.png to e21e143c33312aa3495558d7936242ec: timeout trying to init app from /var/www/html/monster-ui/apps/csv-onboarding app csv-onboarding already loaded in system not updating api_url, it is unchanged no metadata changes for app bdca179bf1754c18f5d739dc91c859b6 failed to find icon in bdca179bf1754c18f5d739dc91c859b6 failed to find screenshots in bdca179bf1754c18f5d739dc91c859b6 trying to init app from /var/www/html/monster-ui/apps/accounts app accounts already loaded in system not updating api_url, it is unchanged no metadata changes for app 1b7093aa6cbf7b50708027d03a106a12 saved Accounts_app.png to 1b7093aa6cbf7b50708027d03a106a12 saved Account-AvailableApps.png to 1b7093aa6cbf7b50708027d03a106a12 saved Account-Limits.png to 1b7093aa6cbf7b50708027d03a106a12 saved AccountOverview.png to 1b7093aa6cbf7b50708027d03a106a12 trying to init app from /var/www/html/monster-ui/apps/callflows app callflows already loaded in system not updating api_url, it is unchanged no metadata changes for app 184e7c891e2b9c64e8f4e8eec5b00a7b saved Callflows_app.png to 184e7c891e2b9c64e8f4e8eec5b00a7b saved callflows_1.png to 184e7c891e2b9c64e8f4e8eec5b00a7b saved callflows_2.png to 184e7c891e2b9c64e8f4e8eec5b00a7b saved callflows_3.png to 184e7c891e2b9c64e8f4e8eec5b00a7b trying to init app from /var/www/html/monster-ui/apps/voip app voip already loaded in system not updating api_url, it is unchanged no metadata changes for app e9cc159ebcb93062e51c5f38604ec77f failed to save SmartPBX_app.png to e9cc159ebcb93062e51c5f38604ec77f: timeout failed to save smartpbx1.png to e9cc159ebcb93062e51c5f38604ec77f: timeout saved smartpbx2.png to e9cc159ebcb93062e51c5f38604ec77f saved smartpbx3.png to e9cc159ebcb93062e51c5f38604ec77f saved smartpbx4.png to e9cc159ebcb93062e51c5f38604ec77f saved smartpbx5.png to e9cc159ebcb93062e51c5f38604ec77f trying to init app from /var/www/html/monster-ui/apps/numbers app numbers already loaded in system not updating api_url, it is unchanged no metadata changes for app 8637570ad7b41bc7f3906882507094db saved Numbers_app.png to 8637570ad7b41bc7f3906882507094db saved numbers1.png to 8637570ad7b41bc7f3906882507094db saved numbers2.png to 8637570ad7b41bc7f3906882507094db trying to init app from /var/www/html/monster-ui/apps/voicemails app voicemails already loaded in system not updating api_url, it is unchanged no metadata changes for app 48630c8173c977fdc460ae93e96010ec saved Voicemail_app.png to 48630c8173c977fdc460ae93e96010ec saved PlayVoicemail.png to 48630c8173c977fdc460ae93e96010ec saved SelectedVoicemails.png to 48630c8173c977fdc460ae93e96010ec ok real 1m25.705s user 0m0.425s sys 0m0.105s Extract from tcpdump log while running the script and "timeout" event occured: {"_id":"ecallmgr","_rev":"19-64144d828aa26ea806e165f4089bf4cd","default":{"balance_crawler_enabled":false,"max_channel_cleanup_timeout_ms":60000,"acls":{"kamailio@vlprodkazoosbc02.kazoo.dsritelecom.ca":{"type":"allow","network-list-name":"authoritative","cidr":["192.168.11.11/32"],"ports":[5060,7000]},"kamailio@vlprodkazoosbc01.kazoo.dsritelecom.ca":{"type":"allow","network-list-name":"authoritative","cidr":["192.168.11.10/32"],"ports":[5060,7000]}},"max_channel_uptime_s":0,"fs_nodes":[],"fs_cmds_wait_ms":5000,"freeswitch_context":"context_2","call_routing_bindings":["context_2"],"text_routing_bindings":["context_2"],"event_stream_idle_alert":0,"tcp_packet_type":2,"publish_conference_event":["conference-create","conference-destroy","lock","unlock","add-member","del-member","stop-talking","start-talking","mute-member","unmute-member","deaf-member","undeaf-member"],"fs_node_uptime_s":600,"fs_cmds":[{"load":"mod_sofia"},{"reloadacl":""}],"acl_request_timeout_ms":2000,"acl_request_timeout_fudge_ms":100,"capabilities":[{"module":"mod_conference","is_loaded":false,"capability":"conference"},{"module":"mod_channel_move","is_loaded":false,"capability":"channel_move"},{"module":"mod_http_cache","is_loaded":false,"capability":"http_cache"},{"module":"mod_dptools","is_loaded":false,"capability":"dialplan"},{"module":"mod_sofia","is_loaded":false,"capability":"sip"},{"module":"mod_spandsp","is_loaded":false,"capability":"fax"},{"module":"mod_flite","is_loaded":false,"capability":"tts"},{"module":"mod_freetdm","is_loaded":false,"capability":"freetdm"},{"module":"mod_skypopen","is_loaded":false,"capability":"skype"},{"module":"mod_dingaling","is_loaded":false,"capability":"xmpp"},{"module":"mod_skinny","is_loaded":false,"capability":"skinny"},{"module":"mod_sms","is_loaded":false,"capability":"sms"}]},"pvt_account_id":"system_config","pvt_account_db":"system_config","pvt_created":63748236928,"pvt_modified":63750144111,"pvt_type":"config","pvt_node":"ecallmgr@vlprodkazookapps02.kazoo.dsritelecom.ca","pvt_document_hash":"cd56ed686a602c7692a1a523803883d7","ecallmgr@vlprodkazookapps01.kazoo.dsritelecom.ca":{"fs_nodes":["freeswitch@vlprodkazoomedia01.kazoo.dsritelecom.ca"]},"ecallmgr@vlprodkazookapps02.kazoo.dsritelecom.ca":{"fs_nodes":["freeswitch@vlprodkazoomedia02.kazoo.dsritelecom.ca"]}} As we are to roll this environment into production testing soon, this remaining issue needs to be resolved prior to production testing. Best regards and thank you for any leads we can look into. default.ini local.ini prlimit.txt vm.args tcpdump_extract_timeout_sample.txt successful_run_output.txt unsuccessful_run_output.txt Quote Link to comment Share on other sites More sharing options...
2600Hz Employees mc_ Posted April 15, 2020 2600Hz Employees Report Share Posted April 15, 2020 Adding the file SmartPBX_app.png to doc e9cc159ebcb93062e51c5f38604ec77f resulted in a timeout. I would check the local HAProxy logs to see what it has to say about DB requests to Couch for that ID, then check the actual Couch servers for that ID as well. May be something lurking in those logs. And welcome!!! Quote Link to comment Share on other sites More sharing options...
wolfman1956 Posted April 16, 2020 Author Report Share Posted April 16, 2020 Thank you very much MC_...we will pull that info and go through the suggested verifications... Stay tuned... Rgds, Serge Quote Link to comment Share on other sites More sharing options...
wolfman1956 Posted April 27, 2020 Author Report Share Posted April 27, 2020 (edited) Good day mc_ .... CORRECTION: Word post was deleted, and data provided in the proper manner. See next post. Rgds. Edited April 27, 2020 by wolfman1956 invalid document format (see edit history) Quote Link to comment Share on other sites More sharing options...
2600Hz Employees mc_ Posted April 27, 2020 2600Hz Employees Report Share Posted April 27, 2020 @wolfman1956 you'll have better luck using plaintext; I don't open random MS docs posted to weird forums about telecom stuff... Quote Link to comment Share on other sites More sharing options...
wolfman1956 Posted April 27, 2020 Author Report Share Posted April 27, 2020 No problem, duly noted...I'll familiarize myself with gists, but in the meantime here's a text version... Word was used mainly to color-outline keywords... logs_couchdb-haproxy_intermittent_timeouts.txt Quote Link to comment Share on other sites More sharing options...
wolfman1956 Posted April 27, 2020 Author Report Share Posted April 27, 2020 Gist created...: https://gist.github.com/wolfman1956/8aa3d76f7898cb4e882a002adea3e3a2 Quote Link to comment Share on other sites More sharing options...
2600Hz Employees mc_ Posted April 27, 2020 2600Hz Employees Report Share Posted April 27, 2020 Yup, nothing too helpful in that log. You may not have captured the logs of the timeout. In any case, keep an eye on it and hopefully you can capture it closer to live next time. Quote Link to comment Share on other sites More sharing options...
wolfman1956 Posted April 28, 2020 Author Report Share Posted April 28, 2020 (edited) Good day mc_, Actually we timed it so that the captures are running, while we start our script, keeping a very close eye on the start/stop times. Since the script seems to end successfully (even with timeouts), what we observed are run times in the 5-6 seconds (no timeouts) to various times up to 1m35s (with 2-3 timeouts), but it ends while these timeouts will indicate that a few graphical files could not be either founf or saved, as seen in the unsuccessful_run_output.txt file submitted earlier. Addendum: our environment is a test one, so we can (re-)start or stop anything, anytime for our testing purposes. Question: from the tcpdump file, anything we can pull from the several timeout statuses logged in there ? Edited April 28, 2020 by wolfman1956 Added question. (see edit history) Quote Link to comment Share on other sites More sharing options...
wolfman1956 Posted May 7, 2020 Author Report Share Posted May 7, 2020 UPDATE: we created another Kazoo clustered environment on a DELL platform, rebuilt identical to the initial NUTANIX cluster and the timeout issue is no longer present. Proves that the hardware is the culprit and a ticket will be opened with the vendor. Thanks for your help in this matter. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.