Jump to content

Kazoo startup causes various timeout (intermittent)


Recommended Posts

Good day community members,

As this is my 1st submission to this forum, I will do my very best to attempt to describe the situation we are currently running into. As I’m new to Kazoo and CouchDB, I will try to be as clear and concise as possible with my information regarding this situation. Please feel free to ask or request any additional information that might be missing, and is considered relevant……

 

Situation:

We are setting up a KAZOO pre-production cluster composed of 2 main KAZOO servers, 2 Kamalio servers, 2 freeswitch servers, 2 rabbitmq servers and a 6-node CouchDB cluster, divided into 2 groups, 1 x 3 nodes as primaries’, 1 x 3 secondaries’ and with time we got the environment up and running, except for one event for which we can’t truly pinpoint a cause. When running the command string (described further down), although we seem to result with a good load, we will see the occasional warning, at random areas of the script’s execution. At this point we are looking into what is causing these various “timeout” warnings, as they appear at random (full details and script output also found further down).

 

Environment:

 

-        2 Kazoo server with HaProxy (1 x promary, 1 x secondary)

-        2 Freeswitch servers

-        2 Kamalio servers

-        2 Rabbitmq

-        6 CouchDB nodes

o   3 primaries’

o   3 secondaries’

 

Attached are the config files (7 in all) we investigated plus a few log extracts;

- default.ini

- local.ini

- vm.args

- prlimit command output

- tcpdump extract with "timeout" strings

- successful run output

- unsuccessful run output

 

Command string:

The command string we run from one of the 2 masters, is:

 

sup crossbar_maintenance init_apps /var/www/html/monster-ui/apps http://our.master1.server:8000/v2

 


 

Script execution:

 

Observations from Kazoo/Couchdb startup process: warnings with timeouts or failed save on “.png” files while loading and process still ends with an “OK” status:

(a successful run will take around 6 seconds, an successful run with "timeout" warnings, will take anywhere from 25 seconds up to 1 minute and 40 seconds, but will still end with an "OK" status.

 

Sample run with "timeout events":

[root@vlprodkazookapps01 ~]# time sup crossbar_maintenance init_apps /var/www/html/monster-ui/apps http://ourkazooserver.kazoo.ourdomain.ca:8000/v2
trying to init app from /var/www/html/monster-ui/apps/pbxs
 app pbxs already loaded in system
 not updating api_url, it is unchanged
 no metadata changes for app 6f12f101f866c972d04a43f960ff41b7
   saved PBXconnector_app.png to 6f12f101f866c972d04a43f960ff41b7
   saved pbxconnector1.png to 6f12f101f866c972d04a43f960ff41b7
   saved pbxconnector2.png to 6f12f101f866c972d04a43f960ff41b7
trying to init app from /var/www/html/monster-ui/apps/webhooks
 app webhooks already loaded in system
 not updating api_url, it is unchanged
 no metadata changes for app e6e5774256edd2201f838865baa25ef3
   saved WebHooks_app.png to e6e5774256edd2201f838865baa25ef3
   saved webhooks1.png to e6e5774256edd2201f838865baa25ef3
   saved webhooks2.png to e6e5774256edd2201f838865baa25ef3
trying to init app from /var/www/html/monster-ui/apps/fax
 app fax already loaded in system
 not updating api_url, it is unchanged
 no metadata changes for app e21e143c33312aa3495558d7936242ec
   failed to save Fax_app.png to e21e143c33312aa3495558d7936242ec: timeout
   failed to save OutboundFaxes.png to e21e143c33312aa3495558d7936242ec: timeout
trying to init app from /var/www/html/monster-ui/apps/csv-onboarding
 app csv-onboarding already loaded in system
 not updating api_url, it is unchanged
 no metadata changes for app bdca179bf1754c18f5d739dc91c859b6
  failed to find icon in bdca179bf1754c18f5d739dc91c859b6
  failed to find screenshots in bdca179bf1754c18f5d739dc91c859b6
trying to init app from /var/www/html/monster-ui/apps/accounts
 app accounts already loaded in system
 not updating api_url, it is unchanged
 no metadata changes for app 1b7093aa6cbf7b50708027d03a106a12
   saved Accounts_app.png to 1b7093aa6cbf7b50708027d03a106a12
   saved Account-AvailableApps.png to 1b7093aa6cbf7b50708027d03a106a12
   saved Account-Limits.png to 1b7093aa6cbf7b50708027d03a106a12
   saved AccountOverview.png to 1b7093aa6cbf7b50708027d03a106a12
trying to init app from /var/www/html/monster-ui/apps/callflows
 app callflows already loaded in system
 not updating api_url, it is unchanged
 no metadata changes for app 184e7c891e2b9c64e8f4e8eec5b00a7b
   saved Callflows_app.png to 184e7c891e2b9c64e8f4e8eec5b00a7b
   saved callflows_1.png to 184e7c891e2b9c64e8f4e8eec5b00a7b
   saved callflows_2.png to 184e7c891e2b9c64e8f4e8eec5b00a7b
   saved callflows_3.png to 184e7c891e2b9c64e8f4e8eec5b00a7b
trying to init app from /var/www/html/monster-ui/apps/voip
 app voip already loaded in system
 not updating api_url, it is unchanged
 no metadata changes for app e9cc159ebcb93062e51c5f38604ec77f
   failed to save SmartPBX_app.png to e9cc159ebcb93062e51c5f38604ec77f: timeout
   failed to save smartpbx1.png to e9cc159ebcb93062e51c5f38604ec77f: timeout
   saved smartpbx2.png to e9cc159ebcb93062e51c5f38604ec77f
   saved smartpbx3.png to e9cc159ebcb93062e51c5f38604ec77f
   saved smartpbx4.png to e9cc159ebcb93062e51c5f38604ec77f
   saved smartpbx5.png to e9cc159ebcb93062e51c5f38604ec77f
trying to init app from /var/www/html/monster-ui/apps/numbers
 app numbers already loaded in system
 not updating api_url, it is unchanged
 no metadata changes for app 8637570ad7b41bc7f3906882507094db
   saved Numbers_app.png to 8637570ad7b41bc7f3906882507094db
   saved numbers1.png to 8637570ad7b41bc7f3906882507094db
   saved numbers2.png to 8637570ad7b41bc7f3906882507094db
trying to init app from /var/www/html/monster-ui/apps/voicemails
 app voicemails already loaded in system
 not updating api_url, it is unchanged
 no metadata changes for app 48630c8173c977fdc460ae93e96010ec
   saved Voicemail_app.png to 48630c8173c977fdc460ae93e96010ec
   saved PlayVoicemail.png to 48630c8173c977fdc460ae93e96010ec
   saved SelectedVoicemails.png to 48630c8173c977fdc460ae93e96010ec
ok

real    1m25.705s
user    0m0.425s
sys     0m0.105s

 

Extract from tcpdump log while running the script and "timeout" event occured:

 

{"_id":"ecallmgr","_rev":"19-64144d828aa26ea806e165f4089bf4cd","default":{"balance_crawler_enabled":false,"max_channel_cleanup_timeout_ms":60000,"acls":{"kamailio@vlprodkazoosbc02.kazoo.dsritelecom.ca":{"type":"allow","network-list-name":"authoritative","cidr":["192.168.11.11/32"],"ports":[5060,7000]},"kamailio@vlprodkazoosbc01.kazoo.dsritelecom.ca":{"type":"allow","network-list-name":"authoritative","cidr":["192.168.11.10/32"],"ports":[5060,7000]}},"max_channel_uptime_s":0,"fs_nodes":[],"fs_cmds_wait_ms":5000,"freeswitch_context":"context_2","call_routing_bindings":["context_2"],"text_routing_bindings":["context_2"],"event_stream_idle_alert":0,"tcp_packet_type":2,"publish_conference_event":["conference-create","conference-destroy","lock","unlock","add-member","del-member","stop-talking","start-talking","mute-member","unmute-member","deaf-member","undeaf-member"],"fs_node_uptime_s":600,"fs_cmds":[{"load":"mod_sofia"},{"reloadacl":""}],"acl_request_timeout_ms":2000,"acl_request_timeout_fudge_ms":100,"capabilities":[{"module":"mod_conference","is_loaded":false,"capability":"conference"},{"module":"mod_channel_move","is_loaded":false,"capability":"channel_move"},{"module":"mod_http_cache","is_loaded":false,"capability":"http_cache"},{"module":"mod_dptools","is_loaded":false,"capability":"dialplan"},{"module":"mod_sofia","is_loaded":false,"capability":"sip"},{"module":"mod_spandsp","is_loaded":false,"capability":"fax"},{"module":"mod_flite","is_loaded":false,"capability":"tts"},{"module":"mod_freetdm","is_loaded":false,"capability":"freetdm"},{"module":"mod_skypopen","is_loaded":false,"capability":"skype"},{"module":"mod_dingaling","is_loaded":false,"capability":"xmpp"},{"module":"mod_skinny","is_loaded":false,"capability":"skinny"},{"module":"mod_sms","is_loaded":false,"capability":"sms"}]},"pvt_account_id":"system_config","pvt_account_db":"system_config","pvt_created":63748236928,"pvt_modified":63750144111,"pvt_type":"config","pvt_node":"ecallmgr@vlprodkazookapps02.kazoo.dsritelecom.ca","pvt_document_hash":"cd56ed686a602c7692a1a523803883d7","ecallmgr@vlprodkazookapps01.kazoo.dsritelecom.ca":{"fs_nodes":["freeswitch@vlprodkazoomedia01.kazoo.dsritelecom.ca"]},"ecallmgr@vlprodkazookapps02.kazoo.dsritelecom.ca":{"fs_nodes":["freeswitch@vlprodkazoomedia02.kazoo.dsritelecom.ca"]}}

 

As we are to roll this environment into production testing soon, this remaining issue needs to be resolved prior to production testing.

 

Best regards and thank you for any leads we can look into.

default.ini local.ini prlimit.txt vm.args tcpdump_extract_timeout_sample.txt successful_run_output.txt unsuccessful_run_output.txt

Link to comment
Share on other sites

  • 2600Hz Employees

Adding the file SmartPBX_app.png to doc e9cc159ebcb93062e51c5f38604ec77f resulted in a timeout. I would check the local HAProxy logs to see what it has to say about DB requests to Couch for that ID, then check the actual Couch servers for that ID as well. May be something lurking in those logs.

And welcome!!! :)

Link to comment
Share on other sites

  • 2 weeks later...

Good day mc_,

 

Actually we timed it so that the captures are running, while we start our script, keeping a very close eye on the start/stop times. Since the script seems to end successfully (even with timeouts), what we observed are run times in the 5-6 seconds (no timeouts) to various times up to 1m35s (with 2-3 timeouts), but it ends while these timeouts will indicate that a few graphical files could not be either founf or saved, as seen in the unsuccessful_run_output.txt file submitted earlier.

Addendum: our environment is a test one, so we can (re-)start or stop anything, anytime for our testing purposes.

Question: from the tcpdump file, anything we can pull from the several timeout statuses logged in there ?

Edited by wolfman1956
Added question. (see edit history)
Link to comment
Share on other sites

  • 2 weeks later...

UPDATE: we created another Kazoo clustered environment on a DELL platform, rebuilt identical to the initial NUTANIX cluster and the timeout issue is no longer present. Proves that the hardware is the culprit and a ticket will be opened with the vendor.

 

Thanks for your help in this matter.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...