2600hz Hosted RTP Media Servers

Logicwrath · October 9, 2017

I started doing some testing this morning. The way I do the testing is I register the Yealink phone with the US-CENTRAL proxy using 2600's DNS and then call a test call flow I created which only plays a 500hz sine wave tone. I like using the sine wave tone because it really makes it easy to spot packet loss. In a perfect world the tone would not break.

I am very excited to announce that as of this morning all tests are now using the Chicago RTP media 8.45.x.x.

The RTP status page in the Yealink phone does update while the call is in progress and there is a refresh button. You can refresh it to look for packet loss as the call is in progress.

I also went into my firewall and looked for connections from the phone to confirm it is working correctly. All of my tests since switching to US-CENTRAL today appear to have used 8.45.x.x. The bottom Dst. Address was likely prior to my proxy change. You can see the history of my tests through the connection tracking list below. I am not sure what that 162.242.x.x connection is. I am clearly not prioritizing it with my address lists as it got PRIO 7. Is this a 2600hz service?

image.png.17b0e28dc7c76bddb4db0808d90a7acd.png

Darren Schreiber · October 9, 2017

The 162.242.163.176 address is just the proxy. The Central ones are hosted in a separate datacenter and don't really need prioritization frankly, the SIP packets should be fast enough.

I've never been able to get the RTP Status page on the Yealink to show properly until end of call, even on latest FW, but if it works for you that's great.

Logicwrath · October 26, 2017

I was thinking about this today. Was the configuration changed so that if ORD were to get overloaded that the calls would shed to EWR instead of SJC at this point? I understand moving the calls, I am wondering what the current logic is on overload.

I believe that SRV priorities for ORD are ORD -> EWR -> SJC for SRV/NAPTR.

Darren Schreiber · October 26, 2017

The configuration in each zone is as follows:

1) Always round-robin the local zone (Kamailio list 1)

2) If all servers in the local-zone fail, fail to the alternate next-closest zone (Kamailio list 2)

Each individual server has the ability to process a call request or return a 5XX to say "I'm overloaded". If the 5XX's are received by all servers in the first list, we'll proceed to the second list. This happens "silently" which is the issue we ran into previously. Each individual media server is configured to allow a set number of calls (that we believe is it's load-tested limit). In this case the limit was being exceeded and the failover was silently happening (correct behavior) but without our knowledge so we did not respond and add more servers when we should have.

We've corrected both issues (notification and lack of available servers). Actually the ultimate fix ended up being that we had more capacity on the existing servers and just needed to raise the ceiling (so we've subsequently torn down the additional servers as well).

Darren Schreiber · October 26, 2017

(Also, the above have had no verifiable impact on call quality throughout - I'm just answering your question. The actual call quality issues were traced to a peering and bandwidth overload issue which we resolved separately with Level3. So all of the above is probably not even related to your issue, though I know you are convinced otherwise).

Logicwrath · October 26, 2017

I am trying to confirm that the next closest zone in list 2 as you put it will be EWR and not SJC. I just want to make sure that SJC is the choice of last resort for Central.

Darren Schreiber · October 26, 2017

I believe it will try either if ORD is full. The theory is simple. If there are 3 datacenters, then each can handle up to 66% capacity of the platform. If we lose ORD and it's at 66% then we want to take 33% and send to one remaining location and 33% to the other until the ORD servers are resolved.

That's the automated strategy.

If a failure is actively occurring we want the automation to kick-in and then we can analyze "better" routing strategies as we see what's transpiring if the automation is not the ideal setup.

Logicwrath · October 26, 2017

I would like to register my preference to follow the SRV priorities of ORD -> EWR -> SJC.

In Michigan latency is near identical to ORD and EWR. Latency to SJC is double.

Darren Schreiber · October 26, 2017

Noted, however, we have customers farther west where SJC is closer :-)

In reality, we'll usually manage these failures as a one-off as needed. I think you're concern is mostly because the pro-longed audio issues in the beginning of the month, which was an anomaly for our service (we haven't had severe audio issues on our network in over two years that lasted more than maybe 30 minutes at most).

Tuly · October 27, 2017

@Logicwrath Even if you are in Chicago and sending the calls to SJC, call quality should still be very good,

In our testing we had all clients from east use SJC for a long period of time, and call quality was reasonably OK, (I believe there are very fast battery operated upgraded huge tunnels under ground for the call to travel😊)

Edited October 27, 2017 by Tuly (see edit history)

Darren Schreiber · October 27, 2017

So we're clear, my answer isn't just a guess, it's a VoIP norm.

https://www.voip-info.org/wiki/view/QoS : "Callers usually notice roundtrip voice delays of 250ms or more. ITU-T G.114 recommends a maximum of a 150 ms one-way latency. Since this includes the entire voice path, part of which may be on the public Internet, your own network should have transit latencies of considerably less than 150 ms."

We generally try to keep the latency as low as possible, because the statement "this includes the entire voice path" above is really critical and as soon as you throw a mobile cell tower connected by microwave to it's home base in the mix, latency can go >400ms (if you travel to, say, central Washington and use your cell phone to talk to a VoIP user, you'll hear a much bigger delay than if you call that same person while they're in Seattle, for example).

That said, if the choice is "my call is not going to go through at all" vs "there might be a slightly higher delay for an hour or so but at least it's not down", the choice is obvious.

The SRV records basically assume we're adding no more than 50ms during a failover scenario, way less than 150 or 250ms.

So while I don't recommend having all your clients in EWR hit SJC "for a long period of time", it certainly is not a big deal for a small amount of time and frankly they'll probably never notice. Hence why I really don't think your testing and assessment of horrible voice quality when hitting a specific zone in the US is valid, I think that was more of an issue because of the voice path issues we were having with Level3 in early October.

Sign In

2600hz Hosted RTP Media Servers

Recommended Posts

Logicwrath

Darren Schreiber

Logicwrath

Darren Schreiber

Darren Schreiber

Logicwrath

Darren Schreiber

Logicwrath

Darren Schreiber

Tuly

Darren Schreiber

Join the conversation