- Joined
- May 18, 2003
- Location
- Missouri
*Cliff notes at end
It was a cold and windy night as I stood.. wait, wrong story.
Actually it began *LAST* Wednsday 1/4, we began prepping to finish the second half of our fiber project. This required us to add additional DS1 ports to our router by changing out the existing 4 port ds1 card for an 8 port version. In the process we upgraded the software on the router. I decided to go ahead and work through and get some administrative work done prior to the swap at 1am and ended up working 7am-1:30am. The swap and upgrade went well and we called it a night. There were no support calls or complaints made Thursday or Friday so the assumption was made to call it good. Saturday I recieved a complaint about throughput & latency from a customer who is off a remote DSLAM being fed via one of the DS1's we upgraded. I checked the network over and all looked well so I refered the customer to tech support. Sunday, same thing, I poked a little more into the router & dslam and the config was right, no dropped cells, nada grr. I called and spoke with the other admin *since he would be leaving Monday and not returning until Sunday* leaving me alone this week. He came up with the same conclusion and decided that I should call the router support line first thing Monday and consult with them.
Monday morning I called the support line, tossed out a few ideas, we went over the configs again, called the engineer of the DS1 cards to see if there was a chipset difference - nope. We discovered that I had 10-25% packet loss out to the remote DSLAM - not good. We got the software develeopers on the phone to find out what if there was a code change to the drivers for the DS1 card - there was, they had added QOS to the ATM layer. I made the decision to end the support call and allow them to break for lunch while I drove down to be on site with the router since it is over an hours drive away and my ear hurt from being on the phone for 4 hours.. After arriving on-site I re-establish the tech call and also bring in an engineer from the company whom makes the DSLAM. The engineers talk for a while and determine that the problem is not at the ATM layer as there are no cells being dropped (I told them this upfront from my checking over the weekend) and that the loss was on the IP layer. It was about 3pm at that point and I decided that it would be best to go ahead and downgrade the software on the router to the version we were running prior. *Engineers can be a strange bunch when they are presented with a problem they can't explain, they were excited and would have poked & prodded the router for hours. When I made the decision to change versions I could hear the disapointment in their voices, but I had customers getting crappy service that I had to take care of.* To downgrade the software you have to reboot the router, not a big deal takes about 2mins to completely reboot. We get the new/old software installed and reboot the router.... After rebooting I check the interfaces:
DS1 #1-4 - up & up (line up, protocol up)
OC3 down & down - UTOH!
GREAT! Now instead of having the 30-40 customers with poor service, I have 300+ with NO SERVICE at 2:30pm! We go over everything, check the configs (we had made a few changing working on the ds1's) everything looked good, I pull & clean the fiber, still nothing. I looked over at the DSLAM and notice there is *NO* ALARMS, but the OC3 is down! I pull the fiber again and the DSLAM alarms. The CO tech starts calling because the alarm dialer is calling him (it's now about 5:30pm), I tell him what's happened, and he instructs me to loop the DSLAM. He gets into the DSLAM and says everything is ok on his end - it's with the router. We try pulling the DS1 card out, try changing software multiple times, etc. I pull out another patch cable and by-pass the existing cables - still nada except the DSLAM is alarming again.. I reconnect the DSLAM, and loop up the router and the OC3 is still down & down. On a hunch I try the other patch cable and the OC3 comes UP! I reconnect the router and the OC3 goes down, but the alarm on the DSLAM goes away, odd.. It was about 8:30pm at that point, I called the CO tech back at home and had him come out and to bring the fiber tester/light meter. An hour later he gets there and we check the cables, all good except the one spare patch cable. We check the routers OC3 card - good. Check the DSLAM - good. He gets in the DSLAM and resets the card, the OC3 comes UP! WOOT! Short lived though, 90 seconds later it dies, but there is no alarm on his DSLAM. The DSLAM has hot-redundant OC3 cards so we pull the fiber splitters and plug directly into one of the cards, same result up for a short time then down, then back up, over and over. Try the other card, OC3 comes up and stays up! We plug everything back in and he forces the other card into failure to prevent it from swapping cards.
Nice, what a coincidence, I reboot my router and the DSLAM fails, we spent 8 hours fighting the router for the CO tech to fix it in 20 minutes..
With the OC3 working we go back to our original fight - the DS1's. With fighting the OC3 we had re-upped the software and were still seeing the loss on the DS1's. The engineer just so happened to catch what was occurring, we were seeing CRC errors on the IP packets. We went ahead and downgraded the software and rebooted again (the CO tech decided to stick around for this part). Everything comes up like it should and everything looks good, YEA!! It's 11:30pm and I head home.. It's raining - I sure hope there is no lightning........
Tuesday morning - I got to sleep about 1:30am listening to the thunder...2 :30am I am awoken by the song 'Happy' by Mudvayne - a fitting choice for my network alarm... Start poking around, yep I have an entire network segment down, and hey look that's upstream from the same segment I was working on earlier, so those same customers are down again. *Normally not a huge deal as our network is a true ring and will re-route the opposite direction, but we are changing upstream connections so we have multiple I-net gateways for now, so we have some static routes in place to make things work..* Crap, I get dressed and head to the office and grab my spare core router and head to the tower. The lightning is pretty bad so I wait a while in my van and at about 4:30am the lightning lessens enough I feel safe working out at the site. I open the box, the router is on but not responding to the keyboard, also there is no link lights on the redlines. I reboot the router, still no links. I replace the NIC, get links, traffic starts flowing. I check the AP, nothing. So the AP is fried. I close stuff up and call the tower climbers and leave a message to get ahold of me. I get back to the van (I can't park close to this tower because it is in a low spot and floods) and fall asleep. At 7:30am I am awoken by someone tapping on the glass... Why, hello Mr. Officer.. We talk for a bit, I tell him what is going on etc. All the while I notice he is looking at me funny, hrm... He tells me someone called because they saw me in the van not moving for an extended length of time and they thought I had commited suicide.. While, I admit I may have looked dead with my head back, mouth gaping open, with drool running down my chin, but I was very much alive just exhausted.. We both chuckled, but mid-chuckle my cell phone rings, I look at the number and begin to cry (not really, but I sure felt like it), it was my alarm dialer. I answer to the sound of the mechanical voice, "Hello...This.is.Telephone.Number.2120.The.Time.Is.7.57.AM.The.Power.is.Off". Me and the officer part ways, and I head to another one of our towers to make sure the generator has started etc. Enroute the office calls and says there is a ton of messages on the recorders from customers - well duh - their i-net was out. I get to the tower, generator is running - good. I check the propane level - crap, only 47% capacity. I check my generator logs, last month showed 70% - hrm. Since it is still raining, I take a little hand cleaner from the van and fill an empty cup with rain water and check the lines. I find a leak at the fitting on the generator, apparently the vibration had loosened the fitting - tighten it up and leak is gone - Call MFA to schedule a refill. I spend the rest of the day fielding tech calls because people can't seem to understand 'reboot your computer' and get home about 8pm. At 10pm the CO tech calls and reminds me that he is doing the OC12 upgrade for the fiber project that night at 1am and I need to be up incase the engineers need remote access. He assures me that all will be done by 1:30am..
Wednsday morning - I tape my eyelids open and wait for a call from the CO tech, at 1:30am I call him to check status and he says they're having some minor issues, but he would like me to stay up 'in case'. Around 2:30am, I get a call, I glance at the caller-id, it's the co tech calling from his cell phone, I think to myself he must be on his way home, I was wrong. He had lost dial-tone, not good. He had also lost all special circuits - meaning ds1's, really not good as their VPN is over a ds1 circuit, how was I supposed to get them remote access with no circuit? Well phone service out ranks internet service and as an added bonus the same customers who were down because of the router failure and lightning strike were going to go down again. They love me down there, I can feel it by their stares and finger pointing. I have him pull the routers ethernet cable out and stick it into their vpn router. I setup DHCP on the router upstream of that point and their router pulls an address, I setup nat on their router and the engineers have access. Around 4am everything is back working enough we can put things back the way they were. I check my network and everything is flowing. I pass out around 5:30am. My alarm goes off at 7:30am I hit the snooze button and fall back asleep. My home phone rings at 7:40am, it's the VP reminding me I need to attend a meeting at the business office to discuss the upcoming FTTH project at 8am. I take a quick shower and head to the meeting. Around 8:30am they usher me out of the meeting because the people on the conference call can hear me snoring in the background. I stumble to the area where the office lady's cubicles are and once again fall asleep. They wake me up soon after and tell me that it's not a good business image for me to be sleeping where customers can see me, so I get up and go to the kitchen. I get about a good half hour in when I am again woken and reminded that the tower climbers will be there to replace the AP smoked Tuesday. I go to our office, setup the new AP and drive back to the tower location. This time I parked where I wouldn't be seen and tried to catch a few more Z's. No sooner than I had nodded off the climbers showed up. The change out was un-eventful and by about 1pm we were done and all the customers were working again. I pack up and start heading back to our office. I get a few miles from the tower when I am once again called to return back to the end of our network because several customers for another ISP have not been able to get IP's since the outage Monday (that router is an ADSL aggregator point, multiple ISP's connect to it). Once there I check the configuration, check dhcp service, etc. everything is okay. I check their ethernet port, got a link, I check their SDSL modem, all good. I call them and tell them it is fine on our end, but I was not seeing any traffic on any of their PVC's. They ask me to stay there while they send a tech out (2 hours away mind you). I pull out my emrgency kit (I keep one at each co) and go to sleep. He gets there about 6pm, he verify's for himself that everything is fine. We then went over to their pop - I was a little upset at this point and insisted I go with him as I wanted to know what the problem was. Their pop is in the back of the local drug store... Turns out during the outage Monday one of the employees took it upon themselves to work on their equipment and had unplugged several ethernet cables.. grr. Luckily the rest of the night was un-eventful and I got to sleep somewhere around 10pm.
Thursday Morning - YEA!! I get to sleep! At 7:30am my alarm goes off, get up, shower and head to the office. Pickup a list of support calls that had built up over the past few days and start knocking them out. I finish up around 6pm and head home. Still somewhat lethargic I call it a night around 10pm and go to bed. I am just getting to sleep when I hear that oh-so-dreaded song 'Happy'. I get up and look to see what is wrong. Our old upstream connection is down (gee I wonder why we're leaving them), not a huge deal, only dialup and our servers on that network right now... Hrm, scratches head, our servers... *ponder*, servers... CRAP our DNS servers are still on that network! All of my customers are without dns resolution! I call their emergency number, I explain to the lady on the other end that we are having and service affecting outage.. She then tells me they don't take internet issues until 10am, umm I don't think so lady.. I was absolutely furious and I went for the jugular vein. I not so calmly explained to her again that I wasn't a residential customer nor a business customer, and that she must contact their noc imediately. I hung up with her and started setting up a caching DNS server on our new network. I got bind installed and configured, got into the routers and natted all DNS traffic to the temporary DNS server. No sooner than I had that done I saw the other network come back up and a few minutes later my phone rang, the admin said they had a router go bad... I went back to bed and listened to the rain on the window - I sure hope theres no lightning....
Friday morning - Have you ever had a song invade your dream? You know, you're kinda half-awake - half-asleep, you can hear it but you can't understand why it's there? "In this hole.." No. please god, no.. "That is me..." ARGH! I roll over and look at the alarm clock, 3:15am - nice, real nice. I check the network, hmm, only the 900 AP is down. I get dressed and head to the office, grab the spare AP and go sit at the building until the lightning lets up a bit. I finally get brave enough to work on the equipment. Luckily a simple reboot brought the AP back to life, so I head back home and get to sleep about 4:30am. Alarm goes off at 7:30, shower and head to the office. The VP calls and wants my input on somethings so I pack up and drive to their office. I guess I had the look of death because everyone just kinda stared at me. I sit down and we begin talking. Mid sentence she stops and asks me if I'm alright, and that I look like I haven't gotten any sleep lately.... The rest of the day as un-eventful. I got home about 6pm, passed out about 8pm and slept until 7am.
First thing I did this morning was to change my network alarm song to 'Invoke The Suffering' by Bloodgasm.
*cliffs:
It was a cold and windy night as I stood.. wait, wrong story.
Actually it began *LAST* Wednsday 1/4, we began prepping to finish the second half of our fiber project. This required us to add additional DS1 ports to our router by changing out the existing 4 port ds1 card for an 8 port version. In the process we upgraded the software on the router. I decided to go ahead and work through and get some administrative work done prior to the swap at 1am and ended up working 7am-1:30am. The swap and upgrade went well and we called it a night. There were no support calls or complaints made Thursday or Friday so the assumption was made to call it good. Saturday I recieved a complaint about throughput & latency from a customer who is off a remote DSLAM being fed via one of the DS1's we upgraded. I checked the network over and all looked well so I refered the customer to tech support. Sunday, same thing, I poked a little more into the router & dslam and the config was right, no dropped cells, nada grr. I called and spoke with the other admin *since he would be leaving Monday and not returning until Sunday* leaving me alone this week. He came up with the same conclusion and decided that I should call the router support line first thing Monday and consult with them.
Monday morning I called the support line, tossed out a few ideas, we went over the configs again, called the engineer of the DS1 cards to see if there was a chipset difference - nope. We discovered that I had 10-25% packet loss out to the remote DSLAM - not good. We got the software develeopers on the phone to find out what if there was a code change to the drivers for the DS1 card - there was, they had added QOS to the ATM layer. I made the decision to end the support call and allow them to break for lunch while I drove down to be on site with the router since it is over an hours drive away and my ear hurt from being on the phone for 4 hours.. After arriving on-site I re-establish the tech call and also bring in an engineer from the company whom makes the DSLAM. The engineers talk for a while and determine that the problem is not at the ATM layer as there are no cells being dropped (I told them this upfront from my checking over the weekend) and that the loss was on the IP layer. It was about 3pm at that point and I decided that it would be best to go ahead and downgrade the software on the router to the version we were running prior. *Engineers can be a strange bunch when they are presented with a problem they can't explain, they were excited and would have poked & prodded the router for hours. When I made the decision to change versions I could hear the disapointment in their voices, but I had customers getting crappy service that I had to take care of.* To downgrade the software you have to reboot the router, not a big deal takes about 2mins to completely reboot. We get the new/old software installed and reboot the router.... After rebooting I check the interfaces:
DS1 #1-4 - up & up (line up, protocol up)
OC3 down & down - UTOH!
GREAT! Now instead of having the 30-40 customers with poor service, I have 300+ with NO SERVICE at 2:30pm! We go over everything, check the configs (we had made a few changing working on the ds1's) everything looked good, I pull & clean the fiber, still nothing. I looked over at the DSLAM and notice there is *NO* ALARMS, but the OC3 is down! I pull the fiber again and the DSLAM alarms. The CO tech starts calling because the alarm dialer is calling him (it's now about 5:30pm), I tell him what's happened, and he instructs me to loop the DSLAM. He gets into the DSLAM and says everything is ok on his end - it's with the router. We try pulling the DS1 card out, try changing software multiple times, etc. I pull out another patch cable and by-pass the existing cables - still nada except the DSLAM is alarming again.. I reconnect the DSLAM, and loop up the router and the OC3 is still down & down. On a hunch I try the other patch cable and the OC3 comes UP! I reconnect the router and the OC3 goes down, but the alarm on the DSLAM goes away, odd.. It was about 8:30pm at that point, I called the CO tech back at home and had him come out and to bring the fiber tester/light meter. An hour later he gets there and we check the cables, all good except the one spare patch cable. We check the routers OC3 card - good. Check the DSLAM - good. He gets in the DSLAM and resets the card, the OC3 comes UP! WOOT! Short lived though, 90 seconds later it dies, but there is no alarm on his DSLAM. The DSLAM has hot-redundant OC3 cards so we pull the fiber splitters and plug directly into one of the cards, same result up for a short time then down, then back up, over and over. Try the other card, OC3 comes up and stays up! We plug everything back in and he forces the other card into failure to prevent it from swapping cards.
Nice, what a coincidence, I reboot my router and the DSLAM fails, we spent 8 hours fighting the router for the CO tech to fix it in 20 minutes..
With the OC3 working we go back to our original fight - the DS1's. With fighting the OC3 we had re-upped the software and were still seeing the loss on the DS1's. The engineer just so happened to catch what was occurring, we were seeing CRC errors on the IP packets. We went ahead and downgraded the software and rebooted again (the CO tech decided to stick around for this part). Everything comes up like it should and everything looks good, YEA!! It's 11:30pm and I head home.. It's raining - I sure hope there is no lightning........
Tuesday morning - I got to sleep about 1:30am listening to the thunder...2 :30am I am awoken by the song 'Happy' by Mudvayne - a fitting choice for my network alarm... Start poking around, yep I have an entire network segment down, and hey look that's upstream from the same segment I was working on earlier, so those same customers are down again. *Normally not a huge deal as our network is a true ring and will re-route the opposite direction, but we are changing upstream connections so we have multiple I-net gateways for now, so we have some static routes in place to make things work..* Crap, I get dressed and head to the office and grab my spare core router and head to the tower. The lightning is pretty bad so I wait a while in my van and at about 4:30am the lightning lessens enough I feel safe working out at the site. I open the box, the router is on but not responding to the keyboard, also there is no link lights on the redlines. I reboot the router, still no links. I replace the NIC, get links, traffic starts flowing. I check the AP, nothing. So the AP is fried. I close stuff up and call the tower climbers and leave a message to get ahold of me. I get back to the van (I can't park close to this tower because it is in a low spot and floods) and fall asleep. At 7:30am I am awoken by someone tapping on the glass... Why, hello Mr. Officer.. We talk for a bit, I tell him what is going on etc. All the while I notice he is looking at me funny, hrm... He tells me someone called because they saw me in the van not moving for an extended length of time and they thought I had commited suicide.. While, I admit I may have looked dead with my head back, mouth gaping open, with drool running down my chin, but I was very much alive just exhausted.. We both chuckled, but mid-chuckle my cell phone rings, I look at the number and begin to cry (not really, but I sure felt like it), it was my alarm dialer. I answer to the sound of the mechanical voice, "Hello...This.is.Telephone.Number.2120.The.Time.Is.7.57.AM.The.Power.is.Off". Me and the officer part ways, and I head to another one of our towers to make sure the generator has started etc. Enroute the office calls and says there is a ton of messages on the recorders from customers - well duh - their i-net was out. I get to the tower, generator is running - good. I check the propane level - crap, only 47% capacity. I check my generator logs, last month showed 70% - hrm. Since it is still raining, I take a little hand cleaner from the van and fill an empty cup with rain water and check the lines. I find a leak at the fitting on the generator, apparently the vibration had loosened the fitting - tighten it up and leak is gone - Call MFA to schedule a refill. I spend the rest of the day fielding tech calls because people can't seem to understand 'reboot your computer' and get home about 8pm. At 10pm the CO tech calls and reminds me that he is doing the OC12 upgrade for the fiber project that night at 1am and I need to be up incase the engineers need remote access. He assures me that all will be done by 1:30am..
Wednsday morning - I tape my eyelids open and wait for a call from the CO tech, at 1:30am I call him to check status and he says they're having some minor issues, but he would like me to stay up 'in case'. Around 2:30am, I get a call, I glance at the caller-id, it's the co tech calling from his cell phone, I think to myself he must be on his way home, I was wrong. He had lost dial-tone, not good. He had also lost all special circuits - meaning ds1's, really not good as their VPN is over a ds1 circuit, how was I supposed to get them remote access with no circuit? Well phone service out ranks internet service and as an added bonus the same customers who were down because of the router failure and lightning strike were going to go down again. They love me down there, I can feel it by their stares and finger pointing. I have him pull the routers ethernet cable out and stick it into their vpn router. I setup DHCP on the router upstream of that point and their router pulls an address, I setup nat on their router and the engineers have access. Around 4am everything is back working enough we can put things back the way they were. I check my network and everything is flowing. I pass out around 5:30am. My alarm goes off at 7:30am I hit the snooze button and fall back asleep. My home phone rings at 7:40am, it's the VP reminding me I need to attend a meeting at the business office to discuss the upcoming FTTH project at 8am. I take a quick shower and head to the meeting. Around 8:30am they usher me out of the meeting because the people on the conference call can hear me snoring in the background. I stumble to the area where the office lady's cubicles are and once again fall asleep. They wake me up soon after and tell me that it's not a good business image for me to be sleeping where customers can see me, so I get up and go to the kitchen. I get about a good half hour in when I am again woken and reminded that the tower climbers will be there to replace the AP smoked Tuesday. I go to our office, setup the new AP and drive back to the tower location. This time I parked where I wouldn't be seen and tried to catch a few more Z's. No sooner than I had nodded off the climbers showed up. The change out was un-eventful and by about 1pm we were done and all the customers were working again. I pack up and start heading back to our office. I get a few miles from the tower when I am once again called to return back to the end of our network because several customers for another ISP have not been able to get IP's since the outage Monday (that router is an ADSL aggregator point, multiple ISP's connect to it). Once there I check the configuration, check dhcp service, etc. everything is okay. I check their ethernet port, got a link, I check their SDSL modem, all good. I call them and tell them it is fine on our end, but I was not seeing any traffic on any of their PVC's. They ask me to stay there while they send a tech out (2 hours away mind you). I pull out my emrgency kit (I keep one at each co) and go to sleep. He gets there about 6pm, he verify's for himself that everything is fine. We then went over to their pop - I was a little upset at this point and insisted I go with him as I wanted to know what the problem was. Their pop is in the back of the local drug store... Turns out during the outage Monday one of the employees took it upon themselves to work on their equipment and had unplugged several ethernet cables.. grr. Luckily the rest of the night was un-eventful and I got to sleep somewhere around 10pm.
Thursday Morning - YEA!! I get to sleep! At 7:30am my alarm goes off, get up, shower and head to the office. Pickup a list of support calls that had built up over the past few days and start knocking them out. I finish up around 6pm and head home. Still somewhat lethargic I call it a night around 10pm and go to bed. I am just getting to sleep when I hear that oh-so-dreaded song 'Happy'. I get up and look to see what is wrong. Our old upstream connection is down (gee I wonder why we're leaving them), not a huge deal, only dialup and our servers on that network right now... Hrm, scratches head, our servers... *ponder*, servers... CRAP our DNS servers are still on that network! All of my customers are without dns resolution! I call their emergency number, I explain to the lady on the other end that we are having and service affecting outage.. She then tells me they don't take internet issues until 10am, umm I don't think so lady.. I was absolutely furious and I went for the jugular vein. I not so calmly explained to her again that I wasn't a residential customer nor a business customer, and that she must contact their noc imediately. I hung up with her and started setting up a caching DNS server on our new network. I got bind installed and configured, got into the routers and natted all DNS traffic to the temporary DNS server. No sooner than I had that done I saw the other network come back up and a few minutes later my phone rang, the admin said they had a router go bad... I went back to bed and listened to the rain on the window - I sure hope theres no lightning....
Friday morning - Have you ever had a song invade your dream? You know, you're kinda half-awake - half-asleep, you can hear it but you can't understand why it's there? "In this hole.." No. please god, no.. "That is me..." ARGH! I roll over and look at the alarm clock, 3:15am - nice, real nice. I check the network, hmm, only the 900 AP is down. I get dressed and head to the office, grab the spare AP and go sit at the building until the lightning lets up a bit. I finally get brave enough to work on the equipment. Luckily a simple reboot brought the AP back to life, so I head back home and get to sleep about 4:30am. Alarm goes off at 7:30, shower and head to the office. The VP calls and wants my input on somethings so I pack up and drive to their office. I guess I had the look of death because everyone just kinda stared at me. I sit down and we begin talking. Mid sentence she stops and asks me if I'm alright, and that I look like I haven't gotten any sleep lately.... The rest of the day as un-eventful. I got home about 6pm, passed out about 8pm and slept until 7am.
First thing I did this morning was to change my network alarm song to 'Invoke The Suffering' by Bloodgasm.
*cliffs: