I got a watchdog email shortly before we left for the April Adventure weekend (email: 4/20 6:45P) about datafile and camera pic being older than the 1260 sec threshold. I couldn’t get it running the morning before we left, so didn’t start real recovery efforts until 4/24.
SYSTEM THUMB DRIVE
I couldn’t ping or ssh to the box, so I power cycled it. (Yeah, that’s a little harsh for a Linux system – but I didn’t have many choices.) Nothing. I looked around for backups, and found one – from 6 years ago!
I uncabled the box (power, 485 net, ethernet) and brought it out so I could plug into the 3.3V serial console pigtail I’d installed early on. It didn’t boot as expected, so I tried to look at the system thumb drive. The main Win10 PC, jimsdellmini (Ubuntu) and Gparted on the red kitchen laptop all did not detect that there was a thumb drive plugged in. They did see the USB device (like with lsusb), but no storage drive.
I found a very similar 4GB Microcenter drive (with silver laptop Clonezilla on it) and in desperation decided to try to swap the flash chips to try to recover the files, hoping it was the board or controller chip rather than the flash chip that was bad.
Here are the boards, and the back sides with the flash chips. Oh no – they’re BGA! Looks like the boards were set up both to take a second flash chip and to take either BGA or leaded (TSSOP?) chips. Thanks for these using BGAs, Murphy. Not.
All work below was with air:3 and heater:8 and a smallish (0.198″OD) tip on the AUYOE Int 906. I worked with the chips and boards on a firebrick.
First I unsoldered the good chip from its card with hot air. Took a couple of minutes, but seemed successful, and gave me a little hands on practice with the hot air.
Then with vague hopes that the bad drive might just have a bad solder joint, I reheated that board, hoping to reflow the bad chip in place on its original card. When I tried that in jimsdellmini, it still did not show up as a memory device (though it did show up in lsusb). Nice try.
Then I unsoldered the bad chip from its card. I put the good chip on the bad card – no prep, no reballing, no cleanup. I reheated it, purely guessing at the time. I watched closely, but never saw it move or recenter.
When I plugged the bad card with the good flash into jimsdellmini, the OS found the device and I could see the files from the flash (old clonezilla files). Plugging it into the Win10 PC, that could see it and its files as well. Wow – I just successfully resoldered a BGA chip!
Encouraged, I cleaned off the bad chip with flux and solder wick, and ran a ball of (leaded) solder across it, providing what looked (to eyes that had never seen one) like an appropriate reballing. I cleaned the previously good card with flux/wick, put the reballed bad chip on it, then heated it to reflow. Watched, but never saw it move. Did wiggle the board a little a couple of times to try to help it find home. After I let it cool, the chip seemed well adhered, so presumably at least some solder melted, though I can’t be at all certain all the balls did. But I’m hoping the extra prep increased the chances of good soldering compared to the first (good chip, bad board) try – which worked with no extra prep.
When I put the good board/bad flash in jimsdellmini, lsusb could see it, but it didn’t appear as a file system. The Win10 PC could see the device, but no file system as well. Nice try. 🙁
Despite efforts to take good notes, either I screwed up multiple times or the USB ID moved with the flash chip (which is very unlikely). Boo on me. But I eventually did enough retries comparing lsusb output to convince myself I had done what I intended (like not putting the bad chip back on the bad board), and that it really didn’t work.
In any event, I’ve done all the due diligence I know how to do, and so can completely give up on trying to resurrect the files from the old working drive. Bummer, but with closure.
OK – I’m now confident I’ll never get the files back from the original thumb drive. Time to move on. I had a backup tar from 10/6/11 (!), and with the notes here untarred it to a random 8GB drive from the thumb drive box. It booted, and thanks to the serial console, I got it sort of running. I think the ftp password to the Godaddy host had changed, and the IP of the AT&T first router had changed, and probably some other stuff.
But its (6 year!) old 485pollB.pl talked to the existing HA nodes, got data from them, and pushed it to the Godaddy host. We’re back on the air!
Well, sort of. All the camera stuff was implemented after that backup, so had no hope of working. And the graphs didn’t update like they used to. I feared the data format had been updated somehow, and the parser on the web host was looking for something that wasn’t there any more/yet (and that I had zero recollection of).
Several weeks later (6/20/17), I dug around trying to figure out why the graphs didn’t update. Between looking at the code and finding and reading the note on “Home Page Speedup” I slowly realized the poll perl script on the Pogo not only ftp’d update data, but invoked (with wget) the graphs.php script on the host that actually updates the graphs. After a few stumbles I put that into a new 485pollC.pl and got it working. Now the graphs update automatically again – yay!
First on the agenda is a workable backup method. I don’t remember whether I had a backup script on the old drive, but at least there’s a good start in the project notes from the last painful rebuild.
I copied and touched up that script so just running “gobackup” in perl/backup will (presumably) make a backup tar file and put it on the main PC as F:\pogobackup.MMDDYYYY.tar. If I can just remember to run that when I make a change (or maybe put it in a cron job)! To that end, I put a banner in the motd saying “IF YOU TOUCH ANYTHING HERE, UPDATE THE BACKUP!” And after 485pollC.pl was actually running graphs.php, I kicked off a backup. 🙂 A backup file appeared on the main PC, but I haven’t tried to put it on a new thumb drive.
Maybe make a watchdog that checks sums on some main files and sends an email backup reminder if they’ve changed? What files? Main perl files, hosts, rc stuff? Should all data like credentials be hosted in a single file and read by the scripts? Here’s a first crack at a list:
FILES TO CHECK:
Update 7/15/17: Woo hoo! I think I got a first hack of the camera working again! A first problem was that the main disk on jimsdellmini – which hosts reading from the camera – was completely full (and I couldn’t figure out what it was full of). I dug some more and removed several old kernels and the associated stuff in /lib/modules. Now I can at least work on the machine. Hmm – that machine’s not backed up at all. But it’s still running, and the old scripts are still there. I plugged in an external hard disk, but it didn’t seem to see it. I plugged in an old 1GB thumb drive and got it mounting to /media/CameraDrive. Steps 0,1.
Looks like ~/perl/gosnaps reads from the camera, gets snapshots and puts them in /media/CameraDrive, then calls putpic to send latest to the Pogo. Putpic ftps the latest to <ftproot>/pics with its timestamped file name, removes old latest.jpg, and copies the latest also to that filename. Of course it needs an ftpd running on the rebuilt Pogo – which it wasn’t. Got that running, but putpic doesn’t supply any credentials. Added user jim to the Pogo, and some flag somewhere allowed auto logging-in of ftp. I made a symbolic link in /srv/http to /srv/ftp/pics/latest.jpg, and I could see it in a browser looking at the Pogo. It works! Steps 2,3 done.
The main 485pollC.pl script already has stuff to implement regular (5 min) ftps to the GoDaddy host, so I copied one of those blocks of code, and after way more playing with the autoftp.pl it uses than I cared to do, got that set up to push the latest pic to the real host. And that seems to work, too! Of course, the nice go485 script that used to pkill the old 485poll and restart it somehow stopped working, presumably with one of those kernel updates. I can’t yet see how to pkill it, so must manually kill 2 processes and then manually restart it. But at least go485 now prints out instructions on what to do. And I even kicked off a backup with the new 485pollD.pl in it! Oh, but /etc/rc.local still started 485pollC.pl. Rats. But at least now doing the additional backup to capture that was easy. I suppose I should untar it to a new flash drive and test it. One more thing on the list. But now I get pictures! Steps 4,5,6A,6B done.
And the terrible displays recently on the home page turned out to be 2 instances of 485pollC.pl running. (The old go485 script hadn’t killed the old one.) Fixed that. Wanted to clean out the data in datafile.csv, but decided to just let good data accumulate and fix itself in 5 days. Cleaning the intermixed data was harder than I hoped. 🙁
But I did hack in an init in the power outage section of graphs.php that should stop the long-standing problem of displaying one phantom outage every time I’d trim datafile.csv. Seems to work now.
Update 7/19/17: Looks like the woo hoo on getting the camera stuff working was premature. Yeah, it worked, but several hours later it stopped, with error messages on jimsdellmini about no space on device. Yeah, the thumb drive is only 1 Gig – but c’mon, it it can’t be full yet! And df showed only 4% used! Tried touch abc – no space on device. Same, as root – no space on device. Deleted the 171 pics, and it all started working again.
Turns out there’s a limitation on FAT 16 (with which the 1GB thumb drive was formatted) on number of files in the root directory, and if the file names are long the limit is hit much more quickly. I guess “snap07-19-23-59.jpg” qualifies as a long name.
I suppose I could have tweaked a couple of scripts and stashed the image files in a subdirectory, but I just reformatted the thumb drive ext2 with fdisk. Works fine days later. There’s also a daily cron job that clears out files older than 24 hours, so it should work for plenty long enough. (Turned out to be older than 30 days.)
Update 8/18/17: The random 8GB thumb drive I rebuilt as a Pogo system disk failed. This time I schlepped the new HP laptop over with a 3.3V USB-serial cable and plugged it in before I did anything else. Nothing. I power cycled it and got a shell from flash – no USB boot.
Put the thumb drive in the main Win10 PC, and with ext2explore.exe copied the very latest 485pollD.pl, which hadn’t been saved to a backup tar yet. Yay! Everything else should be OK in the latest tar file. I hope it works – I’ve still never actually tested it.
Tried fsck on the thumb drive in jimsdellmini, and it gave a zillion errors. Let it run – twice – but couldn’t read the drive when remounted. Took 2 tries, but deleted partition table and recreated with fdisk, then put new ext2 fs on with mkfs. Copied latest tar over, but oopsed and untarred the 2011 tar instead. Took a LONG time. Discovered what happened when I tried to replace 485pollD.pl – and it wasn’t there. Carefully did rm -rf on the thumb drive (another LONG time), then started to untar the right file. SLOW. Gave up with ctrl-C. Jimsdellmini wasn’t happy with that, and never really unmounted the drive.
Put in a brand new 16GB red Microcenter USB3 drive, fdisk‘d a new partition table, new mkfs, and untarred the right file. Much faster, as expected. Plugged it into the Pogo, and with laptop still connected, watched it boot. Success!
No house pic, tho. Finally killed a days old ftp process on jimsdellmini. A few minutes later, a good pic had propagated all the way to the Godaddy host. Yay! Looks like all the scripts on the Pogo do in fact start on boot – and the backup tar works!
I hope the brand new thumb drive lasts longer than the 8GB fold-flat credit-card drive from some patent law freebie I used last time. And why do these things always happen the night before I have a long drive to a weekend?
Update 8/30/17: I’ve been fighting on and off with being out of storage on the “camera drive” on jimsdell mini. It’s been running that 1GB thumb drive for a while, and just failed again. Thankfully the “house picture old” watchdog fired and sent email. Turns out once again the drive was full (but really, this time). There’s a cron job purging files matching a find -mtime +30 (days), but that wasn’t aggressive enough for this tiny drive. After I manually deleted several days’ worth of pics, it started working again. How many days’ worth will the drive hold? It looks like nite pics (B/W) are ~26KB, while day pics are more like 110KB. Doing an ls |wc on /media/CameraDrive gave 10586 files, while df showed 658048K used. So that averages about 62KB/pic. At 24*30 pics/day, we burn about 45MB/day, so the 1GB drive should hold around 22 days of pics. I changed the cron job to -mtime +15, so it should take care of itself now, and run around 30% free. We’ll see.
STILL TO DO
- I’d really like to move the camera stuff to the Pogo, if I can. The old (and newly current) system uses jimsdellmini to get images from the camera and push them to the Pogo, which eventually pushes them to the host. I think the two reasons I originally used another machine instead of hosting all that on the Pogo were a) a concern that the somewhat compute-intensive image processing burst might interfere with the serial polling of the HA nodes, and b) not wanting to have the HA system bouncing while I tried to get it all working. But if I integrated the camera stuff into the main poll script rather than as a separate process, I could avoid interfering with the serial polls. And I’d get jimsdellmini out of the critical path and more usable for other stuff.
- And of course implementing a more rational database – like RRD? – has been on the list for a long time, but that’s a pretty big deal.
- It would often be helpful to have an easy way to add annotations to any of the graphs/data sets to comment on special things that happened. That will probably wait until a more final database is implemented, but who knows? Update: I could make each power outage line a link to a (password protected) page that would prompt for a comment. Prepending exactly the posted outage line (timestamp and duration) to the comment, maybe with ‘;’ delimiter, the comment line would be appended to a comments file. When printing the outage lines, I could grep out any matching lines from the comments file. Maybe.
- And I’d really like the water meter reader to work again. I even looked at it a few weeks ago and could see that the optics could still be aligned on the red spinning arrow, providing enough brightness variation that I’m surprised the phototran couldn’t see it. Maybe it’s just a drift/reset the bias thing. Update: Looks like the city’s going to replace all the water meters, so the clever-but-not-clever-enough-to-keep-working optical telescope will soon be completely obsolete. More here.