Intermittent inability to boot into Trisquel

5 replies [Last post]
strypey
Offline
Joined: 05/14/2015

Something is going wrong with my poor old AA1 (click on my name for full specs). It started with a few intermittent problems with the file manager, first being unable to access a shared folder from my wife's MacBook that I accessed fine yesterday, and then, after a couple of reboots, being unable to open the file manager at all. So I tried another reboot, and then things got really weird.

At first it wouldn't boot from the hard drive at all. Then after a couple of attempts, I did manage to get to GRUB, but then it wouldn't boot normally from the Flidas partition. When I tried booting in recovery mode, it did all sorts of weird stuff, and when I dropped to a root shell, it refused to restart or poweroff. when I tried booting from the older Belenos partition, I did manage to boot and login, but as soon as I tried to open the file manager, it crashed and burned. In both cases there were lots of weird error messages about input/output.

I worried that something had died on the hardware level, maybe something corrupted the SSD. But then it successfully booted from that USB I did an experimental install of Flidas onto, and accessed the files on the partition I mount as /home. So then I suspecting that maybe my RAM is failing. Either that, or there is some kind of file system corruption, but if so it must go deep to affect both OS partitions, and sometimes even GRUB.

I just tried again a couple of times to boot into either of the OS partitions on the SSD to get more details on the errors, but it couldn't seem to find the hard drive again. On a hunch, I picked up the laptop and gave it a shake, and now voila! It's working again, seemingly as normal. But now I have no idea how long it will be before the problem manifests again.

Any suggestions for how to diagnose and fix this problem? Specifically:
* how to either identify or rule out that it's a hardware problem
* how to isolate exactly what piece of hardware or software is causing the problem
* how to fix it

A couple of theories:
* I have had it refuse to boot before, not long after I installed the SSD. I opened the hard drive cavity, and found that the SSD had come off its plug, plugged it back in, and voila! I wonder if intermittent input/output issues could be caused by it coming slightly lose. If so, maybe I could wedge something into the hard drive cavity to hold the SSD in place?

* It has been very wet and cold here recently, especially in the room where I often keep the laptop overnight, and very hot and humid over the summer. I suspect it's much more humid in China than back home. I have noticed what looks like some blue copper rust on the case screws that I don't remember seeing before we came here. Is there any chance this increased wetness could be causing intermittent hardware issues?

Please don't suggest "buy a new computer"! I would love to, and I know I'm going to have to sooner or later (this laptop is more than 8 years old and only 32-bit). But I can't afford it right now. Also, even if I could, I still like to do everything practical to keep computer equipment in use, and out of landfill, for as long as humanly possible.

Magic Banana

I am a member!

Offline
Joined: 07/24/2010

You can check the SMART data of the disks. To do that with a graphical interface:

  1. install the "gnome-disk-utility" package, e.g. through the "Synaptic Package Manager in the "Control Center";
  2. run "Disks" from the "Control Center";
  3. click on your disk, on the left-hand side of the window;
  4. click on "SMART Data & Self-Tests..." in the "gear" menu.

From there, you can start a self-test too.

To test the RAM:

  1. install the "memtest86+" package, e.g. through the "Synaptic Package Manager in the "Control Center";
  2. reboot;
  3. in the menu of the bootloader, choose the newly-added MemTest86+ entry;
  4. let it run for a night.

If, in the morning, the screen is red, then your RAM is defective.

strypey
Offline
Joined: 05/14/2015

Thanks. Given the nature of the problems I described, I can't reliably do this using the installed OS on the SSD. I will try it with the OS on the USB.

Magic Banana

I am a member!

Offline
Joined: 07/24/2010

You can install "gnome-disk-utility" in a Trisquel live session and then test the disk with it. For MemTest86+, you would need another GNU/Linux live ISO (not Trisquel's) that already includes it: it is to be started from the menu of the bootloader. The disk seems to be the culprit, though.

I forgot to write: your #1 priority is to backup the users' data (if not already done).

strypey
Offline
Joined: 05/14/2015

Magic Banana:
> "The disk seems to be the culprit, though."

Yes, I suspect so. I searched "ata1" and "32" on a Searx instance and found this:

'ata* Comreset failed error =-16 (or32) on booting'
https://ubuntuforums.org/showthread.php?t=2286314

I'm really hoping the SSD has just come loose again. Sadly, I gave away a lot of my tools before I left from China, and neither of the screwdrivers I have will open the panel covering the SDD :( Once I get hold of a jewelers screwdriver, and re-seat the drive (wedging something in to keep it there this time), I will let you know if that fixes the issues.

strypey
Offline
Joined: 05/14/2015

This morning I'm getting more weird errors. When I tried to shutdown using the GUI menu, I got a black screen with a series of numbers,followed by:

ata1: COMRESET failed (err: 32)

... or ...

EXT4-fs error (device sda5): ext4_find-entry:1439: inside #710028: comm gmain: reading directory Iblock 0

Sometimes the same thing but with Cron or dbus instead of gmain. I believe sda5 is my home partition.