[PC-BSD Dev] Wild idea to speed up boot process

Claudio L. claudio at hpgcc3.org
Sat Aug 10 10:49:56 PDT 2013


On 08/10/2013 13:15, Radio młodych bandytów wrote:
> On 10/08/2013 18:07, Claudio L. wrote:
>> On 08/09/2013 14:38, Radio młodych bandytów wrote:
>>> Some thoughts:
>>> 1. W/out UnionFS hacking it's unsafe in case of power failure
>> Yes, I guess it would have to be resolved by using a sync mode where a
>> written file gets written on both the upper and lower layers. That would
>> make it easier to keep things in sync.
>> Another idea would be to simply mark the files that are modified as
>> "dirty" on the ram disk, and write them only to the lower layer. On next
>> boot, any files that are "dirty"  would be re-read from the root
>> filesystem.
> OK.
>>> 2. It would lose ZFS end-to-end checksumming
>> Not necessarily, since the ram disk backup can be on the same zfs root
>> filesystem, checksum would be handled by the underlying ZFS.
> One of us (or both) doesn't understand something. I meant that once data
> gets read from ZFS to RAM, there's a correctness check, but later it
> stays in RAM and can get corrupted.

As far as I understand, ZFS cannot protect you from data corruption in 
RAM (nothing can!). If you ram is getting corrupted in ram, your data 
can be corrupted right before you write it to a file, or when the data 
is copied to the buffers but before the checksum is calculated, etc.
So if data is being corrupted in RAM we are done, there's nothing we can do.

However, if you like to be safe, nothing prevents you from using ZFS as 
the filesystem for the ram disk. I'd probably like to have something 
much simpler,  with near zero overhead (just a simple hash table with 
file names, offset of the data in ram, size and if you want, a checksum 
as well).

>>> 3. Obviously, it wastes RAM.
>> Yes, it would double the ram usage during boot, then it would be
>> released. But since the ramdisk can be uncached, that would be memory
>> that would otherwise be used by the zfs cache anyway. And if the ram is
>> available, why not use it?
>> The idea is to release all ram after booting is finished.
> OK.
>>> 4. man mount_unionfs says: "THIS FILE SYSTEM TYPE IS NOT YET FULLY
>>> SUPPORTED (READ: IT DOESN'T WORK)". Major work is needed. Though there's
>>> also unionfs in FUSE, it may or may not work.
>> Yes, I saw that, and doesn't inspire much confidence. But I think the
>> functionality that would be required is not exactly that of the existing
>> unionfs, so we wouldn't be able to use the existing unionfs anyway.
>>
> I think we're getting too far (viability issues are more important than
> implementation ones), but modifying unionfs seems much easier than
> writing it from scratch.
Right now this is just a mental exercise to see if there are any major 
roadblocks that say "it cannot be done, period" before I attempt 
anything. To know if it can be done, we have to have more or less a 
gross idea of how it would have to be implemented and what would be 
needed in terms of man-hours of work and complexity.
Your comments are very useful to clarify details.


>>> 5. We should measure the bottlenecks of the current procedure before
>>> trying to improve it. Is it going to be faster at all?
>> A lot of people already did that. Just look at the speedups you get from
>> booting on a normal hard disk w/SSD caching (Momentus XT?), which is
>> basically the same I'm proposing: a persistent cache that provides high
>> IOPS on top of a standard filesystem layer.
> This is very different. It doesn't have to read data from disk before
> use and if data is heavily fragmented, this is a major cost.
> We have the following cases:
> A) data is stored sequentially and the boot process accesses it sequentially
> B) data is stored sequentially, accessed randomly
> C) data is stored randomly, accessed randomly
> On the 1st system install, it's most likely C, but maybe we could
> manipulate it to B
> After lots of use and system update, it will end up in C and I see no
> way to prevent it.
> A is for completeness, I don't think that it happens.
> The cache that you propose improves only B. Momentus XT / persistent
> L2ARC - B and C and possibly A too.
> Still, this is theory. If you've seen numbers that you can share, please do.
Yes, it's different in the sense that it requires the data to be 
sequentially read first, so it's not as good as an SSD cache. Still, HDD 
can read sequentially at around 120 MB/s (much faster on raid), but in 
random access that goes down to a couple MB/s. On SSD it doesn't go down 
so much, so the gains will be a lot less.
Actually, the cache I'm proposing converts cases B and C to a single 
sequential read from the "slow" device, and have the random access 
happen in RAM only, so it would help in both cases.
On initial boot, as files are accessed randomly, the ram disk gets 
populated and after the first boot it is stored sequentially. Then on 
next boot, one single sequential access from the ram disk is all we need.
Granted, we should somehow make sure this ram disk file doesn't get 
fragmented...
Claudio


More information about the Dev mailing list