OpenSolaris derived ZFS NAS/ SAN (OmniOS, OpenIndiana, Solaris and napp-it)

_Gea

2[H]4U
Joined
Dec 5, 2010
Messages
4,093
ok, ashift values are ok and not different within datapool and at 12.
(can happen if you add another vdev to a pool)
The ashift=0 values are not clear to me but seems uncritical as they do not affect data or disks

Performance:
- I asume you have enabled sync on the SSD pool. This will result in a poor but secure write performance. To validate, disable sync and retry a write. (On a VM storage you should enable sync or your guest filesystem can become corrupted on a crash during write)

To benchmark pool performance use menu Pool > Benchmark.
This is a series of large and small read/write io with sync enabled vs sync disabled

Check:
Is your vnic a vmxnet3 or e1000. First is faster
Have you run System > Appliance tuning with defaults?
This will improve NFS and ip performance between storage and ESXi

An SSD pool with an additional Slog is "suboptimal" as this has the effect that every write musz be done twice, once fast via rambased writecache and once on every write commit. As I see your pool SSDs are desktop ones and only the log is the enterprise edition with powerloss protection. In such a case the slog improves write security a little but a small risk remains as the pool itself has no powerloss protection so no guarantee that last writes are save on a crash.

Best regarding security and performance would be using a mirror of SSDs with powerloss protection without an extra Slog and sync enabled.
 
Last edited:

docjay

n00b
Joined
Dec 1, 2014
Messages
14
ok, ashift values are ok and not different within datapool and at 12.
(can happen if you add another vdev to a pool)
The ashift=0 values are not clear to me but seems uncritical as they do not affect data or disks

Performance:
- I asume you have enabled sync on the SSD pool. This will result in a poor but secure write performance. To validate, disable sync and retry a write. (On a VM storage you should enable sync or your guest filesystem can become corrupted on a crash during write)

To benchmark pool performance use menu Pool > Benchmark.
This is a series of large and small read/write io with sync enabled vs sync disabled

Check:
Is your vnic a vmxnet3 or e1000. First is faster
Have you run System > Appliance tuning with defaults?
This will improve NFS and ip performance between storage and ESXi

An SSD pool with an additional Slog is "suboptimal" as this has the effect that every write musz be done twice, once fast via rambased writecache and once on every write commit. As I see your pool SSDs are desktop ones and only the log is the enterprise edition with powerloss protection. In such a case the slog improves write security a little but a small risk remains as the pool itself has no powerloss protection so no guarantee that last writes are save on a crash.

Best regarding security and performance would be using a mirror of SSDs with powerloss protection without an extra Slog and sync enabled.
I'm running omnios, v11 r151042b. All of my VMs are vmxnet3 except for omnios as I cannot make it see the NIC when I choose vmxnet3 from my host. I have to flip it back to e1000 for it to recognize a NIC. Know any tricks for this, I've read your manuals and didn't see anything special. I also have VMware tools installed, they might need to be updated.

I did notice that I had only one NIC on my vSwitch, probably the default setup. It had defaulted to 100mb. I've added two more NICs and they autodetected to 100mb as well but I forced them 1000/full. I also have a 10GB microtik on the way and will attempt again to get omnios to see the vmxnet3 nic.

I have also removed the SLOG and disabled sync on my SSDMirror. Still working on your other suggestions.

Thank you very much for your guidance and your work on napp-it!

UPDATE: I now have a 10GBE switch in place and connected to a 10GB SFP+ port on my host. My machine ID changed so now my license needs to be updated, will send an email about this. I've added the new NIC to my vSwitch0 in ESXi.
 
Last edited:

chune

Weaksauce
Joined
Nov 2, 2013
Messages
71
If you have a drive fail in a pool that is 80% full then start a resilver operation only to realize you can purge some ancient snaps that free up an additional 40% of the array- will the resilver operation see that dynamically or will it continue to resilver all of the 80% full array and not see it until after the drive replacement is complete?
 

_Gea

2[H]4U
Joined
Dec 5, 2010
Messages
4,093
A drive resilver reads all data with a metadata reference. If amount of referenced data shrinks, a resilver should finish faster. With modern ZFS and sorted resilver, the difference may be not huge.
 

_Gea

2[H]4U
Joined
Dec 5, 2010
Messages
4,093
The multihreaded Solaris/ZFS integrated SMB server allows file/folderbased ntfs alike ACL but also sharebased ACL. These are ACLs on a share controlfile /pool/filesystem/.zfs/shares/filesystem. This file is created when you activate a share and deleted when you disable a share. Share ACL are therefor not persistent.

In current napp-it 22.dev share ACL are preserved as ZFS properties. When you re-enable a share you can now restore last share ACL or set basic settings like everyone full, modify or read.
 
Last edited:

_Gea

2[H]4U
Joined
Dec 5, 2010
Messages
4,093
Save energy on your napp-it backupserver

Energy costs have multiplied since last year. Really a problem for a backupserver up 24/7 when you only want to backup your storageserver once or a few times a day especially as incremental ongoing ZFS replications are finished within minutes.

A money saving solution is to remotely power up the backupserver via ipmi, sync the filesystema via ZFS replication and power off the backupserver when replications are finished. For this I have created a script for a napp-it 'other job' for your storageserver to simplify and automate this.

Details, see https://forums.servethehome.com/ind...laris-news-tips-and-tricks.38240/#post-357328
 

_Gea

2[H]4U
Joined
Dec 5, 2010
Messages
4,093
OpenIndiana Hipster 2022.10 is here

https://www.openindiana.org/2022/12/04/openindiana-hipster-2022-10-is-here/
OpenIndiana is a Illumos distribution and more or less the successor of OpenSolaris. It comes in a desktop edition with a Mate GUI, browser, email and Office apps, a textedition similar to OmniOS bloody and a minimal distribution. Usually you install the desktop or text edition. Minimal lacks essential tools.

While OpenIndiana Hipster is ongoing Illumos (every pgk update gives you newest Illumos so quite a Illumos reference installation) there are annual snapshots that gives a tested startpoint for beginners. This is the main difference to OmniOS where stability with dedicated stable repositories is the main concern.

During setup, select your keyboard but keep language=en when using napp-it.
 

_Gea

2[H]4U
Joined
Dec 5, 2010
Messages
4,093
Use case and performance considerations for an OmniOS/OpenIndiana/Solaris based ZFS server
This is what I am asked quite often

If you simply want the best performance, durability and security, order a server with a very new CPU with a frequency > 3GHz and 6 cores or more, 256 GB RAM and a huge Flash only storage with 2 x 12G multipath SAS (10dwpd) or NVMe in a multi mirror setup - with a datacenter quality powerloss protection to ensure data on a powerloss during writes or background garbage collection. Do not forget to order twice as you need a backup on a second location at least for a disaster like fire, theft or Ransomware.

Maybe you can follow this simple suggestion, mostly you search a compromise between price, performance and capacity under a given use scenario. Be aware that when you define two of the three parameters, the third is a result of your choice ex low price + high capacity = low performance.

As your main concern should be a workable solution, you should not start with a price restriction but with your use case and the needed performance for that (low, medium, high, extreme). With a few users and mainly office documents, your performance need is low, even a small server with a 1.5 GHz dualcore CPU, 4-8 GB RAM and a mirror from two SSD or HD can be good enough. Add some external USB disks for a rolling daily backup and you are ready.

If you are a media firm with many users that want to edit multitrack 4k video from ZFS storage, you need an extreme solution regarding pool performance (> 2GB/s sequential read,write), network (min multiple 10G) and capacity according your needs. Maybe you come to the conclusion to prefer a local NVMe for hot data and a medium class disk based storage for shared file access and versioning only. Do not forget to add a disaster backup solution.

After you have defined the performance class/use case (low, medium, high, extreme), select needed components.


CPU
For lower performance needs and 1G networks, you can skip this. Even a cheap dual/quadcore CPU is e good enough. If your performance need is high or extreme with a high throughput in a 10G network or when you need encryption, ZFS is quite CPU hungry as you see in https://www.napp-it.org/doc/downloads/epyc_performance.pdf. If you have the choice prefer higher frequency over more cores. If you need sync write (VM storage or databases) avoid encryption as encrypted small sync writes are always very slow and add an Slog for diskbased pools.

RAM
Solaris based ZFS systems are very resource efficient due the deep integration of iSCSI, NFS and SMB into the Solaris kernel that was developped around ZFS from the beginning. You need less than 3 GB for a 64bit Solaris based OS itself to be stable with any pool size. Use at least 4-8 GB RAM to allow some caching for low to medium needs with only a few users.

memory.png

As ZFS uses most of the RAM (unless not dynamically demanded by other processes) for ultrafast read/write caching to improve performance you may want to add more RAM. Per default Open-ZFS uses 10% of RAM for write caching. As a rule of thumb you should collect all small writes < 128k in the rambased write cache as smaller writes are slower or very slow. As you can only use half of the write cache unless the content must be written to disk, you want at least 256k write cache that you can have with 4 GB RAM in a single user scenario. This RAM need for write caching scale with number of users that write concurrently so add around 0.5 GB RAM per active concurrent user.

Oracle Solaris with native ZFS works different. The rambased writecache caches last 5s of writes that can consume up to 1/8 of total RAM. In general this often leads to similar RAM needs than OI/OmniOS with Open-ZFS. On a faster 10G network with a max write of 1 GB/s this means 8GB RAM min + RAM wanted for readcaching.


Most of the remaining RAM is used for ultrafast rambased readcaching (Arc). The readcache works only for small io on a read last/ read most optimazation. Large files are not cached at all. Cache hits are therefore for matadate and small random io. Check napp-it menu System > Basic Statistic > Arc after some time of storage usage. Unless you does not have a use scenario with many users, many small files and a high volatility (ex a larger mailserver), cache hit rate should be > 80% and metadata hit rate > 90%. If results are lower you should add more RAM or use high performance storage like NVMe where caching is not so important.

arc.png

If you read about 1GB RAM per TB storage, forget this. It is a myth unless you do not activate rambased realtime dedup (not recommendet at all or when dedup is needed use fast NVMe as a special vdev mirror for dedup). Needed RAM size depends on number of users, files or wanted cache hit rate not poolsize.

L2Arc
L2Arc is an SSD or at best NVMe that can be used to extend the rambased Arc. L2Arc is not as fast as RAM but can increase cache size when more RAM is not an option or when the server is rebooted more often as L2Arc is persistent. As L2Arc needs RAM to organize, do not use more than say 5x RAM as L2Arc. Additionally you can enable read ahead on L2Arc that may improve sequential reads a little. (add "set zfs:l2arc_noprefetch=0" to /etc/system or use napp-it System > Tuning).


Disk types
RAM can help a lot to improve ZFS performance with the help of read/write caching. For larger sequential writes and reads or many small io it is only raw storage performance that counts. If you look at the specs of disks the two most important values are seqential transfer rate for large transfers and iops that counts when you read or write small datablocks.

Mechanical disks
On mechanical disks you find values of around 200-300 MB/s max sequential transfer rate and around 100 iops. As a Copy on Write filesystem like ZFS is not optimized to a single user/single datastream load, it spread data quite evenly over the pool for a best multiuser/multithread performance. It is therefore affected by fragmentation with many smaller datablocks spread over the whole pool where performance is more limited by iops than sequential values. On average use you will often see no more than 100-150 MB/s per disk. When you enable sync write on a single mechanical disk, write performance is not better than say 10 MB/s due the low iops rating.

Desktop Sata SSD
can achieve around 500 MB/s (6G Sata) and a few thousand iops. Often iops values from specs are only valid for a short time until performance drops down to a fraction on steady writes.

Enterprise SSDs
can hold their performance and offer powerloss protection PLP. Without PLP last writes are not save on a power outage during write as well as data on disk with background operations like firmware based garbage collection to keep SSD performance high.

Enterprise SSDs are often available as 6G Sata or 2 x 12G multipath SAS. When you have an SAS HBA prefer 12G SAS models due the higher performance (up to 4x faster than 6G Sata) and as SAS is full duplex while Sata is only half duplex with a more robust signalling with up to 10m cable length (Sata 1m). The best of all SAS SSDs can achieve up to 2 GB/s transfer rate and over 300k iops on steady 4k writes. SAS is also a way to build a storage with more than 100 hotplug disks easily with the help of SAS expanders.

NVMe are the fastest option for storage. The best like Intel Optane 5800x rate at 1.6M iops and 6.4 GB/s transfer rate. In general Desktop NVMe lack powerloss protection and can hold write iops not on steady write so prefer datacenter models with PLP. While NVMe are ultrafaste it is not as easy to use many of them as each wants a 4x pci lane connection (pci-e card, M.2 or oculink/U.2 connector). For a larger capacity SAS storage is often nearly as fast and easier to implement especially when hotplug is wanted. NVMe is perfect for a second smaller high performance pool for databases/VMs or to tune a ZFS pool with an Slog for a faster sync write on disk based pools, a persistent L2Arc or a special vdev mirror.


ZFS Pool Layout

ZFS groups disks to a vdev and stripe several vdevs to a pool to improve performance or reliability. While a ZFS pool from a single disk without redundancy rate as described above, a vdev from several disks can behave better.

Raid-0 pool (ZFS always stripes data over vdevs in a raid-0)
You can create a pool from a single disk (this is a basic vdev) or a mirror/raid-Z vdev and add more vdevs to create a raid-0 configuration. Overall read/write performance from math is number of vdevs x performance of a single vdev as each must only process 1/n of data. Real world performnce is not a factor n but more 1.5 to 1.8 x n depending on disks or disc caches and decreases with more vdevs. Keep this in mind when you want to decide if ZFS performance is "as expected"

A pool from a single n-way mirror vdev
You can mirror two or more disks to create a mirror vdev. Mostly you mirror to improve datasecurity as write performance of an n-way mirror is equal to a single disk (a write is done when on all disks). As ZFS can read from all disks simultaniously read performance and read iops scale with n. When a single disk rate with 100 MB/s and 100 iops a 3way mirror can give up to 300 MB/s and 300 iops. If you run a napp-it Pool > Benchmsrk with a singlestream read benchmark vs a fivestream one, you can see the effect. In a 3way mirror any two disks can fail without a dataloss.

A pool from multiple n-way mirror vdevs
Some years ago a ZFS pool from many striped mirror vdevs was the preferred method for faster pools. Nowaday I would use mirrors only when one mirror is enough or when an easy extension to a later Raid-10 setup ex from 4 disks is planned. If you really need performance, use SSD/Nvme as they are by far superiour.

A pool from a single Z1 vdev
A Z1 vdev is good to combine up to say 4 disks. Such a 4 disk Z1 vdev gives the capacity of 3 disks. One disk of the vdev is allowed to fail without a dataloss. Unlike other raid types like raid-5 a readerror in a degraded Z1 does not mean a pool lost but only a damaged reported file that is affected by the read error. This is why Z1 is much better and named different than raid-5. Sequential read/write performance of such a vdev is similar to a 3 disk raid-0 but iops is only like a single disk (all heads must be in position prior an io)

A pool from a single Z2 vdev
A Z2 vdev is good to combine say 5-10 disks. A 7 disk Z2 vdev gives the capacity of 5 disks. Any two disks of the vdev are allowed to fail without a dataloss. Unlike other raid types like raid-6 a readerror in a totally degraded Z2 does not mean a pool lost but only a damaged reported file that is affected by the read error. This is why Z2 is much better and named different than raid-6. Sequential read/write performance of such a vdev is similar to a 5 disk raid-0 but iops is only like a single disk (all heads must be in position prior an io)

A pool from a single Z3 vdev
A Z1 vdev is good to combine say 11-20 disks. A 13 disk Z2 vdev gives the capacity of 10 disks. Any three disks of the vdev are allowed to fail without a dataloss. There is no equivalent to Z3 in traditional raid. Sequential read/write performance of such a vdev is similar to a 10 disk raid-0 but iops is only like a single disk (all heads must be in position prior an io)


A pool from multiple raid Z[1-3] vdevs
Such a pool stripes the vdevs what means sequential performance and iops scale with number of vdevs (not linear similar to the raid-0 degression with more disks)


Many small disks vs less larger disks
Many small disks can be faster but are more power hungry and as performance improvement is not linear and failure rate scale with number of parts I would always prefer less but larger disks. The same is with number of vdevs. Prefer a pool from less vdevs. If you have a pool of say 100 disks and an annual failure rate of 5%, you have 5 bad disks per year. I you asume a resilver time of 5 days per disk you can expect 3-4 weeks where a resilver is running with a noticeable performance degration.


Special vdev
Some high end storages offer tiering where active or performance sensitive files can be placed on a faster part of an array. ZFS does not offer traditional tiering but you can place critical data based on their physical size (small io), type (dedup or metadata) or based on the recsize setting of a filesystem on a faster vdev of a ZFS pool. Main advantage is that you do not need to copy files around so this is often a superiour approach as mostly the really slow data is data with a small physical file or blocksize. As a vdev lost means a pool lost, use special vdevs always in a n-way mirror. Use the same ashift as all other vdevs (mostly use ashift=12 for 4k physical disks) to allow a special vdev remove.

To use a special vdev, use menu Pools > Extend, select a mirror (best a fast SSD/NVMe mirror with PLP) with type=special. Allocations in the special class are dedicated to specific block types. By default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks. This means you can force all data of a certain filesystem to the special vdev when you set the ZFS property "special_small_blocks" ex special_small_blocks=128K for a filesystem with a recsize setting smaller or equal. In such a case all small io and some critical filesystems are on the faster vdev others on the regular pool. If you add another vdev mirror load is distributed over both vdevs. If a special vdev is too full, data is stored on the other slower vdevs.

Slog
With ZFS all writes always go to the rambased writecache (there may be a direct io option in a future ZFS) and
are written as a fast large transfer with a delay. On a crash during write the content or the writcache is lost (up
to several MB). Filesystems on VM storage or databased may get corrupted. If you cannot allow such a dataloss
you can enable sync write for a filesystem. This will force any write commit immediately to a faster Zil area of
the pool or to a fast dedicated Slog device that can be much faster than the pool ZIL area and additionally in a
second step as a regular cache write. Every bit that you write is writtn twice, once directly and once collected
in writecache. This can never be as fast as a regular write vie writecache. So Slog is not a performance option
but a security option when you want acceptable sync write performance. The Slog is never read beside after a
power outage to redo missing writes on next reboot, similar to the BBU protection of hardware raid.
Add an Slog only when you need sync write and buy the best that you can afford regarding low latency, high
endurance and 4k write iops. The Slog can be quite small (min 10GB). Widely used are the Intel datacenter
Optane.

Tuning
Beside the above "physical" options you have a few tuning options. For faster 10G+ networks you can increase tcp buffers or NFS settings in menu System > Tuning. Another option is Jumboframes that you can set in menu System > Network Eth ex to a "payload" of 9000. Do not forget to set all switches to highest possible mtu value or at least to 9126 (to include ip headers)

Another setting is ZFS recsize. For VM storage with filesystems on it I would set to 32K or 64K (not lower as ZFS becomes inefficient then). For mediadata a higer value of 512K or 1M may be faster.

more, https://www.napp-it.org/doc/downloads/napp-it_build_examples.pdf
 

arryo

n00b
Joined
May 23, 2012
Messages
58
I have a problem of accessing ZFS data via local network. The speed seems very slow eventhough I have 1G ethernet port out of my all in one system (ESXi as boot up in a seperate USB, Omni's in a SSD, and ZFS data in a HDDs). The data is not filled up all the way, and when ever I play a 4K movie stored in ZFS (around 80Mbps-90Mbps) via wired LAN, the stutter happened. when I change to wireless , somehow the situation gets better. What should I do to make the connection is better?
 

_Gea

2[H]4U
Joined
Dec 5, 2010
Messages
4,093
You should first check pool performance in menu Pool > Benchmarks. This is a series or read/write tests with sync vs async. If performance is as expected, test network performance via iperf (server in menu Services, client in System > Network Eth). To rule out cable/switch problems, compare a direct cabling. If you use special settings like Jumbo, disable.
 
Top