Imagine for the moment that you have a set of very large data files in CSV format. They might be log files from some process, or Census data, or other big data. Also imagine that your desktop workstation doesn't have the capacity to easily process the whole file set locally, either because some individual file is larger than your desktop RAM, or because the aggregate storage is bigger than your local SSD, or (most likely for me) the time to download everything is material and you don't want to waste bandwidth.
How do you structure file processing so that you divide it effectively between network resources and local resources? One enduring principle is to discard unneeded data as close to the source as you can. If you table has a million rows, but you only need a certain thousand of them, then do that step first before you sort or otherwise transform the data. Here's a couple of options for CSV processing that are helpful in cutting a problem down to size.
If your data lives in S3 or in Minio's S3-compatible storage, you can use S3 select. With the minio-mc package this looks like
mc sql --query 'SELECT BOTANICAL,CONDITION from S3object' myminio/csv/trees.csv
A toolset limited to CSV files but that's quite fast is csvcut and csvgrep, from the csvkit toolkit. These are specifically interesting where the CSV file has embedded nonsense like commas within comma-separated fields; it does the right thing by default, whereas a naive awk approach to this would be super complicated because you'd need to parse CSV in awk and though possible that gets messy fast.
csvcut -c BOTANICAL,CONDITION trees.csv
q is part of the Debian package python-q-text-as-data, and it offers a full SQLite SQL query interface to CSV or TSV data. This means that more than the simple SELECT operations available from the S3 command, you can do joins, group by, and all of that other SQL fun that I should have learned when I was younger but didn't because these sorts of tools didn't exist then.
q -H -d, 'SELECT BOTANICAL,CONDITION from trees.csv'
All of these return the same results. Time on a 57,000 record file on my system was in the 2 to 6 second range (plenty fast for interactive work), with the csvcut operation the fastest, and the performance of mc sql dependent on the speed of my network connection as I was running the Minio server remotely. Which to use is a matter of taste, and perhaps familiarity, and certainly has a lot to do with where your data already lives.