Text Processing Pipeline with sed, awk, cut, and sort

Intermediate12 minTrending

Build text processing pipelines to transform, extract, and aggregate data from CSVs, TSVs, and structured text files using standard Unix tools.

Prerequisites

-Bash with GNU coreutils
-Structured text files (CSV, TSV, logs)

Steps

Extract columns from CSV/TSV data

Use cut or awk to extract specific columns from delimited files.

$ cut -d',' -f1,3,5 data.csv | head -10

Use awk -F',' '{print $1, $3}' for more control over output formatting and delimiters.

Transform text with sed substitutions

Use sed to find and replace patterns, remove lines, or reformat text.

$ sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3\/\2\/\1/g' dates.txt

Use -E for extended regex (no need to escape parentheses). Use -i.bak to edit in place with a backup.

Sort and deduplicate data

Sort by specific columns and remove duplicate entries.

$ sort -t',' -k2,2 -k1,1n data.csv | uniq

-t sets the delimiter, -k2,2 sorts by column 2, -k1,1n sorts column 1 numerically.

Aggregate and compute with awk

Calculate sums, averages, and counts from columnar data.

$ awk -F',' 'NR>1 {sum+=$3; count++} END {printf "Total: %.2f\nAverage: %.2f\nCount: %d\n", sum, sum/count, count}' sales.csv

Build a multi-stage pipeline

Chain tools together to filter, transform, and summarize data in one command.

$ cat access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10

Read pipelines left to right: extract IPs, sort them, count unique occurrences, sort by count descending, show top 10.

Full Script

full-script.sh

#!/usr/bin/env bash
set -euo pipefail

FILE=${1:?'Usage: $0 <data-file>'}

[[ ! -f "$FILE" ]] && { echo "File not found: $FILE" >&2; exit 1; }

# Detect delimiter
FIRST_LINE=$(head -1 "$FILE")
if [[ "$FIRST_LINE" == *$'\t'* ]]; then
  DELIM=$'\t'
  echo "Detected: TSV"
elif [[ "$FIRST_LINE" == *','* ]]; then
  DELIM=','
  echo "Detected: CSV"
else
  DELIM=' '
  echo "Detected: space-delimited"
fi

echo ""
echo "=== File Stats ==="
echo "Lines: $(wc -l < "$FILE")"
echo "Columns: $(head -1 "$FILE" | awk -F"$DELIM" '{print NF}')"
echo "Header: $(head -1 "$FILE")"

echo ""
echo "=== First 5 rows ==="
head -6 "$FILE" | column -t -s"$DELIM"

echo ""
echo "=== Unique values in column 1 ==="
tail -n +2 "$FILE" | cut -d"$DELIM" -f1 | sort | uniq -c | sort -rn | head -10

echo ""
echo "=== Common Pipelines ==="
echo "Top values:    cut -d',' -f1 file | sort | uniq -c | sort -rn"
echo "Filter rows:   awk -F',' '\$3 > 100' file"
echo "Replace text:  sed 's/old/new/g' file"
echo "Column stats:  awk -F',' '{sum+=\$2} END{print sum}' file"

FAQ

Discussion

Loading comments...

#!/usr/bin/env bash set -euo pipefail FILE=${1:?'Usage: $0 <data-file>'} [[ ! -f "$FILE" ]] && { echo "File not found: $FILE" >&2; exit 1; } # Detect delimiter FIRST_LINE=$(head -1 "$FILE") if [[ "$FIRST_LINE" == *$'\t'* ]]; then DELIM=$'\t' echo "Detected: TSV" elif [[ "$FIRST_LINE" == *','* ]]; then DELIM=',' echo "Detected: CSV" else DELIM=' ' echo "Detected: space-delimited" fi echo "" echo "=== File Stats ===" echo "Lines: $(wc -l < "$FILE")" echo "Columns: $(head -1 "$FILE" | awk -F"$DELIM" '{print NF}')" echo "Header: $(head -1 "$FILE")" echo "" echo "=== First 5 rows ===" head -6 "$FILE" | column -t -s"$DELIM" echo "" echo "=== Unique values in column 1 ===" tail -n +2 "$FILE" | cut -d"$DELIM" -f1 | sort | uniq -c | sort -rn | head -10 echo "" echo "=== Common Pipelines ===" echo "Top values: cut -d',' -f1 file | sort | uniq -c | sort -rn" echo "Filter rows: awk -F',' '\$3 > 100' file" echo "Replace text: sed 's/old/new/g' file" echo "Column stats: awk -F',' '{sum+=\$2} END{print sum}' file"