The number of lines for each character by percentage of the series

It would seem that I have far too much time on my hands. After the post about a Star Trek "test", I started wondering if there could be any data to back it up and... well here we go:

Those Old Scientists

NameTotal LinesPercentage of Lines
KIRK825732.89
SPOCK398515.87
MCCOY23349.3
SCOTT9123.63
SULU6342.53
UHURA5752.29
CHEKOV4171.66

The Next Generation

NameTotal LinesPercentage of Lines
PICARD1117520.16
RIKER645311.64
DATA559910.1
LAFORGE38436.93
WORF34026.14
TROI29925.4
CRUSHER28335.11
WESLEY12852.32

Deep Space Nine

NameTotal LinesPercentage of Lines
SISKO807313.0
KIRA51128.23
BASHIR48367.79
O'BRIEN45407.31
ODO45097.26
QUARK43316.98
DAX35595.73
WORF19763.18
JAKE14342.31
GARAK14202.29
NOG12472.01
ROM11721.89
DUKAT10911.76
EZRI9531.53

Voyager

NameTotal LinesPercentage of Lines
JANEWAY1023817.7
CHAKOTAY50668.76
EMH48238.34
PARIS44167.63
TUVOK39936.9
KIM38016.57
TORRES37336.45
SEVEN35276.1
NEELIX28874.99
KES11892.06

Enterprise

NameTotal LinesPercentage of Lines
ARCHER695924.52
T'POL371513.09
TUCKER361012.72
REED20837.34
PHLOX16215.71
HOSHI13134.63
TRAVIS10873.83
SHRAN3581.26

Discovery

Important Note: As the source material is incomplete for Discovery, the following table only includes line counts from seasons 1 and 4 along with a single episode of season 2.

NameTotal LinesPercentage of Lines
BURNHAM216222.92
SARU7738.2
BOOK5866.21
STAMETS5135.44
TILLY4885.17
LORCA4714.99
TARKA3133.32
TYLER3003.18
GEORGIOU2792.96
CULBER2672.83
RILLAK2052.17
DETMER1861.97
OWOSEKUN1691.79
ADIRA1541.63
COMPUTER1521.61
ZORA1511.6
VANCE1011.07
CORNWELL1011.07
SAREK1001.06
T'RINA961.02

If anyone is interested, here's the (rather hurried, don't judge me) Python used:

#!/usr/bin/env python

#
# This script assumes that you've already downloaded all the episode lines from
# the fantastic chakoteya.net:
#
# wget --accept=html,htm --relative --wait=2 --include-directories=/STDisco17/ http://www.chakoteya.net/STDisco17/episodes.html -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Enterprise/ http://www.chakoteya.net/Enterprise/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Voyager/ http://www.chakoteya.net/Voyager/episode_listing.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/DS9/ http://www.chakoteya.net/DS9/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/NextGen/ http://www.chakoteya.net/NextGen/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/StarTrek/ http://www.chakoteya.net/StarTrek/episodes.htm -m
#
# Then you'll probably have to convert the following files to UTF-8 as they
# differ from the rest:
#
# * Voyager/709.htm
# * Voyager/515.htm
# * Voyager/416.htm
# * Enterprise/41.htm
#

import re
from collections import defaultdict
from pathlib import Path

EPISODE_REGEX = re.compile(r"^\d+\.html?$")
LINE_REGEX = re.compile(r"^(?P<name>[A-Z']+): ")

EPISODES = Path("www.chakoteya.net")
DISCO = EPISODES / "STDisco17"
ENT = EPISODES / "Enterprise"
TNG = EPISODES / "NextGen"
TOS = EPISODES / "StarTrek"
DS9 = EPISODES / "DS9"
VOY = EPISODES / "Voyager"

NAMES = {
    TOS.name: "Those Old Scientists",
    TNG.name: "The Next Generation",
    DS9.name: "Deep Space Nine",
    VOY.name: "Voyager",
    ENT.name: "Enterprise",
    DISCO.name: "Discovery",
}


class CharacterLines:
    def __init__(self, path: Path) -> None:
        self.path = path
        self.line_count = defaultdict(int)

    def collect(self) -> None:
        for episode in self.path.glob("*.htm*"):
            if EPISODE_REGEX.match(episode.name):
                for line in episode.read_text().split("\n"):
                    if m := LINE_REGEX.match(line):
                        self.line_count[m.group("name")] += 1

    @property
    def as_tablular_data(self) -> tuple[tuple[str, int, float], ...]:
        total = sum(self.line_count.values())
        r = []
        for k, v in self.line_count.items():
            percentage = round(v * 100 / total, 2)
            if percentage > 1:
                r.append((str(k), v, percentage))
        return tuple(reversed(sorted(r, key=lambda _: _[2])))

    def render(self) -> None:
        print(f"\n\n# {NAMES[self.path.name]}\n")
        print("| Name             | Total Lines | Percentage of Lines |")
        print("| ---------------- | :---------: | ------------------: |")
        for character, total, pct in self.as_tablular_data:
            print(f"| {character:16} | {total:11} | {pct:19} |")


if __name__ == "__main__":
    for series in (TOS, TNG, DS9, VOY, ENT, DISCO):
        counter = CharacterLines(series)
        counter.collect()
        counter.render()