14.5. File Read

14.5.1. Rationale

  • Works with both relative and absolute path

  • Fails when directory with file cannot be accessed

  • Fails when file cannot be accessed

  • Uses context manager

  • mode parameter to open() function is optional (defaults to mode='rt')

14.5.2. Read From File

  • Always remember to close file

    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> file = open(FILE)
    >>> data = file.read()
    >>> file.close()
    

14.5.3. Read Using Context Manager

  • Context managers use with ... as ...: syntax

  • It closes file automatically upon block exit (dedent)

  • Using context manager is best practice

  • More information in Protocol Context Manager

    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> with open(FILE) as file:
    ...     data = file.read()
    

14.5.4. Read File at Once

  • Note, that whole file must fit into memory

    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> with open(FILE) as file:
    ...     data = file.read()
    

14.5.5. Read File as List of Lines

  • Note, that whole file must fit into memory

    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> with open(FILE) as file:
    ...     data = file.readlines()
    

Read selected (1-30) lines from file:

>>> FILE = r'/tmp/myfile.txt'
>>>
>>> with open(FILE) as file:
...     lines = file.readlines()[1:30]

Read selected (1-30) lines from file:

>>> FILE = r'/tmp/myfile.txt'
>>>
>>> with open(FILE) as file:
...     for line in file.readlines()[1:30]:
...         print(line)

Read whole file and split by lines, separate header from content:

>>> FILE = r'/tmp/myfile.txt'
>>>
>>> 
... with open(FILE) as file:
...     header, *content = file.readlines()
...
...     for line in content:
...         print(line)

14.5.6. Reading File as Generator

  • Use generator to iterate over other lines

  • In those examples, file is a generator

    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> with open(FILE) as file:
    ...     for line in file:
    ...         print(line)
    
    >>> FILE = r'/tmp/myfile.txt'
    >>>
    >>> with open(FILE) as file:
    ...     header = file.readline()
    ...
    ...     for line in file:
    ...         print(line)
    

14.5.7. Examples

>>> def isnumeric(x):
...     try:
...         float(x)
...         return True
...     except ValueError:
...         return False
>>>
>>>
>>> def clean(line):
...     line = line.strip().split(',')
...     line = map(lambda x: float(x) if isnumeric(x) else x, line)
...     return tuple(line)
>>>
>>>
>>> with open(FILE) as file:
...     header = clean(file.readline())
...
...     for line in file:
...         line = clean(line)
...         print(line)
>>> total = 0
>>>
>>> with open(FILE) as file:
...     for line in file:
...         total += sum(float(line))
>>>
>>> print(total)
0

14.5.8. Assignments

Code 14.9. Solution
"""
* Assignment: File Read Str
* Required: yes
* Complexity: easy
* Lines of code: 2 lines
* Time: 3 min

English:
    1. Write `DATA` to file `FILE`
    2. Read `FILE` to `result: str`
    3. Print `result`
    4. Run doctests - all must succeed

Polish:
    1. Zapisz `DATA` do pliku `FILE`
    2. Wczytaj `FILE` do `result: str`
    3. Wypisz `result`
    4. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert type(result) is str
    >>> assert result == DATA

    >>> result
    'hello'

    >>> from os import remove; remove(FILE)
"""

FILE = '_temporary.txt'
DATA = 'hello'

with open(FILE, mode='wt') as file:
    file.write(DATA)

# str: FILE content
result = ...

Code 14.10. Solution
"""
* Assignment: File Read Multiline
* Required: yes
* Complexity: easy
* Lines of code: 3 lines
* Time: 3 min

English:
    1. Write `DATA` to file `FILE`
    2. Read `FILE` to `result: list[str]`
    3. Print `result`
    4. Run doctests - all must succeed

Polish:
    1. Zapisz `DATA` do pliku `FILE`
    2. Wczytaj `FILE` do `result: list[str]`
    3. Wypisz `result`
    4. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert type(result) is list
    >>> assert all(type(x) is str for x in result)

    >>> result
    ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

    >>> from os import remove; remove(FILE)
"""

FILE = '_temporary.txt'
DATA = 'sepal_length\nsepal_width\npetal_length\npetal_width\nspecies\n'

with open(FILE, mode='wt') as file:
    file.write(DATA)

Code 14.11. Solution
"""
* Assignment: File Read CSV
* Required: yes
* Complexity: easy
* Lines of code: 15 lines
* Time: 8 min

English:
    1. Write `DATA` to file `FILE`
    2. Read `FILE`
    3. Separate header from data
    4. Write header (first line) to `header`
    5. Read file and for each line:
        a. Strip whitespaces
        b. Split line by coma `,`
        c. Convert measurements do `tuple[float]`
        d. Append measurements to `features`
        e. Append species name to `labels`
    6. Print `header`, `features` and `labels`
    7. Run doctests - all must succeed

Polish:
    1. Zapisz `DATA` do pliku `FILE`
    2. Wczytaj `FILE`
    3. Odseparuj nagłówek od danych
    4. Zapisz nagłówek (pierwsza linia) do `header`
    5. Zaczytaj plik i dla każdej linii:
        a. Usuń białe znaki z początku i końca linii
        b. Podziel linię po przecinku `,`
        c. Przekonwertuj pomiary do `tuple[float]`
        d. Dodaj pomiary do `features`
        e. Dodaj gatunek do `labels`
    6. Wyświetl `header`, `features` i `labels`
    7. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `tuple(float(x) for x in X)`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> header
    ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
    >>> features  # doctest: +NORMALIZE_WHITESPACE
    [(5.4, 3.9, 1.3, 0.4),
     (5.9, 3.0, 5.1, 1.8),
     (6.0, 3.4, 4.5, 1.6),
     (7.3, 2.9, 6.3, 1.8),
     (5.6, 2.5, 3.9, 1.1),
     (5.4, 3.9, 1.3, 0.4)]
    >>> labels
    ['setosa', 'virginica', 'versicolor', 'virginica', 'versicolor', 'setosa']
    >>> from os import remove; remove(FILE)
"""

FILE = '_temporary.csv'

DATA = """sepal_length,sepal_width,petal_length,petal_width,species
5.4,3.9,1.3,0.4,setosa
5.9,3.0,5.1,1.8,virginica
6.0,3.4,4.5,1.6,versicolor
7.3,2.9,6.3,1.8,virginica
5.6,2.5,3.9,1.1,versicolor
5.4,3.9,1.3,0.4,setosa
"""

header = []
features = []
labels = []

with open(FILE, mode='w') as file:
    file.write(DATA)

Code 14.12. Solution
"""
* Assignment: File Read Dict
* Required: no
* Complexity: medium
* Lines of code: 10 lines
* Time: 8 min

English:
    1. Write `DATA` to file `FILE`
    2. Read `FILE` and for each line:
        a. Remove leading and trailing whitespaces
        b. Skip line if it is empty
        c. Split line by whitespace
        d. Separate IP address and hosts names
        e. Append IP address and hosts names to `result`
    3. Merge hostnames for the same IP
    4. Run doctests - all must succeed

Polish:
    1. Zapisz `DATA` do pliku `FILE`
    2. Wczytaj `FILE` i dla każdej linii:
        a. Usuń białe znaki na początku i końcu linii
        b. Pomiń linię, jeżeli jest pusta
        c. Podziel linię po białych znakach
        d. Odseparuj adres IP i nazwy hostów
        e. Dodaj adres IP i nazwy hostów do `result`
    3. Scal nazwy hostów dla tego samego IP
    4. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `str.isspace()`
    * `str.split()`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> result  # doctest: +NORMALIZE_WHITESPACE
    {'127.0.0.1': ['localhost'],
     '10.13.37.1': ['nasa.gov', 'esa.int', 'roscosmos.ru'],
     '255.255.255.255': ['broadcasthost'],
     '::1': ['localhost']}
    >>> from os import remove; remove(FILE)
"""

FILE = '_temporary.txt'

DATA = """127.0.0.1       localhost
10.13.37.1      nasa.gov esa.int roscosmos.ru
255.255.255.255 broadcasthost
::1             localhost
"""

with open(FILE, mode='w') as file:
    file.write(DATA)

result = {}

Code 14.13. Solution
"""
* Assignment: File Read List of Dicts
* Required: no
* Complexity: hard
* Lines of code: 19 lines
* Time: 21 min

English:
    1. Read file and for each line:
        a. Skip line if it's empty, is whitespace or starts with comment `#`
        b. Remove leading and trailing whitespaces
        c. Split line by whitespace
        d. Separate IP address and hosts names
        e. Use one line `if` to check whether dot `.` is in the IP address
        f. If is present then protocol is IPv4 otherwise IPv6
        g. Append IP address and hosts names to `result`
    2. Merge hostnames for the same IP
    3. Run doctests - all must succeed

Polish:
    1. Przeczytaj plik i dla każdej linii:
        a. Pomiń linię jeżeli jest pusta, jest białym znakiem
           lub zaczyna się od komentarza `#`
        b. Usuń białe znaki na początku i końcu linii
        c. Podziel linię po białych znakach
        d. Odseparuj adres IP i nazwy hostów
        e. Wykorzystaj jednolinikowego `if` do sprawdzenia czy jest
           kropka `.` w adresie IP
        f. Jeżeli jest obecna to protokół  jest IPv4,
           w przeciwnym przypadku IPv6
        g. Dodaj adres IP i nazwy hostów do `result`
    2. Scal nazwy hostów dla tego samego IP
    3. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `str.split()` - without an argument
    * `len(line) == 0`
    * `line.startswith('#')`
    * `ip = 'IPv4' if '.' in ip else 'IPv6'`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> result  # doctest: +NORMALIZE_WHITESPACE
    [{'ip': '127.0.0.1', 'hostnames': ['localhost', 'astromatt'], 'protocol': 'IPv4'},
     {'ip': '10.13.37.1', 'hostnames': ['nasa.gov', 'esa.int', 'roscosmos.ru'], 'protocol': 'IPv4'},
     {'ip': '255.255.255.255', 'hostnames': ['broadcasthost'], 'protocol': 'IPv4'},
     {'ip': '::1', 'hostnames': ['localhost'], 'protocol': 'IPv6'}]
    >>> from os import remove; remove(FILE)
"""

FILE = '_temporary.txt'

DATA = """
##
# `/etc/hosts` structure:
#   - IPv4 or IPv6
#   - Hostnames
 ##

127.0.0.1       localhost
127.0.0.1       astromatt
10.13.37.1      nasa.gov esa.int roscosmos.ru
255.255.255.255 broadcasthost
::1             localhost
"""

with open(FILE, mode='w') as file:
    file.write(DATA)

result: list

Code 14.14. Solution
"""
* Assignment: File Read Passwd
* Required: no
* Complexity: hard
* Lines of code: 100 lines
* Time: 55 min

English:
    1. Save listings content to files:
        a. `etc_passwd.txt`
        b. `etc_shadow.txt`
        c. `etc_group.txt`
    2. Copy also comments and empty lines
    3. Parse files and convert it to `result: list[dict]`
    4. Return list of users with `UID` greater than 1000
    5. User dict should contains data collected from all files
    6. Run doctests - all must succeed

Polish:
    1. Zapisz treści listingów do plików:
        a. `etc_passwd.txt`
        b. `etc_shadow.txt`
        c. `etc_group.txt`
    2. Skopiuj również komentarze i puste linie
    3. Sparsuj plik i przedstaw go w formacie `result: list[dict]`
    4. Zwróć listę użytkowników, których `UID` jest większy niż 1000
    5. Dict użytkownika powinien zawierać dane z wszystkich plików
    6. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `from datetime import date`
    * `date.fromtimestamp(timestamp: int)`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> result  # doctest: +NORMALIZE_WHITESPACE
    [{'username': 'watney',
      'uid': 1000,
      'gid': 1000,
      'home': '/home/watney',
      'shell': '/bin/bash',
      'algorithm': None,
      'password': None,
      'groups': ['astronauts', 'mars'],
      'last_changed': datetime.date(2015, 4, 25),
      'locked': True},
     {'username': 'twardowski',
      'uid': 1001,
      'gid': 1001,
      'home': '/home/twardowski',
      'shell': '/bin/bash',
      'algorithm': 'SHA-512',
      'password': 'tgfvvFWJJ5FKmoXiP5rXWOjwoEBOEoAuBi3EphRbJqqjWYvhEM2wa67L9XgQ7W591FxUNklkDIQsk4kijuhE50',
      'groups': ['astronauts', 'sysadmin', 'moon'],
      'last_changed': datetime.date(2015, 7, 16),
      'locked': False},
     {'username': 'ivanovic',
      'uid': 1002,
      'gid': 1002,
      'home': '/home/ivanovic',
      'shell': '/bin/bash',
      'algorithm': 'MD5',
      'password': 'SWlkjRWexrXYgc98F.',
      'groups': ['astronauts', 'sysadmin'],
      'last_changed': datetime.date(2005, 2, 11),
      'locked': False}]
"""

from datetime import date
from os.path import dirname, join


BASE_DIR = dirname(__file__)
FILE_GROUP = join(BASE_DIR, '../data/etc-group.txt')
FILE_SHADOW = join(BASE_DIR, '../data/etc-shadow.txt')
FILE_PASSWD = join(BASE_DIR, '../data/etc-passwd.txt')

SECOND = 1
MINUTE = 60 * SECOND
HOUR = 60 * MINUTE
DAY = 24 * HOUR

ALGORITHMS = {
    '1': 'MD5',
    '2a': 'Blowfish',
    '2y': 'Blowfish',
    '5': 'SHA-256',
    '6': 'SHA-512',
}

result: list

Code 14.15. /etc/passwd
##
# `/etc/passwd` structure:
#   - Username
#   - Password: `x` indicates that shadow passwords are used
#   - UID: User ID number
#   - GID: User's group ID number
#   - GECOS: Full name of the user
#   - Home directory
#   - Login shell
##

root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
nobody:x:99:99:Nobody:/:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
watney:x:1000:1000:Mark Watney:/home/watney:/bin/bash
twardowski:x:1001:1001:Jan Twardowski:/home/twardowski:/bin/bash
ivanovic:x:1002:1002:Ivan Ivanovic:/home/ivanovic:/bin/bash
Code 14.16. /etc/shadow
##
# `/etc/shadow` structure
#   - Username: from `/etc/passwd`
#   - Password
#   - Last Password Change: Days since 1970-01-01
#   - Minimum days between password changes: 0 - changed at any time
#   - Password validity: Days after which password must be changed, 99999 - many, many years
#   - Warning threshold: Days to warn user of an expiring password, 7 - full week
#   - Account inactive: Days after password expires and account is disabled
#   - Time since account is disabled: Days since 1970-01-01
#   - A reserved field for possible future use
#
# Password field (split by `$`):
#   - algorithm
#   - salt
#   - password hash
#
# Password algorithms:
#   - `1` - MD5
#   - `2a` - Blowfish
#   - `2y` - Blowfish
#   - `5` - SHA-256
#   - `6` - SHA-512
#
# Password special chars:
#   - ` ` (blank entry) - password is not required to log in
#   - `*` (asterisk) - account is disabled, cannot be unlocked, no password has ever been set
#   - `!` (exclamation mark) - account is locked, can be unlocked, no password has ever been set
#   - `!<password_hash>` - account is locked, can be unlocked, but password is set
#   - `!!` (two exclamation marks) - account created, waiting for initial password to be set by admin
##

root:$6$Ke02nYgo.9v0SF4p$hjztYvo/M4buqO4oBX8KZTftjCn6fE4cV5o/I95QPekeQpITwFTRbDUBYBLIUx2mhorQoj9bLN8v.w6btE9xy1:16431:0:99999:7:::
adm:$6$5H0QpwprRiJQR19Y$bXGOh7dIfOWpUb/Tuqr7yQVCqL3UkrJns9.7msfvMg4ZO/PsFC5Tbt32PXAw9qRFEBs1254aLimFeNM8YsYOv.:16431:0:99999:7:::
watney:!!:16550::::::
twardowski:$6$P9zn0KwR$tgfvvFWJJ5FKmoXiP5rXWOjwoEBOEoAuBi3EphRbJqqjWYvhEM2wa67L9XgQ7W591FxUNklkDIQsk4kijuhE50:16632:0:99999:7:::
ivanovic:$1$.QKDPc5E$SWlkjRWexrXYgc98F.:12825:0:90:5:30:13096:
Code 14.17. /etc/group
##
# `/etc/group` structure
#   - Group Name: from `/etc/passwd`
#   - Group Password: `x` indicates that shadow passwords are used)
#   - GID: Group ID
#   - Members: usernames from `/etc/passwd`
##

root::0:root
other::1:
bin::2:root,bin,daemon
sys::3:root,bin,sys,adm
adm::4:root,adm,daemon
mail::6:root
astronauts::10:twardowski,watney,ivanovic
daemon::12:root,daemon
sysadmin::14:twardowski,ivanovic
mars::1000:watney
moon::1001:twardowski
nobody::60001:
noaccess::60002:
nogroup::65534: