2.2. Pandas Read CSV

2.2.1. Rationale

  • File paths works also with URLs

2.2.3. Content

import pandas as pd


DATA = 'https://python.astrotech.io/_static/iris-clean.csv'
df = pd.read_csv(DATA)

df.head(3)
#    sepal_length  sepal_width  petal_length  petal_width     species
# 0           5.4          3.9           1.3          0.4      setosa
# 1           5.9          3.0           5.1          1.8   virginica
# 2           6.0          3.4           4.5          1.6  versicolor

2.2.4. Rename Columns

import pandas as pd


DATA = 'https://python.astrotech.io/_static/iris-dirty.csv'
COLUMNS =  ['sepal_length', 'sepal_width',
            'petal_length', 'petal_width', 'species']

df = pd.read_csv(DATA)
df.head(3)
#      150    4  setosa  versicolor  virginica
# 0    5.4  3.9     1.3         0.4          0
# 1    5.9  3.0     5.1         1.8          2
# 2    6.0  3.4     4.5         1.6          1

df = pd.read_csv(url, skiprows=1, names=COLUMNS)
df.head(3)
#    sepal_length  sepal_width  petal_length  petal_width  species
# 0           5.4          3.9           1.3          0.4        0
# 1           5.9          3.0           5.1          1.8        2
# 2           6.0          3.4           4.5          1.6        1

df['species'].replace({
    0: 'setosa',
    1: 'versicolor',
    2: 'virginica',
}, inplace=True)
#    sepal_length  sepal_width  petal_length  petal_width  species
# 0           5.4          3.9           1.3          0.4        setosa
# 1           5.9          3.0           5.1          1.8        virginica
# 2           6.0          3.4           4.5          1.6        versicolor

2.2.5. Compressed

  • If the extension is .gz, .bz2, .zip, and .xz, the corresponding compression method is automatically selected

df = pd.read_csv('myfile.gz', compression='infer')

2.2.6. Use Case

DATA = 'https://python.astrotech.io/_static/iris-dirty.csv'

COLUMNS =  ['sepal_length', 'sepal_width',
            'petal_length', 'petal_width', 'species']


species = pd.read_csv(DATA, nrows=0)
species = dict(enumerate(species.columns[2:]))

df = pd.read_csv(DATA, names=COLUMNS, skiprows=1)
df['species'].replace(species, inplace=True)

2.2.7. Assignments

Code 2.44. Solution
"""
* Assignment: Pandas Read CSV Dates
* Complexity: easy
* Lines of code: 1 lines
* Time: 3 min

English:
    1. Read data from `DATA` to `result: pd.DataFrame`
    2. Parse dates in "Mission Date" column
    3. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` do `result: pd.DataFrame`
    2. Sparsuj daty w kolumnie "Mission Date"
    3. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `parse_dates`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> type(result) is pd.DataFrame
    True
    >>> len(result) > 0
    True
    >>> result
       id First Name   Last Name Mission Date
    0   1        Jan  Twardowski   1988-01-05
    1   2       Mark      Watney   1969-07-21
    2   3       Ivan   Ivanovich   1961-04-12
    3   4    Melissa       Lewis   1970-01-01
    4   5       Alex       Vogel   1968-12-25
"""

import pandas as pd


DATA = 'https://python.astrotech.io/_static/martian-en.csv'


# pd.DataFrame: read DATA and parse dates in "Mission Date" column
result = ...


Code 2.45. Solution
"""
* Assignment: Pandas Read CSV Replace
* Complexity: easy
* Lines of code: 5 lines
* Time: 8 min

English:
    1. Read data from `DATA` to `result: pd.DataFrame`
    2. Use provided column names in `COLUMNS`
    3. Read labels from the first row
    4. Replace data in `label` column with values extracted above
    5. Define `result: pd.DataFrame` with 25 first rows
    6. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` do `result: pd.DataFrame`
    2. Użyj podanych w `COLUMNS` nazw kolumn
    3. Wczytaj nazwy labeli z pierwszego wiersza
    4. Podmień dane w kolumnie `label` na wartości wyciągnięte powyżej
    5. Zdefiniuj `result: pd.DataFrame` z 25 pierwszymi wierszami
    6. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * `hader = pd.read_csv(url, nrows=0).columns`
    * `cancer_types = dict(enumerate(header[2:]))`
    * `df['label'].replace({'from': 'to'}, inplace=True)`

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> type(result) is pd.DataFrame
    True
    >>> len(result) == 25
    True

    >>> result.loc[[0,1,2,3,4,5], ['mean radius', 'mean texture', 'label']]
       mean radius  mean texture      label
    0        17.99         10.38  malignant
    1        20.57         17.77  malignant
    2        19.69         21.25  malignant
    3        11.42         20.38  malignant
    4        20.29         14.34  malignant
    5        12.45         15.70  malignant

    >>> result['label'].value_counts()
    malignant    22
    benign        3
    Name: label, dtype: int64
"""

import pandas as pd


DATA = 'https://python.astrotech.io/_static/breast-cancer.csv'

COLUMNS = ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
           'mean smoothness', 'mean compactness', 'mean concavity',
           'mean concave points', 'mean symmetry', 'mean fractal dimension',
           'radius error', 'texture error', 'perimeter error', 'area error',
           'smoothness error', 'compactness error', 'concavity error',
           'concave points error', 'symmetry error',
           'fractal dimension error', 'worst radius', 'worst texture',
           'worst perimeter', 'worst area', 'worst smoothness',
           'worst compactness', 'worst concavity', 'worst concave points',
           'worst symmetry', 'worst fractal dimension', 'label']


# pd.DataFrame: read DATA, substitute column names, and labels, select 20 rows
result = ...