Librerías comúnmente utilizadas en Ciencia¶

Al igual que otros lenguajes como R que está especializado en estadística y contiene muchos paquetes, python también una lista muy amplia de librerías y paquetes para procesamiento numérico, estadístico, ...

Algunos ejemplos como se muestra en la web https://devopedia.org/python-for-scientific-computing son:

No description has been provided for this image

Los más conocidos son:

Numpy para procesamiento numérico
Pandas para análisis de datos
Matplotlib para graficación
Seaborn para graficación más avanzada
Scikit-learn para Machine Learning
...

Numpy¶

En python, al ser todo objetos, tenemos el gran inconveniente de que los datos pesan muchos.
Para representar un número entero necesitamos muchos bytes dado que tenemos que almacenar el número en sí junto con los métodos de la clase entero. Y así con el resto de datos.

Numpy (Numerical Python), es un módulo optimizado de tal forma que podemos realizar operaciones matriciales con más rapidez.

El principal tipo de dato que ofrece Numpy es ndarray (array o lista n-dimensional).

Definición de ndarray¶

El objeto ndarray es lo mismo que las listas de python más optimizadas en espacio y tiempo.

Entre sus características destacamos:

Todos los elementos son del mismo tipo, solo trabajamos con listas de enteros, de strings, etc.
Tenemos un atributo shape que es una tupla indicando cuanto elementos tenemos en cada dimensión del ndarray.
El tipo de dato viene determinado por un atributo dtype que nos indica con qué dato trabajamos. Ej: int8 (entero de 8 bits), etc.
Podemos hacer slicing y acceso indexado.
Podemos modificar en tiempo de ejecución los elementos de la lista.
El tamaño es fijo, no podemos añadir o quitar elementos, para ello debemos definir un nuevo ndarray.
Podemos realizar operaciones estadísticas como mean, median, ... y más.

Creación básica de una ndarray¶

Crear un ndarray de una dimensión

In [1]:

Copied!

import numpy as np

array1d = np.array([1, 2, 3, 4, 5])
array1d.shape
import numpy as np

array1d = np.array([1, 2, 3, 4, 5])
array1d.shape

Out[1]:

(5,)

Crear un ndarray de dos dimensiones

In [2]:

Copied!

import numpy as np

array2d = np.array( [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ] )
array2d.shape
import numpy as np

array2d = np.array( [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ] )
array2d.shape

Out[2]:

(3, 3)

Crear un ndarray de tres dimensiones

In [3]:

Copied!

import numpy as np

array3d = np.array( [ [ [1, 2] ], [ [3, 4] ], [ [6, 4] ] ] )
array3d.shape
import numpy as np

array3d = np.array( [ [ [1, 2] ], [ [3, 4] ], [ [6, 4] ] ] )
array3d.shape

Out[3]:

(3, 1, 2)

Crear un ndarray de ceros de tamaño 3x3

In [4]:

Copied!

import numpy as np

zeros = np.zeros((3, 3))
zeros.shape
import numpy as np

zeros = np.zeros((3, 3))
zeros.shape

Out[4]:

(3, 3)

Operaciones sobre ndarrays¶

Indexado¶

Array 1D

In [5]:

Copied!

import numpy as np

array1d = np.array( [1, 2, 3, 4, 5, 6, 7, 8, 9], dtype = np.int8)

print(f'{array1d[0]=}')

print(f'{array1d[2]=}')

print(f'{array1d[1]=}')

print(f'{array1d[-1]=}')

print(f'{array1d[[1, 3]]=}') # Acceder con varios índices
import numpy as np

array1d = np.array( [1, 2, 3, 4, 5, 6, 7, 8, 9], dtype = np.int8)

print(f'{array1d[0]=}')

print(f'{array1d[2]=}')

print(f'{array1d[1]=}')

print(f'{array1d[-1]=}')

print(f'{array1d[[1, 3]]=}') # Acceder con varios índices

array1d[0]=1
array1d[2]=3
array1d[1]=2
array1d[-1]=9
array1d[[1, 3]]=array([2, 4], dtype=int8)

Array 2D

In [6]:

Copied!

import numpy as np

array2d = np.array( [[1, 2], [3, 4]], dtype = np.int8)

print(f'{array2d[0]=}') # Fila 0

print(f'{array2d[1]=}') # Fila 1

print(f'{array2d[0][0]=}') # Fila 0 Columna 0

print(f'{array2d[:, 0]=}') # Columna 0 de todas las filas
import numpy as np

array2d = np.array( [[1, 2], [3, 4]], dtype = np.int8)

print(f'{array2d[0]=}') # Fila 0

print(f'{array2d[1]=}') # Fila 1

print(f'{array2d[0][0]=}') # Fila 0 Columna 0

print(f'{array2d[:, 0]=}') # Columna 0 de todas las filas

array2d[0]=array([1, 2], dtype=int8)
array2d[1]=array([3, 4], dtype=int8)
array2d[0][0]=1
array2d[:, 0]=array([1, 3], dtype=int8)

Slicing¶

Array 1D

In [7]:

Copied!





import numpy as np

array1d = np.array( [1, 2, 3, 4, 5, 6, 7, 8, 9], dtype = np.int8)

print(f'{array1d[0:]=}')
print(f'{array1d[1:]=}')
print(f'{array1d[1:2]=}')
print(f'{array1d[0:2]=}')
import numpy as np

array1d = np.array( [1, 2, 3, 4, 5, 6, 7, 8, 9], dtype = np.int8)

print(f'{array1d[0:]=}')
print(f'{array1d[1:]=}')
print(f'{array1d[1:2]=}')
print(f'{array1d[0:2]=}')

array1d[0:]=array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)
array1d[1:]=array([2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)
array1d[1:2]=array([2], dtype=int8)
array1d[0:2]=array([1, 2], dtype=int8)

Array 2D

In [8]:

Copied!





import numpy as np

array2d = np.array( [ [1, 2, 3, 4, 5, 6, 7, 8], 
                      [9, 10, 11, 12, 13, 14, 15, 16], 
                      [1, 2, 3, 4, 5, 6, 7, 8], 
                      [19, 110, 111, 112, 113, 114, 115, 116] ], dtype = np.int8)

print(f'{array2d[1::2, 0:4]=}') # 4 primeras columnas de las filas impares
import numpy as np

array2d = np.array( [ [1, 2, 3, 4, 5, 6, 7, 8], 
                      [9, 10, 11, 12, 13, 14, 15, 16], 
                      [1, 2, 3, 4, 5, 6, 7, 8], 
                      [19, 110, 111, 112, 113, 114, 115, 116] ], dtype = np.int8)

print(f'{array2d[1::2, 0:4]=}') # 4 primeras columnas de las filas impares

array2d[1::2, 0:4]=array([[  9,  10,  11,  12],
       [ 19, 110, 111, 112]], dtype=int8)

Modificar elementos¶

In [9]:

Copied!

import numpy as np

array1d = np.array( [1, 2, 3, 4, 5, 6, 7, 8, 9], dtype = np.int8)

array1d[0] = 9

array1d
import numpy as np

array1d = np.array( [1, 2, 3, 4, 5, 6, 7, 8, 9], dtype = np.int8)

array1d[0] = 9

array1d

Out[9]:

array([9, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int8)

Operaciones ariméticas¶

In [10]:

Copied!

import numpy as np

array1d = np.array( [1, 2, 3], dtype = np.int8)
array2d = np.array( [ [0, 0, 0], 
                      [1, 1, 1]], dtype = np.int8)

print(array2d + 1)
import numpy as np

array1d = np.array( [1, 2, 3], dtype = np.int8)
array2d = np.array( [ [0, 0, 0], 
                      [1, 1, 1]], dtype = np.int8)

print(array2d + 1)

[[1 1 1]
 [2 2 2]]

In [11]:

Copied!

print(array2d * 1)
print(array2d * 1)

[[0 0 0]
 [1 1 1]]

In [12]:

Copied!

print(array2d * array1d)
print(array2d * array1d)

[[0 0 0]
 [1 2 3]]

Álgebra lineal¶

Producto de dos matrices

In [13]:

Copied!

import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
np.dot(A, B)
import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
np.dot(A, B)

Out[13]:

array([[19, 22],
       [43, 50]])

Cálculo de la inversa de una matriz

In [14]:

Copied!





import numpy as np

A = np.array([[1, 2], [3, 4]])
A_inv = np.linalg.inv(A)

print(A)
print(A_inv)
print(A.dot(A_inv))
import numpy as np

A = np.array([[1, 2], [3, 4]])
A_inv = np.linalg.inv(A)

print(A)
print(A_inv)
print(A.dot(A_inv))

[[1 2]
 [3 4]]
[[-2.   1. ]
 [ 1.5 -0.5]]
[[1.0000000e+00 0.0000000e+00]
 [8.8817842e-16 1.0000000e+00]]

Resolver un sistema de ecuaciones lineales
2x + 3y = 5
x - y = 1

In [15]:

Copied!

coefficients = np.array([[2, 3], [1, -1]])
constants = np.array([5, 1])
np.linalg.solve(coefficients, constants)
coefficients = np.array([[2, 3], [1, -1]])
constants = np.array([5, 1])
np.linalg.solve(coefficients, constants)

Out[15]:

array([1.6, 0.6])

Diferencia usar Numpy y no usar Numpy¶

In [16]:

Copied!





import time
import numpy as np

# Tamaño de la matriz
n = 200

# Crear matrices aleatorias sin NumPy
start_time = time.time()

matrix_a = [[i + j for j in range(n)] for i in range(n)]
matrix_b = [[i - j for j in range(n)] for i in range(n)]

result = [[0] * n for _ in range(n)]
for i in range(n):
    for j in range(n):
        for k in range(n):
            result[i][j] += matrix_a[i][k] * matrix_b[k][j]

end_time = time.time()
execution_time = end_time - start_time
print("Tiempo de ejecución en segundos sin NumPy:", execution_time)

# Crear matrices aleatorias con NumPy
start_time = time.time()

array_a = np.random.rand(n, n)
array_b = np.random.rand(n, n)

result_array = np.dot(array_a, array_b)

end_time = time.time()
execution_time = end_time - start_time
print("Tiempo de ejecución en segundos con NumPy:", execution_time)
import time
import numpy as np

# Tamaño de la matriz
n = 200

# Crear matrices aleatorias sin NumPy
start_time = time.time()

matrix_a = [[i + j for j in range(n)] for i in range(n)]
matrix_b = [[i - j for j in range(n)] for i in range(n)]

result = [[0] * n for _ in range(n)]
for i in range(n):
    for j in range(n):
        for k in range(n):
            result[i][j] += matrix_a[i][k] * matrix_b[k][j]

end_time = time.time()
execution_time = end_time - start_time
print("Tiempo de ejecución en segundos sin NumPy:", execution_time)

# Crear matrices aleatorias con NumPy
start_time = time.time()

array_a = np.random.rand(n, n)
array_b = np.random.rand(n, n)

result_array = np.dot(array_a, array_b)

end_time = time.time()
execution_time = end_time - start_time
print("Tiempo de ejecución en segundos con NumPy:", execution_time)

Tiempo de ejecución en segundos sin NumPy: 3.33677339553833
Tiempo de ejecución en segundos con NumPy: 0.0037114620208740234

Estadísticas¶

In [17]:

Copied!

import numpy

array = np.random.rand(4, 5)

print('Media de todos los valores: ', array.mean())
import numpy

array = np.random.rand(4, 5)

print('Media de todos los valores: ', array.mean())

Media de todos los valores:  0.6026659401474472

In [18]:

Copied!

print('Media de todos las filas: ', array.mean(axis = 1))
print('Media de todos las filas: ', array.mean(axis = 1))

Media de todos las filas:  [0.62138473 0.73264987 0.53778201 0.51884715]

In [19]:

Copied!

print('Media de todas las columnas: ', array.mean(axis = 0))
print('Media de todas las columnas: ', array.mean(axis = 0))

Media de todas las columnas:  [0.72060379 0.58694439 0.5962877  0.56353339 0.54596043]

In [20]:

Copied!

print('Mediana de todos los valores: ', np.median(array))
print('Mediana de todos los valores: ', np.median(array))

Mediana de todos los valores:  0.6267550427898875

In [21]:

Copied!

print('Maximo de todos los valores: ', np.max(array))
print('Maximo de todos los valores: ', np.max(array))

Maximo de todos los valores:  0.9729526340616049

In [22]:

Copied!

print('Mínimo de todos los valores: ', np.min(array))
print('Mínimo de todos los valores: ', np.min(array))

Mínimo de todos los valores:  0.14986822185497506

In [23]:

Copied!

print('Desviación estandar de todos los valores: ', np.std(array))
print('Desviación estandar de todos los valores: ', np.std(array))

Desviación estandar de todos los valores:  0.22150210369951986

Ejercicios¶

Crea una matriz de 1s con dimensiones (20, 30, 10)
Mete algunos NaNs en la matriz anterior usando índices y slicing
Calcula la media, median, min, max. Pista: buscar nanmean, ...
Crea varias matrices 2x2 y súmalas entre sí, multiplícalas por un número, ...

Pandas¶

Pandas en una librería de Python generalmente utilizada para análisis numérico, tratamiento de datos, etc...

En Numpy vimos que el elemento estrella sobre el que está montaba toda la funcionalidad es ndarray.
En pandas, de igual forma, tenemos los elementos DataFrame y Series.

Series:
- Una serie en una lista al estilo ndarray cuyos elementos son del mismo tipo, enteros, decimales, fechas, ...
- Las series permiten organizar datos en forma de lista. Estas están indexadas o por números, o por otros elementos como pueden ser texto o fechas.

DataFrame:
- Un DataFrame, en ensencia, es una concatenación de Series donde cada serie está indexada por una columna (normalmente).
- Las series del DataFrame deben de tener el mismo número de elementos. Y los elementos de una misma fila tiene el mismo índice para esa fila.

En otras palabras, un DataFrame es una tabla donde tenemos columnas y filas.
Las filas tienen un índice que nos permite acceder a todos los elementos de dicha fila. Esos índices pueden ser números, letras o fechas.
Las columnas son Series indexadas por columnas que normalmente suelen ser texto.

Creación DataFrame a partir de un diccionario¶

In [24]:

Copied!





import pandas as pd
import numpy as np

data = {
    'Fecha': pd.date_range('2023-01-01', periods = 5),
    'Nombre': ['Juan', 'María', 'Pedro', 'Ana', 'Luisa'],
    'Edad': [25, 30, np.nan, 35, 40],
    'Puntuación': [8.2, 7.5, 6.9, np.nan, 9.0]
}

df = pd.DataFrame(data)
df
import pandas as pd
import numpy as np

data = {
    'Fecha': pd.date_range('2023-01-01', periods = 5),
    'Nombre': ['Juan', 'María', 'Pedro', 'Ana', 'Luisa'],
    'Edad': [25, 30, np.nan, 35, 40],
    'Puntuación': [8.2, 7.5, 6.9, np.nan, 9.0]
}

df = pd.DataFrame(data)
df

Out[24]:

	Fecha	Nombre	Edad	Puntuación
0	2023-01-01	Juan	25.0	8.2
1	2023-01-02	María	30.0	7.5
2	2023-01-03	Pedro	NaN	6.9
3	2023-01-04	Ana	35.0	NaN
4	2023-01-05	Luisa	40.0	9.0

Crear DataFrame desde un CSV, ...¶

In [25]:

Copied!

import pandas as pd

df = pd.read_csv(filepath_or_buffer = r'C:\Users\sergi\Documents\repos\python_course\data\pandas_data.csv', sep = ',', index_col = False)
df
import pandas as pd

df = pd.read_csv(filepath_or_buffer = r'C:\Users\sergi\Documents\repos\python_course\data\pandas_data.csv', sep = ',', index_col = False)
df

Out[25]:

	Fecha	Nombre	Edad	Puntuación
0	2023-01-01	Juan	25.0	8.2
1	2023-01-02	María	30.0	7.5
2	2023-01-03	Pedro	NaN	6.9
3	2023-01-04	Ana	35.0	NaN
4	2023-01-05	Luisa	40.0	9.0

Análisis de datos¶

Resumen inicial

Vamos a usar un fichero .csv que contiene información sobre pasajeros del titanic. Indicando si murieron, si eran tercera clase, ...

In [26]:

Copied!

import pandas as pd

df = pd.read_csv(r'C:\Users\sergi\Documents\repos\python_course\data\pandas\Titanic.csv')

df.head() # Mostramos las primeras filas
import pandas as pd

df = pd.read_csv(r'C:\Users\sergi\Documents\repos\python_course\data\pandas\Titanic.csv')

df.head() # Mostramos las primeras filas

Out[26]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Indices y columnas

In [27]:

Copied!

print(f'Indices: {df.index=}')
print(f'Columnas: {list(df.columns)=}')
print(f'Indices: {df.index=}')
print(f'Columnas: {list(df.columns)=}')

Indices: df.index=RangeIndex(start=0, stop=891, step=1)
Columnas: list(df.columns)=['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

Información general

In [28]:

Copied!

df.info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Estadísticos básicos

In [29]:

Copied!

df.describe()
df.describe()

Out[29]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Valores perdidos

In [30]:

Copied!





for series in df:
    no_nan_values = df[series].count()
    total_values = len(df[series])
    print(f'Column {series}: Non NAN values {no_nan_values}/{total_values}, { round((no_nan_values/total_values) * 100, 4)}% of data')
for series in df:
    no_nan_values = df[series].count()
    total_values = len(df[series])
    print(f'Column {series}: Non NAN values {no_nan_values}/{total_values}, { round((no_nan_values/total_values) * 100, 4)}% of data')

Column PassengerId: Non NAN values 891/891, 100.0% of data
Column Survived: Non NAN values 891/891, 100.0% of data
Column Pclass: Non NAN values 891/891, 100.0% of data
Column Name: Non NAN values 891/891, 100.0% of data
Column Sex: Non NAN values 891/891, 100.0% of data
Column Age: Non NAN values 714/891, 80.1347% of data
Column SibSp: Non NAN values 891/891, 100.0% of data
Column Parch: Non NAN values 891/891, 100.0% of data
Column Ticket: Non NAN values 891/891, 100.0% of data
Column Fare: Non NAN values 891/891, 100.0% of data
Column Cabin: Non NAN values 204/891, 22.8956% of data
Column Embarked: Non NAN values 889/891, 99.7755% of data

Eliminar una o varias columnas

In [31]:

Copied!

df.drop(columns = 'PassengerId')
df.drop(columns = 'PassengerId')

Out[31]:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 11 columns

Eliminar columas con NaNs

In [32]:

Copied!

df.dropna(axis = 1)
df.dropna(axis = 1)

Out[32]:

	PassengerId	Survived	Pclass	Name	Sex	SibSp	Parch	Ticket	Fare
0	1	0	3	Braund, Mr. Owen Harris	male	1	0	A/5 21171	7.2500
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	1	0	PC 17599	71.2833
2	3	1	3	Heikkinen, Miss. Laina	female	0	0	STON/O2. 3101282	7.9250
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	1	0	113803	53.1000
4	5	0	3	Allen, Mr. William Henry	male	0	0	373450	8.0500
...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	0	0	211536	13.0000
887	888	1	1	Graham, Miss. Margaret Edith	female	0	0	112053	30.0000
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	1	2	W./C. 6607	23.4500
889	890	1	1	Behr, Mr. Karl Howell	male	0	0	111369	30.0000
890	891	0	3	Dooley, Mr. Patrick	male	0	0	370376	7.7500

891 rows × 9 columns

Eliminar filas con NaNs

In [33]:

Copied!

df.dropna(axis = 0)
df.dropna(axis = 0)

Out[33]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
...	...	...	...	...	...	...	...	...	...	...	...	...
871	872	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	D35	S
872	873	0	1	Carlsson, Mr. Frans Olof	male	33.0	0	0	695	5.0000	B51 B53 B55	S
879	880	1	1	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0	1	11767	83.1583	C50	C
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C

183 rows × 12 columns

Gestión valores duplicados

En este caso nos quedamos igual, así que no hay duplicados por filas

In [34]:

Copied!

df.drop_duplicates()
df.drop_duplicates()

Out[34]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

Rellenar datos NaN con un valor específico

In [35]:

Copied!

# df['Age'].fillna(value = -11)
# df.fillna(value = -11)
df.Age.fillna(value = -11).describe()
# df['Age'].fillna(value = -11)
# df.fillna(value = -11)
df.Age.fillna(value = -11).describe()

Out[35]:

count    891.000000
mean      21.614108
std       20.809470
min      -11.000000
25%        6.000000
50%       24.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

Rellenar datos NaN con un valor estadístico

In [36]:

Copied!

df.Age.fillna(value = df.Age.mean())
df.Age.fillna(value = df.Age.mean())

Out[36]:

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

Estudio de correlación

In [37]:

Copied!

df.drop(columns = 'PassengerId').corr(method = 'pearson', numeric_only=True)
df.drop(columns = 'PassengerId').corr(method = 'pearson', numeric_only=True)

Out[37]:

	Survived	Pclass	Age	SibSp	Parch	Fare
Survived	1.000000	-0.338481	-0.077221	-0.035322	0.081629	0.257307
Pclass	-0.338481	1.000000	-0.369226	0.083081	0.018443	-0.549500
Age	-0.077221	-0.369226	1.000000	-0.308247	-0.189119	0.096067
SibSp	-0.035322	0.083081	-0.308247	1.000000	0.414838	0.159651
Parch	0.081629	0.018443	-0.189119	0.414838	1.000000	0.216225
Fare	0.257307	-0.549500	0.096067	0.159651	0.216225	1.000000

In [38]:

Copied!

df.drop(columns = 'PassengerId').corr(method = 'kendall', numeric_only=True)
df.drop(columns = 'PassengerId').corr(method = 'kendall', numeric_only=True)

Out[38]:

	Survived	Pclass	Age	SibSp	Parch	Fare
Survived	1.000000	-0.323533	-0.043385	0.085915	0.133933	0.266229
Pclass	-0.323533	1.000000	-0.286081	-0.039552	-0.021019	-0.573531
Age	-0.043385	-0.286081	1.000000	-0.142746	-0.200112	0.093249
SibSp	0.085915	-0.039552	-0.142746	1.000000	0.425241	0.358262
Parch	0.133933	-0.021019	-0.200112	0.425241	1.000000	0.330360
Fare	0.266229	-0.573531	0.093249	0.358262	0.330360	1.000000

In [39]:

Copied!

df.drop(columns = 'PassengerId').corr(method = 'spearman', numeric_only=True)
df.drop(columns = 'PassengerId').corr(method = 'spearman', numeric_only=True)

Out[39]:

	Survived	Pclass	Age	SibSp	Parch	Fare
Survived	1.000000	-0.339668	-0.052565	0.088879	0.138266	0.323736
Pclass	-0.339668	1.000000	-0.361666	-0.043019	-0.022801	-0.688032
Age	-0.052565	-0.361666	1.000000	-0.182061	-0.254212	0.135051
SibSp	0.088879	-0.043019	-0.182061	1.000000	0.450014	0.447113
Parch	0.138266	-0.022801	-0.254212	0.450014	1.000000	0.410074
Fare	0.323736	-0.688032	0.135051	0.447113	0.410074	1.000000

Gráficas básicas¶

Gráfica de líneas sobre Edad

In [40]:

Copied!

df.Age.plot()
df.Age.plot()

Out[40]:

<Axes: >

Distribución Clase y precio Ticket

In [41]:

Copied!

df.plot(x = 'Pclass', y = 'Fare', kind='scatter')
df.plot(x = 'Pclass', y = 'Fare', kind='scatter')

Out[41]:

<Axes: xlabel='Pclass', ylabel='Fare'>

Gráficas de distribución

In [42]:

Copied!

df.Age.plot(kind = 'kde')
df.Age.plot(kind = 'kde')

Out[42]:

<Axes: ylabel='Density'>

In [43]:

Copied!

df.Age.plot(kind = 'hist', bins = 20)
df.Age.plot(kind = 'hist', bins = 20)

Out[43]:

<Axes: ylabel='Frequency'>

In [44]:

Copied!

df.Pclass.plot(kind='hist')
df.Pclass.plot(kind='hist')

Out[44]:

<Axes: ylabel='Frequency'>

In [45]:

Copied!

df.Pclass.plot(kind='kde')
df.Pclass.plot(kind='kde')

Out[45]:

<Axes: ylabel='Density'>

Gráficos de barras

Números supervivientes según sexo

In [46]:

Copied!

df.groupby('Sex').Survived.sum().plot(kind='bar')
df.groupby('Sex').Survived.sum().plot(kind='bar')

Out[46]:

<Axes: xlabel='Sex'>

Extraemos los valores únicos para el Tipo de incidente, País de origen, Año del incidente y Mes del informe

In [47]:

Copied!

print(f"Géneros: {list(df['Sex'].unique())}")
print(f"Clases: {list(df['Pclass'].unique())}")
print(f"Géneros: {list(df['Sex'].unique())}")
print(f"Clases: {list(df['Pclass'].unique())}")

Géneros: ['male', 'female']
Clases: [3, 1, 2]

Ejercicios¶

Usa el archivo births.csv de la carpeta data/pandas

Crea un DataFrame a partir del archivo especificado
Haz un estudio de valores únicos, no nulos, duplicados, ...
En caso de faltar datos (NaNs) haz lo que creas conveniente. Eliminar filas, columnas, rellenar, ...
Crea una nueva columna llamada date que una los valores de las columnas year, month y day con el formato year-month-day
Crea un nuevo DataFrame a partir del nuevo (incluyendo la nueva columna), borra la columna gender y en la crea una nueva columna con el número de nacimientos en cada fecha sin discriminar por género.
Crea una gráfica de líneas usando el DataFrame original para ver la evolución de los nacimientos desde el año 1969 hasta el año 2008. Crea una gráfica para cada género.
Haz lo mismo pero sin discriminar por género con el DataFrame nuevo

Matplotlib¶

Matplotlib es una librería de Python montada bajo Numpy con la finalidad de generar gráficas.
Está basada en Matlab y al igual que ese lenguaje, se suele usar mucho en Ciencia y en ingeniería.

Gráfica de lineas¶

In [48]:

Copied!





import matplotlib.pyplot as plt
import numpy as np

# Datos de ejemplo
x = np.linspace(0, 4 * np.pi, 1000)
y_sin = np.sin(x)
y_2cos = np.cos(x) * 2

# Crear el gráfico de líneas
plt.plot(x, y_sin, label = 'sin(x)')
plt.plot(x, y_2cos, label = '2cos(x)')

# Personalizar el gráfico
plt.xlabel('x')
plt.ylabel('Sin(x) y 2Cos(x)')
plt.title('Gráfico de Líneas')
plt.legend()

# Mostrar el gráfico
plt.show()
import matplotlib.pyplot as plt
import numpy as np

# Datos de ejemplo
x = np.linspace(0, 4 * np.pi, 1000)
y_sin = np.sin(x)
y_2cos = np.cos(x) * 2

# Crear el gráfico de líneas
plt.plot(x, y_sin, label = 'sin(x)')
plt.plot(x, y_2cos, label = '2cos(x)')

# Personalizar el gráfico
plt.xlabel('x')
plt.ylabel('Sin(x) y 2Cos(x)')
plt.title('Gráfico de Líneas')
plt.legend()

# Mostrar el gráfico
plt.show()

Gráfica de puntos¶

In [49]:

Copied!





import matplotlib.pyplot as plt
import numpy as np

# Datos de ejemplo
x = np.random.randn(1000)
y = np.random.rand(1000)

# Crear el gráfico de dispersión
plt.scatter(x, y)

# Personalizar el gráfico
plt.xlabel('Puntos normalmente aleatorios')
plt.ylabel('Puntos uniformemente aleatorios')
plt.title('Gráfico de Dispersión')

# Mostrar el gráfico
plt.show()
import matplotlib.pyplot as plt
import numpy as np

# Datos de ejemplo
x = np.random.randn(1000)
y = np.random.rand(1000)

# Crear el gráfico de dispersión
plt.scatter(x, y)

# Personalizar el gráfico
plt.xlabel('Puntos normalmente aleatorios')
plt.ylabel('Puntos uniformemente aleatorios')
plt.title('Gráfico de Dispersión')

# Mostrar el gráfico
plt.show()

Gráfica de barras¶

In [50]:

Copied!





# data from https://allisonhorst.github.io/palmerpenguins/

import matplotlib.pyplot as plt
import numpy as np

species = ("Adelie", "Chinstrap", "Gentoo")
penguin_means = {
    'Bill Depth': (18.35, 18.43, 14.98),
    'Bill Length': (38.79, 48.83, 47.50),
    'Flipper Length': (189.95, 195.82, 217.19),
}

x = np.arange(len(species))  # the label locations
width = 0.25  # the width of the bars

fig, ax = plt.subplots(layout = 'constrained')

for idx, (attribute, measurement) in enumerate(penguin_means.items()):
    offset = width * idx
    rects = ax.bar(x + offset, measurement, width, label = attribute)
    ax.bar_label(rects, padding = 3)

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Length (mm)')
ax.set_title('Penguin attributes by species')
ax.set_xticks(x + width, species)
ax.legend(loc = 'upper left', ncols = 3)
ax.set_ylim(0, 250)

plt.show()
# data from https://allisonhorst.github.io/palmerpenguins/

import matplotlib.pyplot as plt
import numpy as np

species = ("Adelie", "Chinstrap", "Gentoo")
penguin_means = {
    'Bill Depth': (18.35, 18.43, 14.98),
    'Bill Length': (38.79, 48.83, 47.50),
    'Flipper Length': (189.95, 195.82, 217.19),
}

x = np.arange(len(species))  # the label locations
width = 0.25  # the width of the bars

fig, ax = plt.subplots(layout = 'constrained')

for idx, (attribute, measurement) in enumerate(penguin_means.items()):
    offset = width * idx
    rects = ax.bar(x + offset, measurement, width, label = attribute)
    ax.bar_label(rects, padding = 3)

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Length (mm)')
ax.set_title('Penguin attributes by species')
ax.set_xticks(x + width, species)
ax.legend(loc = 'upper left', ncols = 3)
ax.set_ylim(0, 250)

plt.show()

Contornos¶

In [51]:

Copied!





import matplotlib.pyplot as plt
import numpy as np

plt.style.use('_mpl-gallery-nogrid')

# make data
X, Y = np.meshgrid(np.linspace(-3, 3, 256), np.linspace(-3, 3, 256))
Z = (1 - X/2 + X**5 + Y**3) * np.exp(-X**2 - Y**2)
levels = np.linspace(Z.min(), Z.max(), 7)

# plot
fig, ax = plt.subplots()
fig.set_figwidth(4)
fig.set_figheight(4)
contour = ax.contourf(X, Y, Z, levels = levels)

plt.colorbar(contour)
plt.show()
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('_mpl-gallery-nogrid')

# make data
X, Y = np.meshgrid(np.linspace(-3, 3, 256), np.linspace(-3, 3, 256))
Z = (1 - X/2 + X**5 + Y**3) * np.exp(-X**2 - Y**2)
levels = np.linspace(Z.min(), Z.max(), 7)

# plot
fig, ax = plt.subplots()
fig.set_figwidth(4)
fig.set_figheight(4)
contour = ax.contourf(X, Y, Z, levels = levels)

plt.colorbar(contour)
plt.show()

Gráficos 3D¶

In [52]:

Copied!





import matplotlib.pyplot as plt
import numpy as np

from matplotlib import cm

plt.style.use('_mpl-gallery-nogrid')

# make data
X, Y = np.meshgrid(np.linspace(-3, 3, 256), np.linspace(-3, 3, 256))
Z = (1 - X/2 + X**5 + Y**3) * np.exp(-X**2 - Y**2)

# plot
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
fig.set_figwidth(4)
fig.set_figheight(4)
surface = ax.plot_surface(X, Y, Z, cmap = cm.Blues)

plt.colorbar(surface)

plt.show()
import matplotlib.pyplot as plt
import numpy as np

from matplotlib import cm

plt.style.use('_mpl-gallery-nogrid')

# make data
X, Y = np.meshgrid(np.linspace(-3, 3, 256), np.linspace(-3, 3, 256))
Z = (1 - X/2 + X**5 + Y**3) * np.exp(-X**2 - Y**2)

# plot
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
fig.set_figwidth(4)
fig.set_figheight(4)
surface = ax.plot_surface(X, Y, Z, cmap = cm.Blues)

plt.colorbar(surface)

plt.show()

Gráficos con mapas mundiales¶

In [53]:

Copied!





import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from mpl_toolkits.basemap import Basemap

cities = pd.read_csv(r'C:\Users\sergi\Documents\repos\python_course\data\california_cities.csv')

# Extract the data we're interested in
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values

# 1. Draw the map background
fig = plt.figure(figsize = (8, 8))
m = Basemap(projection = 'lcc', resolution='h', 
            lat_0 = 37.5, lon_0 = -119,
            width = 1E6, height = 1.2E6)
m.shadedrelief() # Añade el fondo con relieve terrestre
m.drawcoastlines(color = 'gray') # Añade las líneas de costa
m.drawcountries(color = 'gray') # Añade las fronteras de los paises
m.drawstates(color = 'gray') # Añade las fronteras de los estados de EEUU

# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True,
          c = np.log10(population), s = area,
          cmap = 'Reds', alpha = 0.5) 

# 3. create colorbar and legend
plt.colorbar(label = r'$\log_{10}({\rm population})$') # Añade la colorbar
plt.clim(3, 7) # Limita la colorbar a mínimo 3 y máximo 7

# make legend with dummy points
for a in [100, 300, 500]: # Añade las legendas
    plt.scatter([], [], c = 'k', alpha = 0.5, s = a, label = str(a) + ' km$^2$')
plt.legend(scatterpoints = 1, frameon = False, labelspacing = 1, loc = 'lower left')
plt.title('California Cities Area')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from mpl_toolkits.basemap import Basemap

cities = pd.read_csv(r'C:\Users\sergi\Documents\repos\python_course\data\california_cities.csv')

# Extract the data we're interested in
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values

# 1. Draw the map background
fig = plt.figure(figsize = (8, 8))
m = Basemap(projection = 'lcc', resolution='h', 
            lat_0 = 37.5, lon_0 = -119,
            width = 1E6, height = 1.2E6)
m.shadedrelief() # Añade el fondo con relieve terrestre
m.drawcoastlines(color = 'gray') # Añade las líneas de costa
m.drawcountries(color = 'gray') # Añade las fronteras de los paises
m.drawstates(color = 'gray') # Añade las fronteras de los estados de EEUU

# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True,
          c = np.log10(population), s = area,
          cmap = 'Reds', alpha = 0.5) 

# 3. create colorbar and legend
plt.colorbar(label = r'$\log_{10}({\rm population})$') # Añade la colorbar
plt.clim(3, 7) # Limita la colorbar a mínimo 3 y máximo 7

# make legend with dummy points
for a in [100, 300, 500]: # Añade las legendas
    plt.scatter([], [], c = 'k', alpha = 0.5, s = a, label = str(a) + ' km$^2$')
plt.legend(scatterpoints = 1, frameon = False, labelspacing = 1, loc = 'lower left')
plt.title('California Cities Area')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

Out[53]:

Text(0, 0.5, 'Latitude')

Ejercicios¶

Mira brevemente los siguientes enlaces:

Tipos de gráficas: https://matplotlib.org/stable/plot_types/index.html
Ejemplos: https://matplotlib.org/stable/gallery/index.html#

Ejercicios:

Haz una gráfica de una serie temporal
Haz una gráfica del histograma de una variable continua y otra discreta
Elige algún ejemplo que te intere y trata de replicarlo con datos tuyos

Seaborn¶

SeaBorn es una librería montada sobre Matplotlib para realizar gráficos más descriptivos para el análisis estadístico

Gráfico de distribución hexagonal para concentraciones¶

In [54]:

Copied!

import numpy as np
import seaborn as sns

sns.set_theme(style = 'ticks')

x = np.random.randn(1000000)
y = np.random.rand(1000000)

sns.jointplot(x = x, y = y, kind = 'scatter', color = '#4CB391', s = 0.1)
sns.jointplot(x = x, y = y, kind = 'hex', color = '#4CB391')
import numpy as np
import seaborn as sns

sns.set_theme(style = 'ticks')

x = np.random.randn(1000000)
y = np.random.rand(1000000)

sns.jointplot(x = x, y = y, kind = 'scatter', color = '#4CB391', s = 0.1)
sns.jointplot(x = x, y = y, kind = 'hex', color = '#4CB391')

Out[54]:

<seaborn.axisgrid.JointGrid at 0x266c89576a0>

In [1]:

Copied!





import seaborn as sns

df = sns.load_dataset('titanic')
df = df.drop(columns = ['adult_male', 'alone'])
corr = df.corr(numeric_only = True)

cmap = sns.diverging_palette(230, 20, as_cmap = True)
sns.heatmap(corr, cmap = cmap, linewidths = .5, annot = True)
import seaborn as sns

df = sns.load_dataset('titanic')
df = df.drop(columns = ['adult_male', 'alone'])
corr = df.corr(numeric_only = True)

cmap = sns.diverging_palette(230, 20, as_cmap = True)
sns.heatmap(corr, cmap = cmap, linewidths = .5, annot = True)

Out[1]:

<Axes: >

Gráficos de densidad y distribución¶

In [7]:

Copied!

import seaborn as sns

sns.set_theme(style = 'ticks')

plot = sns.pairplot(sns.load_dataset('penguins'), hue = 'species')
import seaborn as sns

sns.set_theme(style = 'ticks')

plot = sns.pairplot(sns.load_dataset('penguins'), hue = 'species')

In [2]:

Copied!

import seaborn as sns

corr = sns.load_dataset('penguins').corr(numeric_only = True)

cmap = sns.diverging_palette(230, 20, as_cmap = True)
sns.heatmap(corr, cmap = cmap, linewidths = .5, annot = True)
import seaborn as sns

corr = sns.load_dataset('penguins').corr(numeric_only = True)

cmap = sns.diverging_palette(230, 20, as_cmap = True)
sns.heatmap(corr, cmap = cmap, linewidths = .5, annot = True)

Out[2]:

<Axes: >

Gráficos de barras¶

In [6]:

Copied!

sns.catplot(data = df, x = 'class', y = 'survived', col = 'sex', kind = 'bar')
sns.catplot(data = df, x = 'class', y = 'survived', col = 'sex', kind = 'bar')

Out[6]:

<seaborn.axisgrid.FacetGrid at 0x15b822eecb0>

Gráficos de Violin¶

In [3]:

Copied!

sns.violinplot(data = df, x = 'alive', y = 'age', split = True, hue = 'sex')
sns.violinplot(data = df, x = 'alive', y = 'age', split = True, hue = 'sex')

Out[3]:

<Axes: xlabel='alive', ylabel='age'>

In [4]:

Copied!

sns.catplot(data = df, x = 'class', y = 'age', col = 'sex', kind = 'violin')
sns.catplot(data = df, x = 'class', y = 'age', col = 'sex', kind = 'violin')

Out[4]:

<seaborn.axisgrid.FacetGrid at 0x15b7feddf90>

Gráficos de caja¶

In [5]:

Copied!

sns.catplot(data = df, x = 'alive', y = 'age', col = 'class', hue = 'sex', kind = 'box')
sns.catplot(data = df, x = 'alive', y = 'age', col = 'class', hue = 'sex', kind = 'box')

Out[5]:

<seaborn.axisgrid.FacetGrid at 0x15b8223e9e0>

Ejercicios¶

Elige un dataset propio o de Internet y prueba a realizar gráficas con las funciones catplot, violinplot, heatmap, pairplot, jointplot, ...

Mira la documentación para ver más gráficas interesantes