{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"tags": [
"task"
]
},
"source": [
"# Data Analysis and Plotting in Python with Pandas\n",
"\n",
" _Carolin Penke, J\u00fclich Supercomputing Centre, Forschungszentrum J\u00fclich, 23 October 2024_\n",
" _Based on material by Andreas Herten_"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "onlytask",
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"**Version: Tasks**"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"## Task Outline\n",
"\n",
"* [Task 1](#task1)\n",
"* [Task 2](#task2)\n",
"* [Task 3](#task3)\n",
"* [Task 4](#task4)\n",
"* [Task 5](#task5)\n",
"* [Task 6](#task6)\n",
"* [Task 7](#task7)\n",
"* [Task 7B](#task7b)\n",
"* [Task 8](#task8)\n",
"* [Task 8B](#task8b)"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Task 1\n",
"\n",
"TASK\n",
"\n",
"* Create data frame with\n",
" - 6 names of dinosaurs, \n",
" - their favourite prime number, \n",
" - and their favorite color.\n",
"* Play around with the frame\n",
"* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "nopresentation",
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Jupyter Notebook 101:\n",
"\n",
"* Execute cell: `shift+enter`\n",
"* New cell in front of current cell: `a`\n",
"* New cell after current cell: `b`"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"happy_dinos = {\n",
" \"Dinosaur Name\": [],\n",
" \"Favourite Prime\": [],\n",
" \"Favourite Color\": []\n",
"}\n",
"#df_dinos = "
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Task 2\n",
"\n",
"TASK\n",
"\n",
"* Read in `data-nest.csv` to `DataFrame`; call it `df` \n",
" *(Data was produced with [JUBE](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html))*\n",
"* Get to know it and play a bit with it\n",
"* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
]
},
{
"cell_type": "code",
"execution_count": 151,
"metadata": {
"exercise": "task"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"id,Nodes,Tasks/Node,Threads/Task,Runtime Program / s,Scale,Plastic,Avg. Neuron Build Time / s,Min. Edge Build Time / s,Max. Edge Build Time / s,Min. Init. Time / s,Max. Init. Time / s,Presim. Time / s,Sim. Time / s,Virt. Memory (Sum) / kB,Local Spike Counter (Sum),Average Rate (Sum),Number of Neurons,Number of Connections,Min. Delay,Max. Delay\r\n",
"5,1,2,4,420.42,10,true,0.29,88.12,88.18,1.14,1.20,17.26,311.52,46560664.00,825499,7.48,112500,1265738500,1.5,1.5\r\n",
"5,1,4,4,200.84,10,true,0.15,46.03,46.34,0.70,1.01,7.87,142.97,46903088.00,802865,7.03,112500,1265738500,1.5,1.5\r\n",
"5,1,2,8,202.15,10,true,0.28,47.98,48.48,0.70,1.20,7.95,142.81,47699384.00,802865,7.03,112500,1265738500,1.5,1.5\r\n",
"5,1,4,8,89.57,10,true,0.15,20.41,23.21,0.23,3.04,3.19,60.31,46813040.00,821491,7.23,112500,1265738500,1.5,1.5\r\n",
"5,2,2,4,164.16,10,true,0.20,40.03,41.09,0.52,1.58,6.08,114.88,46937216.00,802865,7.03,112500,1265738500,1.5,1.5\r\n",
"5,2,4,4,77.68,10,true,0.13,20.93,21.22,0.16,0.46,3.12,52.05,47362064.00,821491,7.23,112500,1265738500,1.5,1.5\r\n",
"5,2,2,8,79.60,10,true,0.20,21.63,21.91,0.19,0.47,2.98,53.12,46847168.00,821491,7.23,112500,1265738500,1.5,1.5\r\n",
"5,2,4,8,37.20,10,true,0.13,10.08,11.60,0.10,1.63,1.24,23.29,47065232.00,818198,7.33,112500,1265738500,1.5,1.5\r\n",
"5,3,2,4,96.51,10,true,0.15,26.54,27.41,0.36,1.22,3.33,64.28,52256880.00,813743,7.27,112500,1265738500,1.5,1.5\r\n"
]
}
],
"source": [
"!head data-nest.csv"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Task 3\n",
"\n",
"TASK\n",
"\n",
"* Add a column to the Nest data frame form Task 2 called `Threads` which is the total number of threads across all nodes (i.e. the product of threads per task and tasks per node and nodes)\n",
"* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
]
},
{
"cell_type": "code",
"execution_count": 182,
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Task 4\n",
"\n",
"TASK\n",
"\n",
"\n",
"* Sort the Nest data frame by threads\n",
"* Plot `\"Presim. Time / s\"` and `\"Sim. Time / s\"` of our data frame `df` as a function of threads\n",
"* Use a dashed, red line for `\"Presim. Time / s\"`, a blue line for `\"Sim. Time / s\"` (see [API description](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot))\n",
"* Don't forget to label your axes and to add a legend _(1st rule of plotting)_\n",
"* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Task 5\n",
"\n",
"TASK\n",
"\n",
"Use the Nest data frame `df` to:\n",
"\n",
"1. Make threads index of the data frame (`.set_index()`)\n",
"2. Plot `\"Presim. Time / s\"` and `\"Sim. Time / s`\" individually\n",
"3. Plot them onto one common canvas!\n",
"4. Make them have the same line colors and styles as before\n",
"5. Add a legend, add missing axes labels\n",
"6. Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Task 6\n",
"\n",
"TASK\n",
"\n",
"* To your `df` Nest data frame, add a column with the unaccounted time (`Unaccounted Time / s`), which is the difference of program runtime, average neuron build time, minimal edge build time, minimal initialization time, presimulation time, and simulation time. \n",
"(*I know this is technically not super correct, but it will do for our example.*)\n",
"* Plot a stacked bar plot of all these columns (except for program runtime) over the threads\n",
"* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Task 7\n",
"\n",
"TASK\n",
"\n",
"* Create a pivot table based on the Nest `df` data frame\n",
"* Let the `x` axis show the number of nodes; display the values of the simulation time `\"Sim. Time / s\"` for the tasks per node and threads per task configurations\n",
"* Please plot a bar plot\n",
"* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Task 7B (like Bonus)\n",
"\n",
"TASK\n",
"\n",
"- Same pivot table as before (that is, `x` with nodes, and columns for Tasks/Node and Threads/Task)\n",
"- But now, use `Sim. Time / s` and `Presim. Time / s` as values to show\n",
"- Show them as a **stack** of those two values inside the pivot table\n",
"- Use Panda's functionality as much as possible!"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Task 8 (Super Bonus)\n",
"\n",
"TASK\n",
"\n",
"* Create bar chart of top 10 actors (on `x`) and average ratings of their top movies (`y`) based on IMDb data (only if they play in at least two movies)"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "fragment"
},
"tags": []
},
"source": [
"* IMDb provides data sets at [datasets.imdbws.com](https://datasets.imdbws.com)\n",
"* Can directly be loaded like\n",
"```python\n",
"pd.read_table('https://datasets.imdbws.com/dataset.tsv.gz', sep=\"\\t\", low_memory=False, na_values=[\"\\\\N\",\"nan\"])\n",
"```\n",
"* Needed:\n",
" * `name.basics.tsv.gz` (for names of actors and movies they are known for)\n",
" * `title.ratings.tsv.gz` (for ratings of titles)\n",
"* Strategy _suggestions_:\n",
" * Use `df.apply()` with custom function\n",
" * Custom function: Compute average rating and determine if this entry is eligible for plotting (this _can_ be done at once, but does not need to be)\n",
" * Average rating: Look up title IDs as listed in `knownForTitles` in titles dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task",
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Task 8B (Bonuseption)\n",
"\n",
"TASK\n",
"\n",
"All of the following are ideas for unique sub-tasks, which can be done individually\n",
"* In addition to Task 8, restrict the top titles to those with more than 10000 votes\n",
"* For 30 top-rated actors, plot rating vs. age\n",
"* For 30 top-rated actors, plot rating vs. average runtime of the known-for-titles (using `title.basics.tsv.gz`)"
]
},
{
"cell_type": "markdown",
"metadata": {
"exercise": "task"
},
"source": [
"Feedback to a.herten@fz-juelich.de\n",
"\n",
"_Next slide: Further reading_"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
},
"toc-autonumbering": false,
"toc-showcode": false,
"toc-showmarkdowntxt": false,
"toc-showtags": true
},
"nbformat": 4,
"nbformat_minor": 4
}