{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "tags": [
     "task"
    ]
   },
   "source": [
    "# Data Analysis and Plotting in Python with Pandas\n",
    "\n",
    " _Carolin Penke, J\u00fclich Supercomputing Centre, Forschungszentrum J\u00fclich, 23 October 2024_\n",
    " _Based on material by Andreas Herten_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "onlytask",
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "**Version: Tasks**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "## Task Outline\n",
    "\n",
    "* [Task 1](#task1)\n",
    "* [Task 2](#task2)\n",
    "* [Task 3](#task3)\n",
    "* [Task 4](#task4)\n",
    "* [Task 5](#task5)\n",
    "* [Task 6](#task6)\n",
    "* [Task 7](#task7)\n",
    "* [Task 7B](#task7b)\n",
    "* [Task 8](#task8)\n",
    "* [Task 8B](#task8b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Task 1\n",
    "<a name=\"task1\"></a>\n",
    "<span class=\"task\" style=\"padding: 2px 8px; color: white; background-color: #b9d25f; float: right; text-weight: bolder;\">TASK</em></span>\n",
    "\n",
    "* Create data frame with\n",
    "    - 6 names of dinosaurs, \n",
    "    - their favourite prime number, \n",
    "    - and their favorite color.\n",
    "* Play around with the frame\n",
    "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "nopresentation",
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Jupyter Notebook 101:\n",
    "\n",
    "* Execute cell: `shift+enter`\n",
    "* New cell in front of current cell: `a`\n",
    "* New cell after current cell: `b`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 143,
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "happy_dinos = {\n",
    "    \"Dinosaur Name\": [],\n",
    "    \"Favourite Prime\": [],\n",
    "    \"Favourite Color\": []\n",
    "}\n",
    "#df_dinos = "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Task 2\n",
    "<a name=\"task2\"></a>\n",
    "<span class=\"task\" style=\"padding: 2px 8px; color: white; background-color: #b9d25f; float: right; text-weight: bolder;\">TASK</em></span>\n",
    "\n",
    "* Read in `data-nest.csv` to `DataFrame`; call it `df`  \n",
    "  *(Data was produced with [JUBE](http://www.fz-juelich.de/ias/jsc/EN/Expertise/Support/Software/JUBE/_node.html))*\n",
    "* Get to know it and play a bit with it\n",
    "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "metadata": {
    "exercise": "task"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "id,Nodes,Tasks/Node,Threads/Task,Runtime Program / s,Scale,Plastic,Avg. Neuron Build Time / s,Min. Edge Build Time / s,Max. Edge Build Time / s,Min. Init. Time / s,Max. Init. Time / s,Presim. Time / s,Sim. Time / s,Virt. Memory (Sum) / kB,Local Spike Counter (Sum),Average Rate (Sum),Number of Neurons,Number of Connections,Min. Delay,Max. Delay\r\n",
      "5,1,2,4,420.42,10,true,0.29,88.12,88.18,1.14,1.20,17.26,311.52,46560664.00,825499,7.48,112500,1265738500,1.5,1.5\r\n",
      "5,1,4,4,200.84,10,true,0.15,46.03,46.34,0.70,1.01,7.87,142.97,46903088.00,802865,7.03,112500,1265738500,1.5,1.5\r\n",
      "5,1,2,8,202.15,10,true,0.28,47.98,48.48,0.70,1.20,7.95,142.81,47699384.00,802865,7.03,112500,1265738500,1.5,1.5\r\n",
      "5,1,4,8,89.57,10,true,0.15,20.41,23.21,0.23,3.04,3.19,60.31,46813040.00,821491,7.23,112500,1265738500,1.5,1.5\r\n",
      "5,2,2,4,164.16,10,true,0.20,40.03,41.09,0.52,1.58,6.08,114.88,46937216.00,802865,7.03,112500,1265738500,1.5,1.5\r\n",
      "5,2,4,4,77.68,10,true,0.13,20.93,21.22,0.16,0.46,3.12,52.05,47362064.00,821491,7.23,112500,1265738500,1.5,1.5\r\n",
      "5,2,2,8,79.60,10,true,0.20,21.63,21.91,0.19,0.47,2.98,53.12,46847168.00,821491,7.23,112500,1265738500,1.5,1.5\r\n",
      "5,2,4,8,37.20,10,true,0.13,10.08,11.60,0.10,1.63,1.24,23.29,47065232.00,818198,7.33,112500,1265738500,1.5,1.5\r\n",
      "5,3,2,4,96.51,10,true,0.15,26.54,27.41,0.36,1.22,3.33,64.28,52256880.00,813743,7.27,112500,1265738500,1.5,1.5\r\n"
     ]
    }
   ],
   "source": [
    "!head data-nest.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "## Task 3\n",
    "<a name=\"task3\"></a>\n",
    "<span class=\"task\" style=\"padding: 2px 8px; color: white; background-color: #b9d25f; float: right; text-weight: bolder;\">TASK</em></span>\n",
    "\n",
    "* Add a column to the Nest data frame form Task 2 called `Threads` which is the total number of threads across all nodes (i.e. the product of threads per task and tasks per node and nodes)\n",
    "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Task 4\n",
    "<a name=\"task4\"></a>\n",
    "<span class=\"task\" style=\"padding: 2px 8px; color: white; background-color: #b9d25f; float: right; text-weight: bolder;\">TASK</em></span>\n",
    "\n",
    "\n",
    "* Sort the Nest data frame by threads\n",
    "* Plot `\"Presim. Time / s\"` and `\"Sim. Time / s\"` of our data frame `df` as a function of threads\n",
    "* Use a dashed, red line for `\"Presim. Time / s\"`, a blue line for `\"Sim. Time / s\"` (see [API description](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot))\n",
    "* Don't forget to label your axes and to add a legend _(1st rule of plotting)_\n",
    "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Task 5\n",
    "<a name=\"task5\"></a>\n",
    "<span class=\"task\" style=\"padding: 2px 8px; color: white; background-color: #b9d25f; float: right; text-weight: bolder;\">TASK</em></span>\n",
    "\n",
    "Use the Nest data frame `df` to:\n",
    "\n",
    "1. Make threads index of the data frame (`.set_index()`)\n",
    "2. Plot `\"Presim. Time / s\"` and `\"Sim. Time / s`\" individually\n",
    "3. Plot them onto one common canvas!\n",
    "4. Make them have the same line colors and styles as before\n",
    "5. Add a legend, add missing axes labels\n",
    "6. Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Task 6\n",
    "<a name=\"task6\"></a>\n",
    "<span class=\"task\" style=\"padding: 2px 8px; color: white; background-color: #b9d25f; float: right; text-weight: bolder;\">TASK</em></span>\n",
    "\n",
    "* To your `df` Nest data frame, add a column with the unaccounted time (`Unaccounted Time / s`), which is the difference of program runtime, average neuron build time, minimal edge build time, minimal initialization time, presimulation time, and simulation time.  \n",
    "(*I know this is technically not super correct, but it will do for our example.*)\n",
    "* Plot a stacked bar plot of all these columns (except for program runtime) over the threads\n",
    "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Task 7\n",
    "<a name=\"task7\"></a>\n",
    "<span class=\"task\" style=\"padding: 2px 8px; color: white; background-color: #b9d25f; float: right; text-weight: bolder;\">TASK</em></span>\n",
    "\n",
    "* Create a pivot table based on the Nest `df` data frame\n",
    "* Let the `x` axis show the number of nodes; display the values of the simulation time `\"Sim. Time / s\"` for the tasks per node and threads per task configurations\n",
    "* Please plot a bar plot\n",
    "* Tell me when you're done with status icon in BigBlueButton: \ud83d\udc4d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": []
   },
   "source": [
    "## Task 7B (like <em>B</em>onus)\n",
    "<a name=\"task7b\"></a>\n",
    "<span class=\"task\" style=\"padding: 2px 8px; color: white; background-color: #b9d25f; float: right; text-weight: bolder;\">TASK</em></span>\n",
    "\n",
    "- Same pivot table as before (that is, `x` with nodes, and columns for Tasks/Node and Threads/Task)\n",
    "- But now, use `Sim. Time / s` and `Presim. Time / s` as values to show\n",
    "- Show them as a **stack** of those two values inside the pivot table\n",
    "- Use Panda's functionality as much as possible!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Task 8 (Super Bonus)\n",
    "<a name=\"task8\"></a>\n",
    "<span class=\"task\" style=\"padding: 2px 8px; color: white; background-color: #b9d25f; float: right; text-weight: bolder;\">TASK</em></span>\n",
    "\n",
    "* Create bar chart of top 10 actors (on `x`) and average ratings of their top movies (`y`) based on IMDb data (only if they play in at least two movies)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "fragment"
    },
    "tags": []
   },
   "source": [
    "* IMDb provides data sets at [datasets.imdbws.com](https://datasets.imdbws.com)\n",
    "* Can directly be loaded like\n",
    "```python\n",
    "pd.read_table('https://datasets.imdbws.com/dataset.tsv.gz', sep=\"\\t\", low_memory=False, na_values=[\"\\\\N\",\"nan\"])\n",
    "```\n",
    "* Needed:\n",
    "  * `name.basics.tsv.gz` (for names of actors and movies they are known for)\n",
    "  * `title.ratings.tsv.gz` (for ratings of titles)\n",
    "* Strategy _suggestions_:\n",
    "  * Use `df.apply()` with custom function\n",
    "  * Custom function: Compute average rating and determine if this entry is eligible for plotting (this _can_ be done at once, but does not need to be)\n",
    "  * Average rating: Look up title IDs as listed in `knownForTitles` in titles dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task",
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": []
   },
   "source": [
    "## Task 8B (<em>B</em>onuseption)\n",
    "<a name=\"task8b\"></a>\n",
    "<span class=\"task\" style=\"padding: 2px 8px; color: white; background-color: #b9d25f; float: right; text-weight: bolder;\">TASK</em></span>\n",
    "\n",
    "All of the following are ideas for unique sub-tasks, which can be done individually\n",
    "* In addition to Task 8, restrict the top titles to those with more than 10000 votes\n",
    "* For 30 top-rated actors, plot rating vs. age\n",
    "* For 30 top-rated actors, plot rating vs. average runtime of the known-for-titles (using `title.basics.tsv.gz`)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "exercise": "task"
   },
   "source": [
    "<span class=\"feedback\">Feedback to <a href=\"mailto:a.herten@fz-juelich.de\">a.herten@fz-juelich.de</a></span>\n",
    "\n",
    "_Next slide: Further reading_"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  },
  "toc-autonumbering": false,
  "toc-showcode": false,
  "toc-showmarkdowntxt": false,
  "toc-showtags": true
 },
 "nbformat": 4,
 "nbformat_minor": 4
}