{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/tonegas/nnodely/blob/main/examples/dataset.ipynb\" target=\"_blank\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\">\n",
    "</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load Data\n",
    "\n",
    "Listed here are all the modalitites by which you can load data inside the nnodely framework.\n",
    "There are three modalities to load a dataset inside nnodely:\n",
    "1. Using a directory, each file represents a simulation, with time coherence between lines.\n",
    "2. Using a dictionary, each element in the dictionary represents a variable.\n",
    "3. Using a pandas dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-26T15:13:12.900246Z",
     "start_time": "2025-05-26T15:13:11.882058Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>-- nnodely_v1.5.0 --<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<\n"
     ]
    }
   ],
   "source": [
    "# uncomment the command below to install the nnodely package\n",
    "#!pip install nnodely\n",
    "\n",
    "from nnodely import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following lines a network is created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-26T15:13:13.003845Z",
     "start_time": "2025-05-26T15:13:12.998180Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1;32m================================ nnodely Model =================================\u001b[0m\n",
      "\u001b[32m{'Constants': {},\n",
      " 'Functions': {},\n",
      " 'Info': {'SampleTime': 0.01,\n",
      "          'nnodely_version': '1.5.0',\n",
      "          'ns': [5, 0],\n",
      "          'ntot': 5,\n",
      "          'num_parameters': 5},\n",
      " 'Inputs': {'in1': {'dim': 1, 'ns': [5, 0], 'ntot': 5, 'tw': [-0.05, 0]},\n",
      "            'target': {'dim': 1, 'ns': [1, 0], 'ntot': 1, 'sw': [-1, 0]}},\n",
      " 'Minimizers': {'out': {'A': 'Fir2', 'B': 'SamplePart4', 'loss': 'mse'}},\n",
      " 'Outputs': {'out': 'Fir2'},\n",
      " 'Parameters': {'PFir3W': {'dim': 1,\n",
      "                           'tw': 0.05,\n",
      "                           'values': [[0.7577804327011108],\n",
      "                                      [0.1862850785255432],\n",
      "                                      [0.5226411819458008],\n",
      "                                      [0.8208074569702148],\n",
      "                                      [0.10860830545425415]]}},\n",
      " 'Relations': {'Fir2': ['Fir', ['TimePart1'], 'PFir3W', None, 0],\n",
      "               'SamplePart4': ['SamplePart', ['target'], -1, [-1, 0]],\n",
      "               'TimePart1': ['TimePart', ['in1'], -1, [-0.05, 0]]}}\u001b[0m\n",
      "\u001b[32m================================================================================\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "in1 = Input('in1')\n",
    "target = Input('target')\n",
    "relation = Fir(in1.tw(0.05))\n",
    "output = Output('out', relation)\n",
    "\n",
    "model = Modely(visualizer=TextVisualizer())\n",
    "model.addMinimize('out', output, target.last())\n",
    "model.neuralizeModel(0.01)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load a dataset using a directory\n",
    "\n",
    "Load a dataset inside the framework using a directory.\n",
    "\n",
    "You must specify a name for the dataset, the folder path and also the structure of the data so that the framework will know which column must be used for every input of the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-26T15:13:13.016813Z",
     "start_time": "2025-05-26T15:13:13.012790Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1;32m============================ nnodely Model Dataset =============================\u001b[0m\n",
      "\u001b[32mDataset Name:                 dataset\u001b[0m\n",
      "\u001b[32mNumber of files:              1\u001b[0m\n",
      "\u001b[32mTotal number of samples:      28\u001b[0m\n",
      "\u001b[32mShape of target:              (28, 1, 1)\u001b[0m\n",
      "\u001b[32mShape of in1:                 (28, 5, 1)\u001b[0m\n",
      "\u001b[32m================================================================================\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "train_folder = 'data'\n",
    "data_struct = ['in1', '', 'target']\n",
    "model.loadData(name='dataset', source=train_folder, format=data_struct)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "you can also specify various parameters such as the number of lines to skip, the delimiter to use between data and if you want to include the header of the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-26T15:13:13.033055Z",
     "start_time": "2025-05-26T15:13:13.029511Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1;32m============================ nnodely Model Dataset =============================\u001b[0m\n",
      "\u001b[32mDataset Name:                 dataset_2\u001b[0m\n",
      "\u001b[32mNumber of files:              1\u001b[0m\n",
      "\u001b[32mTotal number of samples:      24\u001b[0m\n",
      "\u001b[32mShape of target:              (24, 1, 0)\u001b[0m\n",
      "\u001b[32mShape of in1:                 (24, 5, 1)\u001b[0m\n",
      "\u001b[32m================================================================================\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "model.loadData(name='dataset_2', source=train_folder, format=data_struct, skiplines=4, delimiter='\\t', header=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load a dataset from a custom dictionary\n",
    "\n",
    "you can build your own dataset with a dictionary containing all the necessary inputs of the network and passing it to the 'source' attribute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-26T15:13:13.047127Z",
     "start_time": "2025-05-26T15:13:13.044856Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1;32m============================ nnodely Model Dataset =============================\u001b[0m\n",
      "\u001b[32mDataset Name:                 dataset_3\u001b[0m\n",
      "\u001b[32mNumber of files:              1\u001b[0m\n",
      "\u001b[32mTotal number of samples:      6\u001b[0m\n",
      "\u001b[32mShape of target:              (6, 1, 1)\u001b[0m\n",
      "\u001b[32mShape of in1:                 (6, 5, 1)\u001b[0m\n",
      "\u001b[32m================================================================================\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "data_x = np.array(range(10))\n",
    "data_a = 2\n",
    "data_b = -3\n",
    "dataset = {'in1': data_x, 'target': (data_a*data_x) + data_b}\n",
    "\n",
    "model.loadData(name='dataset_3', source=dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load a dataset from a pandas DataFrame\n",
    "\n",
    "you can also use a pandas dataframe as source for loading a dataset inside the nnodely framework"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-26T15:13:13.063513Z",
     "start_time": "2025-05-26T15:13:13.058454Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1;32m============================ nnodely Model Dataset =============================\u001b[0m\n",
      "\u001b[32mDataset Name:                 dataset_4\u001b[0m\n",
      "\u001b[32mNumber of files:              1\u001b[0m\n",
      "\u001b[32mTotal number of samples:      96\u001b[0m\n",
      "\u001b[32mShape of target:              (96, 1, 1)\u001b[0m\n",
      "\u001b[32mShape of in1:                 (96, 5, 1)\u001b[0m\n",
      "\u001b[32m================================================================================\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "# Create a DataFrame with random values for each input\n",
    "df = pd.DataFrame({\n",
    "    'in1': np.linspace(1,100,100, dtype=np.float32),\n",
    "    'target': np.linspace(1,100,100, dtype=np.float32)})\n",
    "\n",
    "model.loadData(name='dataset_4', source=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Resampling a pandas DataFrame\n",
    "\n",
    "if you have a column representing time you can also use those values to resample the dataset using the sample time of the neuralized network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-26T15:13:13.100302Z",
     "start_time": "2025-05-26T15:13:13.074236Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1;32m============================ nnodely Model Dataset =============================\u001b[0m\n",
      "\u001b[32mDataset Name:                 dataset_resampled\u001b[0m\n",
      "\u001b[32mNumber of files:              1\u001b[0m\n",
      "\u001b[32mTotal number of samples:      747\u001b[0m\n",
      "\u001b[32mShape of target:              (747, 1, 1)\u001b[0m\n",
      "\u001b[32mShape of in1:                 (747, 5, 1)\u001b[0m\n",
      "\u001b[32m================================================================================\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "df = pd.DataFrame({\n",
    "    'time': np.array([1.0,1.5,2.0,4.0,4.5,5.0,7.0,7.5,8.0,8.5], dtype=np.float32),\n",
    "    'in1': np.linspace(1,10,10, dtype=np.float32),\n",
    "    'target': np.linspace(1,10,10, dtype=np.float32)})\n",
    "\n",
    "model.loadData(name='dataset_resampled', source=df, resampling=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Samples from the Dataset\n",
    "\n",
    "Once a dataset is loaded, you can use it to get random samples from the dataset. Set the 'window' argument to choose the number of samples to get from the specific dataset, and 'index' for selecting a specific time instant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-05-26T15:13:13.116040Z",
     "start_time": "2025-05-26T15:13:13.110843Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'out': [49.65475082397461,\n",
       "  52.050872802734375,\n",
       "  54.44699478149414,\n",
       "  56.843116760253906,\n",
       "  59.23923873901367]}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample = model.getSamples(dataset='dataset_4', window=5)\n",
    "model(sample, sampled=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv (3.11.4)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}