0

I have a program uses tensorflow Boston data. This is my first dive in Deep Learning. In ML you can just do the train_test_split to asign data. After reading a CSV

In most of the jupyter notebook codes I have seen that use tensorflow. They do

from tensorflow.keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

How would you do this part of the code if one has a csv or npz file? And I don't want to use the TF datasource?

4
  • The overarching language you are writing in is Python and so one would search how to read a CSV file in Python. There is a csv library, see here, that is a standard library for that built in to Python, meaning it doesn't need to be installed separately. Note the title along the very top of the link above, " 3.11.3 Documentation » The Python Standard Library » File Formats » csv". Often others will choose to install & use Pandas, Python Data Analysis Library as it reads CSV files and much more & so it is familiar.
    – Wayne
    Commented May 22, 2023 at 2:38
  • Npz is special to the lirbary numpy, as can be seen here and so you would install and use that. All of that can easily be found in a search engine, for example the last information came from searching, "npz data format". Searching "", leads to here, among other options, so clearly you could use b = np.load('filename.npz'). Please read How do I ask a good question?, ...
    – Wayne
    Commented May 22, 2023 at 2:43
  • <continued> in particular the section near the top entitled 'Search, and research'. For installing easily so that the extra packages will be seen by the kernel your notebook is using, in a new cell in your notebook you can enter and run %pip install numpy or %conda install -c anaconda numpy (based on here found by searching 'anaconda numpy'), depending on your main package manager. Always use conda/Anaconda primarily if you are using Anaconda/conda and only resort to pip in cases there is no conda recipe. See ...
    – Wayne
    Commented May 22, 2023 at 2:49
  • <continued> here for more about the modern magic install commands for use in modern Jupyter. The magic commands insure that the installation occurs in the environment where the kernel is running that is backing the active notebook.
    – Wayne
    Commented May 22, 2023 at 2:50

1 Answer 1

0

CSV

From In memory data in the Tensorflow documentation there is a clear example for a CSV that uses Pandas that I touched on in my comments:

"For any small CSV dataset the simplest way to train a TensorFlow model on it is to load it into memory as a pandas Dataframe or a NumPy array. A relatively simple example is the abalone dataset. The dataset is small. All the input features are all limited-range floating point values. Here is how to download the data into a pandas DataFrame:"

You should be able to then adapt the example to your CSV formatted data.
Some related resources that may help:

The key feature to look out for is how to split the data. Pandas has a 'shuffle' method that often gets used to separate out some of the data for later testing. Or you can use the convenient ' train_test_split function' from sklearn library, which is able to handle pandas dataframes as well as numpy arrays. See 'How to Split a Dataframe into Train and Test Set with Python' under 'Splitting and saving'

Quick summary for that from the Daily Python Tip post from January 12, 2018

"split pandas dataframe into two random subsets:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)"

NPZ

For .npz formatted data see the tensorflow documentation here and this top answer to 'How can I import the MNIST dataset that has been manually downloaded?' and adapt the examples to yours.

numpy arrays can be handled by the train_test_split function of the sklearn library, as discussed above in the CSV section.

Not the answer you're looking for? Browse other questions tagged or ask your own question.