-1
$\begingroup$

This listing selects the best features from the 1000 available columns in a given dataset.

The first three columns are dropped because they are useless data.

The dataset is huge. So, they were read in 25 chunks.

Please focus on threshold, feature_importance, and selected_features .

Can anyone tell me if I am doing it correctly, because I am not sure?

import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import StandardScaler
from collections import Counter

# Path to the dataset
file_path = 'dataset.csv'
output_file_path = 'feature_selection_output.txt'

# Chunk size calculation
total_rows = 1000000
num_chunks = 25
chunk_size = total_rows // num_chunks

# Check for GPU availability
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
if len(tf.config.experimental.list_physical_devices('GPU')) > 0:
    print("Using GPU")
else:
    print("No GPU found, using CPU")

# Function to process a chunk
def process_chunk(chunk):
    # Fill empty values with zeros
    chunk.fillna(0, inplace=True)

    # Remove the first three columns
    chunk = chunk.iloc[:, 3:]

    # Filter rows based on the first column's value
    chunk = chunk[chunk.iloc[:, 0].astype(str).str.contains('A|B|C')]

    # Separate target and features
    y = chunk.iloc[:, 0]
    X = chunk.iloc[:, 1:]

    # One-hot encode the target column
    y = pd.get_dummies(y)

    return X, y

# Placeholder for selected features count
feature_counter = Counter()

for i, chunk in enumerate(pd.read_csv(file_path, chunksize=chunk_size)):
    print(f"Processing chunk {i+1}/{num_chunks}")
    X, y = process_chunk(chunk)

    # Normalize features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Build and train a simple neural network model
    model = Sequential()
    model.add(Dense(128, input_dim=X.shape[1], activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(y.shape[1], activation='softmax'))

    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(X, y, epochs=10, batch_size=64, verbose=0)

    # Extract feature importance from the model weights
    weights = model.layers[0].get_weights()[0]
    feature_importance = np.mean(np.abs(weights), axis=1)

    # Select features based on importance
    threshold = np.median(feature_importance)
    selected_features = np.where(feature_importance > threshold)[0]

    # Update feature counter
    feature_counter.update(selected_features)

    # Clear the Keras session to free memory
    tf.keras.backend.clear_session()

# Get the most frequently selected features and sort by frequency in descending order
sorted_features = feature_counter.most_common()

# Save the selected features and their frequencies to a file
with open(output_file_path, 'w') as f:
    for feature, count in sorted_features:
        f.write(f"{feature}: {count}\n")

print(f"Selected features saved to {output_file_path}")
$\endgroup$

1 Answer 1

0
$\begingroup$

The predominant reason for performing feature selection is to redundancy in your dataset. You can think of this as features that are highly correlated and for this reason you don't need to consider training your model using all of them. One approach to estimating the above is using dimensionality reduction techniques such as Principal Component Analysis (PCA).

# Load dataset
dataset = pd.read_csv('path_to_dataset.csv')

# Assume the first three columns are not needed
dataset = dataset.iloc[:, 3:]

# Separate target and features
y = dataset.iloc[:, 0]
X = dataset.iloc[:, 1:]

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA to reduce to 5 components
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)

# One-hot encode the target if it's categorical
y = pd.get_dummies(y)

# Build the neural network model
model = Sequential()
model.add(Dense(128, input_dim=5, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_pca, y, epochs=10, batch_size=64, verbose=1)

In the example above we go from 10 features initially to five, using the principal components as a linear transformation of the initial feature set.

$\endgroup$
3
  • $\begingroup$ PCA is for clustering. I am using classification. $\endgroup$
    – user366312
    Commented Jul 16 at 10:58
  • $\begingroup$ Besides, this question is not about dimensionality reduction. This is about feature-selection. Those are related but different concepts. $\endgroup$
    – user366312
    Commented Jul 16 at 11:00
  • $\begingroup$ PCA is a dimensionality reduction technique. DImensionality reduction is considered part of feature selection and explainability components of dimensionality reduction techniques can also be used for feature importance. "The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets [...]" source: scikit-learn.org/stable/modules/feature_selection.html $\endgroup$
    – hH1sG0n3
    Commented Jul 16 at 12:58

Not the answer you're looking for? Browse other questions tagged or ask your own question.