5

I've created .csv file using excel and I wrote following code using apache tika:

public static boolean checkThatMimeTypeIsCsv(InputStream inputStream) throws IOException {
    BufferedInputStream bis = new BufferedInputStream(inputStream);
    AutoDetectParser parser = new AutoDetectParser();
    Detector detector = parser.getDetector();
    Metadata md = new Metadata();
    MediaType mediaType = detector.detect(bis, md);
    return "text/csv".equals(mediaType.toString());
}

public static void main(String[] args) throws IOException {
    System.out.println(checkThatMimeTypeIsCsv(new FileInputStream("Data.csv")));
}

But it returns false'.

Does Tika so bad or I missed something?

6
  • You're loosing the file name, creating objects you don't need, and generally being overly complicated! Why not just do Tika.detect(File) ?
    – Gagravarr
    Commented Oct 26, 2017 at 18:24
  • 3
    @Gagravarr, System.out.println(new Tika().detect(inputStream)); returns text/plain Commented Oct 26, 2017 at 20:11
  • 2
    @Gagravarr I don't want to provide name because if I rename foo.txt with foo.csv - tika thinks that it csv Commented Oct 26, 2017 at 20:22
  • 2
    There's no way to tell the difference between a CSV and a TXT other than by filename though!
    – Gagravarr
    Commented Oct 27, 2017 at 12:39
  • 1
    @jumping_monkey I've added an answer with the current Apache Tika version. Hope it helps
    – dcalap
    Commented Jan 9, 2023 at 11:53

2 Answers 2

2

Try this...

public static String checkThatMimeTypeIsCsv(String fileName ) throws Exception {
    File sourceFile = new File(fileName );
    DefaultDetector file_detector = new DefaultDetector();
    TikaInputStream file_stream = TikaInputStream.get(sourceFile);
    Metadata metadata = new Metadata();
    metadata.set(Metadata.RESOURCE_NAME_KEY, sourceFile.getName());
    MediaType mediaType = file_detector.detect(file_stream, metadata);              
    String file_type = mediaType.toString();
    System.out.println(file_type);
    return file_type;
}
2

Here is an example of how to do it with Apache Tika 2.6.0 (current version at this moment)

// Read a CSV file.       
File file = new File("src/test/resources/testcsv/entities.csv");
String csvContent = Files.readString(file.toPath());

InputStream is = new FileInputStream(file);
BufferedInputStream bufferedInputStream = new BufferedInputStream(is);

// Prepare Tika data for detection
Metadata metadata = new Metadata();
metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, file.getFilename());

String detectedMimeType = MimeTypes.getDefaultMimeTypes().detect(bufferedInputStream, metadata).toString();
assertEquals("text/csv", detectedMimeType);

For a no real CSV file but trying to fake the extension:

// Read a file that is not a CSV. I've downloaded  https://upload.wikimedia.org/wikipedia/commons/7/74/Apache_Tika_Logo.svg and renamed to '.csv' extension for the test        
File file = new File("src/test/resources/testcsv/Apache_Tika_Logo.csv");
String csvContent = Files.readString(file.toPath());

InputStream is = new FileInputStream(file);
BufferedInputStream bufferedInputStream = new BufferedInputStream(is);

// Prepare Tika data for detection
Metadata metadata = new Metadata();
metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, file.getFilename());

String detectedMimeType = MimeTypes.getDefaultMimeTypes().detect(bufferedInputStream, metadata).toString();
assertNotEquals("text/csv", detectedMimeType);

The output of the detectedMimeType variable in the example file is image/svg+xml

Not the answer you're looking for? Browse other questions tagged or ask your own question.