Skip to main content
The 2024 Developer Survey results are live! See the results

Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

-1 votes
0 answers
18 views

How do I setup Tika to detect corrupted PDF files

I am trying out Tika's ability to determine whether a file is corrupted and up till now I don't seem to be able to trigger exceptions which I kind of expect when I butcher a PDF file to the level even ...
JB007's user avatar
  • 121
0 votes
0 answers
19 views

ParsingReader unable to read a file using reader.read function

I am trying to extract the content from a file and using Apache Tika ParsingReader to read that file. Now, when i am trying to extract content. I was using Apache Tika version 2.3.0, there it was ...
sushree soumya lenka's user avatar
1 vote
1 answer
52 views

How to extract ALT-Texts and Images from a PDF

I have a PDF that contains text and images. All images have an ALT-Text for accessibility readers. Can someone tell me how I can extract Value Pairs <BufferedImage, String>, where BufferedImage ...
Tristate's user avatar
  • 1,691
0 votes
0 answers
33 views

Apache Tika: Getting ArrayIndexOutOfBoundsException: Index 10 out of bounds for length 10 in metadata.names()

We are using ApacheTika for content extraction from text/pdf files and getting this error while accessing the metadata.names(). This is not consistently happening but around 2-3% cases where this ...
user3212707's user avatar
0 votes
0 answers
49 views

Tika (2.x) unable to detect CSV correctly for Excel output format (semicolon separated)

I'm trying to integrate TIKA to detect file types in content management system. Unfortunately, it fails to detect CSV format. I've inspected it detailed, and it seems, it can detect CSV if separator ...
Cjxcz Odjcayrwl's user avatar
0 votes
0 answers
19 views

Override Tika default mimetype for Transport MPEG Transport Stream (TS) files

We have Tika detecting MPEG Transport Stream *.ts files as application/octet-stream This is due to tika-mimetypes.xml not associating the video/m2pt to a *.ts file. When I attempt to override this ...
B Randall's user avatar
  • 172
0 votes
0 answers
12 views

Index Page Content getting jumbled by Tika for docx file

Text on Index page is getting jumbled. Page number is coming at the start of line in docx file. While parsing the docx file. The text for index page the page number is coming at front pd the line ...
Anurag Anand's user avatar
0 votes
0 answers
89 views

Tika unable to detect and parse the non-utf-8 encoded csv file containing non-ascii characters

I have a csv file saved as .csv format which contains non-ascii characters but Tika not able to detect the MimeType of the file and thus assigning the mimeType as application/octet-stream and during ...
Anurag Anand's user avatar
0 votes
0 answers
122 views

Get the file extension from byte array

Below is the code snippet to get the file extension public static Map<String, String> getImageType(byte[] imageContent) throws Exception { Map<String, String> result = new HashMap&...
Ajit Singh's user avatar
0 votes
1 answer
86 views

How to get the html formatting from Excel Sheet Cells in Java

We are trying to get the html formatting from excel cell(basically the cell text in html format) including bullet points, italics, new lines, highlights, hyperlinks etc. We are using Apache POI but it ...
gbhati's user avatar
  • 543
0 votes
0 answers
29 views

Disable Image caption on Tika Server

OCR capabilities from the Tika Server include Image Caption descriptions as part of extracting content. Whenever an MS Office file is sent to Tika the response includes the resulting image analysis. ...
10010110's user avatar
0 votes
1 answer
72 views

Date Format Tika output from XLSX

i have a XLSX file with this content I have downloaded tika-app for testing: java -jar tika-app-2.9.2.jar --metadata test.xlsx Content-Length: 9217 Content-Type: application/vnd.openxmlformats-...
Daniele Grillo's user avatar
0 votes
0 answers
124 views

Passing contents of a pdf file to a pyspark dataframe

I am trying to make a new column from a pdf file text extracted with tika to a pyspark dataframe, but I can't put the text into this new column I have two functions, the first one extracts the text ...
hirampa's user avatar
0 votes
1 answer
76 views

Is possible to deploy apache-tika in a single jar using maven and javafx?

I´ve already finish a desktop application using IntelliJ, OpenJDK17, JavaFX17 in a project build with Maven. The project runs well inside de IDE, the problems begin when I try to deploy the app in a ...
CFJR Corporativo's user avatar
0 votes
1 answer
290 views

Apache Tika - NoSuchMethodError TarArchiveInputStream.getNextEntry()

I am using versions: SpringBoot: 3.2.4 Java: JDK 17 Pom used as in docs and based my dependency tree: <dependency> <groupId>org.apache.tika</groupId> <artifactId>...
Marek Bernád's user avatar

15 30 50 per page
1
2 3 4 5
87