Questions tagged [apache-tika]
The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.
apache-tika
1,300
questions
-1
votes
0
answers
18
views
How do I setup Tika to detect corrupted PDF files
I am trying out Tika's ability to determine whether a file is corrupted and up till
now I don't seem to be able to trigger exceptions which I kind of expect when I butcher
a PDF file to the level even ...
0
votes
0
answers
19
views
ParsingReader unable to read a file using reader.read function
I am trying to extract the content from a file and using Apache Tika ParsingReader to read that file.
Now, when i am trying to extract content.
I was using Apache Tika version 2.3.0, there it was ...
1
vote
1
answer
52
views
How to extract ALT-Texts and Images from a PDF
I have a PDF that contains text and images. All images have an ALT-Text for accessibility readers.
Can someone tell me how I can extract Value Pairs <BufferedImage, String>, where BufferedImage ...
0
votes
0
answers
33
views
Apache Tika: Getting ArrayIndexOutOfBoundsException: Index 10 out of bounds for length 10 in metadata.names()
We are using ApacheTika for content extraction from text/pdf files and getting this error while accessing the metadata.names().
This is not consistently happening but around 2-3% cases where this ...
0
votes
0
answers
49
views
Tika (2.x) unable to detect CSV correctly for Excel output format (semicolon separated)
I'm trying to integrate TIKA to detect file types in content management system.
Unfortunately, it fails to detect CSV format.
I've inspected it detailed, and it seems, it can detect CSV if separator ...
0
votes
0
answers
19
views
Override Tika default mimetype for Transport MPEG Transport Stream (TS) files
We have Tika detecting MPEG Transport Stream *.ts files as application/octet-stream
This is due to tika-mimetypes.xml not associating the video/m2pt to a *.ts file.
When I attempt to override this ...
0
votes
0
answers
12
views
Index Page Content getting jumbled by Tika for docx file
Text on Index page is getting jumbled. Page number is coming at the start of line in docx file.
While parsing the docx file. The text for index page the page number is coming at front pd the line ...
0
votes
0
answers
89
views
Tika unable to detect and parse the non-utf-8 encoded csv file containing non-ascii characters
I have a csv file saved as .csv format which contains non-ascii characters but Tika not able to detect the MimeType of the file and thus assigning the mimeType as application/octet-stream and during ...
0
votes
0
answers
122
views
Get the file extension from byte array
Below is the code snippet to get the file extension
public static Map<String, String> getImageType(byte[] imageContent) throws Exception {
Map<String, String> result = new HashMap&...
0
votes
1
answer
86
views
How to get the html formatting from Excel Sheet Cells in Java
We are trying to get the html formatting from excel cell(basically the cell text in html format) including bullet points, italics, new lines, highlights, hyperlinks etc.
We are using Apache POI but it ...
0
votes
0
answers
29
views
Disable Image caption on Tika Server
OCR capabilities from the Tika Server include Image Caption descriptions as part of extracting content. Whenever an MS Office file is sent to Tika the response includes the resulting image analysis.
...
0
votes
1
answer
72
views
Date Format Tika output from XLSX
i have a XLSX file with this content
I have downloaded tika-app for testing:
java -jar tika-app-2.9.2.jar --metadata test.xlsx
Content-Length: 9217
Content-Type: application/vnd.openxmlformats-...
0
votes
0
answers
124
views
Passing contents of a pdf file to a pyspark dataframe
I am trying to make a new column from a pdf file text extracted with tika to a pyspark dataframe, but I can't put the text into this new column
I have two functions, the first one extracts the text ...
0
votes
1
answer
76
views
Is possible to deploy apache-tika in a single jar using maven and javafx?
I´ve already finish a desktop application using IntelliJ, OpenJDK17, JavaFX17 in a project build with Maven.
The project runs well inside de IDE, the problems begin when I try to deploy the app in a ...
0
votes
1
answer
290
views
Apache Tika - NoSuchMethodError TarArchiveInputStream.getNextEntry()
I am using versions:
SpringBoot: 3.2.4
Java: JDK 17
Pom used as in docs and based my dependency tree:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>...