Bug 2348 - Java source files misdetected as Perl
: Java source files misdetected as Perl
: Sisyphus
(All bugs in Sisyphus/file)
: unstable
: all Linux
: P5 normal
Assigned To:
  Show dependency tree
Reported: 2003-03-10 13:08 by
Modified: 2005-07-13 15:45 (History)

A file from the Jakarta log4j project (8.81 KB, text/plain)
2004-03-07 14:18, Mikhail Zabaluev
no flags Details


You need to log in before you can comment on or make changes to this bug.

Description From 2003-03-10 13:08:18
Huge majority of Java source files contain the keyword \"package\".
This word is treated as an indication of a Perl package.


I believe, the following line in the magic file is to blame:

0       string          package         Perl5 module source text
------- Comment #1 From 2004-03-01 19:16:45 -------
Could you attach an example, please. 
------- Comment #2 From 2004-03-07 14:18:58 -------
Created an attachment (id=356) [details]
A file from the Jakarta log4j project

file recognizes it as "Perl5 module source text".
------- Comment #3 From 2004-03-08 13:06:46 -------
When you comment this "Perl5 module source text" rule out, 
file will misdetect perl package files as "ASCII Java program text". 
------- Comment #4 From 2004-03-08 14:30:42 -------
There should be a more elaborate heuristic.

Components of a Java package name are delimited with dots. Since
widely-available Java package names should be global by convention (e.g.
org.altlinux.oursoftware.ourpackage), single-component package names are
unlikely. In Perl, package names are delimited with ':' (or "'", but that's
obscure), and single-component package names are common.

I haven't had time to master syntax of magic files, so I'll put it down in regex

Here's a pattern to Java files:


If that doesn't match, the following matches Perl modules:

------- Comment #5 From 2004-03-09 01:13:28 -------
The matcher in libmagic has string limit (32), so your regex is too long. 
This one line seems to be enough: 
0	regex	\^package[\ \	]+[A-Za-z][^.;]*;		Perl5 module source text 
Please create empty magic_file, add this line there and test with "file -m magic_file". 
------- Comment #6 From 2004-03-09 11:03:50 -------
Tested with the pattern as suggested.
The regex doesn't mismatch Java files.
But .pm files that do contain "package" are all detected as "ASCII English text"
or "ASCII C++ program text". Tested on XML-LibXML-1.56.
------- Comment #7 From 2004-03-09 14:46:59 -------
Ok, fixed in file-4.07-alt3.