From: Reuben Thomas Date: Thu, 6 Nov 2008 21:17:45 +0000 (+0000) Subject: This check-in mostly fixes a bunch of problems with MIME checks. It's X-Git-Tag: FILE5_05~287 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=355108ab603f5b0947219c4d53772e65bb7cb1f7;p=file This check-in mostly fixes a bunch of problems with MIME checks. It's been tested, but there are probably still gotchas lurking. * Fix --mime, --mime-type and --mime-encoding under new scheme. * Return a precise ("utf-16le" or "utf-16be") MIME charset for UTF-16. * Fix error in comment caused by automatic indentation adding words! * Fix a printf with a non-literal format string. * Fix formatting and punctuation of help for "--apple". --- diff --git a/ChangeLog b/ChangeLog index c2e2213a..97e1d1f2 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,8 +1,26 @@ +2008-11-06 23:00 Reuben Thomas + + * Fix --mime, --mime-type and --mime-encoding under new scheme. + + * Rename "ascii" to "text" and add "encoding" test. + + * Return a precise ("utf-16le" or "utf-16be") MIME charset for + UTF-16. + + * Fix error in comment caused by automatic indentation adding + words! + 2008-11-06 10:35 Christos Zoulas * use memchr instead of strchr because the string might not be NUL terminated (Scott MacVicar) +2008-11-03 07:31 Reuben Thomas + + * Fix a printf with a non-literal format string. + + * Fix formatting and punctuation of help for "--apple". + 2008-10-30 11:00 Reuben Thomas * Correct words counts in comments of struct magic. @@ -109,22 +127,22 @@ 2008-05-06 00:13 Robert Byrnes - * src/Makefile.am: + * src/Makefile.am: Ensure that getopt_long and [v]asprintf are included in libmagic, as needed. Remove unnecessary EXTRA_DIST. - * src/Makefile.in: + * src/Makefile.in: Rerun automake. - * src/vasprintf.c (dispatch): + * src/vasprintf.c (dispatch): Fix variable precision bug: be sure to step past '*'. - * src/vasprintf.c (core): + * src/vasprintf.c (core): Remove unreachable code. - * src/apprentice.c (set_test_type): + * src/apprentice.c (set_test_type): Add cast to avoid compiler warning. 2008-04-22 23:45 Christos Zoulas @@ -135,12 +153,12 @@ 2008-04-04 11:00 Christos Zoulas - * >= <= is not supported, so fix the magic and warn about it. + * >= <= is not supported, so fix the magic and warn about it. reported by: Thien-Thi Nguyen 2008-03-27 16:16 Robert Byrnes - * src/readelf.c (donote): + * src/readelf.c (donote): ELF core file command name/line bug fixes and enhancements: Try larger offsets first to avoid false matches @@ -166,7 +184,7 @@ * Clarify UTF-8 BOM message (Reuben Thomas) * Add HTML comment to token list in names.h - + 2007-02-04 15:50 Christos Zoulas * Debian fixes (Reuben Thomas) @@ -206,7 +224,7 @@ 2007-10-28 20:48 Christos Zoulas - * float and double magic support (Behan Webster) + * float and double magic support (Behan Webster) 2007-10-28 20:48 Christos Zoulas @@ -253,7 +271,7 @@ be easily parsed: mimetype [charset=character-set] [encoding=encoding-mime-type] - Remove spurious extra text from some MIME type printouts + Remove spurious extra text from some MIME type printouts (mostly in is_tar). Fix one case where -i produced nothing at all (for a 1-byte file, @@ -283,7 +301,7 @@ 2007-03-15 10:51 Christos Zoulas * fix fortran and nroff reversed tests (Dmitry V. Levin) - + * fix exclude option (Dmitry V. Levin) 2007-02-08 17:30 Christos Zoulas @@ -302,7 +320,7 @@ * Add exclude flag. 2007-01-18 05:29 Anon Ymous - + * Move the "type" detection code from parse() into its own table driven routine. This avoids maintaining multiple lists in file.h. @@ -310,7 +328,7 @@ * Add an optional conditional field (ust before the type field). This code is wrapped in "#ifdef ENABLE_CONDITIONALS" as it is likely to go away. - + 2007-01-16 23:24 Anon Ymous * Fix an initialization bug in check_mem(). @@ -381,7 +399,7 @@ 2006-12-08 16:32 Christos Zoulas * store and print the line number of the magic - entry for debugging. + entry for debugging. * if the magic entry did not print anything, don't treat it as a match @@ -396,7 +414,7 @@ file_softmagic. 2006-11-25 13:35 Christos Zoulas - + * Don't store the current offset in the magic struct, because it needs to be restored and it was not done properly all the time. Bug @@ -486,7 +504,7 @@ * Look for note sections in non executables. 2005-09-20 13:33 Christos Zoulas - + * Don't print SVR4 Style in core files multiple times (Radek Vokál) @@ -497,9 +515,9 @@ 2005-08-18 09:53 Christos Zoulas * Remove erroreous mention of /etc/magic in the file man page - This is gentoo bug 101639. (Mike Frysinger) + This is gentoo bug 101639. (Mike Frysinger) - * Cross-compile support and detection (Mike Frysinger) + * Cross-compile support and detection (Mike Frysinger) 2005-08-12 10:17 Christos Zoulas @@ -531,20 +549,20 @@ * Avoid NULL pointer dereference in time conversion. 2005-03-06 00:00 Joerg Walter - + * Add indirect magic offset support, and search mode. 2005-01-12 00:00 Stepan Kasal - * src/ascmagic.c (file_ascmagic): Fix three bugs about text files: - If a CRLF text file happens to have CR at offset HOWMANY - 1 - (currently 0xffff), it should not be counted as CR line - terminator. - If a line has length exactly MAXLINELEN, it should not yet be - treated as a ``very long line'', as MAXLINELEN is ``longest sane - line length''. - With CRLF, the line length was not computed correctly, and even - lines of length MAXLINELEN - 1 were treated as ``very long''. + * src/ascmagic.c (file_ascmagic): Fix three bugs about text files: + If a CRLF text file happens to have CR at offset HOWMANY - 1 + (currently 0xffff), it should not be counted as CR line + terminator. + If a line has length exactly MAXLINELEN, it should not yet be + treated as a ``very long line'', as MAXLINELEN is ``longest sane + line length''. + With CRLF, the line length was not computed correctly, and even + lines of length MAXLINELEN - 1 were treated as ``very long''. 2004-12-07 14:15 Christos Zoulas @@ -579,12 +597,12 @@ * Remove 3rd and 4th copyright clause; approved by Ian Darwin. - * Fix small memory leaks; caught by: Tamas Sarlos + * Fix small memory leaks; caught by: Tamas Sarlos 2004-07-24 16:33 Christos Zoulas - * magic.mime update Danny Milosavljevic + * magic.mime update Danny Milosavljevic * FreeBSD version update Oliver Eikemeier diff --git a/doc/file.man b/doc/file.man index 2a32b286..bd3be256 100644 --- a/doc/file.man +++ b/doc/file.man @@ -1,4 +1,4 @@ -.\" $File: file.man,v 1.77 2008/10/30 10:50:24 rrt Exp $ +.\" $File: file.man,v 1.78 2008/11/06 21:17:45 rrt Exp $ .Dd October 9, 2008 .Dt FILE __CSECTION__ .Os @@ -41,12 +41,12 @@ characters and is probably safe to read on an terminal), .Em executable (the file contains the result of compiling a program -in a form understandable to some +in a form understandable to some .Dv UNIX kernel or another), or .Em data -meaning anything else (data is usually +meaning anything else (data is usually .Sq binary or non-printable). Exceptions are well-known file formats (core files, tar archives) @@ -54,12 +54,12 @@ that are known to contain binary data. When modifying magic files or the program itself, make sure to .Em "preserve these keywords" . Users depend on knowing that all the readable files in a directory -have the word +have the word .Dq text printed. -Don't do as Berkeley did and change +Don't do as Berkeley did and change .Dq shell commands text -to +to .Dq shell script . .Pp The filesystem tests are based on examining the return from a @@ -78,16 +78,16 @@ The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable (compiled program) .Dv a.out -file, whose format is defined in +file, whose format is defined in .In elf.h , .In a.out.h and possibly .In exec.h in the standard include directory. -These files have a +These files have a .Sq "magic number" stored in a particular place -near the beginning of the file that tells the +near the beginning of the file that tells the .Dv UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a @@ -116,10 +116,10 @@ ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified -as +as .Dq text because they will be mostly readable on nearly any terminal; -UTF-16 and EBCDIC are only +UTF-16 and EBCDIC are only .Dq character data because, while they contain text, it is text that will require translation @@ -144,13 +144,13 @@ For example, the keyword .Em .br indicates that the file is most likely a .Xr troff 1 -input file, just as the keyword +input file, just as the keyword .Em struct indicates a C program. These tests are less reliable than the previous two groups, so they are performed last. The language test routines also test for some miscellany -(such as +(such as .Xr tar 1 archives). .Pp @@ -177,38 +177,39 @@ from the list of tests made to determine the file type. Valid test names are: .Bl -tag -width .It apptype -Check for .Dv EMX application type (only on EMX). -.It ascii -Check for various types of ascii files. +.It text +Various types of text files. +.It encoding +Different text encodings. +.It tokens +Looks for known tokens inside text files. .It cdf -Don't look for Compound Document Files. +Prints details of Compound Document Files. .It compress -Don't look for, or inside compressed files. +Checks for, and looks inside, compressed files. .It elf -Don't print elf details. +Prints ELF file details. .It soft -Don't consult magic files. +Consults magic files. .It tar -Don't examine tar files. -.It tokens -Don't look for known tokens inside ascii files. +Examines tar files. .El .It Fl f , -files-from Ar namefile -Read the names of the files to be examined from +Read the names of the files to be examined from .Ar namefile -(one per line) +(one per line) before the argument list. -Either +Either .Ar namefile or at least one filename argument must be present; -to test the standard input, use +to test the standard input, use .Sq - as a filename argument. .It Fl F , -separator Ar separator Use the specified string as the separator between the filename and the -file result returned. Defaults to +file result returned. Defaults to .Sq \&: . .It Fl h , -no-dereference option causes symlinks not to be followed @@ -345,47 +346,47 @@ options. .Sh STANDARDS CONFORMANCE This program is believed to exceed the System V Interface Definition of FILE(CMD), as near as one can determine from the vague language -contained therein. +contained therein. Its behavior is mostly compatible with the System V program of the same name. This version knows more magic, however, so it will produce -different (albeit more accurate) output in many cases. +different (albeit more accurate) output in many cases. .\" URL: http://www.opengroup.org/onlinepubs/009695399/utilities/file.html .Pp -The one significant difference +The one significant difference between this version and System V is that this version treats any white space as a delimiter, so that spaces in pattern strings must be escaped. For example, -.Bd -literal -offset indent +.Bd -literal -offset indent >10 string language impress\ (imPRESS data) .Ed .Pp in an existing magic file would have to be changed to -.Bd -literal -offset indent +.Bd -literal -offset indent >10 string language\e impress (imPRESS data) .Ed .Pp In addition, in this version, if a pattern string contains a backslash, it must be escaped. For example -.Bd -literal -offset indent +.Bd -literal -offset indent 0 string \ebegindata Andrew Toolkit document .Ed .Pp in an existing magic file would have to be changed to -.Bd -literal -offset indent +.Bd -literal -offset indent 0 string \e\ebegindata Andrew Toolkit document .Ed .Pp SunOS releases 3.2 and later from Sun Microsystems include a -.Nm +.Nm command derived from the System V one, but with some extensions. My version differs from Sun's only in minor ways. -It includes the extension of the +It includes the extension of the .Sq & operator, used as, for example, -.Bd -literal -offset indent +.Bd -literal -offset indent >16 long&0x7fffffff >0 not stripped .Ed .Sh MAGIC DIRECTORY @@ -393,7 +394,7 @@ The magic file entries have been collected from various sources, mainly USENET, and contributed by various authors. Christos Zoulas (address below) will collect additional or corrected magic file entries. -A consolidation of magic file entries +A consolidation of magic file entries will be distributed periodically. .Pp The order of entries in the magic file is significant. @@ -403,14 +404,14 @@ If your old .Nm command uses a magic file, keep the old magic file around for comparison purposes -(rename it to +(rename it to .Pa __MAGIC__.orig ). .Sh EXAMPLES -.Bd -literal -offset indent +.Bd -literal -offset indent $ file file.c file /dev/{wd0a,hda} file.c: C program text file: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), - dynamically linked (uses shared libs), stripped + dynamically linked (uses shared libs), stripped /dev/wd0a: block special (0/0) /dev/hda: block special (3/0) @@ -439,9 +440,9 @@ file: application/x-executable .Ed .Sh HISTORY -There has been a -.Nm -command in every +There has been a +.Nm +command in every .Dv UNIX since at least Research Version 4 (man page dated November, 1973). The System V version introduced one significant major change: @@ -464,7 +465,7 @@ Primary development and maintenance from 1990 to the present by Christos Zoulas (christos@astron.com). .Pp Altered by Chris Lowth, chris@lowth.com, 2000: -Handle the +Handle the .Fl i option to output mime type strings, using an alternative magic file and internal logic. @@ -510,10 +511,10 @@ files. The support for text files (primarily for programming languages) is simplistic, inefficient and requires recompilation to update. .Pp -The list of keywords in +The list of keywords in .Dv ascmagic probably belongs in the Magic file. -This could be done by using some keyword like +This could be done by using some keyword like .Sq * for the offset value. .Pp @@ -521,20 +522,20 @@ Complain about conflicts in the magic file entries. Make a rule that the magic entries sort based on file offset rather than position within the magic file? .Pp -The program should provide a way to give an estimate -of +The program should provide a way to give an estimate +of .Dq how good a guess is. -We end up removing guesses (e.g. -.Dq From\ +We end up removing guesses (e.g. +.Dq From\ as first 5 chars of file) because -they are not as good as other guesses (e.g. +they are not as good as other guesses (e.g. .Dq Newsgroups: versus .Dq Return-Path: ). Still, if the others don't pan out, it should be possible to use the -first guess. +first guess. .Pp This manual page, and particularly this section, is too long. .Sh RETURN CODE diff --git a/src/Makefile.am b/src/Makefile.am index 33a2d406..db78d96e 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -8,7 +8,7 @@ AM_CPPFLAGS = -DMAGIC='"$(MAGIC)"' AM_CFLAGS = @WARNINGS@ libmagic_la_SOURCES = magic.c apprentice.c softmagic.c ascmagic.c \ - compress.c is_tar.c readelf.c print.c fsmagic.c \ + encoding.c compress.c is_tar.c readelf.c print.c fsmagic.c \ funcs.c file.h names.h patchlevel.h readelf.h tar.h apptype.c \ file_opts.h elfclass.h mygetopt.h cdf.c cdf_time.c readcdf.c cdf.h libmagic_la_LDFLAGS = -no-undefined -version-info 1:0:0 diff --git a/src/ascmagic.c b/src/ascmagic.c index f07cb3fd..cc19d27b 100644 --- a/src/ascmagic.c +++ b/src/ascmagic.c @@ -2,7 +2,7 @@ * Copyright (c) Ian F. Darwin 1986-1995. * Software written by Ian F. Darwin and others; * maintained 1995-present by Christos Zoulas and others. - * + * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: @@ -12,7 +12,7 @@ * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. - * + * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE @@ -31,15 +31,12 @@ * * Extensively modified by Eric Fischer in July, 2000, * to handle character codes other than ASCII on a unified basis. - * - * Joerg Wunsch wrote the original support for 8-bit - * international characters, now subsumed into this file. */ #include "file.h" #ifndef lint -FILE_RCSID("@(#)$File: ascmagic.c,v 1.69 2008/11/04 16:38:28 christos Exp $") +FILE_RCSID("@(#)$File: ascmagic.c,v 1.70 2008/11/06 21:17:45 rrt Exp $") #endif /* lint */ #include "magic.h" @@ -57,32 +54,63 @@ FILE_RCSID("@(#)$File: ascmagic.c,v 1.69 2008/11/04 16:38:28 christos Exp $") #define ISSPC(x) ((x) == ' ' || (x) == '\t' || (x) == '\r' || (x) == '\n' \ || (x) == 0x85 || (x) == '\f') -private int looks_ascii(const unsigned char *, size_t, unichar *, size_t *); -private int looks_utf8_with_BOM(const unsigned char *, size_t, unichar *, - size_t *); -private int looks_ucs16(const unsigned char *, size_t, unichar *, size_t *); -private int looks_latin1(const unsigned char *, size_t, unichar *, size_t *); -private int looks_extended(const unsigned char *, size_t, unichar *, size_t *); -private void from_ebcdic(const unsigned char *, size_t, unsigned char *); private int ascmatch(const unsigned char *, const unichar *, size_t); private unsigned char *encode_utf8(unsigned char *, size_t, unichar *, size_t); +private size_t trim_nuls(const unsigned char *, size_t); +/* + * Undo the NUL-termination kindly provided by process() + * but leave at least one byte to look at + */ +private size_t +trim_nuls(const unsigned char *buf, size_t nbytes) +{ + while (nbytes > 1 && buf[nbytes - 1] == '\0') + nbytes--; + + return nbytes; +} protected int file_ascmagic(struct magic_set *ms, const unsigned char *buf, size_t nbytes) { - size_t i; - unsigned char *nbuf = NULL, *utf8_buf = NULL, *utf8_end; - unichar *ubuf = NULL; - size_t ulen, mlen; - const struct names *p; - const char *encoding = "binary"; + unichar *ubuf = NULL; + size_t ulen; int rv = -1; - int mime = ms->flags & MAGIC_MIME; const char *code = NULL; const char *code_mime = NULL; const char *type = NULL; + + if (ms->flags & MAGIC_APPLE) + return 0; + + nbytes = trim_nuls(buf, nbytes); + + /* If file doesn't look like any sort of text, give up. */ + if (file_encoding(ms, buf, nbytes, &ubuf, &ulen, &code, &code_mime, &type) == 0) { + rv = 0; + goto done; + } + + rv = file_ascmagic_with_encoding(ms, buf, nbytes, ubuf, ulen, code, code_mime, type); + + done: + if (ubuf) + free(ubuf); + + return rv; +} + +protected int +file_ascmagic_with_encoding(struct magic_set *ms, const unsigned char *buf, size_t nbytes, unichar *ubuf, size_t ulen, const char *code, const char *code_mime, const char *type) +{ + unsigned char *utf8_buf = NULL, *utf8_end; + size_t mlen, i; + const struct names *p; + int rv = -1; + int mime = ms->flags & MAGIC_MIME; + const char *subtype = NULL; const char *subtype_mime = NULL; @@ -101,90 +129,17 @@ file_ascmagic(struct magic_set *ms, const unsigned char *buf, size_t nbytes) if (ms->flags & MAGIC_APPLE) return 0; - /* - * Undo the NUL-termination kindly provided by process() - * but leave at least one byte to look at - */ - while (nbytes > 1 && buf[nbytes - 1] == '\0') - nbytes--; - - mlen = (nbytes + 1) * sizeof(nbuf[0]); - if ((nbuf = CAST(unsigned char *, calloc((size_t)1, mlen))) == NULL) { - file_oomem(ms, mlen); - goto done; - } - mlen = (nbytes + 1) * sizeof(ubuf[0]); - if ((ubuf = CAST(unichar *, calloc((size_t)1, mlen))) == NULL) { - file_oomem(ms, mlen); - goto done; - } - - /* - * Then try to determine whether it's any character code we can - * identify. Each of these tests, if it succeeds, will leave - * the text converted into one-unichar-per-character Unicode in - * ubuf, and the number of characters converted in ulen. - */ - if (looks_ascii(buf, nbytes, ubuf, &ulen)) { - code = "ASCII"; - code_mime = "us-ascii"; - type = "text"; - encoding = "7bit"; - } else if (looks_utf8_with_BOM(buf, nbytes, ubuf, &ulen) > 0) { - code = "UTF-8 Unicode (with BOM)"; - code_mime = "utf-8"; - type = "text"; - } else if (file_looks_utf8(buf, nbytes, ubuf, &ulen) > 1) { - code = "UTF-8 Unicode"; - code_mime = "utf-8"; - type = "text"; - } else if ((i = looks_ucs16(buf, nbytes, ubuf, &ulen)) != 0) { - if (i == 1) - code = "Little-endian UTF-16 Unicode"; - else - code = "Big-endian UTF-16 Unicode"; - - type = "character data"; - code_mime = "utf-16"; /* is this defined? */ - } else if (looks_latin1(buf, nbytes, ubuf, &ulen)) { - code = "ISO-8859"; - type = "text"; - code_mime = "iso-8859-1"; - } else if (looks_extended(buf, nbytes, ubuf, &ulen)) { - code = "Non-ISO extended-ASCII"; - type = "text"; - code_mime = "unknown"; - } else { - from_ebcdic(buf, nbytes, nbuf); - - if (looks_ascii(nbuf, nbytes, ubuf, &ulen)) { - code = "EBCDIC"; - type = "character data"; - code_mime = "ebcdic"; - } else if (looks_latin1(nbuf, nbytes, ubuf, &ulen)) { - code = "International EBCDIC"; - type = "character data"; - code_mime = "ebcdic"; - } else { - if (mime == MAGIC_MIME_ENCODING) - if (file_printf(ms, "%s", encoding) == -1) - goto done; - rv = 0; - goto done; /* doesn't look like text at all */ - } - } + nbytes = trim_nuls(buf, nbytes); + /* If we have fewer than 2 bytes, give up. */ if (nbytes <= 1) { rv = 0; goto done; } /* Convert ubuf to UTF-8 and try text soft magic */ - /* If original was ASCII or UTF-8, could use nbuf instead of - re-converting. */ /* malloc size is a conservative overestimate; could be - re-converting improved, or at least realloced after - re-converting conversion. */ + improved, or at least realloced after conversion. */ mlen = ulen * 6; if ((utf8_buf = CAST(unsigned char *, malloc(mlen))) == NULL) { file_oomem(ms, mlen); @@ -270,7 +225,8 @@ subtype_identified: n_cr++; if (mime) { - if (mime & MAGIC_MIME_TYPE) { + if ((mime & MAGIC_MIME_TYPE) && + !(ms->event_flags && EVENT_WROTE_MIME_TYPE)) { if (subtype_mime) { if (file_printf(ms, "%s", subtype_mime) == -1) goto done; @@ -289,22 +245,18 @@ subtype_identified: } if (mime == MAGIC_MIME_ENCODING) - if (file_printf(ms, "%s", encoding) == -1) + if (file_printf(ms, "%s", code_mime) == -1) goto done; } else { if (file_printf(ms, "%s", code) == -1) goto done; if (subtype) { - if (file_printf(ms, " ") == -1) - goto done; - if (file_printf(ms, "%s", subtype) == -1) + if (file_printf(ms, " %s", subtype) == -1) goto done; } - if (file_printf(ms, " ") == -1) - goto done; - if (file_printf(ms, "%s", type) == -1) + if (file_printf(ms, " %s", type) == -1) goto done; if (has_long_lines) @@ -320,7 +272,7 @@ subtype_identified: if (file_printf(ms, ", with") == -1) goto done; - if (n_crlf == 0 && n_cr == 0 && n_nel == 0 && n_lf == 0) { + if (n_crlf == 0 && n_cr == 0 && n_nel == 0 && n_lf == 0) { if (file_printf(ms, " no") == -1) goto done; } else { @@ -363,10 +315,6 @@ subtype_identified: } rv = 1; done: - if (nbuf) - free(nbuf); - if (ubuf) - free(ubuf); if (utf8_buf) free(utf8_buf); @@ -389,144 +337,6 @@ ascmatch(const unsigned char *s, const unichar *us, size_t ulen) return 1; } -/* - * This table reflects a particular philosophy about what constitutes - * "text," and there is room for disagreement about it. - * - * Version 3.31 of the file command considered a file to be ASCII if - * each of its characters was approved by either the isascii() or - * isalpha() function. On most systems, this would mean that any - * file consisting only of characters in the range 0x00 ... 0x7F - * would be called ASCII text, but many systems might reasonably - * consider some characters outside this range to be alphabetic, - * so the file command would call such characters ASCII. It might - * have been more accurate to call this "considered textual on the - * local system" than "ASCII." - * - * It considered a file to be "International language text" if each - * of its characters was either an ASCII printing character (according - * to the real ASCII standard, not the above test), a character in - * the range 0x80 ... 0xFF, or one of the following control characters: - * backspace, tab, line feed, vertical tab, form feed, carriage return, - * escape. No attempt was made to determine the language in which files - * of this type were written. - * - * - * The table below considers a file to be ASCII if all of its characters - * are either ASCII printing characters (again, according to the X3.4 - * standard, not isascii()) or any of the following controls: bell, - * backspace, tab, line feed, form feed, carriage return, esc, nextline. - * - * I include bell because some programs (particularly shell scripts) - * use it literally, even though it is rare in normal text. I exclude - * vertical tab because it never seems to be used in real text. I also - * include, with hesitation, the X3.64/ECMA-43 control nextline (0x85), - * because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline - * character to. It might be more appropriate to include it in the 8859 - * set instead of the ASCII set, but it's got to be included in *something* - * we recognize or EBCDIC files aren't going to be considered textual. - * Some old Unix source files use SO/SI (^N/^O) to shift between Greek - * and Latin characters, so these should possibly be allowed. But they - * make a real mess on VT100-style displays if they're not paired properly, - * so we are probably better off not calling them text. - * - * A file is considered to be ISO-8859 text if its characters are all - * either ASCII, according to the above definition, or printing characters - * from the ISO-8859 8-bit extension, characters 0xA0 ... 0xFF. - * - * Finally, a file is considered to be international text from some other - * character code if its characters are all either ISO-8859 (according to - * the above definition) or characters in the range 0x80 ... 0x9F, which - * ISO-8859 considers to be control characters but the IBM PC and Macintosh - * consider to be printing characters. - */ - -#define F 0 /* character never appears in text */ -#define T 1 /* character appears in plain ASCII text */ -#define I 2 /* character appears in ISO-8859 text */ -#define X 3 /* character appears in non-ISO extended ASCII (Mac, IBM PC) */ - -private char text_chars[256] = { - /* BEL BS HT LF FF CR */ - F, F, F, F, F, F, F, T, T, T, T, F, T, T, F, F, /* 0x0X */ - /* ESC */ - F, F, F, F, F, F, F, F, F, F, F, T, F, F, F, F, /* 0x1X */ - T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, /* 0x2X */ - T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, /* 0x3X */ - T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, /* 0x4X */ - T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, /* 0x5X */ - T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, /* 0x6X */ - T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, F, /* 0x7X */ - /* NEL */ - X, X, X, X, X, T, X, X, X, X, X, X, X, X, X, X, /* 0x8X */ - X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, /* 0x9X */ - I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, /* 0xaX */ - I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, /* 0xbX */ - I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, /* 0xcX */ - I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, /* 0xdX */ - I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, /* 0xeX */ - I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I /* 0xfX */ -}; - -private int -looks_ascii(const unsigned char *buf, size_t nbytes, unichar *ubuf, - size_t *ulen) -{ - size_t i; - - *ulen = 0; - - for (i = 0; i < nbytes; i++) { - int t = text_chars[buf[i]]; - - if (t != T) - return 0; - - ubuf[(*ulen)++] = buf[i]; - } - - return 1; -} - -private int -looks_latin1(const unsigned char *buf, size_t nbytes, unichar *ubuf, size_t *ulen) -{ - size_t i; - - *ulen = 0; - - for (i = 0; i < nbytes; i++) { - int t = text_chars[buf[i]]; - - if (t != T && t != I) - return 0; - - ubuf[(*ulen)++] = buf[i]; - } - - return 1; -} - -private int -looks_extended(const unsigned char *buf, size_t nbytes, unichar *ubuf, - size_t *ulen) -{ - size_t i; - - *ulen = 0; - - for (i = 0; i < nbytes; i++) { - int t = text_chars[buf[i]]; - - if (t != T && t != I && t != X) - return 0; - - ubuf[(*ulen)++] = buf[i]; - } - - return 1; -} - /* * Encode Unicode string as UTF-8, returning pointer to character * after end of string, or NULL if an invalid character is found. @@ -583,226 +393,3 @@ encode_utf8(unsigned char *buf, size_t len, unichar *ubuf, size_t ulen) return buf; } - -/* - * Decide whether some text looks like UTF-8. Returns: - * - * -1: invalid UTF-8 - * 0: uses odd control characters, so doesn't look like text - * 1: 7-bit text - * 2: definitely UTF-8 text (valid high-bit set bytes) - * - * If ubuf is non-NULL on entry, text is decoded into ubuf, *ulen; - * ubuf must be big enough! - */ -protected int -file_looks_utf8(const unsigned char *buf, size_t nbytes, unichar *ubuf, size_t *ulen) -{ - size_t i; - int n; - unichar c; - int gotone = 0, ctrl = 0; - - if (ubuf) - *ulen = 0; - - for (i = 0; i < nbytes; i++) { - if ((buf[i] & 0x80) == 0) { /* 0xxxxxxx is plain ASCII */ - /* - * Even if the whole file is valid UTF-8 sequences, - * still reject it if it uses weird control characters. - */ - - if (text_chars[buf[i]] != T) - ctrl = 1; - - if (ubuf) - ubuf[(*ulen)++] = buf[i]; - } else if ((buf[i] & 0x40) == 0) { /* 10xxxxxx never 1st byte */ - return -1; - } else { /* 11xxxxxx begins UTF-8 */ - int following; - - if ((buf[i] & 0x20) == 0) { /* 110xxxxx */ - c = buf[i] & 0x1f; - following = 1; - } else if ((buf[i] & 0x10) == 0) { /* 1110xxxx */ - c = buf[i] & 0x0f; - following = 2; - } else if ((buf[i] & 0x08) == 0) { /* 11110xxx */ - c = buf[i] & 0x07; - following = 3; - } else if ((buf[i] & 0x04) == 0) { /* 111110xx */ - c = buf[i] & 0x03; - following = 4; - } else if ((buf[i] & 0x02) == 0) { /* 1111110x */ - c = buf[i] & 0x01; - following = 5; - } else - return -1; - - for (n = 0; n < following; n++) { - i++; - if (i >= nbytes) - goto done; - - if ((buf[i] & 0x80) == 0 || (buf[i] & 0x40)) - return -1; - - c = (c << 6) + (buf[i] & 0x3f); - } - - if (ubuf) - ubuf[(*ulen)++] = c; - gotone = 1; - } - } -done: - return ctrl ? 0 : (gotone ? 2 : 1); -} - -/* - * Decide whether some text looks like UTF-8 with BOM. If there is no - * BOM, return -1; otherwise return the result of looks_utf8 on the - * rest of the text. - */ -private int -looks_utf8_with_BOM(const unsigned char *buf, size_t nbytes, unichar *ubuf, - size_t *ulen) -{ - if (nbytes > 3 && buf[0] == 0xef && buf[1] == 0xbb && buf[2] == 0xbf) - return file_looks_utf8(buf + 3, nbytes - 3, ubuf, ulen); - else - return -1; -} - -private int -looks_ucs16(const unsigned char *buf, size_t nbytes, unichar *ubuf, - size_t *ulen) -{ - int bigend; - size_t i; - - if (nbytes < 2) - return 0; - - if (buf[0] == 0xff && buf[1] == 0xfe) - bigend = 0; - else if (buf[0] == 0xfe && buf[1] == 0xff) - bigend = 1; - else - return 0; - - *ulen = 0; - - for (i = 2; i + 1 < nbytes; i += 2) { - /* XXX fix to properly handle chars > 65536 */ - - if (bigend) - ubuf[(*ulen)++] = buf[i + 1] + 256 * buf[i]; - else - ubuf[(*ulen)++] = buf[i] + 256 * buf[i + 1]; - - if (ubuf[*ulen - 1] == 0xfffe) - return 0; - if (ubuf[*ulen - 1] < 128 && - text_chars[(size_t)ubuf[*ulen - 1]] != T) - return 0; - } - - return 1 + bigend; -} - -#undef F -#undef T -#undef I -#undef X - -/* - * This table maps each EBCDIC character to an (8-bit extended) ASCII - * character, as specified in the rationale for the dd(1) command in - * draft 11.2 (September, 1991) of the POSIX P1003.2 standard. - * - * Unfortunately it does not seem to correspond exactly to any of the - * five variants of EBCDIC documented in IBM's _Enterprise Systems - * Architecture/390: Principles of Operation_, SA22-7201-06, Seventh - * Edition, July, 1999, pp. I-1 - I-4. - * - * Fortunately, though, all versions of EBCDIC, including this one, agree - * on most of the printing characters that also appear in (7-bit) ASCII. - * Of these, only '|', '!', '~', '^', '[', and ']' are in question at all. - * - * Fortunately too, there is general agreement that codes 0x00 through - * 0x3F represent control characters, 0x41 a nonbreaking space, and the - * remainder printing characters. - * - * This is sufficient to allow us to identify EBCDIC text and to distinguish - * between old-style and internationalized examples of text. - */ - -private unsigned char ebcdic_to_ascii[] = { - 0, 1, 2, 3, 156, 9, 134, 127, 151, 141, 142, 11, 12, 13, 14, 15, - 16, 17, 18, 19, 157, 133, 8, 135, 24, 25, 146, 143, 28, 29, 30, 31, -128, 129, 130, 131, 132, 10, 23, 27, 136, 137, 138, 139, 140, 5, 6, 7, -144, 145, 22, 147, 148, 149, 150, 4, 152, 153, 154, 155, 20, 21, 158, 26, -' ', 160, 161, 162, 163, 164, 165, 166, 167, 168, 213, '.', '<', '(', '+', '|', -'&', 169, 170, 171, 172, 173, 174, 175, 176, 177, '!', '$', '*', ')', ';', '~', -'-', '/', 178, 179, 180, 181, 182, 183, 184, 185, 203, ',', '%', '_', '>', '?', -186, 187, 188, 189, 190, 191, 192, 193, 194, '`', ':', '#', '@', '\'','=', '"', -195, 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 196, 197, 198, 199, 200, 201, -202, 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', '^', 204, 205, 206, 207, 208, -209, 229, 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 210, 211, 212, '[', 214, 215, -216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, ']', 230, 231, -'{', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 232, 233, 234, 235, 236, 237, -'}', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 238, 239, 240, 241, 242, 243, -'\\',159, 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 244, 245, 246, 247, 248, 249, -'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 250, 251, 252, 253, 254, 255 -}; - -#ifdef notdef -/* - * The following EBCDIC-to-ASCII table may relate more closely to reality, - * or at least to modern reality. It comes from - * - * http://ftp.s390.ibm.com/products/oe/bpxqp9.html - * - * and maps the characters of EBCDIC code page 1047 (the code used for - * Unix-derived software on IBM's 390 systems) to the corresponding - * characters from ISO 8859-1. - * - * If this table is used instead of the above one, some of the special - * cases for the NEL character can be taken out of the code. - */ - -private unsigned char ebcdic_1047_to_8859[] = { -0x00,0x01,0x02,0x03,0x9C,0x09,0x86,0x7F,0x97,0x8D,0x8E,0x0B,0x0C,0x0D,0x0E,0x0F, -0x10,0x11,0x12,0x13,0x9D,0x0A,0x08,0x87,0x18,0x19,0x92,0x8F,0x1C,0x1D,0x1E,0x1F, -0x80,0x81,0x82,0x83,0x84,0x85,0x17,0x1B,0x88,0x89,0x8A,0x8B,0x8C,0x05,0x06,0x07, -0x90,0x91,0x16,0x93,0x94,0x95,0x96,0x04,0x98,0x99,0x9A,0x9B,0x14,0x15,0x9E,0x1A, -0x20,0xA0,0xE2,0xE4,0xE0,0xE1,0xE3,0xE5,0xE7,0xF1,0xA2,0x2E,0x3C,0x28,0x2B,0x7C, -0x26,0xE9,0xEA,0xEB,0xE8,0xED,0xEE,0xEF,0xEC,0xDF,0x21,0x24,0x2A,0x29,0x3B,0x5E, -0x2D,0x2F,0xC2,0xC4,0xC0,0xC1,0xC3,0xC5,0xC7,0xD1,0xA6,0x2C,0x25,0x5F,0x3E,0x3F, -0xF8,0xC9,0xCA,0xCB,0xC8,0xCD,0xCE,0xCF,0xCC,0x60,0x3A,0x23,0x40,0x27,0x3D,0x22, -0xD8,0x61,0x62,0x63,0x64,0x65,0x66,0x67,0x68,0x69,0xAB,0xBB,0xF0,0xFD,0xFE,0xB1, -0xB0,0x6A,0x6B,0x6C,0x6D,0x6E,0x6F,0x70,0x71,0x72,0xAA,0xBA,0xE6,0xB8,0xC6,0xA4, -0xB5,0x7E,0x73,0x74,0x75,0x76,0x77,0x78,0x79,0x7A,0xA1,0xBF,0xD0,0x5B,0xDE,0xAE, -0xAC,0xA3,0xA5,0xB7,0xA9,0xA7,0xB6,0xBC,0xBD,0xBE,0xDD,0xA8,0xAF,0x5D,0xB4,0xD7, -0x7B,0x41,0x42,0x43,0x44,0x45,0x46,0x47,0x48,0x49,0xAD,0xF4,0xF6,0xF2,0xF3,0xF5, -0x7D,0x4A,0x4B,0x4C,0x4D,0x4E,0x4F,0x50,0x51,0x52,0xB9,0xFB,0xFC,0xF9,0xFA,0xFF, -0x5C,0xF7,0x53,0x54,0x55,0x56,0x57,0x58,0x59,0x5A,0xB2,0xD4,0xD6,0xD2,0xD3,0xD5, -0x30,0x31,0x32,0x33,0x34,0x35,0x36,0x37,0x38,0x39,0xB3,0xDB,0xDC,0xD9,0xDA,0x9F -}; -#endif - -/* - * Copy buf[0 ... nbytes-1] into out[], translating EBCDIC to ASCII. - */ -private void -from_ebcdic(const unsigned char *buf, size_t nbytes, unsigned char *out) -{ - size_t i; - - for (i = 0; i < nbytes; i++) { - out[i] = ebcdic_to_ascii[buf[i]]; - } -} diff --git a/src/encoding.c b/src/encoding.c new file mode 100644 index 00000000..44320452 --- /dev/null +++ b/src/encoding.c @@ -0,0 +1,487 @@ +/* + * Copyright (c) Ian F. Darwin 1986-1995. + * Software written by Ian F. Darwin and others; + * maintained 1995-present by Christos Zoulas and others. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice immediately at the beginning of the file, without modification, + * this list of conditions, and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR + * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ +/* + * Encoding -- determine the character encoding of a text file. + * + * Joerg Wunsch wrote the original support for 8-bit + * international characters. + */ + +#include "file.h" +#include "magic.h" +#include +#include +#include + +#ifndef lint +FILE_RCSID("@(#)$File: encoding.c,v 1.1 2008/11/06 21:17:45 rrt Exp $") +#endif /* lint */ + +private int looks_ascii(const unsigned char *, size_t, unichar *, size_t *); +private int looks_utf8_with_BOM(const unsigned char *, size_t, unichar *, + size_t *); +private int looks_ucs16(const unsigned char *, size_t, unichar *, size_t *); +private int looks_latin1(const unsigned char *, size_t, unichar *, size_t *); +private int looks_extended(const unsigned char *, size_t, unichar *, size_t *); +private void from_ebcdic(const unsigned char *, size_t, unsigned char *); + +/* + * Try to determine whether text is in some character code we can + * identify. Each of these tests, if it succeeds, will leave + * the text converted into one-unichar-per-character Unicode in + * ubuf, and the number of characters converted in ulen. + */ +protected int +file_encoding(struct magic_set *ms, const unsigned char *buf, size_t nbytes, unichar **ubuf, size_t *ulen, const char **code, const char **code_mime, const char **type) +{ + size_t mlen; + int rv = 0, ucs_type; + unsigned char *nbuf = NULL; + + mlen = (nbytes + 1) * sizeof(nbuf[0]); + if ((nbuf = CAST(unsigned char *, calloc((size_t)1, mlen))) == NULL) { + file_oomem(ms, mlen); + goto done; + } + mlen = (nbytes + 1) * sizeof((*ubuf)[0]); + if ((*ubuf = CAST(unichar *, calloc((size_t)1, mlen))) == NULL) { + file_oomem(ms, mlen); + goto done; + } + + if (looks_ascii(buf, nbytes, *ubuf, ulen)) { + *code = "ASCII"; + *code_mime = "us-ascii"; + *type = "text"; + } else if (looks_utf8_with_BOM(buf, nbytes, *ubuf, ulen) > 0) { + *code = "UTF-8 Unicode (with BOM)"; + *code_mime = "utf-8"; + *type = "text"; + } else if (file_looks_utf8(buf, nbytes, *ubuf, ulen) > 1) { + *code = "UTF-8 Unicode"; + *code_mime = "utf-8"; + *type = "text"; + } else if ((ucs_type = looks_ucs16(buf, nbytes, *ubuf, ulen)) != 0) { + if (ucs_type == 1) { + *code = "Little-endian UTF-16 Unicode"; + *code_mime = "utf-16le"; + } else { + *code = "Big-endian UTF-16 Unicode"; + *code_mime = "utf-16be"; + } + *type = "character data"; + } else if (looks_latin1(buf, nbytes, *ubuf, ulen)) { + *code = "ISO-8859"; + *type = "text"; + *code_mime = "iso-8859-1"; + } else if (looks_extended(buf, nbytes, *ubuf, ulen)) { + *code = "Non-ISO extended-ASCII"; + *type = "text"; + *code_mime = "unknown-8bit"; + } else { + from_ebcdic(buf, nbytes, nbuf); + + if (looks_ascii(nbuf, nbytes, *ubuf, ulen)) { + *code = "EBCDIC"; + *type = "character data"; + *code_mime = "ebcdic"; + } else if (looks_latin1(nbuf, nbytes, *ubuf, ulen)) { + *code = "International EBCDIC"; + *type = "character data"; + *code_mime = "ebcdic"; + } else /* Doesn't look like text at all */ + rv = -1; + } + + done: + if (nbuf) + free(nbuf); + + return rv; +} + +/* + * This table reflects a particular philosophy about what constitutes + * "text," and there is room for disagreement about it. + * + * Version 3.31 of the file command considered a file to be ASCII if + * each of its characters was approved by either the isascii() or + * isalpha() function. On most systems, this would mean that any + * file consisting only of characters in the range 0x00 ... 0x7F + * would be called ASCII text, but many systems might reasonably + * consider some characters outside this range to be alphabetic, + * so the file command would call such characters ASCII. It might + * have been more accurate to call this "considered textual on the + * local system" than "ASCII." + * + * It considered a file to be "International language text" if each + * of its characters was either an ASCII printing character (according + * to the real ASCII standard, not the above test), a character in + * the range 0x80 ... 0xFF, or one of the following control characters: + * backspace, tab, line feed, vertical tab, form feed, carriage return, + * escape. No attempt was made to determine the language in which files + * of this type were written. + * + * + * The table below considers a file to be ASCII if all of its characters + * are either ASCII printing characters (again, according to the X3.4 + * standard, not isascii()) or any of the following controls: bell, + * backspace, tab, line feed, form feed, carriage return, esc, nextline. + * + * I include bell because some programs (particularly shell scripts) + * use it literally, even though it is rare in normal text. I exclude + * vertical tab because it never seems to be used in real text. I also + * include, with hesitation, the X3.64/ECMA-43 control nextline (0x85), + * because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline + * character to. It might be more appropriate to include it in the 8859 + * set instead of the ASCII set, but it's got to be included in *something* + * we recognize or EBCDIC files aren't going to be considered textual. + * Some old Unix source files use SO/SI (^N/^O) to shift between Greek + * and Latin characters, so these should possibly be allowed. But they + * make a real mess on VT100-style displays if they're not paired properly, + * so we are probably better off not calling them text. + * + * A file is considered to be ISO-8859 text if its characters are all + * either ASCII, according to the above definition, or printing characters + * from the ISO-8859 8-bit extension, characters 0xA0 ... 0xFF. + * + * Finally, a file is considered to be international text from some other + * character code if its characters are all either ISO-8859 (according to + * the above definition) or characters in the range 0x80 ... 0x9F, which + * ISO-8859 considers to be control characters but the IBM PC and Macintosh + * consider to be printing characters. + */ + +#define F 0 /* character never appears in text */ +#define T 1 /* character appears in plain ASCII text */ +#define I 2 /* character appears in ISO-8859 text */ +#define X 3 /* character appears in non-ISO extended ASCII (Mac, IBM PC) */ + +private char text_chars[256] = { + /* BEL BS HT LF FF CR */ + F, F, F, F, F, F, F, T, T, T, T, F, T, T, F, F, /* 0x0X */ + /* ESC */ + F, F, F, F, F, F, F, F, F, F, F, T, F, F, F, F, /* 0x1X */ + T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, /* 0x2X */ + T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, /* 0x3X */ + T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, /* 0x4X */ + T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, /* 0x5X */ + T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, /* 0x6X */ + T, T, T, T, T, T, T, T, T, T, T, T, T, T, T, F, /* 0x7X */ + /* NEL */ + X, X, X, X, X, T, X, X, X, X, X, X, X, X, X, X, /* 0x8X */ + X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, X, /* 0x9X */ + I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, /* 0xaX */ + I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, /* 0xbX */ + I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, /* 0xcX */ + I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, /* 0xdX */ + I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, /* 0xeX */ + I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I /* 0xfX */ +}; + +private int +looks_ascii(const unsigned char *buf, size_t nbytes, unichar *ubuf, + size_t *ulen) +{ + size_t i; + + *ulen = 0; + + for (i = 0; i < nbytes; i++) { + int t = text_chars[buf[i]]; + + if (t != T) + return 0; + + ubuf[(*ulen)++] = buf[i]; + } + + return 1; +} + +private int +looks_latin1(const unsigned char *buf, size_t nbytes, unichar *ubuf, size_t *ulen) +{ + size_t i; + + *ulen = 0; + + for (i = 0; i < nbytes; i++) { + int t = text_chars[buf[i]]; + + if (t != T && t != I) + return 0; + + ubuf[(*ulen)++] = buf[i]; + } + + return 1; +} + +private int +looks_extended(const unsigned char *buf, size_t nbytes, unichar *ubuf, + size_t *ulen) +{ + size_t i; + + *ulen = 0; + + for (i = 0; i < nbytes; i++) { + int t = text_chars[buf[i]]; + + if (t != T && t != I && t != X) + return 0; + + ubuf[(*ulen)++] = buf[i]; + } + + return 1; +} + +/* + * Decide whether some text looks like UTF-8. Returns: + * + * -1: invalid UTF-8 + * 0: uses odd control characters, so doesn't look like text + * 1: 7-bit text + * 2: definitely UTF-8 text (valid high-bit set bytes) + * + * If ubuf is non-NULL on entry, text is decoded into ubuf, *ulen; + * ubuf must be big enough! + */ +protected int +file_looks_utf8(const unsigned char *buf, size_t nbytes, unichar *ubuf, size_t *ulen) +{ + size_t i; + int n; + unichar c; + int gotone = 0, ctrl = 0; + + if (ubuf) + *ulen = 0; + + for (i = 0; i < nbytes; i++) { + if ((buf[i] & 0x80) == 0) { /* 0xxxxxxx is plain ASCII */ + /* + * Even if the whole file is valid UTF-8 sequences, + * still reject it if it uses weird control characters. + */ + + if (text_chars[buf[i]] != T) + ctrl = 1; + + if (ubuf) + ubuf[(*ulen)++] = buf[i]; + } else if ((buf[i] & 0x40) == 0) { /* 10xxxxxx never 1st byte */ + return -1; + } else { /* 11xxxxxx begins UTF-8 */ + int following; + + if ((buf[i] & 0x20) == 0) { /* 110xxxxx */ + c = buf[i] & 0x1f; + following = 1; + } else if ((buf[i] & 0x10) == 0) { /* 1110xxxx */ + c = buf[i] & 0x0f; + following = 2; + } else if ((buf[i] & 0x08) == 0) { /* 11110xxx */ + c = buf[i] & 0x07; + following = 3; + } else if ((buf[i] & 0x04) == 0) { /* 111110xx */ + c = buf[i] & 0x03; + following = 4; + } else if ((buf[i] & 0x02) == 0) { /* 1111110x */ + c = buf[i] & 0x01; + following = 5; + } else + return -1; + + for (n = 0; n < following; n++) { + i++; + if (i >= nbytes) + goto done; + + if ((buf[i] & 0x80) == 0 || (buf[i] & 0x40)) + return -1; + + c = (c << 6) + (buf[i] & 0x3f); + } + + if (ubuf) + ubuf[(*ulen)++] = c; + gotone = 1; + } + } +done: + return ctrl ? 0 : (gotone ? 2 : 1); +} + +/* + * Decide whether some text looks like UTF-8 with BOM. If there is no + * BOM, return -1; otherwise return the result of looks_utf8 on the + * rest of the text. + */ +private int +looks_utf8_with_BOM(const unsigned char *buf, size_t nbytes, unichar *ubuf, + size_t *ulen) +{ + if (nbytes > 3 && buf[0] == 0xef && buf[1] == 0xbb && buf[2] == 0xbf) + return file_looks_utf8(buf + 3, nbytes - 3, ubuf, ulen); + else + return -1; +} + +private int +looks_ucs16(const unsigned char *buf, size_t nbytes, unichar *ubuf, + size_t *ulen) +{ + int bigend; + size_t i; + + if (nbytes < 2) + return 0; + + if (buf[0] == 0xff && buf[1] == 0xfe) + bigend = 0; + else if (buf[0] == 0xfe && buf[1] == 0xff) + bigend = 1; + else + return 0; + + *ulen = 0; + + for (i = 2; i + 1 < nbytes; i += 2) { + /* XXX fix to properly handle chars > 65536 */ + + if (bigend) + ubuf[(*ulen)++] = buf[i + 1] + 256 * buf[i]; + else + ubuf[(*ulen)++] = buf[i] + 256 * buf[i + 1]; + + if (ubuf[*ulen - 1] == 0xfffe) + return 0; + if (ubuf[*ulen - 1] < 128 && + text_chars[(size_t)ubuf[*ulen - 1]] != T) + return 0; + } + + return 1 + bigend; +} + +#undef F +#undef T +#undef I +#undef X + +/* + * This table maps each EBCDIC character to an (8-bit extended) ASCII + * character, as specified in the rationale for the dd(1) command in + * draft 11.2 (September, 1991) of the POSIX P1003.2 standard. + * + * Unfortunately it does not seem to correspond exactly to any of the + * five variants of EBCDIC documented in IBM's _Enterprise Systems + * Architecture/390: Principles of Operation_, SA22-7201-06, Seventh + * Edition, July, 1999, pp. I-1 - I-4. + * + * Fortunately, though, all versions of EBCDIC, including this one, agree + * on most of the printing characters that also appear in (7-bit) ASCII. + * Of these, only '|', '!', '~', '^', '[', and ']' are in question at all. + * + * Fortunately too, there is general agreement that codes 0x00 through + * 0x3F represent control characters, 0x41 a nonbreaking space, and the + * remainder printing characters. + * + * This is sufficient to allow us to identify EBCDIC text and to distinguish + * between old-style and internationalized examples of text. + */ + +private unsigned char ebcdic_to_ascii[] = { + 0, 1, 2, 3, 156, 9, 134, 127, 151, 141, 142, 11, 12, 13, 14, 15, + 16, 17, 18, 19, 157, 133, 8, 135, 24, 25, 146, 143, 28, 29, 30, 31, +128, 129, 130, 131, 132, 10, 23, 27, 136, 137, 138, 139, 140, 5, 6, 7, +144, 145, 22, 147, 148, 149, 150, 4, 152, 153, 154, 155, 20, 21, 158, 26, +' ', 160, 161, 162, 163, 164, 165, 166, 167, 168, 213, '.', '<', '(', '+', '|', +'&', 169, 170, 171, 172, 173, 174, 175, 176, 177, '!', '$', '*', ')', ';', '~', +'-', '/', 178, 179, 180, 181, 182, 183, 184, 185, 203, ',', '%', '_', '>', '?', +186, 187, 188, 189, 190, 191, 192, 193, 194, '`', ':', '#', '@', '\'','=', '"', +195, 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 196, 197, 198, 199, 200, 201, +202, 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', '^', 204, 205, 206, 207, 208, +209, 229, 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 210, 211, 212, '[', 214, 215, +216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, ']', 230, 231, +'{', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 232, 233, 234, 235, 236, 237, +'}', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 238, 239, 240, 241, 242, 243, +'\\',159, 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 244, 245, 246, 247, 248, 249, +'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 250, 251, 252, 253, 254, 255 +}; + +#ifdef notdef +/* + * The following EBCDIC-to-ASCII table may relate more closely to reality, + * or at least to modern reality. It comes from + * + * http://ftp.s390.ibm.com/products/oe/bpxqp9.html + * + * and maps the characters of EBCDIC code page 1047 (the code used for + * Unix-derived software on IBM's 390 systems) to the corresponding + * characters from ISO 8859-1. + * + * If this table is used instead of the above one, some of the special + * cases for the NEL character can be taken out of the code. + */ + +private unsigned char ebcdic_1047_to_8859[] = { +0x00,0x01,0x02,0x03,0x9C,0x09,0x86,0x7F,0x97,0x8D,0x8E,0x0B,0x0C,0x0D,0x0E,0x0F, +0x10,0x11,0x12,0x13,0x9D,0x0A,0x08,0x87,0x18,0x19,0x92,0x8F,0x1C,0x1D,0x1E,0x1F, +0x80,0x81,0x82,0x83,0x84,0x85,0x17,0x1B,0x88,0x89,0x8A,0x8B,0x8C,0x05,0x06,0x07, +0x90,0x91,0x16,0x93,0x94,0x95,0x96,0x04,0x98,0x99,0x9A,0x9B,0x14,0x15,0x9E,0x1A, +0x20,0xA0,0xE2,0xE4,0xE0,0xE1,0xE3,0xE5,0xE7,0xF1,0xA2,0x2E,0x3C,0x28,0x2B,0x7C, +0x26,0xE9,0xEA,0xEB,0xE8,0xED,0xEE,0xEF,0xEC,0xDF,0x21,0x24,0x2A,0x29,0x3B,0x5E, +0x2D,0x2F,0xC2,0xC4,0xC0,0xC1,0xC3,0xC5,0xC7,0xD1,0xA6,0x2C,0x25,0x5F,0x3E,0x3F, +0xF8,0xC9,0xCA,0xCB,0xC8,0xCD,0xCE,0xCF,0xCC,0x60,0x3A,0x23,0x40,0x27,0x3D,0x22, +0xD8,0x61,0x62,0x63,0x64,0x65,0x66,0x67,0x68,0x69,0xAB,0xBB,0xF0,0xFD,0xFE,0xB1, +0xB0,0x6A,0x6B,0x6C,0x6D,0x6E,0x6F,0x70,0x71,0x72,0xAA,0xBA,0xE6,0xB8,0xC6,0xA4, +0xB5,0x7E,0x73,0x74,0x75,0x76,0x77,0x78,0x79,0x7A,0xA1,0xBF,0xD0,0x5B,0xDE,0xAE, +0xAC,0xA3,0xA5,0xB7,0xA9,0xA7,0xB6,0xBC,0xBD,0xBE,0xDD,0xA8,0xAF,0x5D,0xB4,0xD7, +0x7B,0x41,0x42,0x43,0x44,0x45,0x46,0x47,0x48,0x49,0xAD,0xF4,0xF6,0xF2,0xF3,0xF5, +0x7D,0x4A,0x4B,0x4C,0x4D,0x4E,0x4F,0x50,0x51,0x52,0xB9,0xFB,0xFC,0xF9,0xFA,0xFF, +0x5C,0xF7,0x53,0x54,0x55,0x56,0x57,0x58,0x59,0x5A,0xB2,0xD4,0xD6,0xD2,0xD3,0xD5, +0x30,0x31,0x32,0x33,0x34,0x35,0x36,0x37,0x38,0x39,0xB3,0xDB,0xDC,0xD9,0xDA,0x9F +}; +#endif + +/* + * Copy buf[0 ... nbytes-1] into out[], translating EBCDIC to ASCII. + */ +private void +from_ebcdic(const unsigned char *buf, size_t nbytes, unsigned char *out) +{ + size_t i; + + for (i = 0; i < nbytes; i++) { + out[i] = ebcdic_to_ascii[buf[i]]; + } +} diff --git a/src/file.c b/src/file.c index e0cd74d4..22d10063 100644 --- a/src/file.c +++ b/src/file.c @@ -2,7 +2,7 @@ * Copyright (c) Ian F. Darwin 1986-1995. * Software written by Ian F. Darwin and others; * maintained 1995-present by Christos Zoulas and others. - * + * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: @@ -12,7 +12,7 @@ * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. - * + * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE @@ -32,7 +32,7 @@ #include "file.h" #ifndef lint -FILE_RCSID("@(#)$File: file.c,v 1.127 2008/11/04 16:38:28 christos Exp $") +FILE_RCSID("@(#)$File: file.c,v 1.128 2008/11/06 21:17:45 rrt Exp $") #endif /* lint */ #include "magic.h" @@ -149,6 +149,7 @@ main(int argc, char *argv[]) { "cdf", MAGIC_NO_CHECK_CDF }, { "compress", MAGIC_NO_CHECK_COMPRESS }, { "elf", MAGIC_NO_CHECK_ELF }, + { "encoding", MAGIC_NO_CHECK_ENCODING }, { "soft", MAGIC_NO_CHECK_SOFT }, { "tar", MAGIC_NO_CHECK_TAR }, { "tokens", MAGIC_NO_CHECK_TOKENS }, @@ -228,7 +229,7 @@ main(int argc, char *argv[]) else flags |= nv[i].value; break; - + case 'f': if(action) usage(); @@ -343,7 +344,7 @@ main(int argc, char *argv[]) process(argv[optind], wid); } - c = magic->haderr ? 1 : 0; + c = (magic->event_flags & EVENT_HAD_ERR) ? 1 : 0; magic_close(magic); return c; } @@ -483,9 +484,9 @@ help(void) "Determine type of FILEs.\n" "\n", stderr); #define OPT(shortname, longname, opt, doc) \ - fprintf(stderr, " -%c, --" longname doc, shortname); + fprintf(stderr, " -%c, --" longname doc, shortname); #define OPT_LONGONLY(longname, opt, doc) \ - fprintf(stderr, " --" longname doc); + fprintf(stderr, " --" longname doc); #include "file_opts.h" #undef OPT #undef OPT_LONGONLY diff --git a/src/file.h b/src/file.h index 375477d7..164ab0ab 100644 --- a/src/file.h +++ b/src/file.h @@ -2,7 +2,7 @@ * Copyright (c) Ian F. Darwin 1986-1995. * Software written by Ian F. Darwin and others; * maintained 1995-present by Christos Zoulas and others. - * + * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: @@ -12,7 +12,7 @@ * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. - * + * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE @@ -27,7 +27,7 @@ */ /* * file.h - definitions for file(1) program - * @(#)$File: file.h,v 1.113 2008/11/04 16:48:42 christos Exp $ + * @(#)$File: file.h,v 1.114 2008/11/06 21:17:45 rrt Exp $ */ #ifndef __file_h__ @@ -122,7 +122,7 @@ union VALUETYPE { unsigned char us[MAXstring]; float f; double d; -}; +}; struct magic { /* Word 1 */ @@ -134,7 +134,7 @@ struct magic { #define UNSIGNED 0x08 /* comparison is unsigned */ #define NOSPACE 0x10 /* suppress space character before output */ #define BINTEST 0x20 /* test is for a binary type (set only - for top-level tests) */ + for top-level tests) */ #define TEXTTEST 0 /* for passing to file_softmagic */ uint8_t factor; @@ -209,7 +209,7 @@ struct magic { #else uint8_t dummy; #endif - uint8_t factor_op; + uint8_t factor_op; #define FILE_FACTOR_OP_PLUS '+' #define FILE_FACTOR_OP_MINUS '-' #define FILE_FACTOR_OP_TIMES '*' @@ -317,8 +317,10 @@ struct magic_set { } o; uint32_t offset; int error; - int flags; - int haderr; + int flags; /* Control magic tests. */ + int event_flags; /* Note things that happened. */ +#define EVENT_HAD_ERR 0x01 +#define EVENT_WROTE_MIME_TYPE 0x02 const char *file; size_t line; /* current magic line number */ @@ -355,6 +357,8 @@ protected int file_trycdf(struct magic_set *, int, const unsigned char *, protected int file_zmagic(struct magic_set *, int, const char *, const unsigned char *, size_t); protected int file_ascmagic(struct magic_set *, const unsigned char *, size_t); +protected int file_ascmagic_with_encoding(struct magic_set *, const unsigned char *, size_t, unichar *, size_t, const char *, const char *, const char *); +protected int file_encoding(struct magic_set *, const unsigned char *, size_t, unichar **, size_t *, const char **, const char **, const char **); protected int file_is_tar(struct magic_set *, const unsigned char *, size_t); protected int file_softmagic(struct magic_set *, const unsigned char *, size_t, int); protected struct mlist *file_apprentice(struct magic_set *, const char *, int); diff --git a/src/funcs.c b/src/funcs.c index bb9bb4f1..95fe34c7 100644 --- a/src/funcs.c +++ b/src/funcs.c @@ -1,7 +1,7 @@ /* * Copyright (c) Christos Zoulas 2003. * All Rights Reserved. - * + * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: @@ -11,7 +11,7 @@ * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. - * + * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE @@ -27,7 +27,7 @@ #include "file.h" #ifndef lint -FILE_RCSID("@(#)$File: funcs.c,v 1.47 2008/11/04 16:38:28 christos Exp $") +FILE_RCSID("@(#)$File: funcs.c,v 1.48 2008/11/06 21:17:45 rrt Exp $") #endif /* lint */ #include "magic.h" @@ -98,17 +98,17 @@ file_error_core(struct magic_set *ms, int error, const char *f, va_list va, uint32_t lineno) { /* Only the first error is ok */ - if (ms->haderr) + if (ms->event_flags & EVENT_HAD_ERR) return; if (lineno != 0) { free(ms->o.buf); ms->o.buf = NULL; file_printf(ms, "line %u: ", lineno); } - file_vprintf(ms, f, va); + file_vprintf(ms, f, va); if (error > 0) file_printf(ms, " (%s)", strerror(error)); - ms->haderr++; + ms->event_flags |= EVENT_HAD_ERR; ms->error = error; } @@ -158,9 +158,15 @@ protected int file_buffer(struct magic_set *ms, int fd, const char *inname, const void *buf, size_t nb) { - int m; + int m = 0, rv = 0; int mime = ms->flags & MAGIC_MIME; const unsigned char *ubuf = CAST(const unsigned char *, buf); + unichar *u8buf = NULL; + size_t ulen; + + const char *code = NULL; + const char *code_mime = NULL; + const char *type = NULL; if (nb == 0) { if ((!mime || (mime & MAGIC_MIME_TYPE)) && @@ -198,24 +204,35 @@ file_buffer(struct magic_set *ms, int fd, const char *inname, const void *buf, /* Check if we have a CDF file */ if ((ms->flags & MAGIC_NO_CHECK_CDF) != 0 || (m = file_trycdf(ms, fd, ubuf, nb)) == 0) { - /* try tests in /etc/magic (or surrogate magic file) */ + /* try to discover text encoding */ + if ((ms->flags & MAGIC_NO_CHECK_ENCODING) == 0) + file_encoding(ms, ubuf, nb, &u8buf, &ulen, &code, &code_mime, &type); + /* try soft magic tests */ if ((ms->flags & MAGIC_NO_CHECK_SOFT) != 0 || (m = file_softmagic(ms, ubuf, nb, BINTEST)) == 0) { - /* try known keywords, check whether it is ASCII */ - if ((ms->flags & MAGIC_NO_CHECK_ASCII) != 0 || - (m = file_ascmagic(ms, ubuf, nb)) == 0) { - /* abandon hope, all ye who remain here */ + /* try text properties (and possibly text tokens) */ + if ((ms->flags & MAGIC_NO_CHECK_TEXT) != 0 || + (m = file_ascmagic_with_encoding(ms, ubuf, nb, u8buf, ulen, code, code_mime, type)) == 0) { + /* give up */ if ((!mime || (mime & MAGIC_MIME_TYPE)) && file_printf(ms, mime ? - "application/octet-stream" : - "data") == -1) - return -1; - m = 1; + "application/octet-stream" : + "data") == -1) { + rv = -1; + goto done; + } } + m = 1; } } } } + done: + if (u8buf) + free(u8buf); + if (rv) + return rv; + #ifdef BUILTIN_ELF if ((ms->flags & MAGIC_NO_CHECK_ELF) == 0 && m == 1 && nb > 5 && fd != -1) { @@ -242,7 +259,7 @@ file_reset(struct magic_set *ms) return -1; } ms->o.buf = NULL; - ms->haderr = 0; + ms->event_flags &= ~EVENT_HAD_ERR; ms->error = -1; return 0; } @@ -261,7 +278,7 @@ file_getbuffer(struct magic_set *ms) char *pbuf, *op, *np; size_t psize, len; - if (ms->haderr) + if (ms->event_flags & EVENT_HAD_ERR) return NULL; if (ms->flags & MAGIC_RAW) @@ -324,7 +341,7 @@ file_getbuffer(struct magic_set *ms) for (np = ms->o.pbuf, op = ms->o.buf; *op; op++) { if (isprint((unsigned char)*op)) { - *np++ = *op; + *np++ = *op; } else { OCTALIFY(np, op); } diff --git a/src/magic.c b/src/magic.c index 644ecb04..719a8fcb 100644 --- a/src/magic.c +++ b/src/magic.c @@ -28,7 +28,7 @@ #include "file.h" #ifndef lint -FILE_RCSID("@(#)$File: magic.c,v 1.55 2008/11/04 16:38:28 christos Exp $") +FILE_RCSID("@(#)$File: magic.c,v 1.56 2008/11/06 21:17:45 rrt Exp $") #endif /* lint */ #include "magic.h" @@ -117,7 +117,7 @@ magic_open(int flags) if ((ms->c.li = CAST(struct level_info *, malloc(len))) == NULL) goto free; - ms->haderr = 0; + ms->event_flags = 0; ms->error = -1; ms->mlist = NULL; ms->file = "unknown"; @@ -386,13 +386,13 @@ magic_buffer(struct magic_set *ms, const void *buf, size_t nb) public const char * magic_error(struct magic_set *ms) { - return ms->haderr ? ms->o.buf : NULL; + return (ms->event_flags & EVENT_HAD_ERR) ? ms->o.buf : NULL; } public int magic_errno(struct magic_set *ms) { - return ms->haderr ? ms->error : 0; + return (ms->event_flags & EVENT_HAD_ERR) ? ms->error : 0; } public int diff --git a/src/magic.h b/src/magic.h index 5864f266..a664e9aa 100644 --- a/src/magic.h +++ b/src/magic.h @@ -1,7 +1,7 @@ /* * Copyright (c) Christos Zoulas 2003. * All Rights Reserved. - * + * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: @@ -11,7 +11,7 @@ * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. - * + * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE @@ -34,13 +34,13 @@ #define MAGIC_SYMLINK 0x000002 /* Follow symlinks */ #define MAGIC_COMPRESS 0x000004 /* Check inside compressed files */ #define MAGIC_DEVICES 0x000008 /* Look at the contents of devices */ -#define MAGIC_MIME_TYPE 0x000010 /* Return only the MIME type */ +#define MAGIC_MIME_TYPE 0x000010 /* Return the MIME type */ #define MAGIC_CONTINUE 0x000020 /* Return all matches */ #define MAGIC_CHECK 0x000040 /* Print warnings to stderr */ #define MAGIC_PRESERVE_ATIME 0x000080 /* Restore access time on exit */ -#define MAGIC_RAW 0x000100 /* Don't translate unprint chars */ +#define MAGIC_RAW 0x000100 /* Don't translate unprintable chars */ #define MAGIC_ERROR 0x000200 /* Handle ENOENT etc as real errors */ -#define MAGIC_MIME_ENCODING 0x000400 /* Return only the MIME encoding */ +#define MAGIC_MIME_ENCODING 0x000400 /* Return the MIME encoding */ #define MAGIC_MIME (MAGIC_MIME_TYPE|MAGIC_MIME_ENCODING) #define MAGIC_APPLE 0x000800 /* Return the Apple creator and type */ #define MAGIC_NO_CHECK_COMPRESS 0x001000 /* Don't check for compressed files */ @@ -48,9 +48,13 @@ #define MAGIC_NO_CHECK_SOFT 0x004000 /* Don't check magic entries */ #define MAGIC_NO_CHECK_APPTYPE 0x008000 /* Don't check application type */ #define MAGIC_NO_CHECK_ELF 0x010000 /* Don't check for elf details */ -#define MAGIC_NO_CHECK_ASCII 0x020000 /* Don't check for ascii files */ +#define MAGIC_NO_CHECK_TEXT 0x020000 /* Don't check for text files */ #define MAGIC_NO_CHECK_CDF 0x040000 /* Don't check for cdf files */ -#define MAGIC_NO_CHECK_TOKENS 0x100000 /* Don't check ascii/tokens */ +#define MAGIC_NO_CHECK_TOKENS 0x100000 /* Don't check tokens */ +#define MAGIC_NO_CHECK_ENCODING 0x200000 /* Don't check text encodings */ + +/* Defined for backwards compatibility (renamed) */ +#define MAGIC_NO_CHECK_ASCII MAGIC_NO_CHECK_TEXT /* Defined for backwards compatibility; do nothing */ #define MAGIC_NO_CHECK_FORTRAN 0x000000 /* Don't check ascii/fortran */ diff --git a/src/softmagic.c b/src/softmagic.c index e9e49513..0cebe2db 100644 --- a/src/softmagic.c +++ b/src/softmagic.c @@ -2,7 +2,7 @@ * Copyright (c) Ian F. Darwin 1986-1995. * Software written by Ian F. Darwin and others; * maintained 1995-present by Christos Zoulas and others. - * + * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: @@ -12,7 +12,7 @@ * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. - * + * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE @@ -32,7 +32,7 @@ #include "file.h" #ifndef lint -FILE_RCSID("@(#)$File: softmagic.c,v 1.127 2008/11/06 15:38:28 christos Exp $") +FILE_RCSID("@(#)$File: softmagic.c,v 1.128 2008/11/06 21:17:45 rrt Exp $") #endif /* lint */ #include "magic.h" @@ -69,9 +69,6 @@ file_softmagic(struct magic_set *ms, const unsigned char *buf, size_t nbytes, in { struct mlist *ml; int rv; - if (ms->flags & MAGIC_MIME_ENCODING) - /* Let ascmagic do the work */ - return 0; for (ml = ms->mlist->next; ml != ms->mlist; ml = ml->next) if ((rv = match(ms, ml->magic, ml->nmagic, buf, nbytes, mode)) != 0) return rv; @@ -140,7 +137,7 @@ match(struct magic_set *ms, struct magic *magic, uint32_t nmagic, if (flush) { if (m->reln == '!') flush = 0; - } else { + } else { switch (magiccheck(ms, m)) { case -1: return -1; @@ -152,7 +149,7 @@ match(struct magic_set *ms, struct magic *magic, uint32_t nmagic, } } if (flush) { - /* + /* * main entry didn't match, * flush its continuations */ @@ -169,7 +166,7 @@ match(struct magic_set *ms, struct magic *magic, uint32_t nmagic, if (*m->desc) { need_separator = 1; printed_something = 1; - if ((e = handle_annotation(ms, m)) != 0) + if ((e = handle_annotation(ms, m)) != -2) return e; if (print_sep(ms, firstline) == -1) return -1; @@ -212,7 +209,7 @@ match(struct magic_set *ms, struct magic *magic, uint32_t nmagic, flush = !mget(ms, s, m, nbytes, cont_level); if (flush && m->reln != '!') continue; - + switch (flush ? 1 : magiccheck(ms, m)) { case -1: return -1; @@ -237,7 +234,7 @@ match(struct magic_set *ms, struct magic *magic, uint32_t nmagic, */ if (*m->desc) { printed_something = 1; - if ((e = handle_annotation(ms, m)) != 0) + if ((e = handle_annotation(ms, m)) != -2) return e; if (print_sep(ms, firstline) == -1) return -1; @@ -277,7 +274,7 @@ match(struct magic_set *ms, struct magic *magic, uint32_t nmagic, } if ((ms->flags & MAGIC_CONTINUE) == 0 && printed_something) { return 1; /* don't keep searching */ - } + } } return returnval; /* This is hit if -k is set or there is no match */ } @@ -826,7 +823,7 @@ mcopy(struct magic_set *ms, union VALUETYPE *p, int type, int indir, } if (lines) last = (const char *)s + nbytes; - + ms->search.s = buf; ms->search.s_len = last - buf; ms->search.offset = offset; @@ -839,10 +836,10 @@ mcopy(struct magic_set *ms, union VALUETYPE *p, int type, int indir, const unsigned char *esrc = s + nbytes; char *dst = p->s; char *edst = &p->s[sizeof(p->s) - 1]; - + if (type == FILE_BESTRING16) src++; - + /* check for pointer overflow */ if (src < s) { file_magerror(ms, "invalid offset %u in mcopy()", @@ -1389,14 +1386,14 @@ mget(struct magic_set *ms, const unsigned char *s, if (nbytes < (offset + 1)) /* should alway be true */ return 0; break; - + case FILE_SHORT: case FILE_BESHORT: case FILE_LESHORT: if (nbytes < (offset + 2)) return 0; break; - + case FILE_LONG: case FILE_BELONG: case FILE_LELONG: @@ -1415,7 +1412,7 @@ mget(struct magic_set *ms, const unsigned char *s, if (nbytes < (offset + 4)) return 0; break; - + case FILE_DOUBLE: case FILE_BEDOUBLE: case FILE_LEDOUBLE: @@ -1465,7 +1462,7 @@ file_strncmp(const char *s1, const char *s2, size_t len, uint32_t flags) if (0L == flags) { /* normal string: do it fast */ while (len-- > 0) if ((v = *b++ - *a++) != '\0') - break; + break; } else { /* combine the others */ while (len-- > 0) { @@ -1479,8 +1476,8 @@ file_strncmp(const char *s1, const char *s2, size_t len, uint32_t flags) if ((v = toupper(*b++) - *a++) != '\0') break; } - else if ((flags & STRING_COMPACT_BLANK) && - isspace(*a)) { + else if ((flags & STRING_COMPACT_BLANK) && + isspace(*a)) { a++; if (isspace(*b++)) { while (isspace(*b)) @@ -1575,23 +1572,23 @@ magiccheck(struct magic_set *ms, struct magic *m) case 'x': matched = 1; break; - + case '!': matched = fv != fl; break; - + case '=': matched = fv == fl; break; - + case '>': matched = fv > fl; break; - + case '<': matched = fv < fl; break; - + default: matched = 0; file_magerror(ms, "cannot happen with float: invalid relation `%c'", @@ -1609,23 +1606,23 @@ magiccheck(struct magic_set *ms, struct magic *m) case 'x': matched = 1; break; - + case '!': matched = dv != dl; break; - + case '=': matched = dv == dl; break; - + case '>': matched = dv > dl; break; - + case '<': matched = dv < dl; break; - + default: matched = 0; file_magerror(ms, "cannot happen with double: invalid relation `%c'", m->reln); @@ -1825,18 +1822,23 @@ magiccheck(struct magic_set *ms, struct magic *m) private int handle_annotation(struct magic_set *ms, struct magic *m) { - int ret = 0; if (ms->flags & MAGIC_APPLE) { if (file_printf(ms, "%.8s", m->apple) == -1) return -1; - ret = 1; + return 1; } - if (ms->flags & MAGIC_MIME_TYPE) { - if (file_printf(ms, "%s", m->mimetype) == -1) - return -1; - ret = 1; + if (ms->flags & MAGIC_MIME) { + if (ms->flags & MAGIC_MIME_TYPE) { + if (file_printf(ms, "%s", m->mimetype) == -1) + return -1; + } + if (ms->flags & MAGIC_MIME_ENCODING) { + ms->event_flags |= EVENT_WROTE_MIME_TYPE; + return 0; /* Let ascmagic find the encoding */ + } + return 1; } - return ret; + return -2; } private int @@ -1845,7 +1847,7 @@ print_sep(struct magic_set *ms, int firstline) if (firstline) return 0; /* - * we found another match + * we found another match * put a newline and '-' to do some simple formatting */ return file_printf(ms, "\n- ");